Race & Oscars Winners

Mikayla Edwards

11/22/2020

Hypothesis

I found 2 datasets with Oscars data. The first is called Oscars nominated movies 2000-2017 (https://www.kaggle.com/vipulgote4/oscars-nominated-movies-from-2000-to-2017) and the second is demographics (https://www.kaggle.com/fmejia21/demographics-of-academy-awards-oscars-winners). I hypothesize that due to the Oscars so White movement years prior, movies with higher proportions of people of color are less likely to win an award in any of the 5 categories below:

- Best Picture won

- Best Director won

- Oscars Best Actor Won

- Oscars Best Actress Won

- People Choice won

These categories were selected from the Oscars nominated movies dataset because they had both 'nomination' and 'winner' columns. They are also the most well-known categories. This allows me to optimize the amount of data I have for running my analysis. To look for films with people of color I used the race_simple column which respectively has people listed as White or POC, person of color. I also used Excel to add a column called POC_percent. This column was created by filtering the data by movie and then dividing the number of POC by the total amount of characters listed.

Merging the Datasets

I noticed that the datasets have a common column, the title, written as 'movie' or 'film', so I used a left join to combine them.

library(readxl)
Oscar_nominated_Movies_2000_2017 <- read_excel("Desktop/Oscar nominated Movies 2000-2017.xlsx")
View(Oscar_nominated_Movies_2000_2017)

library(readxl)
demographics <- read_excel("Desktop/demographics.xlsx")
View(demographics)


library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Oscar_demographics <- inner_join(Oscar_nominated_Movies_2000_2017, demographics, by = c("movie" = "film"))

Cleaning the data

The merged data had a lot of columns I wasn't interested in so I deleted the excess columns.

Oscar_demographics <-Oscar_demographics[-c(1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)]

Oscar_demographics <- Oscar_demographics[-c(3,5,7,9,11)]

Oscar_demographics <- Oscar_demographics[-c(8,9,10,11,12,13,14,15,16,17)]

Oscar_demographics <- Oscar_demographics[-c(8:18)]

Oscar_demographics <- Oscar_demographics[-c(8:25)]

Oscar_demographics <- Oscar_demographics[-c(8:15)]

Oscar_demographics <- Oscar_demographics[-c(8:11)]

Oscar_demographics <- Oscar_demographics[-c(9:40)]

Oscar_demographics <- Oscar_demographics[-c(9:19)]

Oscar_demographics <- Oscar_demographics[-c(9)]
Oscar_demographics <- Oscar_demographics[-c(6,7)]
View(Oscar_demographics)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

Binary

Next, I changed all the columns of interest from Yes or No / White or POC to their respective binary values. *For race_simple I made POC=1 and White=0

Oscar_demographics$race_simple <- ifelse(Oscar_demographics$race_simple == "White", 0, 1)
Oscar_demographics$Oscar_Best_Picture_won <- ifelse(Oscar_demographics$Oscar_Best_Picture_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Actor_won <-ifelse(Oscar_demographics$Oscar_Best_Actor_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Actress_won <- ifelse(Oscar_demographics$Oscar_Best_Actress_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Director_won <- ifelse(Oscar_demographics$Oscar_Best_Director_won == "No", 0, 1)

View(Oscar_demographics)

## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1

Correlation

Next, instead of running separate correlation analyses for each column with POC_percent and race_simple, I decided to run a correlation matrix to see them all at once.

library(GGally)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggcorr(Oscar_demographics, low = "steelblue", mid = "white", high = "brown")

## Warning in ggcorr(Oscar_demographics, low = "steelblue", mid = "white", : data
## in column(s) 'movie', 'race' are not numeric and were ignored

It appears that the strongest correlations of POC_percent and race simple are both with the People_Choice_won category.

Example

An example of this correlation proving true can been seen in the graph above where The Martian had the highest POC_percent and won the People's Choice Award.

Results & Further Research

My hypothesis was proven incorrect as movies with higher proportions of people of color are more likely to win one of the five awards: the People's Choice Award. This is interesting because this is the only category that is decided by the general public and not the Academy. Further research with a larger dataset would be interesting as there could be different standards for what the general public considers a 'good film' versus the Academy. It would also be interesting to explore the high positive correlation between the Best Actor and Best Director.

Updated 12/28/2022