I found 2 datasets with Oscars data. The first is called Oscars nominated movies 2000-2017 (https://www.kaggle.com/vipulgote4/oscars-nominated-movies-from-2000-to-2017) and the second is demographics (https://www.kaggle.com/fmejia21/demographics-of-academy-awards-oscars-winners). I hypothesize that due to the Oscars so White movement years prior, movies with higher proportions of people of color are less likely to win an award in any of the 5 categories below:
- Best Picture won
- Best Director won
- Oscars Best Actor Won
- Oscars Best Actress Won
- People Choice won
These categories were selected from the Oscars nominated movies dataset because they had both 'nomination' and 'winner' columns. They are also the most well-known categories. This allows me to optimize the amount of data I have for running my analysis. To look for films with people of color I used the race_simple column which respectively has people listed as White or POC, person of color. I also used Excel to add a column called POC_percent. This column was created by filtering the data by movie and then dividing the number of POC by the total amount of characters listed.
I noticed that the datasets have a common column, the title, written as 'movie' or 'film', so I used a left join to combine them.
library(readxl)
Oscar_nominated_Movies_2000_2017 <- read_excel("Desktop/Oscar nominated Movies 2000-2017.xlsx")
View(Oscar_nominated_Movies_2000_2017)
library(readxl)
demographics <- read_excel("Desktop/demographics.xlsx")
View(demographics)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Oscar_demographics <- inner_join(Oscar_nominated_Movies_2000_2017, demographics, by = c("movie" = "film"))
The merged data had a lot of columns I wasn't interested in so I deleted the excess columns.
Oscar_demographics <-Oscar_demographics[-c(1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17)]
Oscar_demographics <- Oscar_demographics[-c(3,5,7,9,11)]
Oscar_demographics <- Oscar_demographics[-c(8,9,10,11,12,13,14,15,16,17)]
Oscar_demographics <- Oscar_demographics[-c(8:18)]
Oscar_demographics <- Oscar_demographics[-c(8:25)]
Oscar_demographics <- Oscar_demographics[-c(8:15)]
Oscar_demographics <- Oscar_demographics[-c(8:11)]
Oscar_demographics <- Oscar_demographics[-c(9:40)]
Oscar_demographics <- Oscar_demographics[-c(9:19)]
Oscar_demographics <- Oscar_demographics[-c(9)]
Oscar_demographics <- Oscar_demographics[-c(6,7)]
View(Oscar_demographics)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
Next, I changed all the columns of interest from Yes or No / White or POC to their respective binary values. *For race_simple I made POC=1 and White=0
Oscar_demographics$race_simple <- ifelse(Oscar_demographics$race_simple == "White", 0, 1)
Oscar_demographics$Oscar_Best_Picture_won <- ifelse(Oscar_demographics$Oscar_Best_Picture_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Actor_won <-ifelse(Oscar_demographics$Oscar_Best_Actor_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Actress_won <- ifelse(Oscar_demographics$Oscar_Best_Actress_won == "No", 0, 1)
Oscar_demographics$Oscar_Best_Director_won <- ifelse(Oscar_demographics$Oscar_Best_Director_won == "No", 0, 1)
View(Oscar_demographics)
## Warning in system2("/usr/bin/otool", c("-L", shQuote(DSO)), stdout = TRUE):
## running command ''/usr/bin/otool' -L '/Library/Frameworks/R.framework/Resources/
## modules/R_de.so'' had status 1
Next, instead of running separate correlation analyses for each column with POC_percent and race_simple, I decided to run a correlation matrix to see them all at once.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(Oscar_demographics, low = "steelblue", mid = "white", high = "brown")
## Warning in ggcorr(Oscar_demographics, low = "steelblue", mid = "white", : data
## in column(s) 'movie', 'race' are not numeric and were ignored
It appears that the strongest correlations of POC_percent and race simple are both with the People_Choice_won category.
An example of this correlation proving true can been seen in the graph above where The Martian had the highest POC_percent and won the People's Choice Award.
My hypothesis was proven incorrect as movies with higher proportions of people of color are more likely to win one of the five awards: the People's Choice Award. This is interesting because this is the only category that is decided by the general public and not the Academy. Further research with a larger dataset would be interesting as there could be different standards for what the general public considers a 'good film' versus the Academy. It would also be interesting to explore the high positive correlation between the Best Actor and Best Director.
Updated 12/28/2022