Tuesday, November 15, 2016

College Majors with Fewer Women tend to have Larger Pay Gaps

The Gender Pay Gap has become a key talking point for many in politics in the past few years. It is commonly quoted that women make only 77¢ for every $1 of their male counterparts. This statistic is calculated by taking the average salary of women and comparing it to the average salary of men. This does not control for job title, degree, regional difference, company size, or hours worked. After controlling for these factors, the gender pay gap drops from 77% to 98% (Payscale.com does a fantastic analysis and break down of the gender pay gap. I highly recommend you check it out.)

Using the salaries that controlled for these factors from the Payscale report, I decided to do further analysis on the pay gap. I wanted to know how the percentage of women in a field effected the pay difference between men and women. Since many companies try to hire equal amounts of men and women for similar roles and there are significantly fewer women in science and engineering, I assumed these few women would be able to demand a higher salary because of the scarcity. However, I found the opposite.

The interactive graphic below shows that college majors with fewer women tend to have higher differences in pay. (The graphics may be difficult to see on mobile, please switch to a desktop or use the mobile-friendly version)



Two of the largest outliers are Nursing and Accounting. Nursing has 92.3% women and men still make about $2400 more on average. Accounting has 52.1% women and a pay difference of +$2400 for men. One point of interest is elementary education where women make 1.4% more than men.

The following graph shows the same results as above, but instead of absolute difference in dollars, the commonly used metric of women's salaries as a percentage of men's salaries is plotted.



Any guess as to why this occurs would largely be speculative, but I imagine a “boy’s club” mentality could be to blame where men like to hire men and are willing to even pay more to do so.

The links to the data sources can be found below and as always, add any suggestions in the comments.


Data Source:

[1] http://www.payscale.com/career-news/2009/12/do-men-or-women-choose-majors-to-maximize-income
[2] https://docs.google.com/spreadsheets/d/1FDrXUk4t-RQekuKotqMD7pyGFmylHip-xarOawcVvqk/edit#gid=0

Monday, October 31, 2016

A Look at Lynching in the United States

With the current racial tension within the United States and the Halloween season, I have seen many posts on Facebook showing black mannequins and other human-like figures hanging from trees. These images spark debate on whether these are directly racist acts or just distasteful mistakes. The comment sections are filled with pseudo-lynching-experts defending both sides.

So I decided to do an analysis on lynching within the United States. I found data on lynching from the Tuskegee Institute [1-2]. This data set contains information on lynchings from 1882-1962. I tried to find lynching statistics from before 1882 and found, during slavery, lynching Africans was fairly uncommon due to slave owners having a vested interest in keeping the slave alive.

The first bit of information I found shocking was whites made up 27% of lynching victims. However, the percentage of whites vs. blacks changes wildly from state to state. To understand this the percentage of black victims were plotted by state. As can be assumed, the deep south had the highest percentage of blacks followed by the rest of the south.



However, a look at the total number of lynchings showed states in the deep south did not offend evenly. Mississippi, Georgia, and Texas had the most lynchings (581, 531, and 493 respectfully) and then there was a large drop off to Louisiana and Alabama at 391 and 347. New Hampshire had zero lynchings and Delaware, Maine, and Vermont had one. New York and New Jersey only had two.

Over time, the number of people lynched has dropped to near zero. The next graph shows how lynching has decreased over time for whites and blacks. Black lynchings have about a 30 year lag behind the decrease of white lynchings.


In the first few years of the above chart, whites out number blacks in the number of lynchings. Below is a graph showing the percentage breakdown by year. Whites out number blacks for 4 years and then blacks out number whites for 60 years. In the last few years in the data set, whites exceed blacks. When taking into account how infrequent lynchings where in these last few years, two white people being lynched can account for 66.6% of lynchings.

I hope you learned something new, reading this analysis and as always, please feel free to add ideas for additional graphs or analysis in the comment section.

Data Link and Notes:

[1] - http://law2.umkc.edu/faculty/projects/ftrials/shipp/lynchingsstate.html
[2] - http://law2.umkc.edu/Faculty/projects/ftrials/shipp/lynchingyear.html

Friday, October 14, 2016

Launching @BestOfData

[Image Credit: Gizmodo]

Since starting HallwayMathlete, many of my friends started having an interest in data visualization and all around data journalism. They have asked for suggestions on other websites to get top quality stories told through data analysis. The easiest answer is: FiveThirtyEight. However, there are many smaller blogs that turn out fantastic quality material, but in low volume. These smaller websites tend not to have Twitters, Facebook Pages, or other venues for fans to follow their most recent work.

For this reason, I am launching @BestOfData. Best of Data is an aggregation of the best smaller data journalism blogs on the web. Currently, Best of Data automatically pulls from a list of my favorite blogs and I will manual add other articles I find interesting.

The complete list of website automatically populating @BestOfData:
FlowingData.com
HowMuch.net
RandalOlson.com/Blog
Priceonomics.com
InformationIsBeautiful.net
ToddwSchneider.com
Insidesamegrain.com

Any websites you think I should add to this list? Please add them in the comments.

If you are not a Twitter user (you should really have a Twitter), the list of articles can be found at BestOfData.org.

Sunday, August 7, 2016

Which is the best Pokemon in Pokemon GO?


Below are interactive graphics that display the average max CP of each Pokemon by trainer level. The first graph is of only the top Pokemon in Pokemon GO. The final graph shows all Pokemon and the following graphs are grouped by type.

Tips:
  • If you want to know which Pokemon a line represents, place your mouse on the line and the information will be displayed. 
  • To remove lines from the plot click the Pokemon's name in the legend.
  • To see a graph in greater detail, click and drag over the region you wish to observe. 
Note: The graphics look better on desktop.


Top Pokemon
Best Pokemon:
1. Mewtwo 
2. Dragonite 
3. Mew 
4. Moltres 
5. Zapdos 
6. Snorlax 
7. Arcanine 
8. Lapras 
9. Articuno 
10. Exeggutor 
11. Vaporeon 
12. Gyarados 
13. Flareon 
14. Muk 
15. Charizard 

Top Pokemon Currently Available In-Game 
1. Dragonite 
2. Snorlax 
3. Arcanine 
4. Lapras 
5. Exeggutor 
6. Vaporeon 
7. Gyarados 
8. Flareon 
9. Muk 
10. Charizard 

Some interesting notes, Blastoise and Venusaur both are not in the top pokemon; this is surprising for many who played the Gameboy Games because the starter pokemon were always some of the most powerful. Also, Dragonite and Charizard are the only top 10 pokemon who have three evolution forms. All others are two evolutions besides, Snorlax and Lapras.

After trainer level 30 the CP limit for each Pokemon grows at a slower rate than previous levels.

All Pokemon
Magikarp is all the way at the bottom and by far the worst Pokemon in Pokemon GO.

By Type 

The data for the above plots comes from Serebii.net. In the following weeks, Hallway Mathlete will post a web-scraping tutorial showing how the data was gathered.



Sunday, July 3, 2016

Analyzing the Annual Republicans vs. Democrats Congressional Baseball Game

Every year, the United States Congress takes a break from blocking each others bills and plays a charity baseball game. The best part, the teams are broken down by party lines, Republicans vs. Democrats. The tradition started in 1909 by Representative John Tener of Pennsylvania, a former professional baseball player. Last week was the annual game and Republicans were able to break a 7 year winning streak by the Democrats.

Below the net wins over the series is shown. The higher on the y axis the more Republican wins and the lower on the y axis the more Democrat wins. From this graph it is fairly obvious that each party has had long winning streaks. The gray dots represent years when the game was not held or I could not find any information about the game. In 1935, 1937, 1938, 1939, and 1941, games were held between members of congress and the press.


The following graph displays the points scored by each team over time. In the early years of the series, the games had much higher total scores than more recent years.

Next the point differentials were explored. The point differential is the difference between the scores of the two teams. Many of the closest games were held in the late seventies through the nineties. This time period also saw few winning streaks because the competition was fairly even between the parties.
A histogram was formed to understand the distribution of point differences. The Democrats have some extremely large wins with three wins over 20 points and the Republicans have none. Another interesting finding is that only one game ended in a tie. This is surprising because the charity event does not have overtime so it is logical to think out of the 81 games played more than one would end in a tie.


Over the years, the annual game has been held at many different locations.  Each party has had different rates of success at each field. The winning percentage at each field was calculated to understand if either party has a home-field-advantage at any park. Langley High School is a bit of an outlier because it was selected as the location after two rain delays and only hosted one game. American League Park II and Georgetown Field were the first two stadiums to host the game and each only hosted one game. Memorial Stadium had the fourth fewest games with only four, but all other locations had nine or more games.

Ironically, RFK Stadium, named after the famous Democratic U.S. Senator, has given Republicans a strong home-field-advantage. Republicans have won 13 out of the 14 games played at the stadium. Democrats have seen similar success at Nationals Park; winning 7 out of the 9 games.

Currently, I am planning to update these graphs each year after the annual game. Please feel free to add ideas for additional graphs or analysis in the comment section.


Notes:
  1. The data came from https://en.wikipedia.org/wiki/Congressional_Baseball_Game#Game_results
  2. Some of the stadiums were renamed over the years and the original data set contained both names. For the analysis, the same stadiums were combined with the most recent name.

Friday, June 24, 2016

How Gender and Race Affect Police Interactions

Recently, police violence has become the focus of a lot of media attention. It has formed many protests and organizations around reducing police violence. Many of the organizations are specially focused on reducing violence toward blacks because it is a problem disproportionately effecting the black community. This post seeks to investigate some of these claims and understand the relationship between the violence each ethnic group experiences and their violence against police.

The following graphics come from a conversation about disparities between races when it comes to police killings. The discussion turned to the fact that only some disparities are thought of as problems of the system but others are generally thought to be acceptable. For example, blacks make up about 11% of the population, but 29% of the police killings. This disparity is largely seen as racism in the law enforcement and the overall justice system. Critics of this assumption usually point to the higher rates of crimes committed by blacks compared to whites and other races. However, the use of crime statistics from, what some believe is a racist institution is not a good method for explaining the differences in police kill rates.

Another group that is disproportionately killed, compared to their percentage of the population, is men.  Males make up a little less than half of the population (49.1%), but are 94.2% of the police killing victims. However, no one asserts the justice department to be sexist. The group discussing this matter largely agreed the reason for men to disproportionately be killed by police is because men most likely kill police more than women.

The follow graphic was created to compare the population, the proportion of people killed by police, and the number of police killed, broken down by gender. Men make up 94.2% of police killings, but also were responsible for 97.5% of police murders. This means while only half the population, men are 16 times more likely to be killed by police than compared to women. However, the killer of a police officer is 39 times more likely to be a man compared to a woman.

A similar graphic was created broken down by race (Note: http://killedbypolice.net/ did not use the method of classifying asians as the population data and FBI, so asians was included in "Other" for the people killed by police. Also the FBI defines hispanics as a subset of whites and not their own category so this is why hispanics are not represented in the "Killed Police" section). The chart below shows blacks are much more likely to be killed by police compared to their portion of the population, however while only being 29.5% of the people killed by police and 11% of the population, 43% of police officers are killed by blacks. 

Personally, I do not believe you can say that one race can be expected to be killed more because they kill police more. I believe, unlike gender, there are socio-economic differences between the groups that could lead to a greater likelihood of turning to crime because of lack of economic opportunity. Another factor is the populations are not perfectly comparable. Whites and asians households have fewer children than black and hispanics [3]. This leads to a lower ratio of old people to young people in the black and hispanics populations. Since the vast majority of people committing murders and/or being killed by police are young, populations with fewer old people will look like they commit more murders per capita.

Below is the R code used to generate the plots.


Sources
[1] http://killedbypolice.net/ (May 2, 2013)
[2] https://www.fbi.gov/about-us/cjis/ucr/leoka/2013/tables/table_44_leos_fk_race_and_sex_of_known_offender_2004-2013.xls (April 10, 2016)
[3] http://www.pewsocialtrends.org/2012/05/17/explaining-why-minority-births-now-outnumber-white-births/ (April 29, 2016)

ewl e2earh maeh armaly2'i fi rmehiiarm
mmîW Srn'eW ,mtehaded pbii e ¡idada Sehîie

CPu1s)

This text was recognized by the built-in Ocrad engine. A better transcription may be attained by right clicking on the selection and changing the OCR engine to "Tesseract" (under the "Language" menu). This message can be removed in the future by unchecking "OCR Disclaimer" (under the Options menu). More info: http://projectnaptha.com/ocrad


########################### R Code ################################ 

######### By Race ###########
# Cop killing graphic
# https://www.fbi.gov/about-us/cjis/ucr/leoka/2013/tables/table_44_leos_fk_race_and_sex_of_known_offender_2004-2013.xls
cop_killer_race <- c("White", "Black", "Asian", "Other")
cop_killer_quanity <- c(289, 243, 9, 24)
variable <- rep("Killed Police", length(cop_killer_race))
percent <- cop_killer_quanity/sum(cop_killer_quanity)
cop_killer <- data.frame(race=cop_killer_race, quanity=cop_killer_quanity, type=variable, percent=percent)
 
# People killed by cops
# Source killedbypolice.net (May 2, 2013)
cop_killed_race <- c("White", "Black", "Hispanic", "Other")
cop_killed_quanity <- c(782, 464, 302, 26)
variable <- rep("Killed by Police", length(cop_killed_race))
percent <- cop_killed_quanity/sum(cop_killed_quanity)
killed_by_cop <- data.frame(race=cop_killed_race, quanity=cop_killed_quanity, type=variable, percent=percent)
 
#Population Data
population_race <- c("White", "Black", "Asian", "Hispanic", "Other")
population_quanity <- c(196817552, 37685848, 14465124, 50477594, 28116441+2932248+540013)
variable <- rep("Population ", length(population_race))
percent <- population_quanity/sum(population_quanity)
population <- data.frame(race=population_race, quanity=population_quanity, type=variable, percent=percent)
 
# Bind the three data frames
data <- rbind(population, killed_by_cop, cop_killer )
 
# Calc the placement of the percent text in the graph
df <- data
df <- transform(df, mid_y = ave(df$percent, df$type, FUN = function(val) cumsum(val) - (0.5 * val)))
 
# Plot
ggplot(data=df, aes(x=type, y=quanity, fill=race, label=paste(round(percent*100,1),"%"))) +
geom_bar(stat="identity", position = "fill") + labs(x = "", y = "Percent", fill = "Race") +
geom_text(aes(y = mid_y)) + theme_bw() +
annotate("text", label = "HallwayMathlete.com", x = 2, y = -.03, size = 4, colour = "gray")
 
######### By Gender ###########
# Gender Women killed
# https://www.fbi.gov/about-us/cjis/ucr/leoka/2013/tables/table_44_leos_fk_race_and_sex_of_known_offender_2004-2013.xls
cop_killer_gender <- c( "Female", "Male", "Not Reported")
cop_killer_quanity <- c(13, 551, 1)
variable <- rep("Killed Police", length(cop_killer_gender))
percent <- cop_killer_quanity/sum(cop_killer_quanity)
cop_killer <- data.frame(race=cop_killer_gender, quanity=cop_killer_quanity, type=variable, percent=percent)
 
# People killed by cops
# Source killedbypolice.net (May 2, 2013)
cop_killed_gender <- c("Female", "Male","Not Reported")
cop_killed_quanity <- c( 177,2916, 2)
variable <- rep("Killed by Police", length(cop_killed_gender))
percent <- cop_killed_quanity/sum(cop_killed_quanity)
killed_by_cop <- data.frame(race=cop_killed_gender, quanity=cop_killed_quanity, type=variable, percent=percent)
 
#Population Data
population_gender <- c("Female","Male")
population_quanity <- c(143368343, 138053563)
variable <- rep("Population ", length(population_gender))
percent <- population_quanity/sum(population_quanity)
population <- data.frame(race=population_gender, quanity=population_quanity, type=variable, percent=percent)
 
# Bind the three data frames
data <- rbind(population, killed_by_cop, cop_killer )
 
# Calc the placement of the percent text in the graph
df <- data
df <- transform(df, mid_y = ave(df$percent, df$type, FUN = function(val) cumsum(val) - (0.5 * val)))
 
# Plot
ggplot(data=df, aes(x=type, y=quanity, fill=race, label=paste(round(percent*100,1),"%"))) +
geom_bar(stat="identity", position = "fill") + labs(x = "", y = "Percent", fill = "Race") +
geom_text(aes(y = mid_y)) + theme_bw() +
annotate("text", label = "HallwayMathlete.com", x = 2, y = -.03, size = 4, colour = "gray")

Sunday, May 29, 2016

Salaries of Presidential Primary Voters by Candidate and State

The data used in this post comes from FiveThirtyEight and are put into easy to understand graphics. The first graphic shows the average salary of supports of each candidate by state. The states are ranked by highest average salary, Maryland, to the state with the lowest salary, Mississippi. The red line shows the average salary for each state. The first obvious conclusion is that John Kasich supporters make about $5-10 thousand more than supporters of other candidates. Second, in low salary states Clinton and Sanders supporter's have similar salaries, but when looking at higher salary states, Clinton supporters' salaries are even with Trump and Cruz supporters' salaries.


The following plot shows the distribution of average salaries for each Presidential Candidate. Again we see similar trends as before with Kasich having a high average income and Clinton having a mix of low and high income supporters.


The last plot shows the relationship between the average salary of a state and the average salary of a candidate's supporters. The red line is a perfect 1 to 1 ratio and the closer a candidate is to the red line the closer the candidate's supporters are to having the same salary as the average person of that state. The reason for almost all the dots falling above the line is because people with below average salaries are less likely to vote.


If you have any suggestions for plots using this data, please share in the comment section.


Note:

[1] All states are not included because all states have not held elections yet.