Welcome to Vizual Statistix! My name is Seth Kadish. I live in Portland, OR, where I work as a scientist. To learn more about me, visit my LinkedIn profile, and send an invitation to connect.
This blog is a product of my passion for data visualization. The data shown here are sourced from other websites, but all statistical operations on these data and the resulting graphics are original.
If you would like to use one of my graphics on your website or in a publication, please email me. I also take requests and am available for freelance work. Contact me if you have a suggestion for a graphic or need support on a project.
My dad is the crossword champion of our family – it’s rare that he can’t polish off the Saturday puzzle. I, on the other hand, am usually googling (i.e., cheating) by Wednesday. I’m just more of a KenKen guy.
After creating the chess square utilization graphic last week, I realized it might be interesting to run a similar analysis on crossword puzzle grids. I pulled the grids of just over 7400 crosswords from The New York Times – the database included all Will Shortz-era puzzles, starting November 21, 1993 through March 10, 2014. I then removed grids with non-standard dimensions. Standard for Monday to Saturday is 15x15. The majority of Sunday puzzles are 21x21; while there are some 23x23, they account for less than 10% of the Sunday puzzles.
I then calculated, separately for each day of the week, how often each box was a blank (black shaded square) versus a box with a character (part of a solution). I’ve used the word “character” because, though the entry is almost always an individual letter, it can sometimes be something tricky like a number or multiple letters.
The percentage shown after the day of the week represents the average percentage of boxes for that day that requires characters. So, as the week goes by, the puzzles have fewer and fewer black boxes and more white boxes needing answers.
The shading scale ranges from 40% to 100%; the former indicates that a box contains a character (i.e., is not shaded black) in 40% of puzzles on that day of the week, while the latter specifies that a box has always required a character. The actual lowest value for any puzzle is 39.8%, which appears as black squares in two locations on Monday. All values for individual boxes are written in blue.
Note the striking similarity between Monday through Thursday puzzles, with Monday having the most contrast and Thursday, the least. Friday and Saturday appear similar and have comparable percentages, while the larger Sunday grid is in a league of its own. I’ve also provided the average of the Monday through Saturday puzzles. This allows you to see which boxes have never been black on a non-Sunday puzzle.
Data source: http://www.xwordinfo.com/ (If you’ve never seen this site, check it out – they have some amazing statistics on the NYTimes Crossword!)
The distribution of households living at various levels of poverty has been extensively studied. Similarly, the distribution of households receiving food stamps (enrolled in SNAP) is also well known. Not surprisingly, these populations have considerable overlap; eligibility for SNAP is based largely on income, and therefore, correlates strongly with poverty level. But there are scenarios where a household may live below the poverty line and opt not to apply (or not qualify) for food stamps. Conversely, there are situations where household income raises a family above the 100% poverty level, but due to non-financial factors (e.g., disability, employment status), the household can still receive food stamps.
In the top two maps, I show the raw county data: percentage of households living below the poverty line and percentage of households receiving food stamps. Using subsets of data provided by the Census, I was able to calculate the overlap between the groups – those households that are both below the poverty line and enrolled in SNAP – and mapped the values as percentages of each respective group. The result (bottom two maps) provides the percentage of households below the poverty line that receive food stamps (left), and the percentage of households enrolled in SNAP that are below the poverty line (right).
The top maps show that it is more common for a household to be living below the poverty line than to receive food stamps; this is not surprising, as it is much easier for a household that is below the poverty line not to apply for SNAP than for a family above the poverty line to qualify for SNAP. So, in general, a household is more likely to be below the poverty line than to get food stamps, making the top left map darker than the top right map. Despite these differences, the variables are still well correlated (r^2 = 0.67).
The two calculated percentages, however, have only a weak correlation (r^2 = 0.17). This is where the data show some interesting results. If the trend described in the previous paragraph held true, the bottom right map should typically be darker than the bottom left map; because there are fewer households receiving food stamps than living below the poverty line, there should be a higher percentage of households enrolled in SNAP who are below the poverty line than the percentage of households below the poverty line who are enrolled in SNAP.
As a consequence, any regions where the bottom left map is darker than the bottom right map indicates a region that contradicts the norm. States such as Oregon, Michigan, and Maine, represent regions where a household is more likely to be enrolled in SNAP than living below the poverty line. This suggests that these states have unusually high percentages of households above the poverty line who are receiving food stamps.
When I started this blog, one of my first posts addressed chess square utilization by Bobby Fischer. At the time, several people asked me how his move distribution compared to other GMs. So I finally decided to revisit the topic and do some additional exploration. Here are the results for square utilization for 12 masters, playing as white and black. In generating these, I calculated some other interesting stats that I thought were worth a few bar charts. Who knew queenside castling was so unpopular?
Data source: http://www.pgnmentor.com/files.html#players
I just got back from a vacation in Oceania (which is why I haven’t posted in a couple weeks). The flight path home included a leg from Melbourne to Los Angeles. During the 15 hours of sitting, I thought to myself that it must be one of the longest flights in the world. As it turns out, it isn’t even in the top 10; it’s the 12th longest. I’ve mapped the geodesics of the top 20 commercial flights here. With the recent problems at Qantas (and with airlines in general) some of these may change soon, but these are the current paths. Note that, because I’ve used a Robinson projection, some flights that go over the Arctic Circle appear longer than they are, so I’ve added a color scheme and listed the lengths on the map.
In this series of maps, I’ve compared the geographic distribution of Planned Parenthood locations and crisis pregnancy centers. Planned Parenthood is considered a liberal non-profit, as it supports a woman’s choice to have an abortion. Crisis pregnancy centers (also known as pregnancy resource centers) are conservative non-profits that typically provide counseling to discourage a woman from having an abortion. Not all crisis pregnancy centers are licensed as medical centers, and as such, some provide counseling only, with no medical services.
The databases I generated include all centers of both types listed in the data sources. There were approximately three times as many crisis pregnancy centers as Planned Parenthood locations in the USA. The top two maps show the number of locations per one million state population. Note that the scales are different; they are stretched to highlight maximums and minimums for each type of health center.
The lower map compares the resource allocations – as defined by the percentage of locations – for Planned Parenthood and crisis pregnancy centers. To calculate these values, I derived separately the percentage of Planned Parenthood clinics in each state, and the percentage of crisis pregnancy centers in each state. Then, to compare how the two non-profits focus their efforts across the county, I mapped the difference between the percentages for each state. The example, as noted on the map, is that 8.1% of all Planned Parenthood locations are in New York, but only 5.7% of crisis pregnancy centers are in New York. The difference is 2.4% in favor of Planned Parenthood. The scale for this map shows the absolute value of the differences, with blue indicating a state that has a greater proportion of the Planned Parenthood centers, relative to its share of the crisis pregnancy centers, and red denoting states with the opposite balance in resource allocation.
Picking the perfect gaming alias is no easy task. I chose mine when I was 14, and while it’s too embarrassing to reveal, I’m not sure I’d come up with anything better today. But there are hosts of name generators now available online to assist you in your selection.
I was curious to see what aliases people choose, so I selected four popular PC games, and pulled at least 1500 names of the top-rated players for each game. I then ran a character and length analysis on the names. Players can, in some cases, use symbols, numerals, and non-English letters in their names. These were included in the length analysis, but I separated them from the names for frequency analysis. For those wondering why I didn’t include WoW, there were just too many non-English letters in the names for the results to be meaningful.
The usage analysis shows the frequency with which English-letters are used; these values were normalized to how often the letters are used in the English language. This highlights the overuse of letters that are usually rare (e.g., Z, X), and the underuse of common letters (e.g., T, E). I have noted the percentage of characters that are not in the English alphabet (e.g., conventions, numerals, non-English letters).
For the length analysis, I counted all characters (not including spaces). As shown in the graph, EVE Online names tend to be longer and SC2 names tend to be shorter; RIFT and SWTOR have similar distributions.
On a side note, if anyone wants to feel really good about their SC2 skills, come find me. I’ll be the really mediocre Protoss player building only void rays.
We’re told not to judge a book by its cover, but what about its length, or how long its sentences are? Turns out, that’s also probably not a good idea. I pulled word data for 20 well-known books. No comment on whether these are all good books - they’re just widely recognized!
The data, which are from Amazon, include the number of words and sentences. Amazon provides average sentence length, but they round it, so I recalculated the values. They also provide a value for “complex words,” which they define as the percentage of words in the book that are at least three syllables. I have included those values here, which are rounded to the nearest percent.
Just a fun Google Autocomplete graphic. I just searched for “____ is the new,” where ____ was one of the colors, and then recorded the first color listed on the Google Autocomplete options. And yes, I realize white and black aren’t technically colors :)
I thought it would be interesting to compare which countries the U.S. has embassies/consulates in to which countries have embassies and consulates in the U.S. Read the notes on the maps for details about counting, as what to include/exclude required some thought.
For the most part, the maps are quite balanced, but there are a few differences. The U.S. places an equal emphasis on having embassies/consulates in Asia and Europe. After Mexico (13), the U.S. has an equal number in Canada, China, and France (8), and six each in Germany, Japan, and India. In contrast, there are 17 cities that have embassies/consulates in more than 25 U.S. cities. Thirteen of them are in Europe, with France leading the way; it has even more than Mexico, which comes in second. The other non-European countries in this group are Japan, Canada, and Guatemala.
In honor of Facebook’s 10th birthday, I’ve made a graphic on multi-platform social media use. The data, from a Pew Research Center study published at the end of 2013, show that Facebook still has a dominant presence in the social media world. Of the five social media platforms considered, at least 83% of people who use the other platforms also use Facebook. Twitter and Instagram users are the most likely to have accounts with more than one social media platform.
The CDC publishes HIV surveillance reports that include incidence (new infections) and prevalence (people living with diagnosed cases) data. These graphs show prevalence data based on estimates of the 2010 population diagnosed with HIV. The data include all stages of the disease; stage 3 is commonly known as AIDS. According to the CDC, as many as one in six people with HIV are unaware that they are infected. As such, the prevalence data, which include only diagnosed cases, slightly underestimate the actual rates of infection.
Afraid of heights? Here are some buildings to avoid! The map shows the height of the tallest building in each country, including buildings under construction that have topped out. The graph plots number of floors vs height. Because spires and other architectural features are included in the height measurements of the buildings, the height per floor statistics are upper limits for the typical height of each floor.
The tallest building by far is the Burj Khalifa in the United Arab Emirates (2717 ft or just over half a mile). Saudi Arabia, which has the second tallest building, is currently constructing the Kingdom Tower, which is projected to be at least 3281 ft tall. It plans to take over first place when construction is completed in 2019. Meanwhile, China has three buildings in construction that will exceed 2000 ft in height.
Data source: http://skyscrapercenter.com/create.php
It has long been known that the Boy Scouts of America is generally a conservative organization with significant religious involvement. While it aims to teach youth what it deems as good values, it also has a history of discrimination against homosexuals and atheists.
Politics aside, I thought it would be interesting to show, graphically, just how religious an organization it is. These graphs highlight the dominant presence of faith-based chartered organizations. The data represent the number of youth members, but had I used the numbers for scouting units, the religious bias would have been even more pronounced; more than 70% of all units are chartered to faith-based organizations.
There are more than 1.7M Boy Scout youths who are members of faith-based units. By comparison, 743k are in civic units, and only 313k have joined educational units. Almost one in every six Boy Scout youths is in a unit associated with the Mormon Church (Jesus Christ of Latter-day Saints)!
Most celebrities who choose a stage name make it something short and snappy. A few go for something longer than what they were given at birth.
This graph shows that effect quantitatively. The median change in length among these 47 celebrities is a shortening by four letters.
Julianne Moore has the longest increase from Julie Smith (her middle name at birth was Anne). The most significant decreases belong to those who chose single stage names: Florian Armstrong to Dido and Ella Yelich-O’Connor to Lorde.
The National Science Foundation keeps track of the doctorates granted by US universities each year. In this post, I’ve mapped recipients of doctoral degrees in 2012 (the most recent data available) by state and country. The top map shows the number of doctorates granted in each state, as well as the 20 universities that graduated the most doctoral students. The bottom map illustrates the citizenship of the students who received their doctorates in the USA. All countries that had at least 50 students are shown.
Both maps are influenced by population density, but there are some interesting outliers. The list of the 20 universities producing the most doctoral students may also surprise you…