Sad Moth Music: Don't Just Look at the Ratings -- A Closer Look at Pitchfork's Numbers, Part 1: Genre Bias

If you clicked on this on facebook because of this
bullshit MGMT picture, you got got. Please read
the article anyway.

I guess just a warning that this post will be unlike anything anyone is used to reading on Sad Moth. Some of you will hate it. Well, most of you will probably hate it. But I enjoy doing it, and this is just the first of a series of these, so get used to it. I guess I just ask that you try it, and don't immediately turn your brain of the first time you see a graph and go jag it instead.

It's kind of a spoiler to put my big conclusion in the second paragraph of the first post of this series, but here it is. Don't just look at the numbers on Pitchfork. You don't know what you're getting yourself into. Rating music is completely subjective, and putting it on a numerical scale is even more so. But there are some things that can be expected of these ratings. If Pitchfork wants their ratings to mean something, they need to be unbiased in all their ratings. In other words, if Pitchfork wants to rate every one of their albums on the same scale, they need to actually do that in practice.

Pitchfork rates hundreds of albums each year, which generates a lot of data. I'll be using statistical analysis to test if Pitchfork's ratings are actually impartial, in various ways. I was able to access this data through a great website called albumoftheyear.org and the program import.io. In this post I will be specifically looking at Pitchfork's potential biases towards different genres. The basis hypothesis is this: if Pitchfork's ratings are unbiased towards genre, the aggregate mean rating of a specific genre should not be significantly different as the mean of the rest of Pitchfork's ratings. If anyone has taken intro to stats, you know what this means: a lot of t-tests. Pitchfork scores are more-or-less distributed normally, as the histogram of my data shows below, so such tests are valid.

I'm looking at genre first, as it is one of the hardest data sets to generate, entirely because it involved going through each and every album and classifying its genre. I decided to just use data from 2014, as doing this for anything over a year would have taken the entire goddamn summer. Albumoftheyear.org provided very useful genre classifications for about 2/3 of the albums Pitchfork rated in 2014, and I pretty much followed these to a tee. I dropped the other 1/3 of the albums with no genre, as it would have been too time intensive to individually look up each and every album. This caused my dataset to go from 926 to 610. As the albums with missing genres were more or less evenly distributed across the dataset, I don't think dropping these albums caused any significant bias in the numbers.

I separated the data into 6 general genres: Rock, Hip-Hop, Electronic, Country, R&B/Soul/Jazz, and Other. This is the largest potential flaw in my analysis, as it was pretty much just my own judgement as to what went where. When two or more genres were a good choice, which happened mostly with electronic, pop and rock, I tended to give pop precedent over electronic, as well as pop precedent over rock. So if AOTY.org listed an album as "electropop", it would go as pop, and "indie pop" would also be pop. Again, if you want to take issue with my data classifications, I acknowledge this is probably the sketchiest part of my analysis. If you want to pick this apart, which I know no one will, you can find a link to my data at the bottom of the post. Despite these problems, I think these issues occurred infrequently, and had a negligible effect on the data. In other words, I you think that Future Islands just has to be Rock, not Pop, suck my dick.

First, let's look at the frequency of Pitchfork's reviews in 2014, broken up by genre.

As to be expected, not much country, R&B or "other" (which is mostly comprised of what I would describe contemporary classical) What surprised me the most, and I sensed this when going through the data, is how few Hip-Hop albums Pitchfork rates. This isn't really a criticism, after all Pitchfork was mostly an indie rock site for a large part of its existence, but I still find it odd that the most popular genre today is still 4th on this list. Or maybe it's possible that not as many significant Hip-Hop albums come out each year, relative to other genres.

Now it's time to look at what kind of scores Pitchfork gave to each of these genres:

This is a boxplot, which shows the upper and lower quartiles (the upper and lower ends of the "box") with the median as the black line in the middle of each. The lines on the top and bottom, show the extreme values of each distribution. Any outliers are represented by dots (i.e., the dot high about the RBSoul/Jazz plot is D'Angelo's "Black Messiah". It is an outlier, as it was far and away the highest rated album in that category). This plot give a graphical representation of the distributions of which I will now be testing against each other.

H0: samplemean.genre = samplemean.restofratings

5%sig level	gen.rock	gen.elec	gen.pop	gen.hip	gen.coun	gen.rbsoluljazz	gen.other
mean	7.243939	7.309649	6.93271	7.153425	7.3625	7.3	7.55
compmean	7.163	7.1724	7.254473	7.204	7.195847	7.19	7.19571
p-value	0.2525	0.1103	0.000732	0.6313	0.415	0.4885	0.4055
Upper C-interval	0.2197	0.306	-0.5062302	0.1582	0.6173	0.423996	1.513392
Lower C-interval	-0.057	-0.0315	-0.1372955	-0.2596	-0.283	-0.20575	-0.84811
Diff	0.080939	0.137249	-0.321763	-0.050575	0.166653	0.11	0.35429
Failtoreject/reject	fail	fail	reject	fail	fail	fail	fail

Yeow! Exciting, isn't it? The most important thing to look at is the "p-value" row, which is the t-test's generated likelihood that the mean rating of the genre is the same as Pitchfork's rating for the rest of the genres. Most are, in fact pretty high. If they are higher than 5% I cannot say that the means are in fact different, or conclude that Pitchfork is biased in their ratings of that genre. Two genres, specifically Country and Other, have very high variances, due to their small sample sizes. Due to this, these two tests are essentially meaningless. However, one test stands out. The p-value for Pop is 0.000732. That is a 0.0732% chance that the mean for Pop is the same as the mean for the rest of the ratings. In other words, it's actually different. A quick look at the other numbers shows that Pitchfork consistently rates Pop albums lower than other albums. Going off the sample means, it is estimated that Pitchfork's standard for Pop albums is about .32 of a point lower than other albums. Using the confidence intervals, there is a 95% chance that the mean is between .13 and .5 of a point lower than the scores of other albums.

Why is this? That I can't answer for certain. It is possible, due to the limited size of my data, that there were simply a lot of bad Pop albums in 2014. Other possibilities are more nefarious. It is possible that Pitchfork simply likes Pop albums less, and rates them lower. It is also possible selection bias is at play, in that there is some internal grudge against Pop music, and Pitchfork likes to pick bad Pop albums and give them bad ratings. It is important to keep in mind that "Pop" in the sense that my data was constructed is not a group of mainstream albums, but a group of albums in the pop style. The albums in the Pop category in my dataset were given genres by AOTY.org like "indie pop", "art pop" and "dream pop". Though there is an occasional Lana Del Ray or Charli XCX, this list is mostly made of of Foxygen and New Pornographers type bands. The highest rated album is Ariel Pink's "Pom Pom". I think this dispels any idea that Pitchfork is rating these albums lower because they're "more mainstream" and thus "worse".

After I completed these tests, I realized that "rock" is really a very big category, too broad in my opinion. So I broke it up into smaller subgenres, and did the same tests with those.

5%sig level	rock.folk	rock.indie	rock.metal	rock.psych	rock.shoegaze	rock.punk
mean	7.275	7.160976	7.67487	7.16	6.9111	7.3653
compmean	7.192632	7.200703	7.1651	7.199322	7.2023	7.19
p-value	0.5711	0.7675	0.00003662	0.8191	0.3938	0.2747
Upper C-interval	0.3731745	0.229	0.7413447	-0.3926113	0.45188	0.4962411
Lower C-interval	-0.2084	-0.308	0.287332	0.3139674	-1.03432	-0.14663
Diff	0.082368	-0.039727	0.50977	-0.039322	-0.2912	0.1753
Fail/Reject	fail	fail	reject	fail	fail	fail

I want to be clear that these tests were not comparing the subgenres of rock to the other subgenres of rock, but to all other ratings in the entire dataset. Once again, it can be seen that one subgenre is significantly different while the others are more or less in line with the overall mean. It should be noted that shoegaze was an incredibly small subset, only comprised of 9 data points, and its test is pretty much worthless for this reason. However, Metal has a large amount of data points, and the t-test rejects its mean with flying colors. The p-value suggests that there is a 0.003662% chance that the mean rating for Metal albums is actually the same as the mean of all other ratings. I can say pretty much all the same things about Metal that I have said about Pop, except that Metal is biased upwards. Pitchfork really likes Metal, a full degree of magnitude more than it dislikes Pop. The sample means estimate that the average metal album is rated .5 of a point higher than other albums. The confidence interval states that there is a 95% chance that this bias is between 0.287 and 0.741 of a point. Again it is possible that there were just a lot of good Metal albums in 2014. My personal hypothesis is that there is selection bias at work here. As Metal is more or less a niche genre, it would make sense that Pitchfork would refrain from reviewing bad Metal albums entirely. No one cares if you trash some obscure Metal album. The boxplot above supports this idea, as the lower extreme and lower quartile for metal are much higher than other genres. In fact the lower quartile for metal is higher than every other genre's median! This means that most of Pitchfork's Metal ratings are concentrated in a very high, but not super high, 7.5-8.2 range, and not many ratings exist below that. Draw your conclusions as you will, but it seems to me that Pitchfork cherrypicks the top Metal albums, and ignores the rest.

In conclusion, I would have loved to have been funnier or more entertaining in this post, but it is what it is. I just realized that I wrote the sentence "In fact the lower quartile for metal is higher than every other genre's median!" in genuine excitement. My hope is that, at the very least, the next time you look at a Pitchfork review, you can look at the review rating and realize that it is possibly influenced by the type of music being reviewed, and not based on some immaculate genre-blind 1-10 scale that Pitchfork writers brought down engraved on stone tablets from Mt. Sinai. This is (hopefully) the first of many things I'm going to be looking at. If you weren't impressed by my conclusions, hopefully I'll come up with some better zingers about Pitchfork's bullshit at a later date.

Data used -- taken from albumoftheyear.org and acquired using import.io

2 comments:

UnknownSeptember 25, 2015 at 11:07 PM
I definitely decided to go jag it instead of reading this
UnknownSeptember 25, 2015 at 11:25 PM
In actuality I read the whole thing and was thoroughly intrigued and entertained by it. This must've taken a shit ton of time to do, so props on that.
Music taste is entirely subjective, as you said, so it's really difficult to categorize, rate, and especially graph it like you did, which I think caused a lot of flaws in your analysis. However, I think you did the best job with what you had and it was a really thought-provoking read.
Sad Moth is better than Pitchfork anyways ;)

Pages

6/15/15

Don't Just Look at the Ratings -- A Closer Look at Pitchfork's Numbers, Part 1: Genre Bias

2 comments: