The Case Of The Flawed Metacritic Study

A study looking into the hidden formula that drives Metacritic made headlines this week, but Kotaku has discovered some critical errors that call it into question. Earlier this week, Full Sail University professor Adams Greenwood-Ericksen held a GDC session in San Francisco in which he shared some of his research on the effects of Metacritic, the aggregation site that takes media reviews from hundreds of outlets and outputs them as a single number, or Metascore.

Metacritic has taken some heat over the past few years for refusing to reveal the formula they use to produce their scores. It's not a simple average: Metacritic admits they give more weight to some outlets when crunching the numbers. But they've never said how that weighting system works.

So when Greenwood-Ericksen said he had a model that replicated Metacritic's scores, people took notice. Gamasutra ran an article titled "Metacritic's weighting system revealed," and it got a whole lot of video game developers and reporters talking. The system categorised outlets in six different "tiers" and gave heavy weight to sites like IGN and Wired (and significantly less weight to other big sites like Giant Bomb).

Shortly afterwards, Metacritic came out firing. They took to Facebook to shoot down the formula, calling it "wildly, wholly inaccurate," and they accused Gamasutra of running a misleading headline. (When reached by Kotaku for comment, Gamasutra editor Kris Graft apologised: "Yeah, I feel that the main issue was a poor headline, and we apologise for the confusion over this. It's also unfortunate that a session with inaccurate information like this got into the show.")

Some, however, have remained sceptical of Metacritic's accusations, as the aggregator still won't share the formula that they use.

However, today Kotaku discovered a flaw in Greenwood-Ericksen's formula: at least two of the listed weights -- for the outlets The Sixth Axis and Play UK -- are incorrect.

Let's start from the beginning. Greenwood-Ericksen's model -- devised based on Metacritic data spanning from 2005 or 2006 until 2011 -- assigns certain numerical weights, like 1.5 and 0.5, to each video game outlet. The formula: look at a video game's Metacritic page, take all of the review scores listed, multiply each one by the weight associated with its outlet, add them all together, and divide by the total number of scores. This model has successfully replicated something like 50 scores, Greenwood-Ericksen said.

Except, while plugging in the numbers and testing out the formula today, I discovered that the maths just didn't work for the PS3 game Swords & Soldiers. When I tried to get the Metascore, I found that my results were 7-8 points off. (The maths did work for some of the other games I experimented with, like Venetica.)

So I reached out to Greenwood-Ericksen, who I've been chatting with throughout the day.

"Looks like [Swords & Soldiers] was the development case for The Sixth Axis, and also for Play UK," he told me via Gchat. "So those two weights were actually set using that erroneous data."

I asked exactly what that means.

"It means you caught us making a mistake," he said. "It also means that at least one of those two weights of the 189 are probably off... So those particular weights are unreliable. The good news is that it suggests the process still works, one of us just made a mistake somewhere in applying it... It's embarrassing, certainly. On the one hand, I'm glad somebody spotted the issue. On the other hand, I wish we'd done it before we were so far into the public spotlight."

I asked what makes him think there are no other mistakes like this in the study.

"I don't think we'd have made a mistake like that one twice, but it's always possible," he said. "Certainly I'm going to have to check our work over again to make sure."

But the Full Sail professor doesn't believe that these flaws invalidate the study: the point, he says, was not to determine the values of each weight, but to show that it's possible to figure out the weight behind each outlet.

Greenwood-Ericksen and I had a long conversation on the phone this morning, before I started digging into this formula. He wanted to make it clear that these weights are just one part of a larger study -- a study that makes a number of other conclusions about Metacritic, like its strong connections to sales data -- and he told me that the goal was never to show off an accurate model of how Metacritic weights scores.

"One of the things that virtually everybody missed was that this was a model," he said. "We didn't go down under the basement with a flashlight and find out what the results were. A lot of words like 'revealed' and 'discovered' were all kinds of inaccurate."

The professor said he was pleased by Metacritic's Facebook response, even though the aggregator called his work inaccurate. He's pleased because it offered new information: Metacritic said they use fewer than six tiers, for example, and that publication weights are much closer together than they were in Greenwood-Ericksen's model.

It seems like Greenwood-Ericksen is on the right track, even if the numbers weren't quite right in this case. As he continues crunching numbers and trying to figure out exactly how each Metascore works, the truth behind this formula could eventually come out.

Greenwood-Ericksen said he wishes Metacritic would be more transparent about the formula that they use. It'd certainly preempt situations like this.

"I think the community -- and Metacritic as well -- would be better served by transparency on this," he said. "Part of what makes them so unpopular and what creates so much resentment is that people have the sense that there's this sort of arbitrary magical process that produces this score. I don't think that's the case. I think Metacritic is actually trying very hard to get a reasonable score to represent the quality of the product.

"I think that's what comes across because they're opaque about this particular issue."

Picture: Gualtiero Boffi/Shutterstock


Comments

    Of course, this is only because we put too much stock in what metacritic thinks...

    Academic FAIL.

      How? Everyone makes mistakes - as he said this was a model - do you have any idea how many times we've created, refined and ultimately discarded economic and scientific models and principles? It just means that more refinement is needed

        Because - as stated - his set of data was erroneous. Academic FAIL. Yes I do have an idea of... blah... blah... blah... blah. That is irrelevant. It is still a fail for PUBLISHING A THEOREM BASED ON INCORRECT DATA. A hypothesis can be refined etc... but it is a FAIL if the data is wrong.

          I've passed subjects with certain incorrect pieces of data before. Also, he isn't the biggest dickhead in the universe. At least one small thing he can be proud of that you can't.

          I hear the biggest dickheads are the ones who lose their shit when someone they've never heard of makes an honest mistake and then owns up to it.

            I have passed by using the correct data. I never lost my shit - I just made a comment. I hear the biggest dickheads are those that use incorrect data and then abuse others for stating a fact.. If your proud of your mistakes and scraping through.... Good luck to you and your sad methodology and praxis. By the way I am proud of both my degrees and my post - grad work. I would not be proud of PUBLISHING anything that is proven wrong in a day. So go f yourself and your sad pass. That's not how you obtain a scholarship.

              I'm proud of my 4 PHDs and my Medal of Honor, Croix de Guerre, Nobel Peace prize and a bunch of lesser commendations.

              Using that as a basis for my observation, I do believe you HAVE lost your shit. Being mad is the same thing.

            The kicker being that the ENTIRE point of the study is to weigh the data. LOL.

              EXACTLY!! so by definition the data CAN'T be erroneus since he's working backwards from it. It's the formula that wasn't perfect and that was because of some of the assumptions that he made along the way not being the same as Metacritics (the amount of tiers they use for instance)

                Which is my point. How is the data erroneous? The formulae are wrong or it has not been correctly entered/used. Further the formula should and could only have been checked using the said data. Iteration upon re-iteration until a reliable and repeatable mathematical model was formed. To have your work deemed 'broken' in one day is a FAIL.

          *sigh* his data for one entry was wrong. AS POINTED OUT he designed this model from scratch and it apparently works for a fair bit of the data - not all of it true - but do you know what that means to a Mathematician? It means that there was an assumption or variable wrong and he needs to continue to refine his model. WE'VE BEEN DOING THIS FOR SCIENCE AND ECONOMICS FOR CENTURIES. Good lord the whole discipline of Marcoeconomics was created because of flaws in microeconomic theory which became noticed once microeconomic failed to fix the great depression. He had an assumption that there were 6 tiers - an assumption metacritic says was wrong. His hypothesis was not wrong - Metacritic uses tiers to weigh their data, thus using maths I can find the weighting used. Again it was the ASSUMPTIONS he made. His data can't be erroneus since he's using metacritics OWN END DATA. He's working backwards FROM the data to find the formula.

          Also you've passed subjects with models given to you which have been refined over time. this is a model created from scratch obviously its not going to be perfect right at the beginning but again it seems to work for most things otherwise he wouldn't have released the damn thing

            Well the assumption that a 6 tier system is used is WRONG. How is that an assumption? If his model uses it and produces the right results consistently then it becomes part of the hypothesis. Which is WRONG. This means has hypothesis is WRONG and needs to be re evaluated and changed. THEN TESTED PROPERLY.

            I would also like to point out he is not studying natural laws or quantum physics with high bandwidths of variable (to the observer) near infinite variations of data. It is a simple finite set of scores.

    Wait... people take metacritic seriously?

Join the discussion!