Thursday, September 18, 2014

Recap: Iron Viz - Reviewing the Reviewers

I recently had the opportunity to compete in the 2014 Tableau Conference Iron Viz. For those who are unfamiliar, the Iron Viz is a competition which pits three data visualization enthusiasts against each other to create the best possible visualization in 20 minutes - live onstage in front of over a thousand people. It was the fastest 20 minutes of my life. After the dust had settled, my dashboard, which analyzed the Yelp reviewers (as opposed to businesses), won the championship! Without further ado, here was my final product:


Below is the story of my road to the Iron Viz championship…

I earned my spot in the finals by winning the feeder competition Elite 8 Sports Viz Contest with my dashboard It’s 4th Down. My two other competitors, Jeffrey Shaffer and Jonathan Trajkovic, earned their spots by winning the quantified self and storytelling competitions respectively. I got to know them over the course of the week in Seattle and they are both really great guys that create fantastic content for the Tableau community.

The inspiration for my viz was a seafood restaurant I found on Yelp years ago when we were vacationing in a new city and looking for dinner. I remember it only had a few reviews and at 3 overall stars wasn’t looking that promising, but I clicked it anyways. There were two 4 stars reviews and a single 1 star review. I don’t remember the exactly wording, but the 1 star review basically said 'I hate seafood and I hate this restaurant'. I remember thinking ‘Why should I trust someone’s review of a seafood restaurant when they don’t like seafood?’ When I was looking at the Yelp data for Iron Viz, this all came back to me and my idea was born: 'Who should you trust on Yelp?'

Tactically my Iron Viz strategy basically involved taking advantage of the strengths of Tableau:
  •  Easy exploration to see what data was available and how it was structured
  • ‘Speed-of-thought’ analysis to see what insights I could pull out of the data.
  •  Rapid creation and iteration to arrive at a final design
  •  Rapid creation and iteration to start from scratch after I decided I didn’t like my ‘final’ design at the last minute



Ease of Exploration

Although we knew we were going to be competing months in advance, to make it more challenging, we were only given the data several days ahead of time - which left only a few days to explore and practice. Pulling the data apart quickly revealed it was just over 1 million Yelp reviews going back to 2004 for three cities (Las Vegas, Phoenix and Edinburgh, UK). The structure of the data had some implications given that the data is a large denormalized table. For example, because businesses are repeated for each review, the SUM of the businesses’ review are meaningless. Understanding the data structure was critical to using the right aggregations for our metrics.



Speed of Thought

I knew I wanted to evaluate the reviewers themselves, but how best to do that? I quickly created a few different calculated fields before settling on my key metric, Reviewer Error. Reviewer Error measures how far that particular reviewer’s rating varied from the overall Yelp consensus. For individual reviews this isn’t meaningful (people can have a bad experience at a good business) but when looking in aggregate you can get an idea how close or far someone is from the overall consensus. This is technically the Root Mean Square Deviation. It was easy to create this metric in Tableau:

Reviewer Error = SQRT(SUM(([Review Score]-[Business Score])^2)/SUM([Number of Records]))

My first exploration was segmenting reviewers by a few key dimensions including that user’s overall review average, how many votes that user had accrued, how many fans they had, and how many years they were Yelp elite. There were some very clear trends in the data:


Key Takeaways:
  • Reviewers whose average rating was less than 3, or worse, less than 2 had a very large error. This is likely because some people likely go on Yelp, write a bad review, and never write any other reviews.
  •  Both the number of votes a person had received and the number of fans they had were correlated to a reduction in error.
  • Reviewers with at least one year Yelp elite had a lower error, but additional years of Yelp elite didn’t lead to any significant further reduction of error.
In layman’s terms, a trusted reviewer typically has an average review greater than three stars, has many votes and fans, and has at least one year as Yelp elite. I practiced building these charts and a couple of detail charts in the two days leading up to the competition. After multiple rounds of practice I got it down to under 20 minutes. Of course, as you can see in my final dashboard, I did not include these charts.



Rapid Design and Iteration

About 18 hours before the competition, after looking at my dashboard for the 100th time, I decided the analysis was too static and didn’t tell the story of any individual reviewers. I decided to go back to the drawing board. I ended up rebuilding almost the entire dashboard the night before.

I kept my Reviewer Error metric and just started poking around and slicing different ways until I settled on looking at individual reviewers in a scatterplot. The Reviewer Error / Number of Reviews is a great scatterplot chart that shows a regression to the mean as would be expected, but also has a skew to it that indicates the frequent reviewers on Yelp really are more accurate than the overall average. More interesting, the scatterplot quickly illuminates the outliers, or reviewers who are either very good or very bad at reviewing. I’ll start with the good: Norm.



Norm has over 1,000 reviews, an error of .72 stars, and minimal bias in either direction. We can see he has been very consistent over time both in how many reviews he has left and what his average review is. I feel that Norm has a good sense of what to expect from a business. If I pick one of his recent reviews (link from the tool tip in bottom right chart), a 4 star review of ‘Delhi Indian Cuisine’, I can see he wrote a detailed review with pictures. Clicking on his profile reveals he has thousands of votes and many years at Yelp elite. Given my prior segmentation, this was not a surprise.



Now let's go to the other extreme, Diana. At nearly three times Norm's error, her error is 2.01 stars with negative .76 star bias over 120 reviews. When we select her, we can she is an 'all or nothing' type woman. All of her reviews are either 5 stars or 1 star. From her average rating and bias we can tell she is, on aggregate, a harsh critic and really hands out the 1 star reviews like it is going out of style. Selecting her recent review of Zoyo Neighborhood Yogurt we can see she gives it a 1 because of ‘flies and bugs’. Clicking her profile we can see her four most recent reviews are all 1 star reviews and every single one of them complains about the flies or insects. It makes you wonder if you can really trust her reviews at all.


Closing Thoughts

In closing, I wanted to discuss a couple of comments I received from the judges:
  •  I used Red/Blue gradient instead of an Amber/Blue gradient
  •  I did not include a color legend

Although we were hurried for time, I did consciously think about both of these aspects prior to creating the viz and wanted to share my thoughts here. Please disagree with me in the comments!
  • As for the red/blue gradient, I wanted my dashboard to have a ‘Yelp feel’ and they prominently use red throughout their site. I used red for this reason. I used blue so there was clear contrast against high and low reviews even though on the low end yelp uses grey/yellow hues.
  • I didn’t use a legend and instead opted for semantically resonant colors. High/low reviews are red/blue respectively where red is 1) hot (popular) 2) aligned to Yelp and 3) reinforced subtlety through all three charts on the right of the dashboard. There could be some argument that red is ‘bad’ such as in a status report, but when it comes to reviews and stars specifically, I've often found them red.

I have a list of about a dozen more things I would do to improve this dashboard, but let me tell you, 20 minutes goes by fast! In the spirit of the Iron Viz, I made no further updates to this workbook since I put the mouse down in the competition. The 20 minutes onstage belies the effort all three of us put in preparing for the showdown. My competitors put together great vizes and I thought Jeffrey's roulette wheel was a creative way to blend the data with story of Vegas (though losing to a pie chart would have been rough). Most importantly, the Iron Viz was a tremendous amount of fun, I learned even more about Tableau, and I got to meet many fantastic people along the way.

John

9 comments:

  1. I thought your bar chart was obviously also serving as the color legend, so didn't understand that criticism at all.

    ReplyDelete
  2. I'd say you didnt need the legend, pixels are precious. Awesome work and under that much pressure too

    ReplyDelete
  3. Great dashboard and congratulations on your win! I was thrown for a moment with the color, but I agree that the bar chart served as the legend for me. I also liked your examples highlighted above. They were a great examples to understand how to interpret the data.

    ReplyDelete
  4. It was cool to watch the three of you compete and take different approaches to the data. I thought your design was great and I really like the concept of using reviewer error as a means of determining the quality of the reviews. As others mentioned, there was no need for a color legend. Thanks for sharing the ideas behind your approach. Congrats on the win!

    ReplyDelete
  5. Congratulations on the win John! I'm bummed that I missed the session (the first one I've ever missed), but there were just too many sessions to choose from at this conference. :-) But I can see why you won...your analysis of the Yelp data offered a unique perspective that was conveyed with clarity in your dashboard.

    Did you consider reversing the Error axis on the scatter-plot? Having the most trusted users at the bottom vs. the top seems counter-intuitive, at least to me. I realize there might be those who complain about having the axis start at 3 and go "up" to 0, so perhaps this is one of those design coin tosses. :-)

    ReplyDelete
    Replies
    1. Hey Michael,

      I actually reversed the axis when I was practicing, but neglected to do it in the finals! I totally agree reversing the axis would make sense in the dashboard. The only thing I would add is, if I had any bar charts showing error, I'd definitely keep vertical axes in sync otherwise in some charts 'high' values are bad and in others they are good.

      Delete
  6. Hi John,

    I have another, more geeky question. What would the difference be in your analysis if you used the absolute bias (absolute value of the reviewer bias) vs. the reviewer error? When I plotted the two on a scatter-plot, the difference between them was quite low (e.g. the Avg Error is 1.096 and the Avg Absolute Bias is .9). If the goal of the analysis is to show those users who deviate the most from the consensus, then both measures achieve that, but Absolute Bias is much easier to understand, at least to the statistically naive like me. :-)

    I'd be interested to know whether the Reviewer Error formula provides a higher degree of accuracy in determining a reviewer's deviation from consensus, or whether both measures effectively achieve the same thing.

    -Mike

    ReplyDelete
    Replies
    1. Great question. You're right, the absolute value of the bias is similar to the reviewer error, although slightly different. The practical effect of using RMSD instead is that it 'punishes' larger outliers more.

      For example, consider a reviewer that left four reviews: 2, 3, 4 and 5 stars each for businesses with overall 5 star rating. The average of the absolute value of bias would be 2.5 stars and the reviewer error would be 1.9. Now consider a reviewer who instead left four reviews of 2, 2, 5 and 5 for the same businesses. The absolute value of their bias is still 2.5, but now the reviewer error is 2.1. The second reviewer has a higher error because they have two reviews that are a full three stars off despite the average absolute bias being the same.

      While I don't think this would materially change the analysis, I would rather punish those who are far off slightly more so I went with reviewer error.

      Delete
    2. Thanks for the explanation, John. Yes, that does make sense and further highlights why I need to up my statistical literacy. :-) I appreciate the education.

      Delete

Note: Only a member of this blog may post a comment.