Thursday, September 27, 2007

Which movies did 305344 fail to rate?

Originally posted to a previous version of this blog 27 April 2007.

I expected that the 117 movies not rated by someone or something that seems to rate every movie would have few raters and an earliest rating date close to the cutoff date for the data. That would be consistent with a rating program of some sort that scores the entire database periodically. This did not prove to be the case. The list of movies customer 305344 failed to rate includes Dr. Shivago, Citizen Kane and A Charlie Brown Christmas.

Unlike most of the recent questions, this one cannot be looked up in the rater signature or the movie signature because this information has been summarized away. Instead I used a query on the original training data that has all the rating transactions. Later, I looked up the earliest rating date for each movie not rated by the alpha movie geek to test my hypothesis that they would be movies only recently made available for rating.


select t.movid from
(select r.movid as movid, sum(custid=305344) as geek
from netflix.train r
group by movid) t
where t.geek = 0


The most rated movies not rated by the alpha rater geek























titlen_ratersearliest
Mystic River143,6822003-09-20
Collateral132,2372004-05-10
Sideways117,2702004-10-22
The Notebook115,9902004-05-19
Ray108,6062004-10-22
The Aviator108,3542004-11-30
Million Dollar Baby102,8612004-11-16
Hotel Rwanda92,3452004-12-09
The Hunt for Red October83,2491999-12-17
12 Monkeys76,4751999-12-30
Crash65,0742005-04-14
Citizen Kane61,7582001-03-17
The Saint28,4482000-01-05
Doctor Zhivago17,7852000-01-12
Hackers17,4522000-01-06
The Grapes of Wrath16,3922001-03-18
The Pledge10,9692001-01-21
A Charlie Brown Christmas7,5462000-08-03
The Tailor of Panama7,4212001-03-28
The Best Years of Our Lives7,0312000-01-06

How weird is customer 305344, the rating champion?

Originally posted to a previous version of this blog on 27 April 2007.

305344



This most prolific of all customers has rated 17,653 of the 17,770 movies; all but 117 of them. As with the last super rater we examined, his ratings are heavily skewed toward the negative end of the scale. Of course, anyone forced to see every movie in the Netflix catalog would probably hate most of them. . .

How weird is customer 2439493?

Remember the 5 raters who rated more than 10,000 movies each? It is important to know whether their opinions are similar to those of less prolific raters because for rarely-rated movies, theirs may be the only opinion we have to work with.

Here is the distribution of 2439493's ratings:

2439493



Customer 2439493 rated 16,565 movies and hated 15,024 of them! If he is an actual human, why does he watch so many movies if he dislikes them so much? If he actually watched all the movies and if the average running time was 1.5 hours, then he spent 22,536 hours watching movies he hated. If he did it as a full-time job, it would take him about 8 years.


This customer did like 334 movies (about 2%) well enough to give them a five. Other than a fondness for Christmas and Santa themed movies, it is hard to discern what his favorites have in common. Here is a partial listing:
    Rudolph the Red-Nosed Reindeer
  • Jingle All the Way
  • The Santa Clause 2
  • The Santa Clause
  • Back to the Future Part III
  • Frosty's Winter Wonderland / Twas the Night Before Christmas
  • Sleeping Beauty: Special Edition
  • Jack Frost
  • The Princess Diaries (Fullscreen)
  • The Little Princess
  • A Christmas Carol
  • The Passion of the Christ
  • Ernest Saves Christmas
  • Bambi: Platinum Edition
  • New York Firefighters: The Brotherhood of 9/11
  • Left Behind: World at War
  • The Year Without a Santa Claus
  • Miss Congeniality
  • National Lampoon's Christmas Vacation: Special Edition
  • Groundhog Day
  • Maid in Manhattan
  • Jesus of Nazareth
  • The Sound of Music
  • A Charlie Brown Christmas
  • Miracle on 34th Street
  • Mary Poppins
  • The Brave Little Toaster
  • The Grinch
This guy rated 16,565 movies and Miss Congeniality and National Lampoon's Christmas Vacation both made the top 2%! Representative? I hope not.

How does my ratings distribution compare?

Originally posted to a previous version of this blog on 18 April 2007.


mjab<--This is how my scores are distributed.




population



This is how scores are distributed in the population.-->




I have rated 110 movies--a few more than the median. My mode is 3 whereas the population mode is 4.

Thinking about my own data is not just narcissism; I have found it useful when looking at supermarket loyalty card data, telephone call data, and drug prescription data. I find it gets me thinking in more interesting ways.

For instance, I never give any movies a 1, so how come other people do? Couldn't they have guessed they wouldn't like that movie before seeing it? Personally, I never watch a movie unless I expect to like it at least a little. Ah, but that wasn't always the case! When my kids were young and living at home, I often suffered through movies that were not to my liking. Come to think of it, it is not only parents who sometimes watch things picked by other people. Room mates, spouses, and dates can all have terrible taste!

Outliers for number of movies rated

Originally posted to a previous version of this blog on 18 April 2007.

My raters signature has a column for the number of movies each subscriber has rated. In the J code below, this column is called n_ratings.

+/ n_ratings >/ 1000 5000 10000

In English, this compares each subscriber's number of ratings with the three values creating a table with three columns. The table contains 1 where the number of ratings is greater and 0 where it is less than or equal to the corresponding value. These 1's and 0's are then summed. The result vector is 13100 43 5.

The 13,100 people who rated more than a thousand movies are presumably legitimate movie buffs who have seen, and have opinions on, a lot of movies. Rating 10,000 movies does not seem like the expected behavior of a single human. Could these be the collective opinions of an organization? Or automatic ratings generated by a computer program? I don't know. What I do know is that such outliers should be treated with care. One concern is that for movies that have been rated by very few subscribers, the ratings will be dominated by these outliers.

There has been some discussion of this issue on the Netflix Prize Forum.

How many movies does each rater rate?

Originally posted to a previous version of this blog on 17 April 2007.

Part of my rater signature is how many movies the rater has rated. A few people (are they really people? Perhaps they are programs or organizations?) have rated nearly all the movies. Most people have rated very few.

movies rated

How Many People Rate Each Movie?

Originally posted to a previous version of this blog on 17 April 2007.

Part of my movie signature is a count of how many people have rated each movie. The number drops off very quickly after the most popular movies.



The mean number of raters is 5,654.5. The median number of raters is 561. As mentioned in an earlier post, the smallest number of raters for any movie is 3.