Wednesday, February 26, 2014

Taking a Random Sample on Amazon Redshift

Recently, I was approached by Vicky whom I'm working with at a client, to help with a particular problem.  She wanted to calculate page view summaries for a random sample of visitors from a table containing about a billion page views.  This is a common problem, especially as data gets larger and larger.  Note that the sample itself is based on visitors, so a simple random sample is not sufficient.  We needed a sample of visitors and then all the pages for each visitor.

Along the way, I learned some interesting things about Redshift, taking random samples, and working with parallel and columnar databases.

For those not familiar with it, Redshift is an Amazon cloud data store that uses ParAccel, a parallel columnar database a based on Postgres (an older version of Postgres).  Postgres is a standard-enough relational databases, used by several database vendors as the basis of their products.

Columnar databases have interesting performance characteristics, because the database stores each column separately from other columns.  Although generally bad performance-wise for ACID-compliant transactions (if you don't know what ACID is, then you don't need to know), columnar databases are good for analysis.

However, your intuition about how things work may not apply.  A seemingly simple query such as this:

select *
from PageViews
limit 10;

takes a relatively long time (several minutes) because all the columns have to be read independently.  On the other hand, a query such as:

select min(BrowserId), max(BrowserId)
from PageViews;

Goes quite fast (a few seconds), because only one column has to be read into memory.  The more columns the queries reads, the slower it is -- other things being equal.

Back to the random sample.  A typical way of getting this type of random sample is to first find the reduced set of visitors and then join them back to the full page views.   This sounds cumbersome, but the strategy actually works well on many databases.  Applied to the query we were working with, the resulting query looks something like:

select pv.BrowserId,
from (select distinct BrowserId
      from PageViews
      order by random()
      limit 100000
     ) list join
     PageViews pv
     on list.BrowserId = pv.BrowserId
group by BrowserId;

This is a reasonable and standard approach to reduce the processing overhead.  The subquery list produces all the BrowserIds and then sorts them randomly (courtesy of the random() function).  The limit clause then takes a sample of one hundred thousand (out of many tens of millions).  The join would normally use an indexed key, so it should go pretty fast.  On Redshift, the subquery to get list performs relatively well.  But the entire query did not finish (our queries time out after 15-30 minutes). We experimented with a several variations, to no avail.

What finally worked?  Well, a much simpler query and this surprised us.  The following returned in just a few minutes:

select BrowserId,
from PageViews pv
group by BrowserId
order by random()
limit 100000;

In other words, doing the full aggregation on all the data and then doing the sorting is actually faster than trying to speed up the aggregation by working on a subset of the data.

I've been working with parallel databases for over twenty years.  I understand why this works better than trying to first reduce the size of the data.  Nevertheless, I am surprised.  My intuition about what works well in databases can be inverted when using parallel and columnar databases.

One of Vicky's requirements was for a repeatable random sample.  That means that we can get exactly the same sample when running the same query again.  The random() function does not provide the repeatability.  In theory, by setting the seed, it should.  In practice, this did not seem to work.  I suspect that aspects of load balancing in the parallel environment cause problems.

Fortunately, Postgres supports the md5() function.  This is a hash function that converts a perfectly readable string into a long string containing hexadecimal digits.  These digits have the property that two similar strings have produce very different results, so this is a good way to randomize strings.  It is not perfect, because two BrowserIds could have the same hash value, so they would always be included or excluded together.  But, we don't need perfection; we are not trying to land a little Curiousity lander in a small landing zone on a planet tens of millions of miles away.

The final form of the query was essentially:

select BrowserId,
from PageViews pv
group by BrowserId
order by md5('seed' || BrowserId)
limit 100000;

The constant "seed" allows us to get different, repeatable sample when necessary.  And Vicky can extract her sample in just a few minutes, whenever she wants to.

Thursday, November 14, 2013

What will I do on my Caribbean vacation? Teach data mining, of course!

Monday, November 18th at the Radisson Hotel Barbados. Presented by Michael Berry of Tripadvisor and David Weisman of the University of Massachusetts.  Sponsored by Purple Leaf Communications. Registration and information here.

Wednesday, September 25, 2013

For Predictive Modeling, Big Data Is No Big Deal

That is what I will be speaking about when I give a keynote talk a the Predictive Analytics World conference on Monday, September 30th in Boston.
For one thing, data has always been big. Big is a relative concept and data has always been big relative to the computational power, storage capacity, and I/O bandwidth available to process it. I now spend less time worrying about data size than I did in 1980. For another, data size as measured in bytes may or may not matter depending on what you want to do with it. If your problem can be expressed as a completely data parallel algorithm, you can process any amount of data in constant time simply by adding more processors and disks.
This session looks at various ways that size can be measured such as number of nodes and edges in a social network graph, number of records, number of bytes, or number of distinct outcomes, and how the importance of size varies by task. I will pay particular attention to the importance or unimportance of data size to predictive analytics and conclude that for this application, data is powerfully predictive, whether big or relatively small. For predictive modeling, you soon reach a point where doubling the size of the training data has no effect on your favorite measure of model goodness. Once you pass that point, there is no reason to increase your sample size. In short, big data is no big deal.

Sunday, October 21, 2012

Catch our Webcast on November 15

Gordon and I rarely find ourselves in the same city these days, but on November 15 we will be in Cary, North Carolina with our friends at JMP for a webcast with Anne Milley.  The format will be kind of like the first presidential debate with Anne as the moderator, and kind of like the second one with questions from you, the audience.  Sign up here.

Tuesday, September 11, 2012

Upcoming Speaking Engagements

After taking a break from speaking at conferences for a while, I will be speaking at two in the next month. Both events are here in Boston.

This Friday (9/14)  I will be at Big Data Innovation talking about how Tripadvisor for Business models subscriber happiness and what we can do to improve a subscriber's probability of renewal.

On October 1 and 2 I will be at Predictive Analytics World in Boston. This has become my favorite data mining conference. On the Monday, I will be visiting with my friends at JMP and giving a sponsored talk about how we use JMP for cannibalization analysis at Tripadvisor for Business. On Tuesday, I will go into the details of that analysis in more detail in a regular conference talk.

Sunday, March 11, 2012

Measuring Site Engagement: Pages or Sessions

One of our clients is a large media website that faced a simple question: What is the best way to find the most engaged users on the web site? The goal was to focus a marketing effort on these users.

A media web site is challenging, because there is no simple definition of engagement or customer worth. The idea is that engagement can either lead to more advertising views or to longer subscriptions, depending on the business model for the site. On the other hand, for a retailing site, the question is simpler, because there is a simple method to see who the best customers are. Namely, the amount of money they spend.

Engagement is a nice marketing concept, but how can it be defined in the real world? One way is to simply look at the number of page views during some period of time. Another is to look at the number of sessions (or alternatively days of activity if sessions are not available) during a specified period of time. Yet another is to measure breadth of usage of the site over a period of time: Does the user only go to one page? Is the user only coming in on referrals from Google?

The first analysis used one month of data to define engagement. The top users for one month were determined based on pages and sessions. Of course, there is a lot of overlap between the two groups -- about 60% of the top deciles overlapped.

Which group seems better for defining engagement, the top users by page views or by sessions? To answer this, let's borrow an idea from survival and measure how many users are still around nine months later. (Nine months is arbitrary in this case). In this case, the return rate for the top decile for sessions was 74.4% but for the top decile for pages was lower at 73.8%. Not a big difference, but one that suggests that sessions are better.

Actually, the results are even more striking for visitors who are not in both top deciles. For the non-overlapping group, the session return rate is69.6% versus 67.9% for the page deciles.

For defining engagement, we then extended these results to three months instead of one to find the top one million most engaged users. The three measures are:

  1. Visitors that have the most page views over three months.
  2. Visitors that have the most sessions over three months.
  3. Visitors in the top tercile of sessions (third) in each month, then take the highest terciles.

Three months was chosen as a rather arbitrary length of time, because the data was available. Holding it constant also lets us understand the difference between sessions and page views.

These three methods all produced about the same number of visitors -- the goal was to find the top one million most engaged users.

By these measures, the top one million visitors chosen by the three methods had the following "return" rates, nine months later:

  1. Page views in three months: 65.4%
  2. Sessions in three months: 65.9%
  3. Sessions over three months: 66.9%

The nine-month survival suggests that the sessions over three months is the better approach for measuring engagement.

Tuesday, February 14, 2012

Using Matched Pairs to Test for Cannibalization

When a company introduces a new product into the same market served by an existing one, it is possible that the new product will achieve success at the expense of the first. For example, when Netflix introduced movie downloading, it knew it would put a dent in DVD subscriptions. This is called cannibalization. Here at TripAdvisor, I recently did a study to determine whether this was occurring on our sites.
A good methodology to use for this kind of study is matched pairs. This allows you to isolate the effects of a single variable while controlling for many others. The idea is simple: To measure the effect of a treatment, you take pairs of subjects who are similar in every way and give the treatment to one, but not the other. In medical studies, twins come in handy for this purpose. 

To simplify slightly, at TripAdvisor we have two ways to generate revenue from the millions of travelers who come to one of our sites to read reviews: they can click on link A to be taken to an on-line travel agency which pays us for the referral or they can click on link B to be taken directly to the site of a hotel that has subscribed to our business listing product. So the question is “Does the presence of link B have an effect on the number of clicks received by link A?” To answer this question, each property with a business listing is paired with a “twin” that does not have a business listing. The result is two cohorts with extremely similar distributions of average daily rate, number of reviews, amount of traffic on review page, number of rooms, and everything else I could think of that might influence clicks on link A. Since the only consistent difference between the cohorts is the presence or absence of link B, any statistically significant difference in Link-A clicks can be attributed to the presence of the business listing.

Why not just compare a random sample of hotels with links A and B with a random sample of hotels with only link A?  Such a comparison would be very flattering to link B; on average, hotels with a business listings subscription perform better than those without one on all kinds of metrics including clicks on link A. This is not surprising. Business listings do not appeal to all properties equally, nor have they been marketed with equal vigor in all markets and market segments. Such a study cannot distinguish between a difference caused by link B and one that is merely correlated with link B. For example, perhaps link B appeals more to hotels in high-traffic destinations and those same properties also attract more clicks of all kinds

Why not do a longitudinal study? The goal would be to compare the click rate before and after link B goes live on a hotel’s review page. The problem with this approach is that though the change in click rate is easy to measure, it is hard to interpret. The quantity of clicks varies over time for all sorts of reasons that have nothing to do with the presence or absence of a business listing. In addition to seasonality, there is trend: The ever increasing number of TripAdvisor users means that clicks will tend to increase over time. Add to that the effects of marketing campaigns, competition, changing exchange rates, and political factors and there is a lot of noise obscuring whatever signal is in the data. A cross-sectional study controls for all that.

How is similarity measured?  The matched pairs methodology calls for each subscriber to be paired with the non-subscriber most similar to it. For this study, there is a list of features that must match exactly and another list of features which, as a group, must be “pretty close.” The exact match features are categorical. The pretty close features are numeric.

Exact match features
·         Same price business listing.
·         Same geographic region.
·         Same category (Hotel, B&B, Specialty Lodging).
·         Same chain status (a Hilton can match a Marriott, but neither can match an independent property).
·         Matching properties are both on the first page of listings for a destination or both on some other page.
·         Presence or absence of reviews supplied by our users.
Hotels that match on all of the above are candidates for matching. A hotel’s actual match is its closest neighbor as determined by the “pretty close” features. The exact match features control for many variables that are not mentioned explicitly. For example, the price charged for a business listing depends on the popularity of the destination and the size of the property so hotels in the same pricing slice are similar in size and traffic. Matching on geography controls for currency, climate, language, and much else.

Pretty close features
·         Average daily rate.
·         Number of rooms.
·         Popularity ranking.
·         Review page views.

The values of these features place each property at a point in a four-dimensional space so it is easy to calculate the Euclidean distance between any pair of properties. The closest candidate by Euclidean distance is picked as the match. Because the features are all measured on different scales, they must first be standardized to make distance along one dimension comparable to distance along any other.
A few pairs are so well matched that, according to this measure, they are distance 0 from each other.
The hotels on the left have business listings. The ones on the right are their twins without business listings. Podere Perelli and Agriturismo il Borghetto are twins because each has 12 rooms, each got exactly 72 page views during the observation period, and each is seventh on its page.
The results
Deciding on the distance metric and creating the matched pairs was most of the work. Once I had the pairings, I loaded 36,000 closely matched pairs into JMP, a data exploration and analysis tool that includes a matched pairs module.

In the diamond-shaped chart, the horizontal axis represents increasing number of clicks on link A (“commerce clicks” in the figure). To the left, where the number of clicks is low, there are some dots below the red line indicating pairs where the non-subscriber got more link-A clicks, but as the number of clicks increases, the business listings subscriber nearly always wins.
In conclusion, after controlling for differences due to geography, traffic, popularity, hotel category, number of rooms, presence or absence of reviews, appearance on page one, and average daily rate, we counted the number of clicks each twin received during a fixed observation period. There was a statistically significant difference in the number of clicks on link A. The average number of clicks for business listing subscribers was 597.49. The average number for non-subscribers was 411.69. This is good news for our subscribing hoteliers: In addition to the traffic we drive directly to their sites, they see increased indirect traffic as well.