One example is movie recommendation where a movie site recommends movies to users based on their past viewing history. When a new movie arrives the market, there isn't enough viewing statistics about the movie and therefore the new movie will not have a strong match score and won't be picked as a recommendation. Because we never learn from those that we haven't recommended, the new movies will continuously not have any statistics and therefore will never be picked in future recommendations.

Another cold start example is online Ad serving when a new Ad enters the Ad repository.

### Multilevel Granularity Prediction

One solution of cold-start problem is to leverage existing items that are "SIMILAR" to the new item; "similarity" is based on content attributes (e.g. actors, genres). Notice that here we are using a coarser level of granularity (group of similar items) for prediction, which can be less accurate than a fine-grain model that use view statistics history for prediction.In other words, we can make recommendation based on two models of different granularity. Fine-grain model based on instance-specific history data is preferred because it usually has higher accuracy. For cold-start problem when the new items don't have history data available, we will fall back to use the coarse-grain model based on other similar items to predict user's interests on the new items.

A common approach is to combine both models of different granularity using different weights where the weights depends on the confidence level of the fine-grain model. For new items, the fine-grain model will have low confidence and therefore it gives more weights to the coarse-grain model.

However, in case we don't have a coarser level of granularity, or the coarse level is too coarse and doesn't give good prediction. We have to use the fine-grain model to predict. But how can we build up the instance-specific history for the fine-grain model when we are not sure if the new items are good recommendation for the user ?

### Optimization under Uncertainty

The core of our problem is we need to optimize under uncertainties. We have two approaches- Exploitation: Make the most optimal choice based on current data. Because of uncertainty (high variation) the current data may deviate from its true expected value, we may end up picking a non-optimal choice.
- Exploration: Make a random choice or choices that we haven't made before. The goal is to gather more data point and reduce the uncertainty. This may waste our cycles of picking the optimal choice.

Unfortunately, you don't know the underlying probability and has only a limited number of rounds to play. How would you choose which bandit to play to maximize the total number of rounds you win.

Our strategy should strike a good balance between exploiting and exploring. To measure how good the strategy is, there is a concept of "regret", which is the ratio of the two elements

- Value you obtain by following Batch Optimal strategy (after you have done batch analysis and have a clear picture in the underlying probability distribution)
- Value you obtain by following the strategy

### Epsilon-Greedy Strategy

In the**"epsilon-greedy"**strategy, at every play we throw a dice between explore and exploit.

With probability p(t) = k/t (where k is a constant and t is the number of tries so far),we pick a bandit randomly with equal chance (regardless of whether the bandit has been picked in the past). And with probability 1 - p(t), we pick the bandit that has the highest probability of win based on passed statistics.

Epsilon-greedy has the desirable property of spending more time to explore initially and gradually reduce the portion as time passes. However, it doesn't have a smooth transition between explore and exploit. Also while it explores, it picks each bandit uniformly without giving more weight to the unexplored bandits. While it exploits, it doesn't consider the confidence of probability estimation.

### Upper Confidence Bound: UCB

In the more sophisticated UCB strategy, each bandit is associated with an estimated mean with a confidence interval. In every play, we choose the bandit whose upper confidence bound (ie: mean + standard deviation) is the largest.Initially each bandit has a zero mean and a large confidence interval. As time goes, we estimated the mean p[i] of bandit i based on how many time it wins since we play the bandit i. We also adjust the confidence interval (reducing deviation) as we play the bandit.

e.g. standard deviation is (p.(1-p)/n)^0.5

Notice that the UCB model can be used in a more general online machine learning setting. We require the machine learning model be able to output its estimation based on a confidence parameter. As a concrete example, lets say a user is visiting our movie site and we want to recommend a movie to the user, based on a bunch of input features (e.g. user feature, query feature ... etc.).

We can do a first round selection (based on information retrieval technique) to identify movie candidate based on relevancy (ie: user's viewing history or user search query). For each movie candidate, we can invoke the ML model to estimated interest level, as well as the 68% confidence boundary (the confidence level is arbitrary and need to be hand-tuned, 68% is roughly one standard deviation of a Gaussian distribution). We then combine them by add the 68% confidence range as an offset to its estimation and recommend the movie that has the highest resulting value.

After recommendation, we monitor whether user click on it, view it ... etc. and the response will be fed back to our ML model as a new training data. Our ML model is an online learning setting and will update the model with this new training data. Over time the 68% confidence range will be reduced over time as more and more data is gathered.

### Relationship with A/B Testing

For most web sites, we run experiments continuously to improve user experience by trying out different layouts, or to improve user's engagement by recommending different types of contents, or by trying out different things. In general, we have an objective function that defines what aspects we are trying to optimize, and we run different experiments through A/B testing to try out different combinations of configuration to see which one will maximize our objective function.When the number of experiments (combinations of different configuration) is small, then A/B is exploration mainly. In a typical setting, we use the old user experience as a control experiment and use the new user experience as a treatment. The goal is to test if the treatment causes any significant improvement from the control. Certain percentage of production users (typically 5 - 10%) will be directed to the new experience and we measured whether the user engagement level (say this is our objective function) has increased significantly in a statistical sense. Such splitting is typically done by hashing the user id (or browser cookies), and based on the range of the hash code falls to determine whether the user should get the new experience. This hashing is consistent (same user will hash into the same bucket in subsequent request) and so the user will get the whole new user experience when visiting the web site.

When the number of experiments is large and new experiments comes out dynamically and unpredictable, traditional A/B testing model described above will not be able to keep track of all pairs of control and treatment combination. In the case, we need to use the dynamic exploration/exploitation mechanism to find out the best user experience.

Using the UCB approach, we can treat each user experience as a bandit that the A/B test framework can choose from. Throughout the process, A/B test framework will explore/exploit among different user experience to optimize for the objective function. At any time, we can query the A/B testing framework to find out the latest statistics of each user experience. This provides a much better way to look at large number of experiment result at the same time.