9 Mistakes Quants Make that Cause Backtests to Lie

Posted on April 27, 2015 by

1


What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

“I’ve never seen a bad backtest” — Dimitris Melas, head of research at MSCI.

About backtests

A backtest is a simulation of a trading strategy used to evaluate how effective the strategy might have been if it were traded historically. Backtestesting is used by hedge funds and other researchers to test strategies before real capital is applied. Backtests are valuable because they enable quants to quickly test and reject trading strategy ideas.

All too often strategies look great in simulation but fail to live up to their promise in live trading. There are a number of reasons for these failures, some of which are beyond the control of a quant developer. But other failures are caused by common, insidious mistakes.

An over optimistic backtest can cause a lot of pain. I’d like to help you avoid that pain by sharing 9 of the most common pitfalls in trading strategy development and testing that can result in overly optimistic backtests:

1. In-sample backtesting

Many strategies require refinement, or model training of some sort. As one example, a regression-based model that seeks to predict future prices might use recent data to build the model. It is perfectly fine to build a model in that manner, but it is not OK to test the model over that same time period. Such models are doomed to succeed.

Don’t trust them.

Solution: Best practices are to build procedures to prevent testing over the same data you train over. As a simple example you might use data from 2007 to train your model, but test over 2008-forward.

By the way, even though it could be called “out-of-sample” testing it is not a good practice to train over later data, say 2014, then test over earlier data, say 2008-2013. This may permit various forms of lookahead bias.

2. Using survivor-biased data

 Suppose I told you I have created a fantastic new blood pressure medicine, and that I had tested it using the following protocol:

  1. Randomly select 500 subjects
  2. Administer my drug to them every day for 5 years
  3. Measure their blood pressure each day

At the beginning of the study the average blood pressure of the participants was 160/110, at the end of the study the average BP was 120/80 (significantly lower and better).

Those look like great results, no? What if I told you that 58 of the subjects died during the study? Maybe it was the ones with the high blood pressure that died! This is clearly not an accurate study because it focused on the statistics of survivors at the end of the study.

This same sort of bias is present in backtests that use later lists of stocks (perhaps members of the S&P 500) as the basis for historical evaluations over earlier periods. A common example is to use the current S&P 500 as the universe of stocks for testing a strategy.

Why is this bad? See the two figures below for illustrative examples.

The green lines show historical performance of stocks that were members of the S&P 500 in 2012.  Note that all of these stocks came out of the 2008/2009 downturn very nicely.

The green lines show historical performance of stocks that were members of the S&P 500 in 2012. Note that all of these stocks came out of the 2008/2009 downturn very nicely.

What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

In our work at Lucena Research, we see an annual 3% to 5% performance “improvement” with strategies using survivor biased data.

Solution: Find datasets that include historical members of indices, then use those lists to sample from for your strategies.

 3. Observing the close & other forms of lookahead bias

In this failure mode, the quant assumes he can observe market closing prices in order to compute an indicator, and then also trade at the close. As an example, one might use closing price/volume to calculate a technical factor used in the strategy, then trade based on that information.

This is a specific example of lookahead bias in which the strategy is allowed to peek a little bit into the future. In my work I have seen time and again that even a slight lookahead bias can provide fantastic (and false) returns.

Other examples of lookahead bias have to do with incorrect registration of data such as earnings reports or news. Assuming for instance that one can trade on the same day earnings are announced even though earnings are usually announced after the close.

Solution: Don’t trade until the open of the next day after information becomes available.

4. Ignoring market impact

The very act of trading affects price. Historical pricing data does not include your trades and is therefore not an accurate representation of the price you would get if you were trading.

Consider the chart below that describes the performance of a real strategy I helped develop. Consider the region A, the first part of the upwardly sloping orange line. This region was the performance of our backtest. The strategy had a Sharpe Ratio over 7.0! Based on the information we had up until that time (the end of A), it looked great so we started trading it.

When we began live trading we saw the real performance illustrated with the green “live” line in region B– essentially flat. The strategy was not working, so we halted trading it after a few weeks. After we stopped trading it, the strategy started performing well again in paper trading (Region C, Arg!).

Performance of a strategy that looked great in backtesting (region A). When traded live, it didn't work well (region B).  When we stopped trading it it went back to working well (region C).

Performance of a strategy that looked great in backtesting (region A). When traded live, it didn’t work well (region B). When we stopped trading it it went back to working well (region C).

How can this be? We thought perhaps that the error was in our predictive model, so we backtested again over the “live” area and the backtest showed that same flat area. The only difference between the nice 7.0 Sharpe Ratio sections and the flat section was that we were engaged in the market in the flat region.

What was going on? The answer, very simply, is that by participating in the market we were changing the prices to our disadvantage. We were not modeling market impact in our market simulation. Once we added that feature more accurately, our backtest appropriately showed a flat, no-return result for region A. If we had had that in the first place we probably would never have traded the strategy.

Solution: Be sure to anticipate that price will move against you at every trade. For trades that are a small part of overall volume, a rule of thumb is about 5 bps for S&P 500 stocks and up to 50 bps for more thinly traded stocks. It depends of course on how much of the market your strategy is seeking to trade.

5. Buy $10M of a $1M company

Naïve backtesters will allow a strategy to buy or sell as much of an asset as it likes. This may provide a misleadingly optimistic backtest because large allocations to small companies are allowed.

There often is real alpha in thinly traded stocks, and data mining approaches are likely to find it. Consider for a moment why it seems there is alpha there. The reason is that the big hedge funds aren’t playing there because they can’t execute their strategy with illiquid assets. There are perhaps scraps of alpha to be collected by the little guy, but check to be sure you’re not assuming you can buy $10M of a $1M company.

Solution: Have your backtester limit the strategy’s trading to a percentage of the daily dollar volume of the equity. Another alternative is to filter potential assets to a minimum daily dollar volume.

6. Overfit the model

An overfit model is one that models in-sample data very well. It predicts the data so well that it is likely modeling noise rather than the underlying principle or relationship in the data that you are hoping it will discover.

Here’s a more formal definition of overfitting: As the degrees of freedom of the model increase, overfitting occurs when in-sample prediction error decreases and out-of-sample prediction error increases.

What do we mean by “degrees of freedom?” Degrees of freedom can take many forms, depending on the type of model being created: Number of factors used, number of parameters in a parameterized model and so on.

Degrees of freedom (X), versus error (Y).  Overfitting is occurring in the region to the right of the yellow symbol as out of sample error increases.

Degrees of freedom (X), versus error (Y). Overfitting is occurring in the region to the right of the yellow symbol as out of sample error increases.

Solution: Don’t repeatedly “tweak” and “refine” your model using in-sample data. And always compare in-sample error versus out-of-sample error.

7. Trust complex models

Complex models are often overfit models. Simple approaches that arise from a basic idea that makes intuitive sense lead to the best models. A strategy built from a handful of factors combined with simple rules is more likely to be robust and less sensitive to overfitting than a complex model with lots of factors.

Solution: Limit the number of factors considered by a model, use simple logic in combining them.

8. Trust stateful strategy luck

A stateful strategy is one whose holdings over time depend on which day in history it was started. As an example, if the strategy rapidly accrues assets, it may be quickly fully invested and therefore miss later buying opportunities. If the strategy had started one day later, it’s holdings might be completely different.

Sometimes such strategies’ success vary widely if they are started on a different day. I’ve seen, for instance, a difference in 50% return for the same strategy started on two days in the same week.

Solution: If your strategy is stateful, be sure to test it starting on many difference days. Evaluate the variance of the results across those days. If is large you should be concerned.

9. Data mining fallacy

 Even if you avoid all of the pitfalls listed above, if you generate and test enough strategies you’ll eventually find one that works very well in a backtest. However, the quality of the strategy cannot be distinguished from a lucky random stock picker.

How can this pitfall be avoided? It can’t be avoided. However you can and should forward test before committing significant capital.

Solution: Forward test (paper trade) a strategy before committing capital.

Summary

It is better to view backtesting as a method for rejecting strategies than as a method for validating strategies. One thing is for sure: If it doesn’t work in a backtest, it won’t work in real life. The converse is not true: Just because it works in a backtest does not mean you can expect it to work in live trading.

If you avoid the pitfalls listed above, your backtests stand a better chance of more accurately representing real life performance.

Tucker Balch, Ph.D. is a professor of Interactive Computing at Georgia Tech. He is also CTO of Lucena Research, Inc., a financial decision support technology company. You can read more essays by Tucker at http://augmentedtrader.com

Advertisements