Intro
Backtesting is every systematic trader’s basic tool. And Python is becoming the lingua franca of programming. So putting Python into Backtesting to get fast results should be possible!
Yes and no!
In this article, we’ll cover how to really improve your Python backtesting and boost your speeds by several orders of magnitude!
First a quick table of cons / pros of using Python:
Pros | Cons |
---|---|
Quick implementation time [Python’s forte!] | You’ll grow really old waiting for the results |
Gazillions of libraries for fancy output | |
Gazillions of libraries for fancy analysis |
And it’s this contra that’s the real biggy.
You’re a trader, so by definition you already have the attention span of a goldfish. It stands to reason, therefore, that waiting for a couple of seconds for a backtesting result to come back is an eternity.
It really is when you’re looking at a portfolio of strategies, a portfolio of assets, a portfolio of both, or if you’re running any sort of optimization.
So, in this article we’ll cover some simple and more sophisticated ways of improving our timing. We’ll start out with pure Python solutions and in Part 2 of this series we’ll cover the more sophisticated Cython module set, to squeeze the last ounce out of our code.
To keep ourselves on the straight and narrow we’ll use the Java and C implementations as a benchmark. Of course, these languages trounce Python. But, by the end of our journey you’ll agree, that we don’t have to give up the comfortable life Python offers to get massive speed improvements.
Background
The above statements might cause people to immediately say “Vectorize your Code using Pandas and NumPy!”
Agreed, in many circumstances you can speed up execution by calling vectorized maths functions from Pandas and NumPy.
But, let’s face it, the challenge for trading is that your decision now is dependent upon a bunch of state variables from the last n time steps. This is where vectorization fails and your now in the world of having to write for-do loops.
And even Event Driven backtesters reduce to for-do loops when you run simulations on historical data.
So, the real challenge is running fast for-do loops in Python.
Is this possible?
Setup
To experiment and validate the various methods of speeding up our backtesting for-do loops we’ll use a straightforward trading system: the RSI(2) applied to the SPY from January 1993 to until now.
Written in pseudo code it looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | // cash and position are arrays, indexing is not explicit for all history do: cash today = cash yesterday position today = position yesterday // Exit Long if long and if C[0] > MA[Close, 5] posn today = 0 cash today = cash today + shares * close // Enter long if no longs and if C[0]> MA[Close, 200] and RSI[Close, 2] < 20 posn today = cash / close price cash today = cash today – posn today * close price |
Since we simply want to focus on the efficiency of the for loops we’ll pre-calculate the Moving Averages and Relative Strength Indicators and simply look up the values in their corresponding array storages.
Python Backtesting – DataFrames
The usual way somebody could implement this in Python is to:
- load the data into a DataFrame using the ubiquitous Pandas library
- add columns to the DataFrame for moving averages and relative strength
- loop over the rows of the data frame to calculate cash and position values over the lifetime of the system
So, it’s worthwhile to see how looping is implemented using DataFrames, since in and of itself it’s not the most obvious, and it’s the first place to remove bottlenecks
Method 1: Indexing using dates
A naïve method would use the date index of the DataFrame to retrieve the values from the matrix, while looping over all dates.
For 7,000 rows this gives an execution time of 7.5 seconds.
This is pretty impressive slowness.
Here’s the code skeleton:
1 2 3 4 5 | dt_range = df.index for d in dt_range: if df.loc[d, ‘close’] > df.loc[d, ‘ma_short’] ... |
You get the picture.
Method 2: Using the iterators from DataFrames [iterrows]
The above example might be intuitive since it queries data for specific days instead of using integer indices to get array / list / series values.
However, a more natural manner of accessing values in the matrix while looping would be to use the built-in constructs.
The first one you come across is the built-in DataFrame method iterrows.
For the same 7,000 rows the time taken to complete the loop is 1.3 seconds.
Here’s the code skeleton:
1 2 3 | for ix, row in df.iterrows(): if long and (row.close > row.ma_short): ... |
So firstly, there is a hope: just by changing the approach we’ve sped the loop up by a factor of five. But it’s still pretty lousy. Imagine wanting to loop over the stocks in the S&P 500. This would take you roughly 9 hours.
Who has time to sit around for 9 hours!!
Is there a better way?
Method 3: A second iterator from DataFrames [itertuples]
It’s really remarkable that there are two methods which are so similar in behavior yet in terms of performance are light years apart.
Replace df.iterrows() by df.itertuples(). Here is the code skeleton:
1 2 3 | for row in df.itertuples(): if long and row.close > row.ma_short: ... |
The syntax change is minimal; however, the speed up goes from 1.3 seconds to … wait for it … 0.03 seconds.
Yes, you’ve read that right! The same code, same logic, and same container [the DataFrame] and we’ve sped up the code by a factor of 43 times.
This is pretty insane, right?
Method 4: Forget about DataFrames and pack everything into lists
So, if DataFrames can work so well, and DataFrames are actually nothing more than complex wrappers around simple arrays, what happens if we just throw out the wrapping and use the arrays themselves. I.e., shove the data into Python lists and loop over those?
The overhead in programming is a bit more, since we now need to explicitly code for each individual list we want to use, but…
… the time for performing the 7,000 loops now has become 0.003 seconds!
You read that right! So, if we were to analyze the S&P 500 stocks it would take a total of 1.5 seconds. Which is much better than lounging 9 hours in front of the screen.
Conclusion
Part 1 of this series took a monstrous 7.3 second loop, backtesting an easy system in Python and reduced it down to 0.003 seconds. That’s an improvement of 2,400 times. Nothing to be sniffed at, and it only involved some basic rewriting of code!
All we did was to acknowledge that DataFrames are great for storing data and applying some math functions to the columns (or rows) in a vectorized fashion. But when it comes to looping, we might as well go down the old-fashioned way of using arrays (known in Python as lists).
So where to next?
In Part 2 of speeding up Python backtesting we’ll start to delve in the Cython module set. This does something funky: it takes your Pythonesque source [a file that ends in .pyx] and transpiles it to the C language. In so doing it can perform a bunch of optimizations which your Python interpreter wasn’t built to do, since it has to deal with most generic use cases. However, you have the option of giving Cython very specific indications as to how your source code is supposed to be used.
The end result is an even bigger speed up!
Do we come close to Java and C on this simple loop?
Check in to Part 2 where we unveil the Cython results as well as provide a link to the GitHub code so you can can check it out for yourself!
Here’s the summary of the speed-ups to-date of our Python backtesting with the corresponding comparisons to a Java / C implementation:
Implementation | Time for RSI2 Backtests |
---|---|
Python – DataFrame, date indexing | 7.3 s |
Python – DataFrame, iterrows | 1.3s |
Python – DataFrame, itertuples | 0.03s |
Python – Lists | 0.003s |
Java | 0.00005s |
C | 0.00002s |
Mason says
Hi, good to see you’re back on the blog. Found a lot of very useful info out of this and your other posts.
Corvin Codirla says
Hi Mason!
Thank you very much!