Why every quant fund rewrites the same backtester

Skelf-Research · November 4, 2025 ·

backtestingengineeringinfra

There is a moment in every quant team’s life when somebody looks at the third copy of compute_momentum.py in three different research folders and asks the question that always gets asked: should we just have one of these? The answer is yes, of course, and the answer is also no, because each copy is slightly different and the differences matter. A team that has lived through this answer once tends to build the same thing the next time. That thing is a typed signal language with a deterministic execution layer. We built sigc because we got tired of building it from scratch.

This is not novel. It is so unnovel that it is almost embarrassing to write down. Every prop shop and quant fund of any size eventually has an internal DSL, an internal scheduler, an internal cache, and an internal way of pretending that research and production are the same code. The names change. The shape does not.

The four failure modes

Why does this keep happening? Because notebooks fail in four specific ways and a quant team cannot ship without solving all four.

The first failure is calendar drift. Two CSVs land in the warehouse. One of them is in NYSE time, the other is in UTC, and a third file has weekends in it because the vendor decided to forward-fill Friday closes into Saturday. The pandas join silently succeeds, the date index is technically aligned, and the resulting signal looks fine. It is not fine. There is no compile step that would have caught this, because there is no compile step at all.

The second failure is non-deterministic backtests. You ran the notebook yesterday and got a Sharpe of 1.4. You ran it today and got 1.6. The data did not change. Something about cache state, kernel state, or library order made a difference, and now you do not know which run to trust. Reviewing the result is impossible because you cannot reproduce the inputs.

The third failure is the fragile join at three in the morning. The signal references a feature that was renamed two days ago. The pandas merge does not raise, it just produces NaN rows, and a fillna upstream papers over them. The strategy trades a position it should not, and the operations team gets paged.

The fourth failure is the research-to-production gap. The backtest was a Python notebook. Production is a Java process. The signal logic was translated, and the translation has a subtle off-by-one in shift(). Live PnL diverges from paper. Nobody can prove which version is “correct” because there are now two implementations.

A team that has been through any of these once builds something to stop it. A team that has been through all four ends up building all of sigc.

What the rebuild always looks like

The internal rebuild always starts in the same place: a small DSL. The team picks a syntax (some go full Python-with-decorators, some pick a YAML thing, some go further and write a real parser). They add a “signal” concept that is a named, typed computation. They add a “portfolio” concept that turns signals into weights. They add a backtest harness that consumes weights and returns Sharpe-and-friends.

The next thing they build is a cache. Without one, every researcher recomputes the same factor on the same window every time they iterate, and the iteration loop becomes minutes instead of seconds. The cache key is usually a hash of the inputs plus the strategy source. The cache value is the materialised result. If the strategy source has not changed and the inputs have not changed, the result does not change. That is the deterministic property, and it is what makes review possible.

Then they build a runtime. Initially this is just “loop over the operator graph in topological order”. Then somebody points out that the operator graph is embarrassingly parallel along the time axis for some operators and along the cross-section for others, and the runtime grows a parallelism story. Then it grows a SIMD story. Then it grows a memory-mapping story for parquet, because pulling five years of daily data for five hundred names into memory naively is slow.

Eventually somebody asks the production question: how do we run the same logic live? The right answer is the same binary, same operator implementations, same cache. The wrong answer is to translate to a second codebase. Teams that pick the wrong answer regret it the first time research and prod disagree.

What sigc actually is

sigc is the version of this rebuild we keep ending up with, made public.

The DSL is a four-block file: data, params, signal, portfolio. Data declarations name external sources and dtypes. Param declarations name tunables. Signals are named typed computations that produce a cross-sectional or time-series series and emit it. The portfolio block names a weighting scheme (rank-based long-short with caps, vol-targeted scaling), a rebalance cadence, a transaction-cost model, an optional benchmark like SPY, and a date range.

The compiler is sig_compiler. It parses the file, checks dtypes and indices at type-check time, and lowers to an IR. Shape errors fail at compile time. There are no surprise NaN columns. The IR is what gets hashed.

The cache is sig_cache. It is content-addressed with blake3 over sled, so an identical IR plus identical input data produces an identical, cacheable result. Reproducibility stops being a wish and becomes a property.

The runtime is sig_runtime. It executes the operator graph on Polars/Arrow columns with Rayon parallelism and SIMD kernels where the CPU supports AVX2 or AVX-512. It ships 120+ operators (zscore, rank, rolling_mean, ema, rsi, macd, atr, vwap, neutralize, winsor, the long-short and vol-scale weight constructors, the tc.bps and slippage models). New operators are normal Rust functions with a defined input/output shape.

The CLI is sigc. sigc run momentum.sig is the research command; it prints Total Return, Sharpe, Max Drawdown, and Turnover. sigc daemon is the production command; it is the same binary, holding the cache, answering compile and run requests over nng on tcp://127.0.0.1:7240.

That is the whole pitch. If you have rewritten a backtester before, none of this is news. If you are about to rewrite one, this is the thing you would build anyway.

When you should still write it yourself

There are good reasons to not adopt a tool like sigc. If your strategies are tick-level event-driven and you need order-book microstructure modelling rather than daily-or-coarser signal-to-weight pipelines, sigc is the wrong altitude. If your team’s competitive advantage is a proprietary cache layout or scheduler, the open-source one is by definition not differentiated. If your stack is heavily Python and you have no tolerance for a Rust binary in the deployment, the impedance mismatch is real and pysigc bindings are a partial answer at best.

We think those are honest reasons. We also think they cover fewer teams than people assume.

What we want sigc to be

The goal is not to be the only quant DSL. Other teams will keep building their own, and they should — the in-house version always wins on integration with the rest of the in-house stack. The goal is to be the default version, the thing you reach for when you are starting a new desk and you do not yet know which corners you will care about. The same way Postgres is the default relational database not because it is best for every workload but because it gets out of the way until you have a reason to leave it.

If you are at the moment where you are about to write your tenth backtester, try ours first. cargo install sigc. Read the quickstart. Write one .sig file. See if it makes the next bug less likely.