Article
Jun 17, 2024

Parsing Python ASTs 20x Faster with Rust

Parsing Python ASTs 20x Faster with Rust

Tach was too slow. It’s a CLI tool that lets you define and enforce import boundaries between Python modules in your project. But when it ran on Sentry’s codebase (about 3k Python files), it took almost 10 seconds!

For reference, ruff runs in well under 1 second on the same set of files. Given that ruff is responsible for a lot more linting than our tool, this was unacceptable. Tach is meant to ‘shift left’ architectural concerns as much as possible, and users want to have Tach running checks in their IDE continuously. Developers don’t wait around for 10 seconds after saving a file to see if it passes lint checks.

The plan? Clearly we need to rewrite it in Rust. 🦀

If you want to see the final results, jump straight to the summary!

Profiling

First I needed to figure out what was taking so long. It’s a well-established meme that tools written in Rust are faster, but I wanted to know specifically why our Python implementation was so slow.

The snippet below shows the core logic behind tach check. You can see that it’s a relatively straightforward loop over all the relevant Python files in the project, checking each ‘project import’ against the project’s configuration.

Python implementation of 'tach check' (version 0.2.6)

Using py-spy and speedscope for profiling, I got the flamegraph below, showing a pretty clear culprit taking up most of the runtime.

'py-spy record -f speedscope -o profile.speedscope -- tach check' on speedscope.app

Roughly 90% of the total time is taken by get_project_imports , with about 2/3rds of that time spent parsing ASTs (parse on the far left) and the remaining 1/3rd spent traversing them (visit on the right). The code for this method is shown below.

Python implementation of 'get_project_imports' (version 0.2.6)

The slowness is already a bit of a surprise, since the ast library uses a C extension to implement the AST parsing. Also, the vast majority of the time spent traversing ASTs appears to be spent on essentially no-ops, since we only have a small amount of logic for checking Import, ImportFrom, and If nodes. It’s also worth noting that the AST visiting logic is entirely in Python (via subclassing ast.NodeVisitor).

So why is the C implementation so slow? The short answer is that Python is forcing it to be slow. For the long answer, we’ll have to dig into the code!

Why is C so slow?

Instead of dragging you along the meandering path that I took, I’m going to give you the answer immediately and then show you the evidence. So what is taking the C implementation so long? Building PyObjects.

In retrospect, this difference was staring me in the face the whole time. The ast module fundamentally has more work to do than an isolated extension. Not only does it need to tokenize the Python source and then build an AST, it must also return that AST to the Python interpreter.

Let’s look at some of the relevant source code from CPython.

This function turns Python source code into an ast.AST object

If you haven’t read much C (like me), it’s probably not clear what you’re looking at. The important points to see here are:

  • _PyParser_ASTFromString turns an input string into a native mod_ty struct representing the AST
  • PyAST_mod2obj turns the mod_ty struct into a PyObject which is returned as the result
  • We can ignore the final three lines since we will have the PyCF_ONLY_AST flag set when calling ast.parse

When I run this through Valgrind with all of Sentry’s python code as input, the results show a massive number of memory allocations (malloc) and a significant amount of time spent on garbage collection. This is when ‘everything is an object’ really bites you.

Valgrind output showing ~59M mallocs, ~3k GCs, ~40M traversal calls

The primary benefit we get from extracting get_project_imports to Rust is not faster tokenization or AST building. The key advantage is that we can keep the entire AST in compact Rust structs instead of allocating millions of PyObjects. Plus we essentially avoid GC traversals for free!

Rewriting it in Rust

Now that the performance issues were thoroughly understood, I just needed to figure out how to develop, build, and distribute a Rust extension alongside Tach. After a quick search, I found PyO3 and maturin to be stable tools which would handle these concerns for me.

To make things simple, I followed the directory structure suggested in the docs, and updated my pyproject.toml to use maturin as the build backend. This let me use maturin develop to build my Rust extension locally and make it available to the rest of my Python code.

Configuration for maturin in pyproject.toml
Cargo.toml configuration

This was also the first time I needed to build wheels and distribute them through PyPI, but luckily maturin can generate its own build workflow in Github Actions! It automatically builds wheels that target a variety of OS/architecture combinations, which means users don’t need a Rust toolchain to build our package.

For the rewrite itself, I only needed to find a suitable Python AST parser and visitor implementation, and the rest of my ‘business logic’ and glue code could be translated pretty much directly from Python into Rust.

AST Parsing is a Solved Problem?

As problems go, parsing an AST for a popular language like Python seems extraordinarily well-scoped. That said, it’s clearly not trivial, and I found that there are not a ton of options for an off-the-shelf Python AST parser written in Rust. My first choice was to use the RustPython parser, since it was published as a standalone crate, and was used by ruff initially. I actually tried using this parser first, but had to rip it out after I discovered it failed to parse some f-string syntax in the CPython codebase. It also seems less actively maintained, and may not support newer versions of Python.

I had two realistic options:

I chose the second option, and ended up vendoring ruff’s parser and ast crates directly into the repo as path dependencies. Then (much later) I realized that Cargo actually supports this natively through git dependencies! This feature let me specify individual crates within the ruff Cargo workspace as dependencies, and everything just worked. As a Python dev it didn’t even occur to me that a package manager might be able to reference a subset of a git repo intelligently.

Luckily, the interface for the ruff parser was roughly the same as the RustPython parser, so switching implementations was straightforward. Compared to Tach’s original Python implementation, the new Rust implementation looks almost identical, but the performance is totally different.

Here’s what the final logic looked like in the Rust extension:

Rust implementation of 'get_project_imports'

The Results

Before the rewrite, the flamegraph of tach check looked like this:

Before: 'tach check' flamegraph

After extracting get_project_imports to Rust, the flamegraph looks like this:

After: 'tach check' flamegraph

The cumulative runtime of get_project_imports went from 8.7s to 530ms (a 16x speedup in this test, although I have seen it as high as 20x).

This leaves us with a roughly 1s total runtime, a 10x speedup overall!

With Valgrind we can also see that we got rid of the massive GC pauses, and dramatically reduced the memory pressure. While the native ast module made almost 60M malloc calls and spent 35% of its cycles on garbage collection, the new implementation shows no significant GC activity and makes only ~7M malloc calls (with estimated cycle count far below the actual lexing, tokenization, and AST building code).

Before: ~59M mallocs, ~3k GCs, ~40M traversal calls
After: ~7M mallocs, no GC pauses

These wins have come from essentially a direct translation of the Python logic into Rust, which is just the first step in this optimization journey. In the future I’ll be grabbing a few other low-hanging performance wins:

  • CPU parallelism, each core can check imports at the same time (potentially 8x speedup on a modern machine)
  • caching previous import information to avoid re-parsing every file
  • move configuration parsing to Rust (parse_project_config is taking over 100ms to parse a tiny yml file!!)
  • optimize Rust version of get_project_imports to make fewer allocations

Digging through CPython was a satisfying peek behind the curtain on a tool I’ve used for years. It reminds me how much is hidden in day-to-day Python programming, and gave me an excuse to learn a lot more about how the program actually runs at a lower level. It’s also great to get more familiar with powerful profiling tools like py-spy, speedscope, and Valgrind. I hope you enjoyed following the journey!

And if you’re working on a Python codebase that would benefit from better module boundaries (do your teammates keep screwing up your architecture?) then give Tach a try! We’re actively improving it on Github and are available to chat in Discord.