Decompression is up to 30% faster in CPython 3.15

tl;dr
compression.zstd is the fastest Python Zstandard bindings with Python 3.15. Changes to code managing output buffers has led to a 25-30% performance uplift for Zstandard decompression and a 10-15% performance uplift for zlib for data at least 1 MiB in size. This has broad implications for e.g. faster wheel installations with pip and many other use cases.

Motivation

Since landing Zstandard support in CPython, I wanted to explore the performance of CPython's compression modules to ensure they were well-optimized. Furthermore, the maintainer of pyzstd and backports.zstd (a backport of compression.zstd to Python versions before 3.14) benchmarked the new compression.zstd module against 3rd party Zstandard Python bindings such as pyzstd, zstandard, and zstd, and found the standard library was slower than most other bindings!

Let's take a closer look at the benchmarks and how to read them:

Figures give timing comparison. For example, +42% means that the library needs 42% more time than stdlib/backports.zstd. The reference time column indicates an average time for a single run.

Emoji scale: ❤️‍🩹 -25% 🟥 -15% 🔴 -5% ⚪ +5% 🟢 +15% 🟩 +25% 💚

Okay, so hopefully we don't see a lot of red, meaning the reference standard library (stdlib) time is slower...

CPython 3.14.0rc3

Casestdlibpyzstdzstandardzstd
compress 1k level 3<1ms⚪ - 3.81%⚪ - 1.17%🟢 + 5.86%
compress 1k level 10<1ms⚪ + 1.91%🟢 + 6.18%🟢 + 9.83%
compress 1k level 17<1ms🟢 + 6.33%🟢 + 7.67%🟢 +12.92%
compress 1M level 37ms⚪ + 0.60%🔴 - 7.37%🟢 +12.08%
compress 1M level 1027ms🟢 +10.39%⚪ + 3.39%🟢 +12.46%
compress 1M level 17174ms⚪ - 2.48%⚪ - 3.91%⚪ + 0.08%
compress 1G level 36.03s🟩 +16.17%⚪ - 2.94%⚪ + 2.25%
decompress 1k level 3<1ms🟥 -15.14%🔴 - 8.53%⚪ - 2.37%
decompress 1k level 10<1ms🟥 -15.41%🔴 - 9.22%⚪ - 3.35%
decompress 1k level 17<1ms🔴 -11.16%🔴 - 7.09%⚪ + 2.07%
decompress 1M level 31ms🔴 - 6.88%⚪ - 4.03%💚 +26.88%
decompress 1M level 101ms🔴 - 6.69%⚪ - 4.86%💚 +25.63%
decompress 1M level 171ms🔴 - 7.99%⚪ - 4.96%💚 +25.58%
decompress 1G level 31.49s🟥 -19.41%🟥 -17.58%🟢 + 6.98%
decompress 1G level 101.62s❤️‍🩹 -27.65%❤️‍🩹 -26.48%🔴 - 6.92%
decompress 1G level 171.67s🟥 -24.01%🟥 -23.04%⚪ - 4.43%

Ouch. 10-25% slower is quite unfortunate! A silver lining is that most of the performance difference is in decompression, so that narrows the area that is in need of optimization.

After sitting down and thinking about it for a while, I came up with a few theories as to why compression.zstd would be slower compared to pyzstd and zstandard. My thinking was focused on noting differences in implementation I knew existed between the various bindings. First, both pyzstd and zstandard build against their own copies of libzstd (the C library implementing Zstandard compression and decompression). Meanwhile, CPython will build against the system- installed libzstd, which is older on my system. Maybe there is a performance improvement in the newer libzstd versions? Second, most of the performance difference is in decompression speed. Perhaps the implementation of compression.zstd.decompress() is inefficient? It uses multiple decompression instances to handle multi-frame input where pyzstd uses one, so perhaps that's the issue? Finally, maybe the handling of output buffers is slow? When decompressing data, CPython needs to provide an output buffer (location in memory to write to) to store the uncompressed data. If the creation/allocation of that output buffer is slow it could bottleneck the decompressor.

Premature Optimizations

These optimizations didn't work, so if you'd like to skip to the optimizations which worked, please move to the next section!

I decided to tackle these one at a time. First, I built pyzstd and zstandard against the system libzstd. Unfortunately, after re-running the benchmark, this yielded zero performance difference. Darn.

Next, I was pretty confident that compression.zstd.decompress() was at least partially the culprit of the worse performance. The current decompress() implementation is written in Python and creates multiple decompression contexts and joins the results together. Surely that had to lead to some performance degradation? I ended up re-implementing the decompress() function in C using a single decompression context to see if my theory was correct. To my chagrin, there was no performance uplift, and it may have even performed worse! For the curious, you can see my hacked together branch here. Goes to show that you can never be sure about performance bottlenecks based on code itself!

Properly Profiling CPython

With my first two attempts at optimizing Zstandard decompression in CPython unsuccessful, I realized that I should do what I probably should have done from the beginning: profile the code! I decided to use the standard library support for the perf profiler, as it would allow me to see both native/C frames such as inside libzstd or the bindings module _zstd, as well as Python frames.

So I went ahead and compiled CPython with some flags to improve perf data and ran a simple script which called compression.zstd.decompress() on a variety of data sizes. I highly recommend reading the Python documentation about perf support for more details but essentially what I ran was:

# in a cpython checkout
./configure --enable-optimizations --with-lto CFLAGS="-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer"
make -j$(nproc)
cd ../compression-benchmarks
perf record -F 9999 -g -o perf.data ../cpython/python -X perf profile_zstd.py

After analyzing the profile with perf report --stdio -n -g, I noticed a significant bottleneck in the output buffer management code! Let's take a brief detour to discuss what the output buffer management code does and why it was the decompression bottleneck.

(Fast) Buffer Handling is Hard

When decompressing data, you feed the decompressor (libzstd in our case) a buffer (bytes in Python) that is then decompressed and needs to be written to a new buffer. Since this all happens in C, basically we need to allocate some memory for libzstd to write the decompressed data into. But how much memory? Well, in many cases, we don't know! So we need to dynamically resize the output buffer as it is filled up.

This is actually a pretty challenging problem because there are several constraints and considerations to be made. The buffer management needs to be fast for a variety of output buffer sizes. If you allocate too much memory up front, you'll waste time allocating unused memory and slow down decompressing small amounts of data. On the other hand, if you don't allocate enough, you'll have to make a lot of calls to the allocator, which will also slow things down as each allocation has overhead and leads to fragmenting the output data. The memory should not grow exponentially for large outputs, otherwise you could run out of memory for tasks that would normally fit into memory. Finally, each output from the decompressor can vary in size, given that it may need to buffer data internally.

Because of the complexity in managing an output buffer, there is code shared across compression modules in CPython to manage the buffer. This code lives in pycore_blocks_output_buffer.h. The code was modified four years ago to use an implementation which writes to a series of bytes objects stored in a list to hold the output of decompress calls. When finished, the bytes objects get concatenated together in _BlocksOutputBuffer_Finish, returning the final bytes object containing the decompressed data. When profiling Zstandard decompression, I found that greater than 50% (!) of decompression time was spent in _BlocksOutputBuffer_Finish! This seemed inordinately long, ideally this function should just be a few memcpys. So with this knowledge in hand, I tried to think of how best to optimize the output buffer code.

Sometimes Timing Works Out

Right around the time that I was working on this, PEP 782 was accepted. This PEP introduces a new PyBytesWriter API to CPython which makes it easier to incrementally build up bytes data in a safe and performant way at the Python C API level. It seemed like a natural fit for what the blocks output buffer code was doing, so I wanted to experiment with using it for the output buffer code. After modifying pycore_blocks_output_buffer.h to use PyBytesWriter, I re-ran the original benchmark to see if we had closed the performance gap:

Note: this benchmark was run on my local machine and the wall times are not comparable to the previous benchmark.

Casestdlibzstandard
compress 1k level 3<1ms💚 +61.02%
compress 1k level 10<1ms💚 +57.77%
compress 1k level 17<1ms💚 +364.86%
compress 1M level 35ms💚 +40.02%
compress 1M level 1032ms⚪ - 0.99%
compress 1M level 17126ms🟩 +15.93%
compress 1G level 34.47s💚 +48.69%
decompress 1k level 3<1ms⚪ + 4.67%
decompress 1k level 10<1ms⚪ + 4.79%
decompress 1k level 17<1ms🟢 + 5.38%
decompress 1M level 31ms💚 +50.23%
decompress 1M level 101ms💚 +41.94%
decompress 1M level 171ms💚 +47.37%
decompress 1G level 31.80s🟢 +12.87%
decompress 1G level 101.77s🟢 +12.54%
decompress 1G level 171.80s🟢 + 8.76%

WOW! Not only have we closed the gap, compression.zstd is now faster than the popular zstandard 3rd-party module.

Validating Our Results

Wanting to validate the speedup, I decided to write up my own minimal benchmark suite at this point too, to compare between revisions of the standard library code and use pyperf, a benchmarking toolkit used in the venerable pyperformance benchmark suite.

So I went ahead and wrote up a benchmark for zstd which tests compression and decompression using default parameters for sizes 1 KiB, 1 MiB, and 1 GiB. I ran these benchmarks on main and my branch which uses PyBytesWriter.

zstd.compress(1K): Mean +- std dev: [main_zstd_3] 3.01 us +- 0.03 us -> [pybyteswriter_zstd_3] 3.00 us +- 0.03 us: 1.01x faster
zstd.compress(1M): Mean +- std dev: [main_zstd_3] 2.92 ms +- 0.02 ms -> [pybyteswriter_zstd_3] 2.89 ms +- 0.02 ms: 1.01x faster
zstd.compress(1G): Mean +- std dev: [main_zstd_3] 2.72 sec +- 0.01 sec -> [pybyteswriter_zstd_3] 2.67 sec +- 0.01 sec: 1.02x faster
zstd.decompress(1K): Mean +- std dev: [main_zstd_3] 1.40 us +- 0.01 us -> [pybyteswriter_zstd_3] 1.38 us +- 0.01 us: 1.01x faster
zstd.decompress(1M): Mean +- std dev: [main_zstd_3] 734 us +- 4 us -> [pybyteswriter_zstd_3] 546 us +- 3 us: 1.34x faster
zstd.decompress(1G): Mean +- std dev: [main_zstd_3] 790 ms +- 4 ms -> [pybyteswriter_zstd_3] 634 ms +- 3 ms: 1.25x faster

Geometric mean: 1.10x faster

For input sizes great than 1 MiB that's 25-30% faster decompression! In hindsight, this actually makes sense if you consider that libzstd's decompression implementation is exceptionally fast. lzbench, a popular compression library benchmark, found that libzstd can decompress data at greater than 1 GiB/s. This is much faster than bz2, lzma, or zlib, the other compression modules in the standard library. One of the motivations for adding Zstandard to CPython was it's performance. So it is not too surprising that the output buffer code would be a bottleneck, given that the existing compression libraries don't write as quickly to the output buffer. This also explains why compression isn't faster after changing the output buffer code. Compression is very CPU intensive so more time is spent in the compressor rather than writing to the output buffer. This also explains why the speedup is non-existent for decompressing 1 KiB of data - the first 32 KiB block that is allocated is plenty to store all of the output data, meaning all of the time is spent in the decompressor.

One final validation I wished to do was to check the performance of zlib, to ensure that the change did not regress performance for other standard library compression modules. I wrote a similar benchmark for zlib to the one I wrote for zstd, and found that there was also a performance increase with the output buffer change!

zlib.compress(1M): Mean +- std dev: [main] 13.5 ms +- 0.1 ms -> [pybyteswriter] 13.4 ms +- 0.0 ms: 1.00x faster
zlib.compress(1G): Mean +- std dev: [main] 11.4 sec +- 0.0 sec -> [pybyteswriter] 11.3 sec +- 0.0 sec: 1.00x faster
zlib.decompress(1K): Mean +- std dev: [main] 1.42 us +- 0.01 us -> [pybyteswriter] 1.39 us +- 0.01 us: 1.02x faster
zlib.decompress(1M): Mean +- std dev: [main] 1.29 ms +- 0.00 ms -> [pybyteswriter] 1.17 ms +- 0.00 ms: 1.10x faster
zlib.decompress(1G): Mean +- std dev: [main] 1.36 sec +- 0.00 sec -> [pybyteswriter] 1.17 sec +- 0.00 sec: 1.17x faster

Benchmark hidden because not significant (1): zlib.compress(1K)

Geometric mean: 1.05x faster

10-15% faster decompression on data of at least 1 MiB for zlib is pretty significant, especially when you consider that zlib is used by pip to unpack files in almost every wheel package Python users install.

Conclusion

With the improvements to output buffer handling, I was not only able to improve the performance of compression.zstd, but all of the compression module's decompression code. After stumbling over a few optimization ideas, I definitely learned my lesson to profile code before jumping to conclusions! You won't know what is a real bottleneck unless you can test it! Just having a benchmark is not enough!

The original issue I opened goes into a bit more detail about the process of benchmarking the compression modules, and the commit with the improvement has the diff of changes to adopt PyBytesWriter. One thing I'm proud of is that not only did the change improve performance, it also simplifies the implementation of the output buffer code and removed 60 lines of code in the process!

I did some more profiling of zlib to see if there were any more performance gains to be made, but the profile I gathered seems to indicate that 95+% of the time is spent in zlib's inflate implementation (with the rest in the CPython VM), so there is little if any room for further optimization in CPython's bindings for zlib. I think this is good, as it indicates Python users are getting the best performance they can in 3.15!

Going forward, I am planning on profiling compression code more, but the vast majority of the time spent there will probably be in the compressor since compression is so CPU intensive. Finally, I want to investigate optimizations related to providing more information about the final size of the output data. In some cases the output buffer is initialized to a small value and dynamically resized as output is produced, but ideally users would be able to provide more information about their workflow and see a performance improvement over it. I have a lot of other ideas related to compression I'd like to work on, check out my OSS TODO list for all of the random ideas I want to work on in the future!

Finding a miscompilation in Rust/LLVM

Among my friends I have a reputation for causing stumbling across esoteric error messages. Whether that is SSL read: I/O error: Success (caused by a layered SSH connection hangup on Windows), or that time I tried installing NixOS on my laptop and os-prober failed to start (this was several years ago, so I am sure it is no longer an issue). I attribute these oddities to my curiosity, particularly around trying things that may or may not work and seeing if they do. Recently, I was trying to complete an item from my OSS TODO list when I came across a bug that stumped me for several days. Turns out sometimes even compilers have bugs...

My goal was to build CPython with Rust implementations of common compression libraries to see if the Rust libraries could be supported. CPython relies on C code to do many performance sensitive activities such as math and compression. I had recently read about the Trifecta Tech Foundation's initiative to re-write popular compression libraries in Rust. So far as of September 2025, they have pure-Rust re-implementations of zlib (the library used for zip and gzip files), and bzip2 that are available for use.

These Rust libraries not only bring increased memory safety, they're also as fast or faster than their C counter-parts. Additionally, zlib-rs is widely deployed in Firefox, to the point that it may have tripped over a CPU hardware bug(!). So I had confidence that at least zlib-rs would work out of the box.

To add support for these libraries to CPython, I made a branch with changes to the autoconf script to search for the Rust libraries through pkg-config. I built zlib-rs's C library with RUSTFLAGS="-Ctarget-cpu=native" for maximum speed, and then pointed CPython's build process to the built zlib_rs library. Everything built just fine. Next, I wanted to run the CPython zlib test suite to verify zlib-rs was working correctly. I mostly did this to make sure I had built things properly, I had no doubts the tests would pass.

A screenshot of test failures. The test_wbits and test_combine_no_iv tests in test_zlib failed.

And yet. I was shocked! zlib-rs is used in Firefox, cargo, and many other widely used tools and applications. Hard to believe it would have a glaring bug that would be surfaced by CPython's test suite. At first I assumed I had somehow made a mistake when building. I realized I had used my system zlib header when building, so maybe there was some weirdness with symbol compatibility?? No, re-building CPython pointing to the zlib-rs include directory didn't fix it. I tried running cargo test in the zlib-rs directory to make sure there wasn't something wrong I could catch there. No failures occurred.

At this point I was convinced it was probably a bug with how I was building things, or a bug in the cdylib (Rust lingo for "C library") wrapping zlib-rs since test Rust tests passed but the tests in CPython failed. To make my testing simpler, I captured the state of the test_zlib.test_combine_no_iv test using PDB and wrote a C program which does the same thing as the test, with deterministic inputs:

#include <stdio.h>
#include <string.h>
#include "zlib.h"

int main()
{
    unsigned char a[32] = {0x88, 0x64, 0x15, 0xce, 0x5e, 0x3b, 0x8d, 0x35,
                        0xdb, 0xd2, 0xb5, 0xfa, 0x8e, 0xa7, 0x73, 0x10,
                        0x66, 0x83, 0x1b, 0xd1, 0xde, 0x0f, 0x25, 0x86,
                        0xeb, 0xe5, 0x42, 0x44, 0xad, 0x62, 0xff, 0x11};
    uInt chk_a = crc32(0, a, 32);
    unsigned char b[64] = {0x31, 0xb8, 0xce, 0x94, 0x4d, 0x2b, 0xb9, 0x7e,
                        0xd5, 0x81, 0x7f, 0xc2, 0x40, 0xbf, 0x3d, 0xa5,
                        0x25, 0xa5, 0xf9, 0xdf, 0x53, 0x68, 0xc4, 0xf6,
                        0xbe, 0x06, 0x7d, 0xf3, 0xc7, 0xdc, 0x5b, 0x84,
                        0xce, 0xd2, 0xb2, 0xeb, 0x87, 0x62, 0x60, 0xe3,
                        0x10, 0x05, 0x64, 0x59, 0x15, 0xc4, 0x2d, 0x78,
                        0xc8, 0xf3, 0x14, 0x38, 0x87, 0x39, 0xb3, 0x58,
                        0xb5, 0x95, 0x07, 0x25, 0xd9, 0xc1, 0xac, 0x04};
    uInt chk_b = crc32(0, b, 64);
    unsigned char buff[96];
    memcpy(buff, a, 32);
    memcpy(buff + 32, b, 64);
    uInt chk = crc32(0, buff, 96);
    uInt chk_combine = crc32_combine(chk_a, chk_b, 64);
    printf("chk (%u) = chk_combine (%u)? %s\n", chk, chk_combine, chk == chk_combine ? "True" : "False");
    return (0);
}

This program also failed. Hm, okay, not an issue with CPython at least. I then translated the above test into Rust to add to the zlib-rs test suite, since the Rust tests passed. If it failed I could more easily debug the issue.

diff --git a/zlib-rs/src/crc32/combine.rs b/zlib-rs/src/crc32/combine.rs
index 40e3745..65c0143 100644
--- a/zlib-rs/src/crc32/combine.rs
+++ b/zlib-rs/src/crc32/combine.rs
@@ -66,6 +66,26 @@ mod test {

    use crate::crc32;

+    #[test]
+    fn test_crc32_combine_no_iv() {
+        for _ in 0..1000 {
+            let a: &[u8] = &[0x88, 0x64, 0x15, 0xce, 0x5e, 0x3b, 0x8d, 0x35, 0xdb, 0xd2, 0xb5, 0xfa, 0x8e, 0xa7, 0x73, 0x10, 0x66, 0x83, 0x1b, 0xd1, 0xde, 0x0f, 0x25, 0x86, 0xeb, 0xe5, 0x42, 0x44, 0xad, 0x62, 0xff, 0x11];
+            let b: &[u8] = &[0x31, 0xb8, 0xce, 0x94, 0x4d, 0x2b, 0xb9, 0x7e, 0xd5, 0x81, 0x7f, 0xc2, 0x40, 0xbf, 0x3d, 0xa5, 0x25, 0xa5, 0xf9, 0xdf, 0x53, 0x68, 0xc4, 0xf6, 0xbe, 0x06, 0x7d, 0xf3, 0xc7, 0xdc, 0x5b, 0x84, 0xce, 0xd2, 0xb2, 0xeb, 0x87, 0x62, 0x60, 0xe3, 0x10, 0x05, 0x64, 0x59, 0x15, 0xc4, 0x2d, 0x78, 0xc8, 0xf3, 0x14, 0x38, 0x87, 0x39, 0xb3, 0x58, 0xb5, 0x95, 0x07, 0x25, 0xd9, 0xc1, 0xac, 0x04];
+            let both: &[u8] = &[0x88, 0x64, 0x15, 0xce, 0x5e, 0x3b, 0x8d, 0x35, 0xdb, 0xd2, 0xb5, 0xfa, 0x8e, 0xa7, 0x73, 0x10, 0x66, 0x83, 0x1b, 0xd1, 0xde, 0x0f, 0x25, 0x86, 0xeb, 0xe5, 0x42, 0x44, 0xad, 0x62, 0xff, 0x11, 0x31, 0xb8, 0xce, 0x94, 0x4d, 0x2b, 0xb9, 0x7e, 0xd5, 0x81, 0x7f, 0xc2, 0x40, 0xbf, 0x3d, 0xa5, 0x25, 0xa5, 0xf9, 0xdf, 0x53, 0x68, 0xc4, 0xf6, 0xbe, 0x06, 0x7d, 0xf3, 0xc7, 0xdc, 0x5b, 0x84, 0xce, 0xd2, 0xb2, 0xeb, 0x87, 0x62, 0x60, 0xe3, 0x10, 0x05, 0x64, 0x59, 0x15, 0xc4, 0x2d, 0x78, 0xc8, 0xf3, 0x14, 0x38, 0x87, 0x39, 0xb3, 0x58, 0xb5, 0x95, 0x07, 0x25, 0xd9, 0xc1, 0xac, 0x04];
+
+            let chk_a = crc32(0, &a);
+            assert_eq!(chk_a, 101488544);
+            let chk_b = crc32(0, &b);
+            assert_eq!(chk_b, 2995985109);
+
+            let combined = crc32_combine(chk_a, chk_b, 64);
+            assert_eq!(combined, 2546675245);
+            let chk_both = crc32(0, &both);
+            assert_eq!(chk_both, 3010918023);
+            assert_eq!(combined, chk_both);
+        }
+    }
+
    #[test]
    fn test_crc32_combine() {
        ::quickcheck::quickcheck(test as fn(_) -> _);

Running cargo test passed! I was at my wits end! How could the C code fail but the Rust code succeed??

I felt like I had enough information that I reported the issue to zlib-rs. Let me interrupt this story to mention that I really want to thank Folkert de Vries (maintainer of zlib-rs) for help debugging this. They were extremely friendly and helpful in figuring out what was going wrong. Folkert responded to my issue that my C program sample works for them! Why would my machine be any different? I was running in the WSL at the time, maybe that could cause weirdness? I decided to write up a Containerfile to ensure I had a clean environment:

FROM ubuntu:24.04

RUN apt-get update && \
    apt-get install -y \
        build-essential \
        curl \
        git \
        pkg-config \
        libssl-dev

RUN curl https://sh.rustup.rs -sSf | bash -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
RUN curl -sSL https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -
RUN echo "deb http://apt.llvm.org/noble/ llvm-toolchain-noble-20 main" > /etc/apt/sources.list.d/llvm.list
RUN apt-get update  && apt-get upgrade -y && apt-get install -y clang-20
RUN cargo install cargo-c
RUN mkdir /scratch
RUN git clone https://github.com/trifectatechfoundation/zlib-rs.git /scratch/zlib-rs
COPY ./test.c /scratch/zlib-rs/libz-rs-sys-cdylib/test.c
WORKDIR /scratch/zlib-rs/libz-rs-sys-cdylib
ENV RUSTFLAGS="-Ctarget-cpu=native" # comment this out to fix the bug
RUN cargo cbuild --release
RUN clang-20 -o test test.c -I ./include/ -static ./target/x86_64-unknown-linux-gnu/release/libz_rs.a
ENV LD_LIBRARY_PATH="target/x86_64-unknown-linux-gnu/release/"
ENTRYPOINT ["./test"]

While experimenting with setting up this container, I found a lead at last! If I compiled with RUSTFLAGS="-Ctarget-cpu=native", the program gave the wrong results. If I compiled without using native code generation, the program worked correctly. Bizarre!!

Backing up a bit, let me explain what RUSTFLAGS="-Ctarget-cpu=native" actually does (if you know already, please skip to the next paragraph). Compilers like rustc have feature flags for each target (aka OS + CPU architecture family) which allows them to optionally emit code that uses features of processors. For example, most x86 processors have sse2, and ARM64 processors have NEON or SVE. Newer processes usually come with newer features which provide optimized implementations of some useful thing, for example some x86 processors has optimized implementations of SHA hashing. Since not all computers have every feature, these need to be opted into at compile time. In the case of RUSTFLAGS="-Ctarget-cpu=native" I'm telling Rust "use all the features for my current processor." This is a way to eke out the most performance from a program. But in this case, it meant I had a bug on my hands! Folkert (maintainer of zlib-rs) suggested I try to narrow down exactly which instruction set extension was causing the issue. After a bit of binary searching, I found out it was avx512vl. AVX is an extension to provide SIMD and AVX512-VL is an extension which allows interoperability between 128/256-bit wide SIMD and faster 512-bit wide SIMD. This made a lot of sense in some ways, after all, I have an AMD R9 9950X, and one of it's features is AVX512 support! But how exactly did these AVX512 instructions get into the final binary?

NOTE:
As pointed out in a message on Mastodon, AVX512-VL is actually 11 years old! It was first introduced in Intel AVX512 implementations. However, AVX512 support in Rust is relatively new.

So enabling AVX512 was the culprit for the bug in crc32 calculations. Skimming over the zlib-rs code, I was a bit surprised to find that it does not explicitly use AVX-512 anywhere! In fact it uses the older SSE4.1 instruction set (presumably for maximum portability). So why was AVX512-VL causing these issues? Unfortunately, I don't know for sure. But I have a theory.

Rust uses LLVM as it's default backend (the bit of the compiler that emits instructions/binaries). LLVM probably realized it could use AVX512-VL instructions (available on my machine) to speed up the SSE4.1 code that zlib-rs is using. However, AVX512-VL is new enough that there was a bug in the compiler - a miscompilation - and the wrong code was emitted. I haven't found a smoking gun issue but it is probably one of these.

I am happy to report that this issue does not present itself with Rust 1.90+ or the latest release of zlib-rs. Many thanks again to Folkert for not only helping figure out the source of the issue, but also adding a mitigation to zlib-rs and cutting a new release to work around the miscompilation! Now the CPython test suite passes when linked against zlib-rs and I can continue my experiments...