<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Emma's Blog - zstd</title><link>https://emmatyping.dev/</link><description/><atom:link href="https://emmatyping.dev/feeds/zstd/rss.xml" rel="self"/><lastBuildDate>Tue, 11 Nov 2025 00:00:00 -0800</lastBuildDate><item><title>Decompression is up to 30% faster in CPython 3.15</title><link>https://emmatyping.dev/decompression-is-up-to-30-faster-in-cpython-315.html</link><description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;compression.zstd&lt;/code&gt; is the fastest Python Zstandard bindings with Python 3.15. Changes to code managing output
buffers has led to a 25-30% performance uplift for Zstandard decompression and a 10-15% performance uplift for &lt;code&gt;zlib&lt;/code&gt;
for data at least 1 MiB in size. This has broad implications for e.g. faster wheel installations with pip and many
other use cases.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;Since &lt;a href="https://peps.python.org/pep-0784/"&gt;landing Zstandard support in CPython&lt;/a&gt;, I wanted to explore
the performance of CPython's compression modules to ensure they were well-optimized. Furthermore, the maintainer of
&lt;a href="https://github.com/Rogdham/pyzstd/"&gt;pyzstd&lt;/a&gt; and &lt;a href="https://github.com/Rogdham/backports.zstd"&gt;backports.zstd&lt;/a&gt; (a backport of
&lt;code&gt;compression.zstd&lt;/code&gt; to Python versions before 3.14) benchmarked the new &lt;code&gt;compression.zstd&lt;/code&gt; module against 3rd party Zstandard
Python bindings such as &lt;a href="https://github.com/Rogdham/pyzstd/"&gt;pyzstd&lt;/a&gt;,
&lt;a href="https://github.com/indygreg/python-zstandard"&gt;zstandard&lt;/a&gt;, and &lt;a href="https://github.com/sergey-dryabzhinsky/python-zstd"&gt;zstd&lt;/a&gt;,
and found the standard library was slower than most other bindings!&lt;/p&gt;
&lt;p&gt;Let's take a closer look at &lt;a href="https://github.com/Rogdham/zstd-benchmark/blob/master/results/2025-09-22_linux.md"&gt;the benchmarks&lt;/a&gt;
and how to read them:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Figures give timing comparison. For example, +42% means that the library needs 42% more time than stdlib/backports.zstd.
The reference time column indicates an average time for a single run.&lt;/p&gt;
&lt;p&gt;Emoji scale: ❤️‍🩹 -25% 🟥 -15% 🔴 -5% ⚪ +5% 🟢 +15% 🟩 +25% 💚&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Okay, so hopefully we don't see a lot of red, meaning the reference standard library (stdlib) time is slower...&lt;/p&gt;
&lt;blockquote&gt;
&lt;h2&gt;CPython 3.14.0rc3&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;stdlib&lt;/th&gt;
&lt;th&gt;pyzstd&lt;/th&gt;
&lt;th&gt;zstandard&lt;/th&gt;
&lt;th&gt;zstd&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;compress 1k level 3&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;⚪ - 3.81%&lt;/td&gt;
&lt;td&gt;⚪ - 1.17%&lt;/td&gt;
&lt;td&gt;🟢 + 5.86%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1k level 10&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;⚪ + 1.91%&lt;/td&gt;
&lt;td&gt;🟢 + 6.18%&lt;/td&gt;
&lt;td&gt;🟢 + 9.83%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1k level 17&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;🟢 + 6.33%&lt;/td&gt;
&lt;td&gt;🟢 + 7.67%&lt;/td&gt;
&lt;td&gt;🟢 +12.92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1M level 3&lt;/td&gt;
&lt;td&gt;7ms&lt;/td&gt;
&lt;td&gt;⚪ + 0.60%&lt;/td&gt;
&lt;td&gt;🔴 - 7.37%&lt;/td&gt;
&lt;td&gt;🟢 +12.08%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1M level 10&lt;/td&gt;
&lt;td&gt;27ms&lt;/td&gt;
&lt;td&gt;🟢 +10.39%&lt;/td&gt;
&lt;td&gt;⚪ + 3.39%&lt;/td&gt;
&lt;td&gt;🟢 +12.46%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1M level 17&lt;/td&gt;
&lt;td&gt;174ms&lt;/td&gt;
&lt;td&gt;⚪ - 2.48%&lt;/td&gt;
&lt;td&gt;⚪ - 3.91%&lt;/td&gt;
&lt;td&gt;⚪ + 0.08%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1G level 3&lt;/td&gt;
&lt;td&gt;6.03s&lt;/td&gt;
&lt;td&gt;🟩 +16.17%&lt;/td&gt;
&lt;td&gt;⚪ - 2.94%&lt;/td&gt;
&lt;td&gt;⚪ + 2.25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1k level 3&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;🟥 -15.14%&lt;/td&gt;
&lt;td&gt;🔴 - 8.53%&lt;/td&gt;
&lt;td&gt;⚪ - 2.37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1k level 10&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;🟥 -15.41%&lt;/td&gt;
&lt;td&gt;🔴 - 9.22%&lt;/td&gt;
&lt;td&gt;⚪ - 3.35%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1k level 17&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;🔴 -11.16%&lt;/td&gt;
&lt;td&gt;🔴 - 7.09%&lt;/td&gt;
&lt;td&gt;⚪ + 2.07%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1M level 3&lt;/td&gt;
&lt;td&gt;1ms&lt;/td&gt;
&lt;td&gt;🔴 - 6.88%&lt;/td&gt;
&lt;td&gt;⚪ - 4.03%&lt;/td&gt;
&lt;td&gt;💚 +26.88%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1M level 10&lt;/td&gt;
&lt;td&gt;1ms&lt;/td&gt;
&lt;td&gt;🔴 - 6.69%&lt;/td&gt;
&lt;td&gt;⚪ - 4.86%&lt;/td&gt;
&lt;td&gt;💚 +25.63%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1M level 17&lt;/td&gt;
&lt;td&gt;1ms&lt;/td&gt;
&lt;td&gt;🔴 - 7.99%&lt;/td&gt;
&lt;td&gt;⚪ - 4.96%&lt;/td&gt;
&lt;td&gt;💚 +25.58%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1G level 3&lt;/td&gt;
&lt;td&gt;1.49s&lt;/td&gt;
&lt;td&gt;🟥 -19.41%&lt;/td&gt;
&lt;td&gt;🟥 -17.58%&lt;/td&gt;
&lt;td&gt;🟢 + 6.98%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1G level 10&lt;/td&gt;
&lt;td&gt;1.62s&lt;/td&gt;
&lt;td&gt;❤️‍🩹 -27.65%&lt;/td&gt;
&lt;td&gt;❤️‍🩹 -26.48%&lt;/td&gt;
&lt;td&gt;🔴 - 6.92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1G level 17&lt;/td&gt;
&lt;td&gt;1.67s&lt;/td&gt;
&lt;td&gt;🟥 -24.01%&lt;/td&gt;
&lt;td&gt;🟥 -23.04%&lt;/td&gt;
&lt;td&gt;⚪ - 4.43%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;Ouch. 10-25% slower is quite unfortunate! A silver lining is that most of the performance difference is in decompression,
so that narrows the area that is in need of optimization.&lt;/p&gt;
&lt;p&gt;After sitting down and thinking about it for a while, I came up with a few theories as to why &lt;code&gt;compression.zstd&lt;/code&gt; would
be slower compared to pyzstd and zstandard. My thinking was focused on noting differences in implementation I knew
existed between the various bindings. First, both pyzstd and zstandard build against their own copies of libzstd (the C
library implementing Zstandard compression and decompression). Meanwhile, CPython will build against the system-
installed libzstd, which is older on my system. Maybe there is a performance improvement in the newer libzstd
versions? Second, most of the performance difference is in decompression speed. Perhaps the implementation of
&lt;code&gt;compression.zstd.decompress()&lt;/code&gt; is inefficient? It uses multiple decompression instances to handle multi-frame input
where pyzstd uses one, so perhaps that's the issue? Finally, maybe the handling of output buffers is slow? When
decompressing data, CPython needs to provide an output buffer (location in memory to write to) to store the
uncompressed data. If the creation/allocation of that output buffer is slow it could bottleneck the decompressor.&lt;/p&gt;
&lt;h2&gt;Premature Optimizations&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;These optimizations didn't work, so if you'd like to skip to the optimizations which worked, please move to the next
section!&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;I decided to tackle these one at a time. First, I built pyzstd and zstandard against the system libzstd. Unfortunately,
after re-running the benchmark, this yielded zero performance difference. Darn.&lt;/p&gt;
&lt;p&gt;Next, I was pretty confident that &lt;code&gt;compression.zstd.decompress()&lt;/code&gt; was at least partially the culprit of the worse
performance. The &lt;a href="https://github.com/python/cpython/blob/95f6e1275b1c9de550d978cb2b4351cc4ed24fe4/Lib/compression/zstd/__init__.py#L152-L172"&gt;current &lt;code&gt;decompress()&lt;/code&gt; implementation&lt;/a&gt;
is written in Python and creates multiple decompression contexts and joins the results together. Surely that had to
lead to some performance degradation? I ended up re-implementing the &lt;code&gt;decompress()&lt;/code&gt; function in C using a single
decompression context to see if my theory was correct. To my chagrin, there was no performance uplift, and it may have
even performed &lt;em&gt;worse&lt;/em&gt;! For the curious, you can see &lt;a href="https://github.com/emmatyping/cpython/tree/zstd-decompress-in-c"&gt;my hacked together branch here&lt;/a&gt;.
Goes to show that you can never be sure about performance bottlenecks based on code itself!&lt;/p&gt;
&lt;h2&gt;Properly Profiling CPython&lt;/h2&gt;
&lt;p&gt;With my first two attempts at optimizing Zstandard decompression in CPython unsuccessful, I realized that I should do
what I probably should have done from the beginning: profile the code! I decided to use the
&lt;a href="https://docs.python.org/3/howto/perf_profiling.html"&gt;standard library support for the perf profiler&lt;/a&gt;, as it would
allow me to see both native/C frames such as inside libzstd or the bindings module &lt;code&gt;_zstd&lt;/code&gt;, as well as Python frames.&lt;/p&gt;
&lt;p&gt;So I went ahead and compiled CPython &lt;a href="https://docs.python.org/3/howto/perf_profiling.html#how-to-obtain-the-best-results"&gt;with some flags to improve perf data&lt;/a&gt;
and ran a simple script which called &lt;code&gt;compression.zstd.decompress()&lt;/code&gt; on a variety of data sizes. I highly recommend
reading the Python documentation about perf support for more details but essentially what I ran was:&lt;/p&gt;
&lt;div class="codehilite" style="background: #0d1117"&gt;&lt;pre style="line-height: 125%;"&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span style="color: #8b949e; font-style: italic"&gt;# in a cpython checkout&lt;/span&gt;
./configure&lt;span style="color: #6e7681"&gt; &lt;/span&gt;--enable-optimizations&lt;span style="color: #6e7681"&gt; &lt;/span&gt;--with-lto&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff"&gt;CFLAGS&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;=&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;&amp;quot;-fno-omit-frame-pointer -mno-omit-leaf-frame-pointer&amp;quot;&lt;/span&gt;
make&lt;span style="color: #6e7681"&gt; &lt;/span&gt;-j&lt;span style="color: #ff7b72"&gt;$(&lt;/span&gt;nproc&lt;span style="color: #ff7b72"&gt;)&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;cd&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;../compression-benchmarks
perf&lt;span style="color: #6e7681"&gt; &lt;/span&gt;record&lt;span style="color: #6e7681"&gt; &lt;/span&gt;-F&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;9999&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;-g&lt;span style="color: #6e7681"&gt; &lt;/span&gt;-o&lt;span style="color: #6e7681"&gt; &lt;/span&gt;perf.data&lt;span style="color: #6e7681"&gt; &lt;/span&gt;../cpython/python&lt;span style="color: #6e7681"&gt; &lt;/span&gt;-X&lt;span style="color: #6e7681"&gt; &lt;/span&gt;perf&lt;span style="color: #6e7681"&gt; &lt;/span&gt;profile_zstd.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After analyzing the profile with &lt;code&gt;perf report --stdio -n -g&lt;/code&gt;, I noticed a significant bottleneck in the output buffer
management code! Let's take a brief detour to discuss what the output buffer management code does and why it was the
decompression bottleneck.&lt;/p&gt;
&lt;h2&gt;(Fast) Buffer Handling is Hard&lt;/h2&gt;
&lt;p&gt;When decompressing data, you feed the decompressor (libzstd in our case) a buffer (&lt;code&gt;bytes&lt;/code&gt; in Python) that is then
decompressed and needs to be written to a new buffer. Since this all happens in C, basically we need to allocate some
memory for libzstd to write the decompressed data into. But how much memory? Well, in many cases, we don't know! So we
need to dynamically resize the output buffer as it is filled up.&lt;/p&gt;
&lt;p&gt;This is actually a pretty challenging problem because there are several constraints and considerations to be made. The
buffer management needs to be fast for a variety of output buffer sizes. If you allocate too much memory up front,
you'll waste time allocating unused memory and slow down decompressing small amounts of data. On the other hand, if you
don't allocate enough, you'll have to make a lot of calls to the allocator, which will also slow things down as each
allocation has overhead and leads to fragmenting the output data. The memory should not grow exponentially for large
outputs, otherwise you could run out of memory for tasks that would normally fit into memory. Finally, each output from
the decompressor can vary in size, given that it may need to buffer data internally.&lt;/p&gt;
&lt;p&gt;Because of the complexity in managing an output buffer, there is code shared across compression modules in CPython to
manage the buffer. This code lives in
&lt;a href="https://github.com/python/cpython/blob/404425575c68bef9d2f042710fc713134d04c23f/Include/internal/pycore_blocks_output_buffer.h"&gt;pycore_blocks_output_buffer.h&lt;/a&gt;.
The code was &lt;a href="https://github.com/python/cpython/commit/f9bedb630e8a0b7d94e1c7e609b20dfaa2b22231"&gt;modified four years ago&lt;/a&gt;
to use an implementation which writes to a series of &lt;code&gt;bytes&lt;/code&gt; objects stored in a &lt;code&gt;list&lt;/code&gt; to hold the output of
decompress calls. When finished, the bytes objects get concatenated together in &lt;code&gt;_BlocksOutputBuffer_Finish&lt;/code&gt;,
returning the final &lt;code&gt;bytes&lt;/code&gt; object containing the decompressed data. When profiling Zstandard decompression, I found
that greater than 50% (!) of decompression time was spent in &lt;code&gt;_BlocksOutputBuffer_Finish&lt;/code&gt;! This seemed inordinately
long, ideally this function should just be a few &lt;code&gt;memcpy&lt;/code&gt;s. So with this knowledge in hand, I tried to think of how
best to optimize the output buffer code.&lt;/p&gt;
&lt;h2&gt;Sometimes Timing Works Out&lt;/h2&gt;
&lt;p&gt;Right around the time that I was working on this, &lt;a href="https://peps.python.org/pep-0782/"&gt;PEP 782&lt;/a&gt; was accepted. This PEP
introduces a new &lt;code&gt;PyBytesWriter&lt;/code&gt; API to CPython which makes it easier to incrementally build up &lt;code&gt;bytes&lt;/code&gt; data in a safe
and performant way at the Python C API level. It seemed like a natural fit for what the blocks output buffer code was
doing, so I wanted to experiment with using it for the output buffer code. After modifying
&lt;code&gt;pycore_blocks_output_buffer.h&lt;/code&gt; to use &lt;code&gt;PyBytesWriter&lt;/code&gt;, I re-ran the original benchmark to see if we had closed the
performance gap:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: this benchmark was run on my local machine and the wall times are not comparable to the previous benchmark.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;stdlib&lt;/th&gt;
&lt;th&gt;zstandard&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;compress 1k level 3&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;💚 +61.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1k level 10&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;💚 +57.77%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1k level 17&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;💚 +364.86%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1M level 3&lt;/td&gt;
&lt;td&gt;5ms&lt;/td&gt;
&lt;td&gt;💚 +40.02%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1M level 10&lt;/td&gt;
&lt;td&gt;32ms&lt;/td&gt;
&lt;td&gt;⚪ - 0.99%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1M level 17&lt;/td&gt;
&lt;td&gt;126ms&lt;/td&gt;
&lt;td&gt;🟩 +15.93%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;compress 1G level 3&lt;/td&gt;
&lt;td&gt;4.47s&lt;/td&gt;
&lt;td&gt;💚 +48.69%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1k level 3&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;⚪ + 4.67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1k level 10&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;⚪ + 4.79%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1k level 17&lt;/td&gt;
&lt;td&gt;&amp;lt;1ms&lt;/td&gt;
&lt;td&gt;🟢 + 5.38%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1M level 3&lt;/td&gt;
&lt;td&gt;1ms&lt;/td&gt;
&lt;td&gt;💚 +50.23%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1M level 10&lt;/td&gt;
&lt;td&gt;1ms&lt;/td&gt;
&lt;td&gt;💚 +41.94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1M level 17&lt;/td&gt;
&lt;td&gt;1ms&lt;/td&gt;
&lt;td&gt;💚 +47.37%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1G level 3&lt;/td&gt;
&lt;td&gt;1.80s&lt;/td&gt;
&lt;td&gt;🟢 +12.87%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1G level 10&lt;/td&gt;
&lt;td&gt;1.77s&lt;/td&gt;
&lt;td&gt;🟢 +12.54%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;decompress 1G level 17&lt;/td&gt;
&lt;td&gt;1.80s&lt;/td&gt;
&lt;td&gt;🟢 + 8.76%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;/blockquote&gt;
&lt;p&gt;WOW! Not only have we closed the gap, &lt;code&gt;compression.zstd&lt;/code&gt; is now &lt;em&gt;faster&lt;/em&gt; than the popular zstandard 3rd-party module.&lt;/p&gt;
&lt;h2&gt;Validating Our Results&lt;/h2&gt;
&lt;p&gt;Wanting to validate the speedup, I decided to write up my own minimal benchmark suite at this point too, to compare
between revisions of the standard library code and use &lt;a href="https://pyperf.readthedocs.io/en/latest/"&gt;&lt;code&gt;pyperf&lt;/code&gt;&lt;/a&gt;,
a benchmarking toolkit used in the venerable &lt;a href="https://github.com/python/pyperformance"&gt;pyperformance benchmark suite&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;So I went ahead and wrote up a &lt;a href="https://github.com/emmatyping/compression-benchmarks/blob/fab8806f3af89b369e40e77be291dd37f3223b7c/bench_zstd.py"&gt;benchmark for zstd&lt;/a&gt;
which tests compression and decompression using default parameters for sizes 1 KiB, 1 MiB, and 1 GiB. I ran these
benchmarks on main and my branch which uses &lt;code&gt;PyBytesWriter&lt;/code&gt;.&lt;/p&gt;
&lt;div class="codehilite" style="background: #0d1117"&gt;&lt;pre style="line-height: 125%;"&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span style="color: #e6edf3"&gt;zstd.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;compress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;K)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;3.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.03&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;3.00&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.03&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;us&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.01&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zstd.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;compress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;M)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;2.92&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.02&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;2.89&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.02&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;ms&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.01&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zstd.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;compress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;G)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;2.72&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;2.67&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;sec&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.02&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zstd.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;decompress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;K)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.40&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.38&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;us&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.01&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zstd.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;decompress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;M)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;734&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;4&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;546&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;3&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;us&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.34&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zstd.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;decompress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;G)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;790&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;4&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter_zstd_3&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;634&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;3&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;ms&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.25&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;

&lt;span style="color: #e6edf3"&gt;Geometric&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;mean&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.10&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For input sizes great than 1 MiB that's 25-30% faster decompression! In hindsight, this actually makes sense if you
consider that libzstd's decompression implementation is exceptionally fast.
&lt;a href="https://github.com/inikep/lzbench"&gt;lzbench&lt;/a&gt;, a popular compression library benchmark, found that libzstd can
decompress data at greater than 1 GiB/s. This is much faster than bz2, lzma, or zlib, the other compression modules in
the standard library. One of the motivations for adding Zstandard to CPython was it's performance. So it is not too
surprising that the output buffer code would be a bottleneck, given that the existing compression libraries don't write
as quickly to the output buffer. This also explains why compression isn't faster after changing the output buffer
code. Compression is very CPU intensive so more time is spent in the compressor rather than writing to the output
buffer. This also explains why the speedup is non-existent for decompressing 1 KiB of data - the first 32 KiB block that
is allocated is plenty to store all of the output data, meaning all of the time is spent in the decompressor.&lt;/p&gt;
&lt;p&gt;One final validation I wished to do was to check the performance of &lt;code&gt;zlib&lt;/code&gt;, to ensure that the change did not regress
performance for other standard library compression modules. I wrote
&lt;a href="https://github.com/emmatyping/compression-benchmarks/blob/fab8806f3af89b369e40e77be291dd37f3223b7c/bench_zlib.py"&gt;a similar benchmark for zlib&lt;/a&gt;
to the one I wrote for zstd, and found that there was also a performance increase with the output buffer change!&lt;/p&gt;
&lt;div class="codehilite" style="background: #0d1117"&gt;&lt;pre style="line-height: 125%;"&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span style="color: #e6edf3"&gt;zlib.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;compress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;M)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;13.5&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.1&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;13.4&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.0&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;ms&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.00&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zlib.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;compress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;G)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;11.4&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.0&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;11.3&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.0&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;sec&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.00&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zlib.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;decompress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;K)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.42&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.39&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;us&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.01&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;us&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.02&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zlib.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;decompress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;M)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.29&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.00&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.17&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;ms&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.00&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;ms&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.10&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;span style="color: #e6edf3"&gt;zlib.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;decompress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;G)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;Mean&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;std&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;dev&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;main&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.36&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.00&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;-&amp;gt;&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;[&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;pybyteswriter&lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;]&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.17&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;sec&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;+-&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;0.00&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;sec&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.17&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;

&lt;span style="color: #e6edf3"&gt;Benchmark&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;hidden&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;because&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #ff7b72; font-weight: bold"&gt;not&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;significant&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;)&lt;/span&gt;&lt;span style="color: #f85149"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;zlib.&lt;/span&gt;&lt;span style="color: #d2a8ff; font-weight: bold"&gt;compress&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;(&lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;K)&lt;/span&gt;

&lt;span style="color: #e6edf3"&gt;Geometric&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #79c0ff; font-weight: bold"&gt;mean&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;:&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #a5d6ff"&gt;1.05&lt;/span&gt;&lt;span style="color: #e6edf3"&gt;x&lt;/span&gt;&lt;span style="color: #6e7681"&gt; &lt;/span&gt;&lt;span style="color: #e6edf3"&gt;faster&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;10-15% faster decompression on data of at least 1 MiB for zlib is pretty significant, especially when you consider that
zlib is used by pip to unpack files in almost every wheel package Python users install.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;With the improvements to output buffer handling, I was not only able to improve the performance of &lt;code&gt;compression.zstd&lt;/code&gt;,
but all of the compression module's decompression code. After stumbling over a few optimization ideas, I definitely
learned my lesson to profile code before jumping to conclusions! You won't know what is a real bottleneck unless you
can test it! Just having a benchmark is not enough!&lt;/p&gt;
&lt;p&gt;&lt;a href="https://github.com/python/cpython/issues/139877"&gt;The original issue I opened&lt;/a&gt; goes into a bit more detail about the
process of benchmarking the compression modules, and &lt;a href="https://github.com/python/cpython/commit/f262297d525e87906c5e4ab28e80284189641c9e"&gt;the commit with the improvement&lt;/a&gt;
has the diff of changes to adopt &lt;code&gt;PyBytesWriter&lt;/code&gt;. One thing I'm proud of is that not only did the change improve
performance, it also simplifies the implementation of the output buffer code and removed 60 lines of code in the
process!&lt;/p&gt;
&lt;p&gt;I did some more profiling of zlib to see if there were any more performance gains to be made, but the profile I
gathered seems to indicate that 95+% of the time is spent in zlib's inflate implementation (with the rest in the
CPython VM), so there is little if any room for further optimization in CPython's bindings for zlib. I think this
is good, as it indicates Python users are getting the best performance they can in 3.15!&lt;/p&gt;
&lt;p&gt;Going forward, I am planning on profiling compression code more, but the vast majority of the time spent
there will probably be in the compressor since compression is so CPU intensive. Finally, I want to investigate
optimizations related to providing more information about the final size of the output data. In some cases the output
buffer is initialized to a small value and dynamically resized as output is produced, but ideally users would be able
to provide more information about their workflow and see a performance improvement over it. I have a lot of other ideas
related to compression I'd like to work on, check out &lt;a href="https://notes.emmatyping.dev/share/ossTODO"&gt;my OSS TODO list&lt;/a&gt;
for all of the random ideas I want to work on in the future!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Emma Smith</dc:creator><pubDate>Tue, 11 Nov 2025 00:00:00 -0800</pubDate><guid>tag:emmatyping.dev,2025-11-11:/decompression-is-up-to-30-faster-in-cpython-315.html</guid><category>misc</category><category>python</category><category>compression</category><category>zstd</category></item></channel></rss>