Making PyPI's test suite faster

(blog.trailofbits.com)

125 points | by rbanffy 2 months ago ago

41 comments

boyd 2 months ago ago
Throwing cores at the problem with `pytest-xdist` is typically the lowest hanging fruit, but you still hit all the paper cuts the authors mention -- collection, DB fixtures, import time, etc.
And, further optimization is really hard when the CI plumbing starts to dominate. For example, the last Warehouse `test` job I checked has 43s of Github Actions overhead for 51s of pytest execution time (half the test action time and approaching 100% overhead).
Disclosure: Have been tinkering on a side project trying to provide 90% of these pytest optimizations automatically, but also get "time-to-first-test-failure" down to ~10 seconds (via warm runners, container snapshotting, etc.). Email in profile if anyone would like to swap notes.
cocoflunchy 2 months ago ago
I don't understand why pytest's collection is so slow.
On our test suite (big django app) it takes about 15s to collect tests. So much that we added a util using ripgrep to find the file and pass it as an argument to pytest when using `pytest -k <testname>`.
[-]
- Galanwe 2 months ago ago
  From my experience speeding up pytests with Django:
  - Creating and migrating the test DB is slow. There is no shame in storing and committing a premigrated sqlite test DB generated upon release, it's often small in size and will save time for everyone.
  - Stash your old migrations that nobody use anymore.
  - Use python -X importtime and paste the result in an online viewer. Sometimes moving heavy imports to functions instead of the global scope will make individual tests slower, but collection will be faster.
  - Use pytest-xdist
  - Disable transactions / rollback on readonly tests. Ideally you want most of your non-inserting tests to work on the migrated/preloaded features in your sqlite DB.
  We can enter into more details if you want, but the pre migrated DB + xdist alone allowed me to speedup tests on a huge project from 30m to 1m.
  [-]
  - caidan 2 months ago ago
    Agreed, the db migrations are usually the slowest part. Another way to speed this up substantially if you are using postgres and need your test database to be postgres too, is to create and maintain a template database for your tests. This database should have all migrations already run on it and be loaded with whatever general use fixtures you will need. You can then use the Django TEMPLATE setting https://docs.djangoproject.com/en/5.1/ref/settings/#template and Django will clone that database when running your tests.
  - imp0cat 2 months ago ago
    Is there a way to use pytest-xdist and still keep the regular output?
- kinow 2 months ago ago
  In their case I think they were no specifying any test path. Which would cause pytest to search or tests in multiple directories.
  Another thing that can slow down pytest collection and bootstrap is how fixture are loaded. So reducing number or scope of fixtures may help too.
- boxed 2 months ago ago
  I've done some work on making pytest faster, and it's mostly a case of death by a thousand paper cuts. I wrote hammett as an experimental benchmark to compare to.
- piokoch 2 months ago ago
  Ehhh, those pesky Python people, complaining and complaining, average Spring Boot application takes 15s to start even looking if the code compiled ;)
  [-]
  - thom 2 months ago ago
    Lest we start to malign the JVM as a whole, my Clojure test suite, which includes functional tests running headless browsers against a full app hitting real Postgres databases, runs end to end in 20s.
    [-]
    - ffsm8 2 months ago ago
      The spring tests are generally quicker then the equivalent python test, so ime - the jvm is mostly to blame.
      How much time actually goes by after you click "run test" (or run the equivalent cli command) until the test finished running?
      Any projects using the jvm I've ever worked on (none of which were clojure, admittedly) have always taken at least 10-15s until the pre-phases were finished and the actual test setup began
      [-]
      - thom 2 months ago ago
        If I completely clear all cached packages maybe, but I never do that locally or in CI/CD, and that's true of Python too (but no doubting UV is faster than Maven). Clojure/JVM startup time is less than half a second, obviously that's still infinitely more than Python or a systems language but tolerable to me. First test runs after about 2s? And obviously day to day these things run instantly because they're already loaded in a REPL/IPython. Maybe unfair to compare an interpreted language to a compiled one: building an uberjar would add 10 seconds but I'd never do that during development, which is part of the selling point I guess. Either way, I don't think the JVM startup time is really a massive issue in 2025, and I feel like whatever ecosystem you're in, you can always attack these slow test suites and improve your quality of life.
    - esafak 2 months ago ago
      It spins up a postgres container in that 20s?
      [-]
      - thom 2 months ago ago
        Not a container but yes, it launches a cluster at the start of a run, and copies a blank Postgres template before every relevant test.
nine_k 2 months ago ago
One thing not mentioned here is putting your test database on a RAM disk, aka tmpfs. This significantly speeds up all DB-related tests that use transactions, fixture loading, and migrations.
In most distros, /tmp is mounted as tmpfs, but YMMV.
qznc 2 months ago ago
I generally try to avoid mocking completely. However, speeding up tests is an appropriate use. If someone changes the implementation the mock usually simply doesn't apply and the test still works as intended.
For example, a great speed optimization in our tests recently was to mock time.sleep.
Why do we have so many sleeps? This is testing a test framework for embedded devices where there is plenty of fiddling-then-wait-for-the-hardware.
I also mocked some file system accesses. Unit testing is about our application logic and not about Linux kernel behavior anyways.
NeutralForest 2 months ago ago
Pretty good article, it's really a challenge to properly isolate DB operations during testing so having a difference instance per worker is nice. I remember trying to use different schemas (not instances) but I had a hard time to isolate roles as well.
[-]
- lyu07282 2 months ago ago
  It's more work, but that's one benefit of clean architecture that abstracts the persistence layer. (You can replace it with an in-memory variant.)
  [-]
  - NeutralForest 2 months ago ago
    I was using https://www.postgresql.org/docs/current/ddl-rowsecurity.html and needed to check that some complex policies were working correctly so I couldn't just replace with say, SQLite.
throwme_123 2 months ago ago
Is Trail of Bits transitioning out of "crypto"?
Imho, they are one of the best auditors out there for smart contracts. Wouldn't be surprising to see some of these talented teams find bigger markets.
[-]
- woodruffw 2 months ago ago
  No; Trail of Bits has always had multiple internal groups, including an OSS engineering group that does security and performance engineering. We still do plenty of audits as a company; you can see recent work on that front here[1] :-).
  Source: I run the group that produced this work.
  [1]: https://github.com/trailofbits/publications
  [-]
  - frogsRnice 2 months ago ago
    You all do amazing work, hope I can boast the same someday - or even 50% of it ;)
    Seriously, you are my heroes!
- bsamuels 2 months ago ago
  In addition to what Will posted, published reports for blockchain projects tend to be skewed compared to our other groups.
  Blockchain clients tend to want to publish the report, but that isn't true for our business lines/projects/clients that are more interesting to HN's audience.
- frogsRnice 2 months ago ago
  Imo its not just crypto- a lot of their reports are enlightening to read
ustad 2 months ago ago
The article uses pytest - does anyone have similar tips when using pythons builtin unittest?
[-]
- masklinn 2 months ago ago
  The sys.monitoring and import optimisation suggestions apply as-is.
  If you use standard unittest discovery the third item might apply as well, though probably not to the same degree.
  I don’t think unittest has any support for distribution so the xdist stuff is a no.
  On the other hand you could use unit test as the API with Pytest as your test runner. Then you can also use xdist. And eventually migrate to the Pytest test api because it’s so much better.
  [-]
  - kinow 2 months ago ago
    I wwsn't familiar with this sys.monitoring option for coverage. Going to give it a try in my test suite. At the moment with docker testcontainers, gh actions test matrix for multiple python versions, and unit + regression + integration tests it is taking about 3-5 minutes.
    [-]
    - darkamaul 2 months ago ago
      Warning, there is a change in coverage 7.7.0 that disables sysmon support for coverage if using branch coverage _and_ a version of Python before 3.14alpha6.
      [0]: https://coverage.readthedocs.io/en/7.8.0/changes.html#versio...
      [-]
      - kinow 2 months ago ago
        Ah, thank you! I think you just saved me some time!
- anticodon 2 months ago ago
  I profiled a huge legacy tests collection using cProfile, and found lots of low hanging fruits. Like some tests were creating 4000x3000 Pillow image in memory just to test how image saving code works (checkign that filename and extension are correct). And hundreds of tests had created this huge image for every test (in the setUp method) because of unittest reliance on inheritance. Reducing size image to 10x5 made the test suite faster for like 5-7% (it was long time ago, so I don't remember exact statistics).
  So, I'd run the tests under cProfile first.
  [-]
  - dmurray 2 months ago ago
    But the changes in TFA were of the other of 75% improvement for "dumb" changes that were agnostic to the details of the tests being run.
    Saying you got a 5-7% improvement from a single change, discovered using the profiler, that took understanding of the test suite and the domain to establish it was OK, and that actually changed the functionality under test - that's all an argument for doing exactly the opposite of what you recommend.
    [-]
    - anticodon 2 months ago ago
      > that actually changed the functionality under test - that's all an argument for doing exactly the opposite of what you recommend.
      It was an old functionality. Someone wrote a super class that for the need of testing filesystem functionality created extremely large images. Not only there was no need to test with such large images, other developers eventually inherited more testcases from that setup code (because there were other utility methods), and now setUp code was needlessly creating images that no test used.
      Generating a huge 4k image takes a significant time using Pillow.
bgwalter 2 months ago ago
I get that pytest has features that unittest does not, but how is scanning for test files in a directory considered appropriate for what is called a high security application in the article?
For high security applications the test suite should be boring and straightforward. pytest is full of magic, which makes it so slow.
Python in general has become so complex, informally specified and bug ridden that it only survives because of AI while silencing critics in their bubble.
The complexity includes PSF development processes, which lead to:
https://www.schneier.com/blog/archives/2024/08/leaked-github...
[-]
- williamdclt 2 months ago ago
  > it only survives because of AI
  I don't disagree that it's "complex, informally specified" (idk about bug ridden or silencing critics), but it's just silly to say it only survives because of AI. It was a top-used language before AI got big for web development, data science and all sorts of scientific analysis, and these haven't gone away: I don't expect Python lost much ground in these fields, if any.
  [-]
  - bgwalter 2 months ago ago
    Dropbox moved parts from Python to Golang already in 2014. Google fired the Python team last year and I hear that it does not use Python for new code. Instagram is kept afloat by gigantic hacks.
    The scientific ecosystem was always there, but relied on heavy marketing to academics, who (sadly) in turn indoctrinate new students to use Python as a first language.
    I did forget about sysadmin use cases in Linux distributions, but they could be easily replaced by even Perl, as leaner BSD distributions already do.
    [-]
    - guappa 2 months ago ago
      You'd be right if go wasn't an awful language designed by someone who clearly failed their compiler class at university.
      [-]
      - 2 months ago ago
        [deleted]
- westurner 2 months ago ago
  strace is one way to determine how many stat calls a process makes.
  Developers avoid refactoring costs by using dependency inversion, fixtures and functional test assertions without OO in the tests, too.
  Pytest collection could be made faster with ripgrep and does it even need AST? A thread here mentions how it's possible to prepare a list of .py test files containing functions that start with "test_" to pass to the `pytest -k` option; for example with ripgrep.
  One day I did too much work refactoring tests to minimize maintenance burden and wrote myself a functional test runner that captures AssertionErrors and outputs with stdlib only.
  It's possible to use unittest.TestCase() assertion methods functionally:
```
  assert 0 == 1
  # AssertionError

  import unittest
  test = unittest.TestCase()

  test.assertEqual(0, 1)
  # AssertionError: 0 != 1
```
  unittest.TestCase assertion methods have default error messages, but the `assert` keyword does not.
  In order to support one file stdlib-only modules, I have mocked pytest.mark.parametrize a number of times.
  chmp/ipytest is one way to transform `assert a == b` to `assertEqual(a,b)` like Pytest in Jupyter notebooks.
  Python continues to top language use and popularity benchmarks.
  Python is not a formally specified language, mostly does not have constant time operations (or documented complexity in docstring attrs), has a stackless variant, supported asynchronous coroutines natively before C++, now has some tail-call optimization in 3.14, now has nogil mode, and is GPU accelerated in many different ways.
  How best could they scan for API tokens committed to public repos?
- woodruffw 2 months ago ago
  pytest's magic is not itself a significant overhead factor. All test suite systems need to perform a similar type of collection; unittest does the exact same thing via `unittest.main()`.
- zahlman 2 months ago ago
  Critics of Python don't get "silenced in their bubble" generally, just ignored.
  Critics of the PSF, well, that's another story.
  As for complexity, it's not so much that new features are added, but that people are using Python in larger systems, and demanding things to help manage the complexity (that end up adding more complexity of their own). The Zen of Python is forgotten - and that's largely on the users.
  pytest is full of magic, but at least it uses that magic to present a pleasant UI. Certainly better than unittest's JUnit-inspired design. But it'd be that much nicer to have something that gets there directly rather than wrapping the bad stuff, and which honours "simple is better than complex" and "explicit is better than implicit" (test discovery, but also fixtures).
  [-]
  - bgwalter 2 months ago ago
    > Critics of Python don't get "silenced in their bubble" generally, just ignored.
    I disagree. The public bans are just the tip of the iceberg. Here is a relatively undocumented one:
    https://lwn.net/Articles/1003436/
    It is typical for a variety of reasons. Someone complains about breakage and is banned. Later, when the right people complain about the same issue, the breakage is reverted.
    The same pattern happens over and over. The SC and the PSF are irresponsible, incompetent and malicious.
- selfselfgo 2 months ago ago
  [dead]