Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster

(fidget-spinner.github.io)

170 points | by lumpa 4 hours ago ago

38 comments

  • mananaysiempre 3 hours ago ago

    The money shot (wish this were included in the blog post):

      #   if defined(_MSC_VER) && !defined(__clang__)
      #      define Py_MUSTTAIL [[msvc::musttail]]
      #      define Py_PRESERVE_NONE_CC __preserve_none
      #   else
      #       define Py_MUSTTAIL __attribute__((musttail))
      #       define Py_PRESERVE_NONE_CC __attribute__((preserve_none))
      #   endif
    
    https://github.com/python/cpython/pull/143068/files#diff-45b...

    Apparently(?) this also needs to be attached to the function declarator and does not work as a function specifier: `static void *__preserve_none slowpath();` and not `__preserve_none static void *slowpath();` (unlike GCC attribute syntax, which tends to be fairly gung-ho about this sort of thing, sometimes with confusing results).

    Yay to getting undocumented MSVC features disclosed if Microsoft thinks you’re important enough :/

    • kenjin4096 18 minutes ago ago

      So it seems I was wrong, [[msvc::musttail]] is documented! I will update the blog post to reflect that.

      https://news.ycombinator.com/item?id=46385526

    • publicdebates 3 hours ago ago

      Important enough, or benefits them directly? I have no good guesses how improving Python's performance would benefit them, but I would guess that's the real reason.

      • pjmlp 20 minutes ago ago

        Microsoft was the one hiring Guido out of retirement, and alongside Facebook finally kicking off the CPython JIT efforts.

        Python is one of the Microsoft blessed languages on their devblogs.

      • andix 2 hours ago ago

        I guess there are some Python workloads on Azure, Microsoft provides a lot of data analysis and LLM tools as a service (not paid by CPU minutes). Saving CPU cycles there directly translates to financial savings.

      • acdha 39 minutes ago ago

        Think about how much effort they have put into things like Pylance and general python support in VAC. Clearly they think they have enough users that this matters to that a first class experience is worth having.

      • HPsquared 3 hours ago ago

        I wonder if this is related to Python in Excel. You'll have lots of people running numerical stuff written in Python, running on Microsoft servers.

      • mkoubaa 2 hours ago ago

        A lot of commercial engineering and scientific software runs on windows.

  • jtrn 2 hours ago ago

    Im a bit out of the loop with this, but hope its not like that time with python 3.14, when it was claimed a geometric mean speedup of about 9-15% over the standard interpreter when built with Clang 19. It turned out the results were inflated due to a bug in LLVM 19 that prevented proper "tail duplication" optimization in the baseline interpreter's dispatch loop. Actual gains was aprox 4%.

    Edit: Read through it and have come to the conclusion that the post is 100% OK and properly framed: He explicitly says his approach is to "sharing early and making a fool of myself," prioritizing transparency and rapid iteration over ironclad verification upfront.

    One could make an argument that he should have cross-compiler checks, independent audits, or delayed announcements until results are bulletproof across all platforms. But given that he is 100% transparent with his thinking and how he works, it's all good in the hood.

    • kenjin4096 an hour ago ago

      Thanks :), that was indeed my intention. I think the previous 3.14 mistake was actually a good one on hindsight, because if I didn't publicize our work early, I wouldn't have caught the attention of Nelson. Nelson also probably wouldn't have spent one month digging into the Clang 19 bug. This also meant the bug wouldn't have been caught in the betas, and might've been out with the actual release, which would have been way worse. So this was all a happy accident on hindsight that I'm grateful for as it means overall CPython still benefited!

      Also this time, I'm pretty confident because there are two perf improvements here: the dispatch logic, and the inlining. MSVC can actually convert switch-case interpreters to threaded code automatically if some conditions are met [1]. However, it does not seem to do that for the current CPython interpreter. In this case, I suspect the CPython interpreter loop is just too complicated to meet those conditions. The key point also that we would be relying on MSVC again to do its magic, but this tail calling approach gives more control to the writers of the C code. The inlining is pretty much impossible to convince MSVC to do except with `__forceinline` or changing things to use macros [2]. However, we don't just mark every function as forceinline in CPython as it might negatively affect other compilers.

      [1]: https://github.com/faster-cpython/ideas/issues/183 [2]: https://github.com/python/cpython/issues/121263

      • jtrn 11 minutes ago ago

        I wish all self-promoting scientists and sensationalizing journalists had a fraction of the honesty and dedication to actual truth and proper communication of truths as you do. You seem to feel that it’s more important to be transparent about these kinds of technical details than other people are about their claims in clinical medical research. Thank you so much for all you do and the way you communicate about it.

        Also, I’m not that familiar with the whole process, but I just wanted to say that I think you were too hard on yourself during the last performance drama. So thank you again and remember not to hold yourself to an impossible standard no one else does.

  • gozzoo 10 minutes ago ago

    I have quetion - slightly off topic, but related. I was wandering why is pyhton interpreter so much slower than V8 javascript interpreter when both javascript and python are dynamic interpreted languages.

    • bheadmaster 5 minutes ago ago

      I can think of two possible reasons:

      First is the Google's manpower. Google somehow succeeds in writing fast software. Most Google products I use are fast in contrast to the rest of the ecosystem. It's possible that Google simply did a better job.

      The second is CPython legacy. There are faster implementations of Python that completely implement the API (PyPy comes to mind), but there's a huge ecosystem of C extensions written with CPython bindings, which make it virtually impossible to break compatibility. It is possible that this legacy prevents many possible optimizations. On the other hand, V8 only needs to keep compatibility on code-level, which allows them to practically switch out the whole inside in incremental search for a faster version.

      I might be wrong, so take what I said with a grain of salt.

  • redox99 3 hours ago ago

    This seems like very low hanging fruit. How is the core loop not already hyper optimized?

    I'd have expected it to be hand rolled assembly for the major ISAs, with a C backup for less common ones.

    How much energy has been wasted worldwide because of a relatively unoptimized interpreter?

    • Calavar 36 minutes ago ago

      Quite to the contrary, I'd say this update is evidence of the inner loop being hyperoptimized!

      MSVC's support for musttail is hot off the press:

      > The [[msvc::musttail]] attribute, introduced in MSVC Build Tools version 14.50, is an experimental x64-only Microsoft-specific attribute that enforces tail-call optimization. [1]

      MSVC Build Tools version 14.50 was released last month, and it only took a few weeks for the CPython crew to turn that around into a performance improvement.

      [1] https://learn.microsoft.com/en-us/cpp/cpp/attributes?view=ms...

    • kccqzy 3 hours ago ago

      Python’s goal is never really to be fast. If that were its goal, it would’ve had a JIT long ago instead of toying with optimizing the interpreter. Guido prioritized code simplicity over speed. A lot of speed improvements including the JIT (PEP 744 – JIT Compilation) came about after he stepped down.

      • davidkhess an hour ago ago

        Should probably mention that Guido ended up on the team working on a pretty credible JIT effort. Though Microsoft subsequently threw a wrench in it with layoffs. Not sure the status now.

    • pjc50 an hour ago ago

      This is (a) wildly over expectations for open source and (b) a massive pain to maintain, and (c) not even the biggest timewaster of python, which is the packaging "system".

      • loeg 31 minutes ago ago

        > not even the biggest timewaster of python, which is the packaging "system".

        For frequent, short-running scripts: start-up time! Every import has to scan a billion different directories for where the module might live, even for standard modules included with the interpreter.

    • WD-42 an hour ago ago

      Probably because anyone concerned with performance wasn’t running workloads on Windows to begin with.

      • pjmlp 18 minutes ago ago

        Games and Proton.

        Apparently people that care about performance do run Windows.

        • nilamo 2 minutes ago ago

          Games are made for windows because that's where the device drivers have historically been. Any other viewpoint is ignoring reality.

      • loeg 30 minutes ago ago

        They weren't using Python, anyway.

    • mkoubaa an hour ago ago

      Software has gotten so slow we've forgotten how fast computers are

    • LtWorf 2 hours ago ago

      If you want fast just use pypy and forget about cpython.

  • g947o 3 hours ago ago

    > This has caused many issues for compilers in the past, too many to list in fact. I have a EuroPython 2025 talk about this.

    Looks like it refers to this:

    https://youtu.be/pUj32SF94Zw

    (wish it's a link in the article)

  • acemarke an hour ago ago

    I've never seen this kind of benchmark graph before, and it looks really neat! How was this generated? What tool was used for the benchmarks?

    (I actually spent most of Sep/Oct working on optimizing the Immer JS immutable update library, and used a benchmarking tool called `mitata`, so I was doing a lot of this same kind of work: https://github.com/immerjs/immer/pull/1183 . Would love to add some new tools to my repertoire here!)

  • Hendrikto 3 hours ago ago

    TLDR: The tail-calling interpreter is slightly faster than computed goto.

    > I used to believe the the tailcalling interpreters get their speedup from better register use. While I still believe that now, I suspect that is not the main reason for speedups in CPython.

    > My main guess now is that tail calling resets compiler heuristics to sane levels, so that compilers can do their jobs.

    > Let me show an example, at the time of writing, CPython 3.15’s interpreter loop is around 12k lines of C code. That’s 12k lines in a single function for the switch-case and computed goto interpreter.

    > […] In short, this overly large function breaks a lot of compiler heuristics.

    > One of the most beneficial optimisations is inlining. In the past, we’ve found that compilers sometimes straight up refuse to inline even the simplest of functions in that 12k loc eval loop.

    • kccqzy an hour ago ago

      I think in the protobuf example the musttail did in fact benefit from better register use. All the functions are called with the same arguments, so there is no need to shuffle the registers. The same six register-passed arguments are reused from one function to the next.

    • cma an hour ago ago

      Does MSVC support computed goto?

  • forrestthewoods 16 minutes ago ago

    Is there a Clang based build for Windows? I’ve been slowly moving my Windows builds from MSVC to Clang. Which still uses the Microsoft STL implementation.

    So far I think using clang instead of MSVC compiler is a strict win? Not a huge difference mind you. But a win nonetheless.

  • bgwalter 2 hours ago ago

    MSVC mostly generates slower code than gcc/clang, so maybe this trick reduces the gap.

    • metaltyphoon 2 hours ago ago

      Is this backed by real evidence?

      • bluecalm an hour ago ago

        My experience is 10%-15% slower than GCC. That was 10 years ago though.

  • develatio 3 hours ago ago

    if the author of this blog reads this: can we can an RSS, please?

    • kenjin4096 3 hours ago ago

      Got it. I'll try to set one up this weekend.

  • machinationu 3 hours ago ago

    The Python interpreter core loop sounds like the perfect problem for AlphaEvolve. Or it's open source equivalent OpenEvolve if DeepMind doesn't want to speed up Python for the competition.