Wild performance tricks

(davidlattimore.github.io)

56 points | by tbillington 3 days ago ago

17 comments

ComputerGuru 9 minutes ago ago
I don't like relying on (release-only) llvm optimizations for a number of reasons, but primarily a) they break between releases, more often than you'd think, b) they're part of the reason why debug builds of rust software are so much slower (at runtime) than release builds, c) they're much harder to verify (and very opaque).
For non-performance-sensitive code, sure, go ahead and rely on the rust compiler to compile away the allocation of a whole new vector of a different type to convert from T to AtomicT, but where the performance matters, for my money I would go with the transmute 100% of the time (assuming the underlying type was decorated with #[transparent], though it would be nice if we could statically assert that). It'll perform better in debug mode, it's obvious what you are doing, it's guaranteed not to break in a minor rustc update, and it'll work with &mut [T] instead of an owned Vec<T> (which is a big one).
Strilanc 4 hours ago ago
Every one of these "performance tricks" is describing how to convince rust's borrow checker that you're allowed to do a thing. It's more like "performance permission slips".
[-]
- oleganza 3 hours ago ago
  You don't have to play this game - you can always write within unsafe { ... } like in plain old C or C++. But people do choose to play this game because it helps them to write code that is also correct, where "correct" has an old-school meaning of "actually doing what it is supposed to do and not doing what it's not supposed to".
  [-]
  - ManlyBread 2 hours ago ago
    That just makes it seem like there's no point in using this language in the first place.
    [-]
    - maccard 2 hours ago ago
      Dont let perfect be the enemy of good.
      Software is built on abstractions - if all your app code is written without unsafe and you have one low level unsafe block to allow for something, you get the value of rust for all your app logic and you know the actual bug is in the unsafe code
- kibwen an hour ago ago
  ...Except that Rust is thread-safe, so expressing your algorithm in terms that the borrow checker accepts makes safe parallelism possible, as shown in the example using Rayon to trivially parallelize an operation. This is the whole point of Rust, and to say that C and C++ fail at thread-safety would be the understatement of the century.
- Ar-Curunir 2 hours ago ago
  This is an issue that you would face in any language with strong typing. It only rears its head in Rust because Rust tries to give you both low-level control and strong types.
  For example, in something like Go (which has a weaker type system than Rust), you wouldn't think twice about, paying for the re-allocation in buffer-reuse example.
  Of course, in something like C or C++ you could do these things via simple pointer casts, but then you run the risk of violating some undefined behaviour.
  [-]
  - jstimpfle 2 hours ago ago
    In C I wouldn't use such a fluffy high-level approach in the first place. I wouldn't use contiguous unbounded vec-slices. And no, I wouldn't attempt trickery with overwriting input buffers. That's a bad inflexible approach that will bite at the next refactor. Instead, I would first make sure there's a way to cheaply allocate fixed size buffers (like 4 K buffers or whatever) and stream into those. Memory should be used in a allocate/write-once/release fashion whenever possible. This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.
    [-]
    - kibwen an hour ago ago
      > In C I wouldn't use such a fluffy high-level approach in the first place.
      Sure, though that's because C has abstraction like Mars has a breathable atmosphere.
      > This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.
      This claim is wild considering that Rust code is more bug-free than C code while being just as efficient, while keeping in mind that Rust makes parallelism so much easier than C that it's stops being funny and starts being tragic.
  - jandrewrogers 2 hours ago ago
    > in something like C or C++ you could do these things via simple pointer casts
    No you don't. You explicitly start a new object lifetime at the address, either of the same type or a different type. There are standard mechanisms for this.
    Developers that can't be bothered to do things correctly is why languages like Rust exist.
- jstimpfle 2 hours ago ago
  Yup -- yet another article only solving language level problems instead of teaching something about real constraints (i.e. hardware performance characteristics). Booooring. This kind of article is why I still haven't mustered the energy to get up to date with Rust. I'm still writing C (or C-in-C++) and having fun, most of the time feeling like I'm solving actual technical problems.
- the-smug-one 2 hours ago ago
  The rayon thing is neat.
vlovich123 2 hours ago ago
> Now that we have a Vec with no non-static lifetimes, we can safely move it to another thread.
I liked most of the tricks but this one seems pointless. This is no different than transmute as accessing the borrower requires an assume_init which I believe is technically UB when called on an uninit. Unless the point is that you’re going to be working with Owned but want to just transmute the Vec safely.
Overall I like the into_iter/collect trick to avoid unsafe. It was also most of the article, just various ways to apply this trick in different scenarios. Very neat!
[-]
- ComputerGuru 17 minutes ago ago
  You misunderstood the purpose of that trick. The vector is not going to be accessed again, the idea is to move it to another thread so it can be dropped in parallel (never accessed).
quotemstr 2 hours ago ago
> Even if it were stable, it only works with slices of primitive types, so we’d have to lose our newtypes (SymbolId etc).
That's weird. I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".
I don't like magical stdlibs and I don't like user types being less powerful than built-in ones.
Clever workaround doing a no-op transformation of the whole vector though! Very nearly zero-cost.
> It would be possible to ensure that the proper Vec was restored for use-cases where that was important, however it would add extra complexity and might be enough to convince me that it’d be better to just use transmute.
Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.
> The optimisation in the Rust standard library that allows reuse of the heap allocation will only actually work if the size and alignment of T and U are the same
Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.
[-]
- aw1621107 an hour ago ago
  > I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".
  This might be related in part to the fact that Rust chose to create specific AtomicU8/AtomicU16/etc. types instead of going for Atomic<T> like in C++. The reasoning for forgoing the latter is [0]:
  > However the consensus was that having unsupported atomic types either fail at monomorphization time or fall back to lock-based implementations was undesirable.
  That doesn't mean that one couldn't hypothetically try to write from_mut_slice<T> where T is a transparent newtype over one of the supported atomics, but I'm not sure whether that function signature is expressible at the moment. Maybe if/when safe transmutes land, since from_mut_slice is basically just doing a transmute?
  > Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.
  I think this optimization does what you say? A quick skim of the source code [1] seems to show that the alignments don't have to exactly match:
```
    //! # Layout constraints
    //! <snip>
    //! Alignments of `T` must be the same or larger than `U`. Since alignments are always a power
    //! of two _larger_ implies _is a multiple of_.
```
  And later:
```
    const fn in_place_collectible<DEST, SRC>(
        step_merge: Option<NonZeroUsize>,
        step_expand: Option<NonZeroUsize>,
    ) -> bool {
        if const { SRC::IS_ZST || DEST::IS_ZST || mem::align_of::<SRC>() < mem::align_of::<DEST>() } {
            return false;
        }
        // Other code that deals with non-alignment conditions
    }
```
  [0]: https://github.com/Amanieu/rfcs/blob/more_atomic_types/text/...
  [1]: https://github.com/rust-lang/rust/blob/c58a5da7d48ff3887afe4...
Cheetahlee01 3 hours ago ago
Just want to know some hacking tricks