It's nice to see some high-performance linear algebra code done in a modern lanugage! Would love to see more!
Is your approach specific to the case where the matrix fits inside cache, but the memory footprint of the basis causes performance issues? Most of the communication-avoiding Krylov works I've seen, e.g [0,1] seem to assume that if the matrix fits, so will its basis, and so end up doing some partitioning row-wise for the 'large matrix' case; I'm curious what your application is.
How accurate is this two-pass approach in general? From my outsider's perspective, it always looked like most of the difficulty in implementing Lanczos was reorthogonalization, which will be hard to do with the two-pass algorithm.
Or is this mostly a problem when you actually want to calculate the eigenvectors themselves, and not just matrix functions?
Hi there, thanks! I started doing this for a university exam and got carried away a bit.
Regarding Rust for numerical linear algebra, I kinda agree with you. I think that theoretically, its a great language for writing low-level "high-performance mathematics." That's why I chose it in the first place.
The real wall is that the past four decades of research in this area have primarily been conducted in C and Fortran, making it challenging for other languages to catch up without relying heavily on BLAS/LAPACK and similar libraries.
I'm starting to notice that more people are trying to move to Rust for this stuff, so it's worth keeping an eye open on libraries like the one that I used, faer.
Nice. I'd be curious to see if this has already been done in the literature. It is a very nice and useful result, but it also kind of an obvious one---so I have to assume people who do work on computing matrix functions are aware of it... (This is not to take anything away from the hard work you've done! You may just appreciate having a reference to any existing work that is already out there.)
Of course, what you're doing depends on the matrix being Hermitian reducing the upper Hessenberg matrix in the Arnoldi iteration to tridiagonal form. Trying to do a similar streaming computation on a general matrix is going to run into problems.
That said... one area of numerical linear algebra research which is very active is randomized numerical linear algebra. There is a paper by Nakatsukasa and Tropp ("Fast and accurate randomized algorithms for linear systems and eigenvalue problems") which presents some randomized algorithms, including a "randomized GMRES" which IIRC is compatible with streaming. You might find it interesting trying to adapt the machinery this algorithm is built on to the problem you're working on.
As for Rust, having done a lot of this research myself... there is no problem relying on BLAS or LAPACK, and I'm not sure this could be called a "wall". There are also many alternative libraries actively being worked on. BLIS, FLAME, and MAGMA are examples that come to mind... but there are so many more. Obviously Eigen is also available in C++. So, I'm not sure this alone justifies using Rust... Of course, use it if you like it. :)
The blog post is a simplification of the actual work; you can check out the full report here [1], where I also reference the literature about this algorithm.
On the cache effects: I haven't seen this "engineering" argument made explicitly in the literature either. There are other approaches to the basis storage problem, like the compression technique in [2]. Funny enough, the authors gave a seminar at my university literally this afternoon about exactly that.
I'm also unfamiliar with randomised algorithms for numerical linear algebra beyond the basics. I'll dig into that, thanks!
On the BLAS point, let me clarify what I meant by "wall": when you call BLAS from Rust, you're essentially making a black-box call to pre-compiled Fortran or C code. The compiler loses visibility into what happens across that boundary. You can't inline, can't specialise for your specific matrix shapes or use patterns, can't let the compiler reason about memory layout across the whole computation. You get the performance of BLAS, sure, but you lose the ability to optimise the full pipeline.
Also, Rust's compilation model flattens everything into one optimisation unit: your code, dependencies, all compiled together from source. The compiler sees the full call graph and can inline, specialise generics, and vectorise across what would be library boundaries in C/C++. The borrow checker also proves at compile time that operations like our pointer swaps are safe and that no aliasing occurs, which enables more aggressive optimisations; the compiler can reorder operations and keep values in registers because it has proof about memory access patterns. With BLAS, you're calling into opaque binaries where none of this analysis is possible.
My point is that if the core computation just calls out to pre-compiled C or Fortran, you lose much of what makes Rust interesting for numerical work in the first place. That's why I hope to see more efforts directed towards expanding the Rust ecosystem in this area in the future :)
Fantastic post; I'm not much of a mathemetician, but the writing and logical progression were so clearly articulated, I was able to follow the gist the whole way through. Kudos!
It's nice to see some high-performance linear algebra code done in a modern lanugage! Would love to see more!
Is your approach specific to the case where the matrix fits inside cache, but the memory footprint of the basis causes performance issues? Most of the communication-avoiding Krylov works I've seen, e.g [0,1] seem to assume that if the matrix fits, so will its basis, and so end up doing some partitioning row-wise for the 'large matrix' case; I'm curious what your application is.
[0] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-..., e.g. page 25. [1] https://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-...
How accurate is this two-pass approach in general? From my outsider's perspective, it always looked like most of the difficulty in implementing Lanczos was reorthogonalization, which will be hard to do with the two-pass algorithm.
Or is this mostly a problem when you actually want to calculate the eigenvectors themselves, and not just matrix functions?
I leafed through your thesis and now will see aside some time in the future to learn more about succint data structures.
I hope you get your pay day, your blog is great!
Thanks!! I'm currently working on expanding that work. I will post something for sure when it's done.
Nice result! Arnoldi is a beautiful algorithm, and this is a good application of it.
What are you using this for and why are you working on it?
I admit I'm not personally convinced of the value of Rust in numerics, but that's just me, I guess...
Hi there, thanks! I started doing this for a university exam and got carried away a bit.
Regarding Rust for numerical linear algebra, I kinda agree with you. I think that theoretically, its a great language for writing low-level "high-performance mathematics." That's why I chose it in the first place.
The real wall is that the past four decades of research in this area have primarily been conducted in C and Fortran, making it challenging for other languages to catch up without relying heavily on BLAS/LAPACK and similar libraries.
I'm starting to notice that more people are trying to move to Rust for this stuff, so it's worth keeping an eye open on libraries like the one that I used, faer.
Nice. I'd be curious to see if this has already been done in the literature. It is a very nice and useful result, but it also kind of an obvious one---so I have to assume people who do work on computing matrix functions are aware of it... (This is not to take anything away from the hard work you've done! You may just appreciate having a reference to any existing work that is already out there.)
Of course, what you're doing depends on the matrix being Hermitian reducing the upper Hessenberg matrix in the Arnoldi iteration to tridiagonal form. Trying to do a similar streaming computation on a general matrix is going to run into problems.
That said... one area of numerical linear algebra research which is very active is randomized numerical linear algebra. There is a paper by Nakatsukasa and Tropp ("Fast and accurate randomized algorithms for linear systems and eigenvalue problems") which presents some randomized algorithms, including a "randomized GMRES" which IIRC is compatible with streaming. You might find it interesting trying to adapt the machinery this algorithm is built on to the problem you're working on.
As for Rust, having done a lot of this research myself... there is no problem relying on BLAS or LAPACK, and I'm not sure this could be called a "wall". There are also many alternative libraries actively being worked on. BLIS, FLAME, and MAGMA are examples that come to mind... but there are so many more. Obviously Eigen is also available in C++. So, I'm not sure this alone justifies using Rust... Of course, use it if you like it. :)
Sorry for the late answer.
The blog post is a simplification of the actual work; you can check out the full report here [1], where I also reference the literature about this algorithm.
On the cache effects: I haven't seen this "engineering" argument made explicitly in the literature either. There are other approaches to the basis storage problem, like the compression technique in [2]. Funny enough, the authors gave a seminar at my university literally this afternoon about exactly that.
I'm also unfamiliar with randomised algorithms for numerical linear algebra beyond the basics. I'll dig into that, thanks!
On the BLAS point, let me clarify what I meant by "wall": when you call BLAS from Rust, you're essentially making a black-box call to pre-compiled Fortran or C code. The compiler loses visibility into what happens across that boundary. You can't inline, can't specialise for your specific matrix shapes or use patterns, can't let the compiler reason about memory layout across the whole computation. You get the performance of BLAS, sure, but you lose the ability to optimise the full pipeline.
Also, Rust's compilation model flattens everything into one optimisation unit: your code, dependencies, all compiled together from source. The compiler sees the full call graph and can inline, specialise generics, and vectorise across what would be library boundaries in C/C++. The borrow checker also proves at compile time that operations like our pointer swaps are safe and that no aliasing occurs, which enables more aggressive optimisations; the compiler can reorder operations and keep values in registers because it has proof about memory access patterns. With BLAS, you're calling into opaque binaries where none of this analysis is possible.
My point is that if the core computation just calls out to pre-compiled C or Fortran, you lose much of what makes Rust interesting for numerical work in the first place. That's why I hope to see more efforts directed towards expanding the Rust ecosystem in this area in the future :)
[1] https://github.com/lukefleed/two-pass-lanczos/raw/master/tex...
[2] https://arxiv.org/abs/2403.04390
Fantastic post; I'm not much of a mathemetician, but the writing and logical progression were so clearly articulated, I was able to follow the gist the whole way through. Kudos!
Nice work. I have gone through the fairly straightforward paper.
May I ask what you've used to confirm the cache hit/miss rate? Thanks!
the comments here might be a good precursor to defending your thesis -- good luck with that btw!