Introduction
In the part 1 of this series, we tried to reproduce the speed up over Python in part 1 of a Modular blog, but in `Rust``. We achieved 46x speedup over the base case. In this blog, we will try to reproduce part 2 and part 3 of the modular blog.
SIMD
We will first explore the SIMD instructions. SIMD stands for Single Instruction Multiple Data. SIMD instructions are a type of instructions that can be executed on multiple data in parallel.
Currently, it is necessary to use Rust nightly to use SIMD instructions. Rust nightly can be installed with the following command:
|
|
The nightly toolchain can be invoked with the +nightly
argument in the cargo
command. Since we invoke the cargo
command indirectly through the maturin
tool, we instead add a rust-toolchain.toml
file in the root of the project with the following content:
|
|
maturin
is the build tool for PyO3 used to build Rust packages for Python.
We achieved an execution time of 30.2ms which is a 213x single-core speedup over the base case.
Parallelization
Part of speedup achieved in the Modular blog is through parallelization. This is not a fair comparison to the single-core Python baseline. It is worth nothing Python also have parallel processing libraries like Ray or Dask. For the sake of comparison, we tried parallelization in Rust using Rayon.
Using my macPro with 6 cores and 12 hyperthreads, we achieved an execution time of 18.2ms without SIMD, which is a 354x speedup over the base case. With SIMD, we achieved an execution time of 4.8ms, which is a 1342x speedup over the base case.
Summary
Method | Time (s) | Speedup |
---|---|---|
Baseline | 6.44 | 1.00 |
Numpy vectorization | 6.37 | 1.01 |
Numpy parallelization | 1.71 | 3.77 |
Numba | 0.681 | 9.46 |
Rust | 0.141 | 45.74 |
Rust SIMD | 0.0302 | 213.24 |
Rust + Rayon | 0.0182 | 354.40 |
Rust SIMD + Rayon | 0.0048 | 1342.50 |
See the source code for the full implementation.
- We have achieved single core speed up of 213. Of them, 46x is from converting Python to Rust and 4.7x from SIMD.
- We did not believe it is a fair to compare multi-core Rust to single core Python baseline as Python itself has some excellent parallel processing libraries. Nevertheless, we demonstrated that Rust code can be parallelized with the change of a single method call. A speed-up of 6-8x is achievable on a 6 core CPU (12 hyper-thread). Parallelization itself is an interesting subject, especially on tasks of different lengths. We will explore this in the future posts.
- Out SIMD code has a separate code base. We believe that it is possible to combine the SIMD and non-SIMD code base using Rust generics and traits. We will explore this in the future posts.
Appendix 1: Rayon
Rayon is an amazing library for parallel processing. It takes just one method change to parallelize the code. In the
example below, we literally just change chunks_mut
to par_chunks_mut
and the code is running in parallel.
|
|
The Rayon code can be further optimized to account for the fact that the kernel function can exist early, and thus uneven length. We will explore this in the future posts. For now, I will refer users to this project for inspiration.
Appendix 2: SIMD
SIMD does have a few new concepts. The first is the SIMD vector types, for example:
|
|
In the above example, each type is a vector of length 8. m32
is a mask
type that we will cover later.
To create a vector from a scalar, we use the splat
method:
|
|
We can use arithmetic operators on vectors like we do on scalars:
|
|
Now let us go back to mask
. mask
can be used as a vector of boolean. Comparisons between vectors return a mask
.
we can use the select
method like if ... else ...
logic for vectors:
|
|
There are a lot more to SIMD. I recommend the Rust SIMD book for more details.
Although packed-simd
is getting replaced by portable-simd. The former is
still has the better documentation.
Last modified on 2023-11-19