How Rust gets an x speedup over Python - Part 2

Introduction

In the part 1 of this series, we tried to reproduce the speed up over Python in part 1 of a Modular blog, but in `Rust``. We achieved 46x speedup over the base case. In this blog, we will try to reproduce part 2 and part 3 of the modular blog.

SIMD

We will first explore the SIMD instructions. SIMD stands for Single Instruction Multiple Data. SIMD instructions are a type of instructions that can be executed on multiple data in parallel.

Currently, it is necessary to use Rust nightly to use SIMD instructions. Rust nightly can be installed with the following command:

1
rustup toolchain install nightly

The nightly toolchain can be invoked with the +nightly argument in the cargo command. Since we invoke the cargo command indirectly through the maturin tool, we instead add a rust-toolchain.toml file in the root of the project with the following content:

1
2
[toolchain]
channel = "nightly"

maturin is the build tool for PyO3 used to build Rust packages for Python.

We achieved an execution time of 30.2ms which is a 213x single-core speedup over the base case.

Parallelization

Part of speedup achieved in the Modular blog is through parallelization. This is not a fair comparison to the single-core Python baseline. It is worth nothing Python also have parallel processing libraries like Ray or Dask. For the sake of comparison, we tried parallelization in Rust using Rayon.

Using my macPro with 6 cores and 12 hyperthreads, we achieved an execution time of 18.2ms without SIMD, which is a 354x speedup over the base case. With SIMD, we achieved an execution time of 4.8ms, which is a 1342x speedup over the base case.

Summary

MethodTime (s)Speedup
Baseline6.441.00
Numpy vectorization6.371.01
Numpy parallelization1.713.77
Numba0.6819.46
Rust0.14145.74
Rust SIMD0.0302213.24
Rust + Rayon0.0182354.40
Rust SIMD + Rayon0.00481342.50

See the source code for the full implementation.

  1. We have achieved single core speed up of 213. Of them, 46x is from converting Python to Rust and 4.7x from SIMD.
  2. We did not believe it is a fair to compare multi-core Rust to single core Python baseline as Python itself has some excellent parallel processing libraries. Nevertheless, we demonstrated that Rust code can be parallelized with the change of a single method call. A speed-up of 6-8x is achievable on a 6 core CPU (12 hyper-thread). Parallelization itself is an interesting subject, especially on tasks of different lengths. We will explore this in the future posts.
  3. Out SIMD code has a separate code base. We believe that it is possible to combine the SIMD and non-SIMD code base using Rust generics and traits. We will explore this in the future posts.

Appendix 1: Rayon

Rayon is an amazing library for parallel processing. It takes just one method change to parallelize the code. In the example below, we literally just change chunks_mut to par_chunks_mut and the code is running in parallel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    if parallel {
        out.par_chunks_mut(width_in_blocks)
            .enumerate()
            .for_each(|(i, row)| {
                let y = f32s::splat(min_y + dy * (i as f32));
                row.iter_mut().enumerate().for_each(|(j, count)| {
                    let x = xs[j];
                    let z = Complex { real: x, imag: y };
                    *count = MandelbrotIter::new(z).count(iters);
                });
            });
    } else {
        out.chunks_mut(width_in_blocks)
            .enumerate()
            .for_each(|(i, row)| {
                let y = f32s::splat(min_y + dy * (i as f32));
                row.iter_mut().enumerate().for_each(|(j, count)| {
                    let x = xs[j];
                    let z = Complex { real: x, imag: y };
                    *count = MandelbrotIter::new(z).count(iters);
                });
            });
    }

The Rayon code can be further optimized to account for the fact that the kernel function can exist early, and thus uneven length. We will explore this in the future posts. For now, I will refer users to this project for inspiration.

Appendix 2: SIMD

SIMD does have a few new concepts. The first is the SIMD vector types, for example:

1
2
3
type u32s = u32x8;
type f32s = f32x8;
type m32s = m32x8;

In the above example, each type is a vector of length 8. m32 is a mask type that we will cover later.

To create a vector from a scalar, we use the splat method:

1
32s::splat(4.0)

We can use arithmetic operators on vectors like we do on scalars:

1
2
3
let xx = x * x;
let yy = y * y;
let sum = xx + yy;

Now let us go back to mask. mask can be used as a vector of boolean. Comparisons between vectors return a mask. we can use the select method like if ... else ... logic for vectors:

1
2
let mask = sum.le(f32s::splat(4.0));
let count = mask.select(count + 1, count);

There are a lot more to SIMD. I recommend the Rust SIMD book for more details. Although packed-simd is getting replaced by portable-simd. The former is still has the better documentation.


Last modified on 2023-11-19