Unleash the Power of Auto-Vectorization in Rust with LLVM

Mar 25, 2023 five persons riding camels walking on sand beside Pyramid of Egypt

When it comes to software performance, every CPU cycle counts. That’s why understanding the capabilities of your compiler is crucial. If you’re coding in Rust, you’re in luck: the LLVM compiler backend offers a potent optimization called auto-vectorization. In this post, we’ll delve into what auto-vectorization is, how it works in Rust, and how you can make the most of this powerful optimization technique.

Demystifying Auto-Vectorization

Auto-vectorization is an advanced optimization mechanism employed by LLVM and other cutting-edge compilers. Its core function? Transforming scalar operations—those that deal with individual data elements—into vector operations, which manipulate multiple data elements simultaneously. This transformation is guided by a singular objective: to capitalize on the Single Instruction Multiple Data (SIMD) architecture that modern CPUs are equipped with, thereby accelerating computation.

Vectorized code has the remarkable ability to conduct multiple operations concurrently within a single CPU cycle. This stands in stark contrast to scalar code, which executes these operations in a linear fashion, potentially incurring performance bottlenecks. During the compilation process, the compiler scrutinizes your code to pinpoint loops or instruction sequences that can be seamlessly converted into their vectorized counterparts without altering the program’s intended behavior.

Today’s processors come with specialized SIMD units, purpose-built for vector operations. By transmuting scalar operations into their vectorized versions, LLVM effectively taps into these SIMD units, thereby unlocking a substantial speed boost for your Rust code.

The Inner Workings of Auto-Vectorization in Rust

At the heart of Rust’s compilation process is its native compiler, rustc, which transmutes your Rust code into LLVM’s Intermediate Representation (LLVM IR). What follows is a series of optimization passes performed by LLVM, among which auto-vectorization is a notable contender.

The auto-vectorization pipeline unfolds in a three-step choreography:

Loop Discovery: Initially, LLVM embarks on a hunt for loops within the code that can be safely metamorphosed into vector operations.
Loop Transformation: Upon finding a suitable loop, LLVM proceeds to transmute the scalar operations contained within it into their vectorized equivalents.
SIMD Instruction Synthesis: The final act of this process is the generation of apt SIMD instructions tailored for the target CPU architecture.

To make the most out of auto-vectorization, it’s imperative to craft your code in a manner that’s conducive for LLVM to identify and optimize. This often translates to employing idiomatic Rust patterns and ensuring that your loops exhibit transparent and predictable structures.

Nonetheless, optimizing for auto-vectorization is far from trivial. The ability to vectorize a loop is contingent upon a myriad of factors—ranging from the loop’s internal structure and data dependencies to the compiler’s proficiency in deducing that the vectorization process won’t adulterate the program’s intended functionality.

In the sections that follow, we’ll explore hands-on examples to demystify the mechanics of auto-vectorization in Rust.

Case 1: When Auto-Vectorization Eludes Us

To better understand the intricacies of auto-vectorization, let’s start by examining a Rust code snippet that, at first glance, might seem a candidate for this optimization but isn’t.

fn complex_calculation(a: &mut [f32], b: &mut [f32]) {
    let mut acc: f32 = 0.0;
    for (x, y) in a.iter_mut().zip(b.iter_mut()) {
        acc *= *x;
        *y = acc;
    }
}

This Rust function, named complex_calculation, accepts two mutable slices of floating-point numbers (f32). It traverses these slices, multiplies elements from the first slice, and accumulates the results into the second slice.

When this code is subjected to LLVM’s scrutiny, the loop within complex_calculation, is quickly dismissed as a candidate for auto-vectorization. The culprit? A data dependency between the accumulator variable acc and the slice element y. Because acc’s value is contingent upon previous iterations, LLVM categorically rules out safe vectorization.

To delve deeper, let’s compile this code and inspect the resultant assembly output on an x86-64 machine. For brevity, we’ll skip parts of the assembly code:

.LBB0_8:
        mulss   xmm0, dword ptr [rdi + 4*rsi]
        movss   dword ptr [rdx + 4*rsi], xmm0
        mulss   xmm0, dword ptr [rdi + 4*rsi + 4]
        movss   dword ptr [rdx + 4*rsi + 4], xmm0
        mulss   xmm0, dword ptr [rdi + 4*rsi + 8]
        movss   dword ptr [rdx + 4*rsi + 8], xmm0
        mulss   xmm0, dword ptr [rdi + 4*rsi + 12]
        lea     r8, [rsi + 4]
        movss   dword ptr [rdx + 4*rsi + 12], xmm0
        mov     rsi, r8
        cmp     rcx, r8
        jne     .LBB0_8

Notice the presence of scalar-focused instructions like mulss and movss. These are not packed SIMD instructions, but scalar instructions. It confirms that the loop has indeed escaped vectorization.

For further confirmation, we can compile the code with the --emit=llvm-ir flag to examine the LLVM IR generated:

 %_14 = load float, ptr %_3.i.i, align 4, !dbg !68
  %4 = fmul float %acc.07, %_14, !dbg !70
  store float %4, ptr %_3.i5.i, align 4, !dbg !71
  %5 = or i64 %iter.sroa.8.06, 2, !dbg !47
  %_3.i.i.1 = getelementptr inbounds float, ptr %a.0, i64 %3, !dbg !49
  %_3.i5.i.1 = getelementptr inbounds float, ptr %b.0, i64 %3, !dbg !65
  %_14.1 = load float, ptr %_3.i.i.1, align 4, !dbg !68
  %6 = fmul float %4, %_14.1, !dbg !70
  store float %6, ptr %_3.i5.i.1, align 4, !dbg !71
  %7 = or i64 %iter.sroa.8.06, 3, !dbg !47
  %_3.i.i.2 = getelementptr inbounds float, ptr %a.0, i64 %5, !dbg !49
  %_3.i5.i.2 = getelementptr inbounds float, ptr %b.0, i64 %5, !dbg !65
  %_14.2 = load float, ptr %_3.i.i.2, align 4, !dbg !68
  %8 = fmul float %6, %_14.2, !dbg !70
  store float %8, ptr %_3.i5.i.2, align 4, !dbg !71
  %9 = add nuw i64 %iter.sroa.8.06, 4, !dbg !47
  %_3.i.i.3 = getelementptr inbounds float, ptr %a.0, i64 %7, !dbg !49
  %_3.i5.i.3 = getelementptr inbounds float, ptr %b.0, i64 %7, !dbg !65
  %_14.3 = load float, ptr %_3.i.i.3, align 4, !dbg !68
  %10 = fmul float %8, %_14.3, !dbg !70

LLVM IR is admittedly verbose. However, one indicator of successful auto-vectorization is the appearance of shufflevector instructions, which are employed for rearranging vector elements. Our code snippet conspicuously lacks any such instructions, reinforcing the conclusion that LLVM has opted not to vectorize the loop.

As this case study illustrates, auto-vectorization is fraught with subtleties. The loop in question was not a candidate for vectorization due to the inherent data dependency between acc and y, which rendered the loop’s behavior too unpredictable for LLVM to safely vectorize.

Case 2: A Prime Candidate for Auto-Vectorization

Now, let’s pivot to another Rust code snippet. Will LLVM deem this loop worthy of auto-vectorization? Take a moment to ponder.

pub fn mul_arrays(a: &mut [f32], b: &[f32], c: f32) {
    for (el_1, el_2) in a.iter_mut().zip(b.iter()) {
        *el_1 = el_2 * c;
    }
}

Here, we have a function named mul_arrays that accepts two slices of type f32 and an additional f32 value. The function traverses both slices in tandem, multiplies elements from the second slice by the f32 value, and places the results back into the first slice.

Upon examining this code, LLVM earmarks the loop within mul_arrays as a strong candidate for auto-vectorization. Why? Because there are no data dependencies hindering the process. The value of el_1 is overwritten in each iteration and doesn’t rely on previous iterations, paving the way for LLVM to vectorize the loop safely.

The golden rule here is straightforward: LLVM will only venture into vectorizing loops when it can unambiguously ascertain that the vectorized rendition will preserve the original program’s semantics.

To validate this, let’s compile the code and examine the resulting assembly code on an x86-64 machine. You’ll notice that LLVM has indeed vectorized the loop. The assembly output flaunts instructions like mulps and movups, which are categorized as packed SIMD instructions.

Taking it a step further, compiling the code with the --emit=llvm-ir flag reveals that LLVM has inserted shufflevector instructions into the generated code. This is a definitive indicator that the loop has been auto-vectorized.

The Tangible Benefits of Auto-Vectorization: A Performance Deep Dive

You’ve acquainted yourself with the nuts and bolts of auto-vectorization, but you might still be pondering, “Why should I care?” The succinct answer: it can dramatically turbocharge your code’s performance.

LLVM has an illuminating section on their website where they showcase real-world examples of auto-vectorization amplifying code performance. However, as Linus Torvalds famously quipped, “Talk is cheap. Show me the code.”

To that end, I’ve curated a GitHub repository, which you can clone and run locally to empirically witness the performance gains conferred by auto-vectorization. Check out the repository vectorization_benchmark for more details.

The repository contains a straightforward project featuring two functions that multiply arrays of f32 values:

pub fn mul_arrays_non_vectorized(a: &[f32], b: &[f32]) -> f32 {
    let mut result = 0.0;
    for i in 0..a.len() {
        result += a[i] * b[i];
    }
    result
}

pub fn mul_arrays_vectorized(a: &[f32], b: &[f32], c: &mut [f32]) {
    for ((&x, &y), z) in a.iter().zip(b.iter()).zip(c.iter_mut()) {
        *z = x * y;
    }
}

The function mul_arrays_non_vectorized is devoid of vectorization—it iterates over two slices and performs element-wise multiplication. In contrast, mul_arrays_vectorized employs vectorization, leveraging SIMD instructions for the same computation. For the benchmarks, arrays containing over 1000 elements were utilized.

To run the benchmarks on your local machine, simply use cargo bench.

Here are the compelling results:

Loop Version	Time (ns) - Lower Bound	Time (ns) - Upper Bound	Outliers
Non-Vectorized	841.10	845.29	6
Vectorized	64.555	64.858	1

The vectorized version outperforms its non-vectorized counterpart by a striking margin—more than an order of magnitude faster. Moreover, its performance is markedly more consistent, as evidenced by fewer outliers.

While this example is admittedly simplistic, it serves as a compelling testament to how auto-vectorization can breathe new life into your code’s performance.

Strategies for Unleashing the Full Power of Auto-Vectorization in Rust

To optimize your Rust code for auto-vectorization, here are some actionable guidelines:

Embrace Simplicity in Loop Structure: LLVM favors loops with transparent control flows and uncomplicated boundaries for auto-vectorization. Eschew elaborate loop conditions and transformations that could confound the compiler.
Neutralize Loop-Carried Dependencies: Loop-carried dependencies—where one iteration’s outcome relies on preceding ones—can be a stumbling block for auto-vectorization. Work on mitigating these dependencies or refactoring your code to enable parallel execution.
Adopt Idiomatic Rust Practices: Using idiomatic Rust constructs like iterators and functional programming paradigms can not only make your code more efficient but also enhance its suitability for auto-vectorization.
Experiment with Optimization Levels: Don’t shy away from trying out different compiler optimization levels (e.g., -C opt-level=2 or -C opt-level=3). Higher levels may trigger more aggressive auto-vectorization at the cost of increased compilation time.
Investigate Explicit SIMD Libraries: When auto-vectorization falls short of your performance expectations, consider delving into explicit SIMD libraries like packed_simd or std::simd. These libraries grant you granular control over vectorization, allowing you to fine-tune your code’s performance.

Epilogue: The Compounding Benefits of Auto-Vectorization in Rust

Auto-vectorization stands as an indispensable optimization in LLVM’s arsenal, capable of imparting substantial performance gains to your Rust applications. Armed with a deeper understanding of its inner workings and the strategies outlined above, you’re now well-equipped to leverage the SIMD capabilities of modern CPUs to push your Rust code to new performance heights.