osa1

When is inlining useful?

December 7, 2024 - Tagged as: en, plt.

Especially in high-level languages, inlining is most useful when it leads to:

Optimizing the callee’s body based on the arguments passed.
Optimizing the call site based on the callee’s return value.

Let’s look at some examples.

Example: avoiding redundant bounds checks

Suppose we have a library for decoding some format of binary files with length-prefixed vectors of 32-bit integers, with all integers encoded as little-endian.

A simple implementation would be:

/// Decode a length-prefixed 32-bit unsigned integer vector.
///
/// # Panics
///
/// Panics if the buffer does not have enough bytes.
pub fn decode_u32_vec(buffer: &[u8]) -> Vec<u32> {
    let len = decode_32_le(buffer, 0) as usize;
    let mut vec = vec![0; len];
    for i in 0..len {
        vec[i] = decode_32_le(buffer, (i + 1) * 4);
    }
    vec
}

/// Decode a single 32-bit unsigned integer, encoded as little-endian.
///
/// # Panics
///
/// Panics if the buffer does not have 4 bytes starting at `offset`.
#[inline(never)]  // this version isn't inlined
pub fn decode_32_le(buffer: &[u8], offset: usize) -> u32 {
    let b1 = buffer[offset];
    let b2 = buffer[offset + 1];
    let b3 = buffer[offset + 2];
    let b4 = buffer[offset + 3];
    u32::from_le_bytes([b1, b2, b3, b4])
}

This version is not ideal, because decode_32_le checks bounds on each byte access. (compiler explorer)

We can improve it by checking the bounds for all of the reads once:

#[inline(never)]  // this version isn't inlined
pub fn decode_32_le(buffer: &[u8], offset: usize) -> u32 {
    if buffer.len() < 4 || buffer.len() - 4 < offset {
        panic!();
    }
    let b1 = buffer[offset];
    let b2 = buffer[offset + 1];
    let b3 = buffer[offset + 2];
    let b4 = buffer[offset + 3];
    u32::from_le_bytes([b1, b2, b3, b4])
}

The conditional makes sure that in the rest of the function the slice indices are all within bounds, so the compiler doesn’t generate bounds checks for the accesses. The compiler is also able to optimize further now to load just one 32-bit word from the memory, instead of reading one byte at a time. (compiler explorer)

decode_32_le is quite good now, but it still has to do one bounds check, and since decode_u32_vec calls it in each iteration, it does a bounds check in each iteration.

Ideally we want to be able to do one bounds check before the loop, just like we did in decode_32_le, and then omit bounds checks within the loop:

pub fn decode_u32_vec(buffer: &[u8]) -> Vec<u32> {
    let len = decode_32_le(buffer, 0) as usize;
    if buffer.len() < (len + 1) * 4 {
        panic!();
    }
    let mut vec = vec![0; len];
    for i in 0..len {
        vec[i] = decode_32_le(buffer, (i + 1) * 4);
    }
    vec
}

But this cannot make the bounds checks in decode_32_le disappear, as it may have other call sites that may not always check the bounds before calling it.

Inlining decode_32_le in the use effectively lets us propagate the information in the call site to the callee’s code, and optimize it based on this information. If we change the inline(never) to inline in decode_32_le, with the extra bounds check in decode_u32_vec, we now check bounds once before the loop and don’t check it again in the loop. (compiler explorer)

Example: avoiding redundant error checks

Dart doesn’t have unsigned integers, and many standard library functions throw an ArgumentError when they are passed negative numbers.

One example of these functions is the unsigned right shift operator. In many call sites, the shift amount is a constant positive number. If we call the standard library function in these cases, the library function will check the sign of the arguments that we already know in the call site to be positive.

When we inline the operator in a call site where the shift argument is constant, the conditional becomes if (constant < 0) throw .... The compiler can then simplify the condition as true or false, and then simplify this code to just throw ... or eliminate the conditional.

The mix64 function linked above when compiled to Wasm, without inlining the right shift operator:

(func $mix64 (param $var0 i64) (result i64)
  ...
  local.tee $var0
  i64.const 24
  call $BoxedInt.>>>
  ...)

(func $BoxedInt.>>> (param $var0 i64) (param $var1 i64) (result i64)
  local.get $var1
  i64.const 64
  i64.lt_u
  if
    local.get $var0
    local.get $var1
    i64.shr_u
    return
  end
  local.get $var1
  i64.const 0
  i64.lt_s
  if
    i32.const 64
    local.get $var1
    struct.new $BoxedInt
    call $ArgumentError
    call $Error._throwWithCurrentStackTrace
    unreachable
  end
  i64.const 0)

With the shift operator inlined:

(func $mix64 (param $var0 i64) (result i64)
  ...
  local.get $var0
  i64.const 24
  i64.shr_u
  ...)

The call to BoxedInt.>>> with error checking is optimized to a single shr_u (shift right, unsigned) instruction.

Example: avoiding boxing

In languages where most values are passed around as boxed, inlining can eliminate boxing.

A common use case where this happens is FFI: pointers/references obtained from FFI calls need to be wrapped in a struct/class/etc. to make them the same “shape” as the language’s native values.

When you have a function that gets a reference from an FFI call, and pass it around to more FFI calls, inlining these FFI calls can avoid boxing the pointer/reference value.

Somewhat silly example, in Dart:

import 'dart:ffi';

@Native<Pointer<Int64> Function()>()
external Pointer<Int64> intPtr();

@Native<Int64 Function(Pointer<Int64>)>()
external int derefIntPtr(Pointer<Int64> ptr);

void main() {
  Pointer<Int64> ptr = intPtr();
  ptr += 1;
  int i = derefIntPtr(ptr);
  print(i);
}

Relevant parts of the generated code when compiled to Wasm:

(func $main
  ...
  call $intPtr
  struct.get $Pointer $field2
  i32.const 8
  i32.add
  call $ffi.derefIntPtr (import)
  ...)

(func $intPtr (result (ref $Pointer))
  i32.const 71
  i32.const 0
  call $ffi.intPtr (import)
  struct.new $Pointer)

intPtr allocates a struct, which the call site directly unpacks (reads the field). Inlining intPtr eliminates this allocation:

(func $main
  ...
  call $ffi.intPtr (import)
  i32.const 8
  i32.add
  call $ffi.derefIntPtr (import)
  ...)

Example: avoiding polymorphism

When a monomorphic type is passed to a polymorphic function, the polymorphic function can often be inlined to avoid polymorphic access to the monomorphic type.

An example, again in Dart, is Int64List, which is a monomorphic List<int>. It stores the integers unboxed, and when used directly, the integer arguments and return values do not need to be boxed.

When used in a polymorphic site though, the integer elements need to be boxed. Example:

import 'dart:typed_data';

int sum(List<int> list) {
  int ret = 0;
  for (int i = 0; i < list.length; i += 1) {
    ret += list[i];
  }
  return ret;
}

void main() {
  Int64List intList = Int64List.fromList([1, 2, 3, 4]);
  sum(intList);
  sum([1, 2, 3, 4]);
}

Relevant parts of the output when compiled to Wasm:

(func $main
  ;; Allocate `Int64List`, call `sum`:
  ...
  call $sum

  ;; Allocate the other `List<int>`, call `sum`:
  ...
  call $sum)

(func $sum (param $var0 (ref $Object))
  (local $var1 i64)
  (local $var2 i64)
  loop $label0
    ...
    if
      ...
      ;; Virtual call to `operator []`:
      struct.get $Object $field0
      i32.const 747
      i32.add
      call_indirect (param (ref $Object) i64) (result (ref null $#Top))

      ;; The virtual call returns a boxed integer, which we directly unbox:
      ref.cast $BoxedInt
      struct.get $BoxedInt $field1
      i64.add
      ...
    end
  end $label0)

If we inline sum, the loop that iterates the Int64List accesses the unboxed integers directly:

(func $main
  ...
  loop $label1
      ...
      local.get $var3
      local.get $var4
      local.get $var1
      i32.wrap_i64
      ;; Array access is now direct, no boxing.
      array.get $Array<WasmI64>
      i64.add
      local.set $var3
      local.get $var1
      i64.const 1
      i64.add
      local.set $var1
      br $label1
    end
  end $label1)

A similar case is when a monomorphic type is used directly, but via a polymorphic interface. In the Int64List type above, List64List.[] is an override of List<E>.[], and all overrides of a method need to have the same type.¹

So when not inlined, it needs to return a boxed integer:

(func $Int64List.[] (param $var0 (ref $Object)) (param $var1 i64) (result (ref null $#Top))
  ...
  array.get $Array<WasmI64>
  struct.new $BoxedInt)

Similar to the previous example, inlining it eliminates the boxing in monomorphic use sites, as the allocated struct is immediately unboxed.

Example: eliminating closures and indirect calls

In languages without monomorphisation, higher-order function arguments need to be allocated as closures.

When calling a closure, the caller needs to get the function pointer from the closure and call the function via the pointer.

These closure allocations and indirect function calls can be too slow in hot loops.

An example of this is in Dart protobuf library. In protobufs, packed fields are encoded as a length prefix followed by the elements. In the Dart library, we decode these fields using this helper:

void _readPacked(CodedBufferReader input, void Function() readFunc) {
  input._withLimit(input.readInt32(), () {
    while (!input.isAtEnd()) {
      readFunc();
    }
  });
}

Here the caller of _readPacked passes readFunc as a closure (allocation). _readPacked then allocates another closure, to be passed to _withLimit.

That’s two closure allocations for every packed field in the input. Calling these closures are slow too, because they’re indirect.

Inlining _withLimit in _readPacked, and _readPacked in its call sites (in _mergeFromCodedBufferReader) eliminates closure allocations, and calls to the closures become direct function calls. This improves packed field decoding significantly.

A trick for effective inlining: outlining

Consider Int64List.[] again. The implementation needs to check that the index is in bounds of the array, and throw an exception if not: (slightly simplified)

class Int64List ... {
  ...

  @override
  int operator [](int index) {
    if (length.leU(index)) { // shorthand for `index < 0 || index >= length`
      ... // throw exception
    }
    return _data.read(index);
  }
}

To avoid boxing when calling this function we want to always inline it, but if we’re not careful with the error throwing code path (the if body above), the function can get quite large, and when inlined the binary can bloat up significantly.

Ideally we want to inline the happy path that can lead to improvements when inlined, and leave the error path separate in a function, so that we can have the benefits of inlining without adding to the binary size too much.

This can be done by moving the error handling code to a separate function, and making sure that separate function is never inlined (ideally with an annotation to the compiler). In the example above, this may look like:

class Int64List ... {
  @override
  @pragma('inline')
  int operator [](int index) {
    if (length.leU(index)) { // shorthand for `index < 0 || index >= length`
      fail(index, length);
    }
    return _data.read(index);
  }
}

@pragma('never-inline')
void fail(int index, int length) { ... }

With this it doesn’t matter how large the error handling code is, because we never inline it.

This way of separating inlineable parts of a function from the parts we don’t want to inline is sometimes called “outlining” or “partial inlining”. We can do it manually (as in the example above), but it can also be done by a compiler based on heuristics.

An example transformation to this is GHC’s worker/wrapper transformation, which splits a function into parts that (1) handle unboxing and boxing before calling the function’s body, and (2) the function’s body that work on the unboxed representations of the arguments. (1) is then inlined to avoid redundant boxing of the arguments and return values.

Another example is wasm-opt which does partial inlining.

A tip for effective inlining: avoid inlining in slow paths

In the protobuf decoding loop that we mentioned above, we have a big type switch to determine how to decode a field. It looks like this:

switch (fieldType) {
  case PbFieldType._OPTIONAL_BOOL:
    fs._setFieldUnchecked(meta, fi, input.readBool());
    break;
  ...
  case PbFieldType._REPEATED_UINT64:
    final list = fs._ensureRepeatedField(meta, fi);
    if (wireType == WIRETYPE_LENGTH_DELIMITED) {
      _readPacked(input, () => list.add(input.readUint64()));
    } else {
      list.add(input.readUint64());
    }
    break;
  ...
  // 37 cases in total. 
}

Assume that there’s a function that throws an exception (e.g. _ensureRepeatedField above), and it’s used in only once, in one of the cases of the switch.

Because the function is only used once, it may look like inlining it makes sense, as that would eliminate function call overhead, shave at least a few bytes off the binary, and potentially allow optimizations in the call site.

However since this switch is in a hot loop, and the inlined code branch is taken very rarely (unless the application is receiving a lot of malformed input), inlining this slow path can make the instruction cache usage much worse and slow down the decoder.

This kind of inlining can commonly happen when optimizing for size, e.g. with wasm-opt -Os. Because inlining single-use functions reduce binary sizes, optimization modes that aim to make the final binaries smaller can inline slow-path error throwing functions.

If we really want to inline a slow-path function, a way to avoid making instruction cache usage worse is to move the slow-path code to the end of the function, away from the common code paths.

This is often done with branch prediction hints, such as clang’s __builtin_expect. When a branch is annotated as “not likely to be taken”, the compiler can move the branch target (the basic blocks) to the end of the current function, away from the hot code. This gives us the binary size benefits of inlining single-use functions, without filling the instruction cache with instructions that will never be executed.

Remarks

The main reason why the examples above are mostly in Dart is because I’ve been spending most of my time this year optimizing Dart’s Wasm backend’s standard library implementation, so the examples are still fresh in my memory.

The principles apply to most languages: inlining a function makes any information about the arguments available to the function’s body, and any information on its return value to the call site.

A big part of optimizing high-level statically-typed languages is about avoiding polymorphism, boxing, and redundant error checks. I’m not aware of any cases where inlining a function in a high-level language, when it doesn’t result in improving one of these, is worthwhile.

In a lower-level language with monomorphisation and control over allocations (e.g. Rust, C++), monomorphisation eliminates polymorphism in compile time, boxing is explicit, and redundant checks can be avoided by using unchecked (unsafe) functions.

In these cases where programs are often much faster by default, inlining to avoid stack/register shuffling and simplify control flow can make a difference.

One example of simplified control flow making a big difference can be seen in the previous post, where implementing a recursive parsing function in a non-recursive way improved performance by 22%.

Updates

2024-12-07: Added link to wasm-opt in partial inlining section.
2025-02-12: Added a higher-order function example.
2025-02-14: Added a section about inlining in slow paths.

More accurately, a method that can be called in a polymorphic call site needs to have a type that is a subtype of the least-upper-bound of the types of all functions that can be called in the polymorphic call sites.

Or more briefly, all methods in an override group need to have the same type.↩︎

(Show comments)