SieveTables

This post starts by explaining what are Rust traits, and how they are implemented. Feel free to skip to issue description if you are familiar with the Rust language. Or you can even skip to proposed solution if you are also familiar with the issue being discussed here.

Lately, I’ve been looking at some Rust. In that context, I got nerd-snipped by an interesting language-design issue: rust-lang#2035.
I tried to propose a solution. I don’t think I have seen yet in other languages. I dubbed it sievetables.

Code is on GitHub (this are benchmarks, not production-quality).

Intro to Rust traits #

Rust has “traits” which for the purpose of the discussion can be thought of as C++ pure virtual classes, or Java interfaces ( more details).

// This snippet defines three traits: First, Middle and Last.
trait First { fn first_name(&self) -> String; }
trait Middle { fn middle_name(&self) -> String; }
trait Last { fn last_name(&self) -> String; }
// Now we can define a person holding the names.
struct Person {
    first: String, middle: String, last: String
}
// And finally, implement the interface.
impl First for Person { /*···*/ }
impl Middle for Person { /*···*/ }
impl Last for Person { /*···*/ }

A Rust function can specify its argument’s types to be traits. It can then accept any compatible object:

fn last_uppercase(x: &dyn Last) -> String {
    x.last_name().to_uppercase()
}

We must use the dyn keyword to tell Rust to generate a single function that can take multiple types. & means the function is taking a reference (like a pointer, but safer as it is guaranteed to outlive the function).

So far so good. We can call for example:

let alice = Person{ /*···*/ };
last_uppercase(&alice);

Under the hood, our &alice is really a pointer. That pointer is called a “fat pointer” because it is larger than a normal pointer. It contains:

The address of the object (64 bit on x86-64, but 32 bits in the example).
The address of the vtable (likewise).

C++ in comparison repeats the vtable pointers at the beginning of each object, which is less memory-efficient but does simplify interoperability¹.

What’s rust-lang#2035 about? #

Now , we’d like to do this:

fn full_name(x: &dyn First+Last+Middle) -> String {
    format("{} {}", first_two_names(x), x.last_name())
}
fn first_two_names(x: &dyn First+Middle) -> String {
    format("{} {}", x.first_name(), x.middle_name())
}
full_name(&alice);

First full_name needs x to be a pointer that has the methods of all traits. Then it needs to somehow change x to have fewer traits, in order to call first_two_names.

This can’t work with our previous fat pointer, because it contained only one vtable. If that vtable contained all methods, it wouldn’t be suitable for first_two_names. If it contained only the first two methods, it wouldn’t be suitable for full_name.

One way to make it work would be to give full_name a “super-fat” pointer, with three vtables. It could then pop the last vtable to call first_two_names.

Problem is, with just tree traits our super-fat pointer is already 16 bytes wide. On a 64 bit system, it’d be 32 bytes wide. This impact performance, And having variable-sized pointers further increases complexity.

So super-fat pointers are seeing push back on the issue. Can we do better?

How does it work in other languages? #

In Golang, the following works:

type First interface { first_name() string }
type Middle interface { middle_name() string }
type Last interface { last_name() string }
type FM interface { First ; Last }
type FML interface { First ; Middle ; Last }
func full_name(x FML) string {
	return fmt.Sprintf("%s %s", first_two_names(x), x.last_name())
}
func first_two_names(x FM) string {
    return fmt.Sprintf("%s %s", x.first_name(), x.middle_name())
}

How? Under the hood, Go dynamically allocates vtables in a hash table.

Convenient. However, this dynamic allocation of vtables requires dynamic memory allocation. Thus Go forfeits the ability to run on systems such as kernels and microncontrollers, which can’t allocate heap freely.²

For Rust, this tradeoff wouldn’t be acceptable.
So we must come up with something new.

Proposed solution #

Unoptimised format overview #

Instead of pointing to a vtable, the pointer points to a sievetable. A sievetable contains the functions for all traits, in segments that are all the same size. In the example above, segment_size is 1. When different traits have different numbers of methods, segment_size is the maximum of all traits.

A pointer to a sievetable is called a sievepointer.
sievetables have a 8 byte alignment because the last 8 bits of the sievepointer have a special meaning: We use them to encode which of the traits are relevant for the function.

The positive of N-th sieve bit with value 1 is the N-th trait the function expects.

When calling full_name, all three traits are available:

first_two_names expects only two traits. When calling first_two_names, the last trait is discarded:

We can discard any trait, not just the first or last one. In this example, we need to discard the Middle trait:

func first_and_last_names(x: &dyn First+Last) -> String {
    fmt!("{} {}", x.first_name(), x.last_name())
}

At the call site #

fn full_name(x: &dyn First+Last+Middle) -> String {
    let ptr_first_plus_middle = remove_trait!(x, 3);
    let last = ptr_from_sieveptr!(Last, x, 3);
    format("{} {}", first_two_names(ptr_first_plus_middle), last.last_name());
}

At compile time, each function knows how many traits it expects. It can’t however know in advance how many traits are present in the sievetable.

There are two operations the compiler needs to perform:

ptr_from_sieveptr: In order to call a function on the n-th trait, it needs to find the position of the n-th unset bit, multiply by segment_size, and to use that as an offset in the sievetable.
remove_trait: In order to call a function that expects fewer traits, it needs to clear some bits from the sievepointer.

Performance #

Calls to ptr_from_sieveptr and remove_trait will degrade performances. By how much exactly?

To set the landscape, calling a function at an address in RAM “costs” on the order of 20-40 cycles (about 5 for the call itself and then we’re waiting for RAM).

ptr_from_sieveptr is the slowest. It needs to:

find the n-th set bit (3-5 cycles)
multiply by segment_size (about 3 cycles)

I ran some benchmarks using a recursive function (unmemoised recursive fibonacci) that calls through the v-table for each of its recursions.

	Size	1 trait	2 traits	3 traits	4 traits	5 traits
v-table pointers	2 words	402ns	-	-	-	-
sievepointers (`PackedSieve`)	2 words	-	464 ns	488 ns	482 ns	495 ns
Extra-fat pointers	N+1 words	-	472 ns	753 ns	764 ns	831 ns

Can we do better? #

Yes! What I described above is just what’s portable.
For example, if we can rely on having larger pointers and PDEP (as is the case on X86-64), we can avoid the strip multiplication by encoding the method offset directly in the sieve. This doesn’t always perform beter, but performance then no longer depends on the number of traits:

	Size	1 trait	2 traits	3 traits	4 traits	5 traits
sievepointers (`InlineSieve`)	2 words	-	510 ns	510 ns	510 ns	510 ns

Also, the solution above limits sums to 8 traits. I don’t expect that to be a problem.
But if it it were, there are solutions. v-tables could be mapped in a specific memory region, such that we get more bits to bitpack our sieve. We could also these extra bits to put the segment_size in the sieve, saving a memory access.

Furthermore, mapping sievetable in a specific memory region could give us 40 bits sieves on 64-bit machines. So we could have 40 segments. With 40 segments, it could become reasonable to have a fixed segment_size (larger traits would be sudivided) and save on the multiplication.

This being said, I think I spent enough time exploring this issue :) I’m leaving this on the Internet in case someone later tries to solve a similar problem.

FFI-friendly vtables can be formed in rust, eg with the vtable crate. ↩︎
There is tinygo, which does support microntrollers and webassembly, but at time of writing doesn’t support calling methods on interfaces. Actually, I wonder if the technique discussed here could be used there. ↩︎