SieveTables
Lately, I’ve been looking at some
Rust.
In that context, I got nerd-snipped by an interesting language-design issue:
rust-lang#2035.
I tried to propose a solution. I don’t think I have seen yet in other languages.
I dubbed it sievetables.
Code is on GitHub (this are benchmarks, not production-quality).
Intro to Rust traits #
Rust has “traits” which for the purpose of the discussion can be thought of as C++ pure virtual classes, or Java interfaces ( more details).
// This snippet defines three traits: First, Middle and Last.
trait First { fn first_name(&self) -> String; }
trait Middle { fn middle_name(&self) -> String; }
trait Last { fn last_name(&self) -> String; }
// Now we can define a person holding the names.
struct Person {
first: String, middle: String, last: String
}
// And finally, implement the interface.
impl First for Person { /*···*/ }
impl Middle for Person { /*···*/ }
impl Last for Person { /*···*/ }
A Rust function can specify its argument’s types to be traits. It can then accept any compatible object:
fn last_uppercase(x: &dyn Last) -> String {
x.last_name().to_uppercase()
}
We must use the dyn
keyword to tell Rust to generate a single function that
can take multiple types. &
means the function is taking a
reference
(like a pointer, but safer as it is guaranteed to outlive the function).
So far so good. We can call for example:
let alice = Person{ /*···*/ };
last_uppercase(&alice);
Under the hood, our &alice
is really a pointer. That pointer is called a “fat
pointer” because it is larger than a normal pointer. It contains:
- The address of the object (64 bit on x86-64, but 32 bits in the example).
- The address of the vtable (likewise).
C++ in comparison repeats the vtable pointers at the beginning of each object, which is less memory-efficient but does simplify interoperability1.
What’s rust-lang#2035 about? #
Now , we’d like to do this:
fn full_name(x: &dyn First+Last+Middle) -> String {
format("{} {}", first_two_names(x), x.last_name())
}
fn first_two_names(x: &dyn First+Middle) -> String {
format("{} {}", x.first_name(), x.middle_name())
}
full_name(&alice);
First full_name
needs x
to be a pointer that has the methods of all traits.
Then it needs to somehow change x
to have fewer traits, in order to call
first_two_names
.
This can’t work with our previous fat pointer, because it contained only one
vtable. If that vtable contained all methods, it wouldn’t be suitable for
first_two_names
. If it contained only the first two methods, it wouldn’t be
suitable for full_name
.
One way to make it work would be to give full_name
a “super-fat”
pointer, with three vtables. It could then pop the last vtable to call
first_two_names
.
Problem is, with just tree traits our super-fat pointer is already 16 bytes wide. On a 64 bit system, it’d be 32 bytes wide. This impact performance, And having variable-sized pointers further increases complexity.
So super-fat pointers are seeing push back on the issue. Can we do better?
How does it work in other languages? #
In Golang, the following works:
type First interface { first_name() string }
type Middle interface { middle_name() string }
type Last interface { last_name() string }
type FM interface { First ; Last }
type FML interface { First ; Middle ; Last }
func full_name(x FML) string {
return fmt.Sprintf("%s %s", first_two_names(x), x.last_name())
}
func first_two_names(x FM) string {
return fmt.Sprintf("%s %s", x.first_name(), x.middle_name())
}
How? Under the hood, Go dynamically allocates vtables in a hash table.
Convenient. However, this dynamic allocation of vtables requires dynamic memory allocation. Thus Go forfeits the ability to run on systems such as kernels and microncontrollers, which can’t allocate heap freely.2
For Rust, this tradeoff
wouldn’t be acceptable.
So we must come up with something new.
Proposed solution #
Unoptimised format overview #
Instead of pointing to a vtable, the pointer points to a
sievetable. A sievetable contains the functions for all traits,
in segments that are all the same size. In the example above, segment_size
is 1. When different traits have different numbers of methods, segment_size
is
the maximum of all traits.
A pointer to a sievetable is called a sievepointer.
sievetables have a 8 byte alignment because the last 8 bits of the sievepointer
have a special meaning: We use them to encode which of the traits are relevant for
the function.
When calling full_name
, all three traits are available:
first_two_names
expects only two traits. When calling first_two_names
, the
last trait is discarded:
We can discard any trait, not just the first or last one. In this example, we
need to discard the Middle
trait:
func first_and_last_names(x: &dyn First+Last) -> String {
fmt!("{} {}", x.first_name(), x.last_name())
}
At the call site #
fn full_name(x: &dyn First+Last+Middle) -> String {
let ptr_first_plus_middle = remove_trait!(x, 3);
let last = ptr_from_sieveptr!(Last, x, 3);
format("{} {}", first_two_names(ptr_first_plus_middle), last.last_name());
}
At compile time, each function knows how many traits it expects. It can’t however know in advance how many traits are present in the sievetable.
There are two operations the compiler needs to perform:
ptr_from_sieveptr
: In order to call a function on the n-th trait, it needs to find the position of the n-th unset bit, multiply bysegment_size
, and to use that as an offset in the sievetable.remove_trait
: In order to call a function that expects fewer traits, it needs to clear some bits from the sievepointer.
Performance #
Calls to ptr_from_sieveptr
and remove_trait
will degrade performances. By
how much exactly?
To set the landscape, calling a function at an address in RAM “costs” on the order of 20-40 cycles (about 5 for the call itself and then we’re waiting for RAM).
ptr_from_sieveptr
is the slowest. It needs to:
- find the n-th set bit (3-5 cycles)
- multiply by
segment_size
(about 3 cycles)
I ran some benchmarks using a recursive function (unmemoised recursive fibonacci) that calls through the v-table for each of its recursions.
Size | 1 trait | 2 traits | 3 traits | 4 traits | 5 traits | |
---|---|---|---|---|---|---|
v-table pointers | 2 words | 402ns | - | - | - | - |
sievepointers ( PackedSieve ) | 2 words | - | 464 ns | 488 ns | 482 ns | 495 ns |
Extra-fat pointers | N+1 words | - | 472 ns | 753 ns | 764 ns | 831 ns |
Can we do better? #
Yes! What I described above is just what’s portable.
For example, if we can rely on having larger pointers and PDEP (as is the case on
X86-64), we can avoid the strip multiplication by encoding the method offset directly
in the sieve. This doesn’t always perform beter, but performance then no longer depends
on the number of traits:
Size | 1 trait | 2 traits | 3 traits | 4 traits | 5 traits | |
---|---|---|---|---|---|---|
sievepointers ( InlineSieve ) | 2 words | - | 510 ns | 510 ns | 510 ns | 510 ns |
Also, the solution above limits sums to 8 traits. I don’t expect that to be a
problem.
But if it it were, there are solutions. v-tables could be mapped in a specific memory
region, such that we get more bits to bitpack our sieve. We could also these extra
bits to put the segment_size
in the sieve, saving a memory access.
Furthermore, mapping sievetable in a specific memory region could give us 40
bits sieves on 64-bit machines. So we could have 40 segments. With 40 segments,
it could become reasonable to have a fixed segment_size
(larger traits would
be sudivided) and save on the multiplication.
This being said, I think I spent enough time exploring this issue :) I’m leaving this on the Internet in case someone later tries to solve a similar problem.
FFI-friendly vtables can be formed in rust, eg with the vtable crate. ↩︎
There is tinygo, which does support microntrollers and webassembly, but at time of writing doesn’t support calling methods on interfaces. Actually, I wonder if the technique discussed here could be used there. ↩︎