Talc – A fast and flexible allocator for no_std and WebAssembly

121 points Feb 29, 2024

tekacs Feb 29, 2024

I've been using this recently in WASM, in particular for the counters feature. It's really great and it makes it super easy to track and follow the evolution of your app's memory!

In my console, I have something akin to this:

  TRACE client_wasm::plugins::allocation: Memory stats counters=Counters { allocation_count: 165454, total_allocation_count: 18756119, allocated_bytes: 34654828, total_allocated_bytes: 3185258585, available_bytes: 82802636, fragment_count: 5026, heap_count: 1, total_heap_count: 1, claimed_bytes: 118423552, total_claimed_bytes: 118423552 }

I haven't carefully benchmarked dlmalloc (Rust's default WASM allocator, https://github.com/alexcrichton/dlmalloc-rs), but it's nothing special (to my knowledge). The swap to Talc is pretty trivial and it's clear that the author is paying attention to its performance.

sfbea Feb 29, 2024

[Author of talc] Glad this feature is proving useful. Seeing this makes me think I should implement a better-looking Display implementation than the default Debug impl though. Something for the next update ^-^

josephg Feb 29, 2024

Thats a pretty big heap size for a wasm bundle! What are you doing in wasm that allocates so much and so often?

IshKebab Feb 29, 2024

34 MB? Doesn't seem that big to me.

flohofwoe Feb 29, 2024

At least I was also confused too because I was looking at total_allocated_bytes, but I guess that includes allocations that have been freed.

speps Feb 29, 2024

Added a new issue [1] to add TLSF to the benchmarks as it's likely going to be faster in a single-threaded environment according to the rlsf crate [2].

[1] https://github.com/SFBdragon/talc/issues/26 [2] https://github.com/yvt/rlsf

sfbea Feb 29, 2024

Thanks for opening the issue. The allocator looks pretty interesting. Happy to try add it to the benchmarks, although doing apples-to-apples tests with its limitations might not be possible without some changes.

lasiotus Feb 29, 2024

Some extra context for comparison: Talc is faster than Frusa when there is no contention, but slower when there are concurrent allocations. Both are much slower than Rust's system allocator. Benchmark here: https://crates.io/crates/frusa.

sfbea Feb 29, 2024

Your results caught me off guard. Particularly, the (linux) system allocator is too fast. I think the simplicity of the benchmark (allocating and immediately deallocating) might be causing issues... perhaps unwanted optimizations? I'm not sure.

On my random actions benchmarks (this resembles real allocation patterns somewhat better?):

- 1 thread: Talc is faster than Frusa and System, Frusa is comparable to System

- 4 threads: System is fastest, Frusa does about ~half as well, Talc does ~half as well as Frusa

Our benchmarks agree on the Frusa vs Talc comparison.

Benchmarks aside, Frusa seems neat. In particular, I had some misconceptions about how to tackle concurrency in Talc which Frusa's code demonstrates not to be true. I may give writing a concurrent version of Talc another shot soon.

sfbea Mar 1, 2024

Apologies, the benchmark is fine. The reason the system allocator is faster than I expected is because Linux's slab allocator takes over for especially small allocation sizes, and it's terrifically fast.

I'm changing up my random-actions benchmark to display results over various allocation sizes, as some allocators do much better than others at different sizes. As a heads up, Frusa takes a large hit at higher allocation sizes. Perhaps tuning bucket sizes or something could help? I'll try to have the benchmarks on GitHub this weekend so you can play around with them, if you'd like to investigate.

gleenn Feb 29, 2024

As a guy who lives in the JVM most days and mostly ignores allocation optimizations, what are some examples of things that are actually newer features in such a project? Isn't allocation a mostly solved problem? Is something about WebAssembly or no_std actually requiring different features?

AnyTimeTraveler Feb 29, 2024

In a no_std Rust environment, there is no allocator. There is no heap. So an allocator is needed to use things like Vectors or Strings.

This is very common in embedded contexts, where you can take nothing for granted.

IshKebab Feb 29, 2024

What's the point of using `no_std` if you're just going to add an allocator anyway? You may as well just use `std` at that point no? (You can use `std` on embedded devices with a small amount of work.)

thargor90 Feb 29, 2024

To get std you need some kind of libc replacement. You don't need that if you just want to use alloc.

jamesmunns Feb 29, 2024

Yeah, just adding, in Rust, there's three main levels:

1. no_std (like no libc + no malloc in C)

2. no_std + alloc (like no libc + malloc in C)

3. std (like a full libc + malloc in C)

The difference between 1 + 2 is like three lines. The difference between 2 + 3 is a change to the entire standard library. ATM only ESP32 devices support option 3 (they build a standard library implementation on top of FreeRTOS/ESP-IDF).

Sharlin Feb 29, 2024

The `core` library is still available on `no_std` and contains a lot of useful stuff, so it’s not exactly like no libc on C. That would be `no_core` which is pretty hardcore (heh). The big things missing in `no_std` are

* file and other I/O, including filesystem ops

* access to system time

* threads

* collections and some other things that require an allocator (not many things actually do in Rust’s stdlib!)

* floating-point functions (the types themselves and builtin operators work fine)

`alloc` gives you `Vec`, `String`, `Box`, `BTreeMap/Set`, ref-counted pointers, and a few `Vec`-derived collections like `VecDeque`. Very annoyingly not `HashSet/Map` though, due to a literally single-line dependence on a system entropy source which happens to not be easily factorable out because reasons.

IshKebab Feb 29, 2024

You don't. It currently requires `-Z build-std=std,panic_abort` and some nightly flags (e.g. `#![feature(restricted_std)]`) but you can build `std` programs on bare metal targets. I can't remember exactly what it does if you try to open files or start threads or whatever (probably panics?) but you can compile and run it. If you don't do any of those things it works fine.

Currently the `sys` crate implementation is hard-coded into the compiler but eventually you will be able to provide it without modifying the compiler so you can e.g. target a RTOS or whatever.

It looks like that work started really recently actually:

https://github.com/rust-lang/rust/commit/99128b7e45f8b95d962...

matheusmoreira Feb 29, 2024

Maybe you're implementing your own "std".

nicoburns Feb 29, 2024

Small binary size (and thus a simpler implementation) are the min things you want for wasm and embedded targets. For wasm this is because its being sent over the network, for embedded because the device may have low amounts of available ram and storage.

It's also the case that such targets often can't take advantage of the advanced features (like multithreading optimisations) that "full fat" allocator provide.

nu11ptr Feb 29, 2024

An allocator and deallocator go hand in hand. For the JVM, most of the GCs are compacting so they can use bump allocators. In that case, yes, it is a solved problem. However, it depends on the alloator/deallocator pair and traditional malloc/free implementations involve many trade offs, so allocation work continues.

flohofwoe Feb 29, 2024

It's only a solved problem if you can live with the downsides of automatic memory management. For instance, do you know exactly how much time your garbage collector spends with memory management and when exactly that time is spent, where objects are located in memory relative to each other, and how much time is wasted on cache misses when accessing those objects? If those questions are not important, then automatic memory management is good enough. For other applications those questions may be more important, and automatic memory management is then usually harder to optimise than coming up with specialised manual allocation strategies.

eknkc Feb 29, 2024

I am also living with GC (mostly .net and go) but recently took a look at Zig.

Zig having allocators totally opaque is interesting and you get to learn a lot if you dig deeper. Tons of different designs. Allocators on top of allocators. Tiny ones for small bundles. Non deallocating ones for one shot apps. Stuff you never care about in a GC environment.

I suggest checking it out if it interests you.

felipellrocha Feb 29, 2024

I’d say it’s solved if you want a garbage collector. If you don’t, then there is plenty of room for innovation

danlugo92 Feb 29, 2024 (dead)

weinzierl Feb 29, 2024

Why isn't it benchmarked against the vanilla glibc allocator? Is it similar enough to dlmalloc that the differences don't matter?

sfbea Feb 29, 2024

Because the glibc allocator is designed for hosted systems with threading (uses pthreads) and memory management utilities not found on bare metal/other smaller platforms. You shouldn't be using Talc where MiMalloc, Jemalloc, the glibc allocator, etc. would be used instead, besides some very particular situations. (Correct me if I'm wrong.)

I could add these benchmarks. They were there at one point in the past, but it's a disingenuous comparison unless the reader understands the particulars of the workload and the particulars of the tradeoffs each allocator makes. Talc will probably beat these allocators in single-threaded allocation, but will suffer under heavily multithreaded loads and does not currently have the system integrations to release unused blocks of memory back to the system (this can be achieved, to a degree via the OOM handler system, but I haven't yet implemented something like this), nor will it be making syscalls like mmap/sbrk at all.

There is the case where you'd want a faster single-threaded allocation pool within a larger application though, which is a case to be made for using Talc when you have access to the system allocator or mimalloc/jemalloc. Perhaps I'll set up something for that.

caleb-allen Feb 29, 2024

Pardon my ignorance, but is this something that a language with a GC could potentially use to run in a WebAssembly environment?

csjh Feb 29, 2024

A GC language would be more fit to use Wasm-GC

danlugo92 Feb 29, 2024 (dead)

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous