Case Studies

I'm comparing the performance of various ways to allocate memory compared to malloc.

In order to gather some data on this, two case studies were done.

Specialised benchmarks - Some benchmark code designed to exhibit the best case for replacement of malloc with an alternative
cURL benchmarks - A real world example from cURL where malloc was replaced with what the specialised benchmarks call dynamic and external allocation

Specialised Benchmarks

There are two benchmarks, the source of which are in the project repo.

The sort benchmark is light on allocations, and performs the following a fixed number of times (10-20 thousand times per allocation type).
It performs one single allocation, then copies input data (a struct containing a random int and a sequential ID, which is populated but never used) into that buffer before sorting the data, using the random int as the key.
To sort, it uses this qsort, as heaptrack indicated that the stdlib qsort (from glibc2.26-11 on Arch Linux) was performing a significant number of allocations, which were affecting the benchmark. I'll be investigating this later, should I have time.
The parallel benchmark is heavy on allocations and parallelism, light on other behaviour. It executes for a fixed amount of time (executing once, then checking a flag for whether it should continue, and so on).
Each execution allocates a buffer, and then copies all the values over from the input buffer. This is done by 4 threads in parallel (all copying from the same input buffer to different target buffers), to maximise contention of malloc's lock.
The number of times it executes is tracked, then the number of operations per second is determined (in case execution continued past the fixed amount of time if the timing thread was asleep).

The types of allocation are explained below:

Dynamic - Uses a fixed size stack allocated buffer, and if the input data doesn't fit in that buffer it uses malloc instead. The static buffer in these benchmarks is of 64 items
External - Rather than allocating its own buffer, uses one that's passed in. Intended as a control, as no allocation is done
Malloc - Uses malloc
Stack - Uses alloca

Predictions:

Performance is expected to roughly follow the following for 64 or less items:

External: As the control, performing no allocations, it should perform fastest as it's plainly doing less work
Stack: Allocating on the stack should be fast, and for the benchmark allocations are simply performed using alloca
Dynamic: Should perform similarly to stack, or possibly faster depending on how efficient alloca is (as the space can be allocated along with the rest of the stack frame)
Malloc: Has to perform all the usual work in order to produce space, so should be slowest

For 64 or more items, the dynamic allocator falls back to malloc, along with the added (small) overhead of determining whether to use malloc or the stack. However, with the benchmark running multiple times with the same conditions, the branch predictor should be able to effectively optimise that out, leaving it equivalent to malloc. The rest should remain relatively unchanged.

In general, performance of all allocators should converge as the number of items increases and relatively less time is spent in allocation.

Results for sort benchmark:

Note: values less than 1 outperform malloc.

Unexpectedly, `alloca` performs **worse** than `malloc`. Others perform as expected. From inspecting the generated assembly, it looks like `-O0` produces very inefficient code for `alloca`.

Everything performing mostly as expected. Stack and dynamic seem to perform very similarly, with dynamic unexpectedly underperforming `malloc` even before falling back to `malloc` itself. This happens around the same time as the other methods converge with `malloc`.

Everything mostly as expected. Dynamic shouldn't be faster than stack, but they're close enough after the first item that I'm not too concerned.

This one's weird. Dynamic performing even worse than `alloca` did with `-O0`. Tracked it down to something to do with `-finline-functions`. It turns out that the dynamic benchmark is split into two cases - one for when the stack is used, another for when malloc is used - and the correct case jumped to at the start. When the perform_busy_work function gets inlined, it gets optimised to a call to memcpy followed by a call to nqsort, but only in the malloc case. This means that the stack case performs worse by comparison as it's using a less efficient method to copy the buffers over.

Same as the above with `-fno-inline-functions`.

Results for parallel benchmark:

Note: values greater than 1 outperform malloc.

`alloca` underperforming again, other two performing as expected.

Stack and dynamic performing almost identically, unusually.

The benchmark functions other than external get optimised out, so this bit's irrelevant.

cURL Benchmarks

Comparing two commits mentioned in the Fewer mallocs in curl blog post to their parent commits, in order to isolate their effect on performance.

In the blog post, the changes are lumped together with a number of other changes in between releases of cURL when doing benchmarking, which could obscure the impact of the changes of interest.

In these benchmarks, the changes are compared directly to their parent commit in order to isolate their impact. The binary built from a given commit performs a download from a local server, then compared to that same download when performed by the binary built from the parent commit.

multi-wait changes

The multi-wait library function was changed to use the stack instead of malloc in a common case where a small amount of data is allocated. This cut down (according to Stenberg's own measurements) the number of mallocs in a simple test by 99.67% (order of tens of thousands to order of hundreds).

In order to test this, the multi-single.c example was modified to download from 1 or 11 file descriptors (as the limit on multi-wait puts up to 10 file descriptors on the stack), then used to download a 1GB empty file from a local python server and writing it to /dev/null. This method is the same as used by Stenberg in his blog post, other than the fact he made requests to only two URLs (one local, one remote).

The times are then compared to the times taken to do the same in the parent commit. The case using 11 items should perform identically to the parent commit.

llist changes

cURL's linked list implementation was modified so that the generic linked list functions performed no allocation. This was done by redesigning the struct so that the linked list node struct was merged with the data struct, allowing the allocation for both to be done at once. This caused a reduction in number of mallocs by 16% (order of hundreds to order of tens) in a simple benchmark, according to Stenberg's measurements.

However, there are some other impacts of the changes made in that commit, such as the fact that the generic linked list functions can no longer fail for lack of memory, which reduces the number of checks required. There are also some other small changes, such as an initialisation function being changed from a guarded malloc+memset to calloc. All of these changes could impact the performance difference.

Tests for curl-llist were run against a local python server, downloading a single 25GB empty file and writing it to /dev/null.

Results

Note: values less than 1 outperform the parent commit.

Times are as given by the time bash builtin.

Real - Wall clock time elapsed (i.e. actual time from start to finish)
Sys - CPU time spent in the kernel
User - CPU time spent outside the kernel

When producing graphs for the results, only the four median values for each type of time are taken from 100 runs.

Nothing varies by more than 0.5%. The only one that performs better is the one that should perform identically, suggesting that any performance benefit is shadowed by random outside effects.

Sys time should in theory decrease in the first two cases, as less `malloc`s should mean less time spent in kernel code (such as `brk`/`sbrk`/`mmap`). Once again, the maximum variation is 0.6%, but in the wrong direction compared to predictions.

The llist case should decrease in time thanks to simplified code due to removal of `malloc`s and use of more efficient library functions (`calloc` instead of `malloc` and `memset`. For the multi-wait case, user time shouldn't really change. This is mostly met, particularly in the llist case.