I'm comparing the performance of various ways to allocate memory compared to malloc.
In order to gather some data on this, two case studies were done.
Specialised benchmarks - Some benchmark code designed to exhibit the best case for replacement of malloc with an alternative
cURL benchmarks - A real world example from cURL where malloc was replaced with what the specialised benchmarks call dynamic and external allocation
Specialised Benchmarks
There are two benchmarks, the source of which are in the project repo.
The sort benchmark is light on allocations, and performs the following a fixed number of times (10-20 thousand times per allocation type).
It performs one single allocation, then copies input data (a struct containing a random int and a sequential ID, which is populated but never used) into that buffer before sorting the data, using the random int as the key.
To sort, it uses this qsort, as heaptrack indicated that the stdlib qsort (from glibc2.26-11 on Arch Linux) was performing a significant number of allocations, which were affecting the benchmark. I'll be investigating this later, should I have time.
The parallel benchmark is heavy on allocations and parallelism, light on other behaviour. It executes for a fixed amount of time (executing once, then checking a flag for whether it should continue, and so on).
Each execution allocates a buffer, and then copies all the values over from the input buffer. This is done by 4 threads in parallel (all copying from the same input buffer to different target buffers), to maximise contention of malloc's lock.
The number of times it executes is tracked, then the number of operations per second is determined (in case execution continued past the fixed amount of time if the timing thread was asleep).
The types of allocation are explained below:
Dynamic - Uses a fixed size stack allocated buffer, and if the input data doesn't fit in that buffer it uses malloc instead. The static buffer in these benchmarks is of 64 items
External - Rather than allocating its own buffer, uses one that's passed in. Intended as a control, as no allocation is done
Malloc - Uses malloc
Stack - Uses alloca
Predictions:
Performance is expected to roughly follow the following for 64 or less items:
External: As the control, performing no allocations, it should perform fastest as it's plainly doing less work
Stack: Allocating on the stack should be fast, and for the benchmark allocations are simply performed using alloca
Dynamic: Should perform similarly to stack, or possibly faster depending on how efficient alloca is (as the space can be allocated along with the rest of the stack frame)
Malloc: Has to perform all the usual work in order to produce space, so should be slowest
For 64 or more items, the dynamic allocator falls back to malloc, along with the added (small) overhead of determining whether to use malloc or the stack. However, with the benchmark running multiple times with the same conditions, the branch predictor should be able to effectively optimise that out, leaving it equivalent to malloc. The rest should remain relatively unchanged.
In general, performance of all allocators should converge as the number of items increases and relatively less time is spent in allocation.
Results for sort benchmark:
Note: values less than 1 outperform malloc.
Results for parallel benchmark:
Note: values greater than 1 outperform malloc.
cURL Benchmarks
Comparing two commits mentioned in the Fewer mallocs in curl blog post to their parent commits, in order to isolate their effect on performance.
In the blog post, the changes are lumped together with a number of other changes in between releases of cURL when doing benchmarking, which could obscure the impact of the changes of interest.
In these benchmarks, the changes are compared directly to their parent commit in order to isolate their impact. The binary built from a given commit performs a download from a local server, then compared to that same download when performed by the binary built from the parent commit.
multi-wait changes
The multi-wait library function was changed to use the stack instead of malloc in a common case where a small amount of data is allocated. This cut down (according to Stenberg's own measurements) the number of mallocs in a simple test by 99.67% (order of tens of thousands to order of hundreds).
In order to test this, the multi-single.c example was modified to download from 1 or 11 file descriptors (as the limit on multi-wait puts up to 10 file descriptors on the stack), then used to download a 1GB empty file from a local python server and writing it to /dev/null. This method is the same as used by Stenberg in his blog post, other than the fact he made requests to only two URLs (one local, one remote).
The times are then compared to the times taken to do the same in the parent commit. The case using 11 items should perform identically to the parent commit.
llist changes
cURL's linked list implementation was modified so that the generic linked list functions performed no allocation. This was done by redesigning the struct so that the linked list node struct was merged with the data struct, allowing the allocation for both to be done at once. This caused a reduction in number of mallocs by 16% (order of hundreds to order of tens) in a simple benchmark, according to Stenberg's measurements.
However, there are some other impacts of the changes made in that commit, such as the fact that the generic linked list functions can no longer fail for lack of memory, which reduces the number of checks required. There are also some other small changes, such as an initialisation function being changed from a guarded malloc+memset to calloc. All of these changes could impact the performance difference.
Tests for curl-llist were run against a local python server, downloading a single 25GB empty file and writing it to /dev/null.
Results
Note: values less than 1 outperform the parent commit.
Times are as given by the timebash builtin.
Real - Wall clock time elapsed (i.e. actual time from start to finish)
Sys - CPU time spent in the kernel
User - CPU time spent outside the kernel
When producing graphs for the results, only the four median values for each type of time are taken from 100 runs.