I'm comparing the performance of various ways to allocate memory compared to malloc
.
In order to gather some data on this, two case studies were done.
malloc
with an alternativemalloc
was replaced with what the specialised benchmarks call dynamic and external allocationThere are two benchmarks, the source of which are in the project repo.
malloc
's lock.
The number of times it executes is tracked, then the number of operations per second is determined (in case execution continued past the fixed amount of time if the timing thread was asleep).
The types of allocation are explained below:
malloc
instead. The static buffer in these benchmarks is of 64 itemsmalloc
alloca
Performance is expected to roughly follow the following for 64 or less items:
For 64 or more items, the dynamic allocator falls back to malloc
, along with the added (small) overhead of determining whether to use malloc
or the stack. However, with the benchmark running multiple times with the same conditions, the branch predictor should be able to effectively optimise that out, leaving it equivalent to malloc
. The rest should remain relatively unchanged.
In general, performance of all allocators should converge as the number of items increases and relatively less time is spent in allocation.
Note: values less than 1 outperform malloc
.
alloca
performs worse than malloc
. Others perform as expected. From inspecting the generated assembly, it looks like -O0
produces very inefficient code for alloca
.malloc
even before falling back to malloc
itself. This happens around the same time as the other methods converge with malloc
.alloca
did with -O0
. Tracked it down to something to do with -finline-functions
. It turns out that the dynamic benchmark is split into two cases - one for when the stack is used, another for when malloc is used - and the correct case jumped to at the start. When the perform_busy_work function gets inlined, it gets optimised to a call to memcpy followed by a call to nqsort, but only in the malloc case. This means that the stack case performs worse by comparison as it's using a less efficient method to copy the buffers over.-fno-inline-functions
.Note: values greater than 1 outperform malloc
.
alloca
underperforming again, other two performing as expected.Comparing two commits mentioned in the Fewer mallocs in curl blog post to their parent commits, in order to isolate their effect on performance.
In the blog post, the changes are lumped together with a number of other changes in between releases of cURL when doing benchmarking, which could obscure the impact of the changes of interest.
In these benchmarks, the changes are compared directly to their parent commit in order to isolate their impact. The binary built from a given commit performs a download from a local server, then compared to that same download when performed by the binary built from the parent commit.
The multi-wait library function was changed to use the stack instead of malloc
in a common case where a small amount of data is allocated. This cut down (according to Stenberg's own measurements) the number of malloc
s in a simple test by 99.67% (order of tens of thousands to order of hundreds).
In order to test this, the multi-single.c
example was modified to download from 1 or 11 file descriptors (as the limit on multi-wait puts up to 10 file descriptors on the stack), then used to download a 1GB empty file from a local python server and writing it to /dev/null
. This method is the same as used by Stenberg in his blog post, other than the fact he made requests to only two URLs (one local, one remote).
The times are then compared to the times taken to do the same in the parent commit. The case using 11 items should perform identically to the parent commit.
cURL's linked list implementation was modified so that the generic linked list functions performed no allocation. This was done by redesigning the struct so that the linked list node struct was merged with the data struct, allowing the allocation for both to be done at once. This caused a reduction in number of malloc
s by 16% (order of hundreds to order of tens) in a simple benchmark, according to Stenberg's measurements.
However, there are some other impacts of the changes made in that commit, such as the fact that the generic linked list functions can no longer fail for lack of memory, which reduces the number of checks required. There are also some other small changes, such as an initialisation function being changed from a guarded malloc
+memset
to calloc
. All of these changes could impact the performance difference.
Tests for curl-llist were run against a local python server, downloading a single 25GB empty file and writing it to /dev/null
.
Note: values less than 1 outperform the parent commit.
Times are as given by the time
bash
builtin.
When producing graphs for the results, only the four median values for each type of time are taken from 100 runs.
malloc
s should mean less time spent in kernel code (such as brk
/sbrk
/mmap
). Once again, the maximum variation is 0.6%, but in the wrong direction compared to predictions.malloc
s and use of more efficient library functions (calloc
instead of malloc
and memset
. For the multi-wait case, user time shouldn't really change. This is mostly met, particularly in the llist case.