Carlos O'Donell wrote on Tue, Dec 14, 2021 at 11:07:42AM -0500:
> So I guess we're just chasing after artifacts from the
allocator, and
> it'll be hard to tell which it is when I happen to see pipewire-pulse
> with high memory later on...
It can be difficult to tell the difference between:
(a) allocator caching
(b) application usage
To help with we developed some additional tracing utilities:
https://pagure.io/glibc-malloc-trace-utils
Thanks for the pointer, I knew something would be able to do this but I
didn't remember what allowed this.
I don't see this in any package, maybe it'd be interesting to ship these
for easy use?
(yes, it's not difficult to git clone and configure/make locally, but
I'll forget about it again whereas a package might be easier to
remember)
For now, I can confirm that all memory is indeed freed in a timely
manner as far as pipewire-pulse knows.
> From what I can see the big allocations are (didn't look at
lifetime of each
> alloc):
> - load_spa_handle for audioconvert/libspa-audioconvert allocs 3.7MB
> - pw_proxy_new allocates 590k
> - reply_create_playback_stream allocates 4MB
> - spa_buffer_alloc_array allocates 1MB from negotiate_buffers
> - spa_buffer_alloc_array allocates 256K x2 + 128K
> from negotiate_link_buffers
On a 64-bit system the maximum dynamic allocation size is 32MiB.
As you call malloc with ever larger values the dynamic scaling will scale up to
at most 32MiB (half of a 64MiB heap). So it is possible that all of these allocations
are placed on the mmap/sbrk'd heaps and stay there for future usage until freed back.
Yes, that's my guess as well - as they're all different sizes the cache
can blow up.
Could you try running with this env var:
GLIBC_TUNABLES=glibc.malloc.mmap_threshold=131072
Note: See `info libc tunables`.
with this the max moved down from ~300-600MB to 80-150MB, and it comes
back down to 80-120MB instead of ~300MB.
> maybe some of these buffers sticking around for the duration of
the
> connection could be pooled and shared?
They are pooled and shared if they are cached by the system memory allocator.
All of tcmalloc, jemalloc, and glibc malloc attempt to cache the userspace requests
with different algorithms that match given workloads.
Yes, I didn't mean pooling as pooling allocator, but I meant live
pooling usage e.g. every objects could use the same objects when they
need to.
I can understand buffers being made per-client so an overhead of 1-2MB
per client is acceptable, but the bulk of the spa handle seem to be
storing many big ports?
$ pahole -y impl spa/plugins/audioconvert/libspa-audioconvert.so.p/merger.c.o
struct impl {
...
struct port in_ports[64]; /* 256 1153024 */
/* --- cacheline 18020 boundary (1153280 bytes) --- */
struct port out_ports[65]; /* 1153280 1171040 */
/* --- cacheline 36317 boundary (2324288 bytes) was 32 bytes ago --- */
struct spa_audio_info format; /* 2324320 284 */
...
$ pahole -y impl spa/plugins/audioconvert/libspa-audioconvert.so.p/splitter.c.o
struct impl {
...
struct port in_ports[1]; /* 184 18056 */
/* --- cacheline 285 boundary (18240 bytes) --- */
struct port out_ports[64]; /* 18240 1155584 */
/* --- cacheline 18341 boundary (1173824 bytes) --- */
...
Which themselves have a bunch of buffers:
struct port {
...
struct buffer buffers[32]; /* 576 17408 */
(pahole also prints useful hints that the structures have quite a bit of
padding, so some optimization there could save some scraps, but I think
it's more fundamental than this)
I understand that allocating once in bulk is ideal for latency so I have
no problem with overallocating a bit, but I'm not sure if we need so
many buffers laying around when clients are mute and probably not using
most of these :)
(I also understand that this isn't an easy change I'm asking about, it
doesn't have to be immediate)
BTW I think we're getting a bit gritty, which might be fine for the list
but probably leave some pipewire devs out. Perhaps it's time to move to
a new pipewire issue?
--
Dominique