Re: [PATCH RFC] mm: mmap: Change DEFAULT_MAX_MAP_COUNT to INT_MAX

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 30. 08. 24 14:01, Lorenzo Stoakes wrote:
On Fri, Aug 30, 2024 at 12:41:37PM GMT, Lorenzo Stoakes wrote:
On Fri, Aug 30, 2024 at 11:56:36AM GMT, Petr Spacek wrote:
From: Petr Spacek <pspacek@xxxxxxx>

Raise default sysctl vm.max_map_count to INT_MAX, which effectively
disables the limit for all sane purposes. The sysctl is kept around in
case there is some use-case for this limit.

[snip]

NACK.

Sorry this may have come off as more hostile than intended... we are
welcoming of patches, promise :)

[snip]

Understood. The RFC in the subject was honest - and we are having the discussion now, so all's good!

I also apologize for not Ccing the right people. This is my first patch here and I'm still trying to grasp the process.


It is only because we want to be _super_ careful about things like this
that can have potentially problematic impact if you have a buggy program
that allocates too many VMAs.

Now I understand your concern. From the docs and code comments I've seen it was not clear that the limit serves _another_ purpose than mere compatibility shim for old ELF tools.

It is a NACK, but it's a NACK because of the limit being so high.

With steam I believe it is a product of how it performs allocations, and
unfortunately this causes it to allocate quite a bit more than you would
expect.

FTR select non-game applications:

ElasticSearch and OpenSearch insist on at least 262144.
DNS server BIND 9.18.28 linked to jemalloc 5.2.1 was observed with usage around 700000.
OpenJDK GC sometimes weeps about values < 737280.
SAP docs I was able to access use 1000000.
MariaDB is being tested by their QA with 1048576.
Fedora, Ubuntu, NixOS, and Arch distros went with value 1048576.

Is it worth sending a patch with the default raised to 1048576?


With jemalloc() that seems strange, perhaps buggy behaviour?

Good question. In case of BIND DNS server, jemalloc handles mmap() and we keep statistics about bytes requested from malloc().

When we hit max_map_count limit the
(sum of not-yet-freed malloc(size)) / (vm.max_map_count)
gives average size of mmaped block ~ 100 k.

Is 100 k way too low / does it indicate a bug? It does not seem terrible to me - the application is handling ~ 100-1500 B packets at rate somewhere between 10-200 k packets per second so it's expected it does lots of small short lived allocations.

A complicating factor is that the process itself does not see the current counter value (unless BPF is involved) so it's hard to monitor this until the limit is hit.

It may be reasonable to adjust the default limit higher, and I'm not
opposed to that, but it might be tricky to find a level that is sensible
across all arches including ones with significantly smaller memory
availability.

Hmm... Thinking aloud:

Are VMA sizes included in cgroup v2 memory accounting? Maybe the safety limit can be handled there?


If sizing based on available memory is a concern then a fixed value is probably already wrong? I mean, current boxes range from dozen MB to 512 GB of RAM.

For a box with 16 MB of RAM we get ~ 16M/(sizeof ~ 184) = 91 180 VMAs to fill RAM, and the current limit is 65 530 _per process_.

Threat model which allows attacker to attacker mmap() but not fork() seems theoretical to me. I.e. an insane (or rogue) application can eat up to
(max # of processes) * (max_map_count) * (sizeof VMA)
bytes of memory, not just
max_map_count * (sizeof VMA)
we were talking about before.

Apologies for having more questions than answers. I'm trying to understand what purpose the limit serves and if we can improve user experience.

Thank you for patience and have a great weekend!

--
Petr Špaček
Internet Systems Consortium





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux