Re: [PATCH] mm: fix maxnode for mbind(), set_mempolicy() and migrate_pages()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23.07.24 19:37, David Hildenbrand wrote:
On 23.07.24 18:33, Jerome Glisse wrote:
On Mon, 22 Jul 2024 at 06:09, David Hildenbrand <david@xxxxxxxxxx> wrote:

On 20.07.24 19:35, Jerome Glisse wrote:
Because maxnode bug there is no way to bind or migrate_pages to the
last node in multi-node NUMA system unless you lie about maxnodes
when making the mbind, set_mempolicy or migrate_pages syscall.

Manpage for those syscall describe maxnodes as the number of bits in
the node bitmap ("bit mask of nodes containing up to maxnode bits").
Thus if maxnode is n then we expect to have a n bit(s) bitmap which
means that the mask of valid bits is ((1 << n) - 1). The get_nodes()
decrement lead to the mask being ((1 << (n - 1)) - 1).

The three syscalls use a common helper get_nodes() and first things
this helper do is decrement maxnode by 1 which leads to using n-1 bits
in the provided mask of nodes (see get_bitmap() an helper function to
get_nodes()).

The lead to two bugs, either the last node in the bitmap provided will
not be use in either of the three syscalls, or the syscalls will error
out and return EINVAL if the only bit set in the bitmap was the last
bit in the mask of nodes (which is ignored because of the bug and an
empty mask of nodes is an invalid argument).

I am surprised this bug was never caught ... it has been in the kernel
since forever.

Let's look at QEMU: backends/hostmem.c

       /*
        * We can have up to MAX_NODES nodes, but we need to pass maxnode+1
        * as argument to mbind() due to an old Linux bug (feature?) which
        * cuts off the last specified node. This means backend->host_nodes
        * must have MAX_NODES+1 bits available.
        */

Which means that it's been known for a long time, and the workaround
seems to be pretty easy.

So I wonder if we rather want to update the documentation to match reality.

[Sorry resending as text ... gmail insanity]

I think it is kind of weird if we ask to supply maxnodes+1 to work
around the bug. If we apply this patch qemu would continue to work as
is while fixing users that were not aware of that bug. So I would say
applying this patch does more good. Long term qemu can drop its
workaround or keep it for backward compatibility with old kernel.

Not really, unfortunately. The thing is that it requires a lot more
effort to sense support than simply pass maxnodes+1. So unless you know
exactly on which minimum kernel version your software runs (barely
happens), you will simply apply the workaround.

I would assume that each and every sane user out there does that
already, judging that even that QEMU code is 10 years old (!).

In any case, we have to document that behavior that existed since the
very beginning. Because it would be even *worse* if someone would
develop against a new kernel and would get a bunch of bug reports when
running on literally every old kernel out there :)

So my best guess is that long-term it will create more issues when we
change the behavior ... but in any case we have to update the man pages.

I'll add a pointer to a discussion from 20 (!) year ago [1].

The conclusion [2] / request from Andi there was that N65 is the right thing to do, and we (re)added the --maxnode. This was when this code was introduced.

Assuming MAX_NODES is 128, we have to pass in 129 -- one more than the number of bits.

So the man page might be actively misleading (unless I misinterpret it): "nodemask points to a bit mask of nodes containing up to maxnode bits. The bit mask size is rounded to the next multiple of sizeof(unsigned long), but the kernel will use bits only up to maxnode."

That documentation should be fixed.

[1] https://lkml.indiana.edu/hypermail/linux/kernel/0409.1/1538.html
[2] https://lkml.indiana.edu/hypermail/linux/kernel/0409.1/1569.html

--
Cheers,

David / dhildenb





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux