Re: [PATCH] mm: fix maxnode for mbind(), set_mempolicy() and migrate_pages()

Jerome Glisse <jglisse@xxxxxxxxxx> · Tue, 23 Jul 2024 09:19:07 -0700

On Mon, 22 Jul 2024 at 06:09, David Hildenbrand <david@xxxxxxxxxx> wrote:
On 20.07.24 19:35, Jerome Glisse wrote:

> Because maxnode bug there is no way to bind or migrate_pages to the

> last node in multi-node NUMA system unless you lie about maxnodes

> when making the mbind, set_mempolicy or migrate_pages syscall.

> 

> Manpage for those syscall describe maxnodes as the number of bits in

> the node bitmap ("bit mask of nodes containing up to maxnode bits").

> Thus if maxnode is n then we expect to have a n bit(s) bitmap which

> means that the mask of valid bits is ((1 << n) - 1). The get_nodes()

> decrement lead to the mask being ((1 << (n - 1)) - 1).

> 

> The three syscalls use a common helper get_nodes() and first things

> this helper do is decrement maxnode by 1 which leads to using n-1 bits

> in the provided mask of nodes (see get_bitmap() an helper function to

> get_nodes()).

> 

> The lead to two bugs, either the last node in the bitmap provided will

> not be use in either of the three syscalls, or the syscalls will error

> out and return EINVAL if the only bit set in the bitmap was the last

> bit in the mask of nodes (which is ignored because of the bug and an

> empty mask of nodes is an invalid argument).

> 

> I am surprised this bug was never caught ... it has been in the kernel

> since forever.

Let's look at QEMU: backends/hostmem.c

     /*

      * We can have up to MAX_NODES nodes, but we need to pass maxnode+1

      * as argument to mbind() due to an old Linux bug (feature?) which

      * cuts off the last specified node. This means backend->host_nodes

      * must have MAX_NODES+1 bits available.

      */

Which means that it's been known for a long time, and the workaround 

seems to be pretty easy.

So I wonder if we rather want to update the documentation to match reality.

I think it is kind of weird if we ask to supply maxnodes+1 to work around the bug. If we apply this patch qemu would continue to work as is while fixing users that were not aware of that bug. So I would say applying this patch does more good. Long term qemu can drop its workaround or keep it for backward compatibility with old kernel.