On Mon, Jan 27, 2025 at 8:45 AM Fedor Pchelkin <pchelkin@xxxxxxxxx> wrote: > > +Cc Mina Almasry and Mike Kravetz > +Cc current Hugetlb maintainer as well. > On Mon, 27. Jan 14:46, Lorenzo Stoakes wrote: > > Thanks for the report. > > > > On Mon, Jan 27, 2025 at 05:32:01PM +0300, Daniil Dulov wrote: > > > In copy_vma() allocation of maple tree nodes may fail. Since page accounting > > > takes place at the close() operation for hugetlb, it is called at the error > > > path against the new_vma to account pages of the vma that was not successfully > > > copied and that shares the page_counter with the original vma. Then, when the > > > process is being terminated, vm_ops->close() is called once again against the > > > original vma, which results in a page_counter underflow. > > > > This seems like a bug in hugetlb. > > > > I really hate the solution here, it's hacky and assumes only these fields are > > meaningful for 'close twice' scenarios. > > > > We now use vma_close(), which assigns vma->vm_ops to vma_dummy_vm_ops, meaning > > no further close() invocations can occur. > > Does the "close twice" scenario exactly mean ->close() is called twice > for the same object of struct vm_area_struct? > > For the observed case I think that's not true. ->close() is called for > two different objects of type vm_area_struct - the first time for the > new_vma on error path of copy_vma(), the second time for the original > vma. It turns out then these objects share the same reservation map > holding page_counters at this point of time. > > > > > If hugetlb is _still_ choosing to internally invoke this, it seems like it > > should have some if (vma->vm_ops == hugetlb_vm_ops) { ... } check first? That > > way it'll account for the closing twice issue. > > > > Can you easily repro in order to check a solution like that fixes your problem? > > I don't see why it shouldn't > > > > Seems that wouldn't fix the problem (again, two different vma objects). > There's presumably no obvious place in hugetlb internals where this may > be fixed elegantly, at the quick glance. But yep, it does look like a bug > for hugetlb to care about.. Perhaps somehow defer the reservation map > copying? > > > > > > > page_counter underflow: -1024 nr_pages=1024 > > > WARNING: CPU: 1 PID: 1086 at mm/page_counter.c:55 page_counter_cancel+0xd6/0x130 mm/page_counter.c:55 > > > Modules linked in: > > > CPU: 1 PID: 1086 Comm: syz-executor200 Not tainted 6.1.108-syzkaller-00078-g9ce77c16947b #0 > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 > > > Call Trace: > > > <TASK> > > > page_counter_uncharge+0x2e/0x70 mm/page_counter.c:158 > > > hugetlb_cgroup_uncharge_counter+0xd2/0x420 mm/hugetlb_cgroup.c:430 > > > hugetlb_vm_op_close+0x435/0x700 mm/hugetlb.c:4886 > > > remove_vma+0x84/0x130 mm/mmap.c:140 > > > exit_mmap+0x32f/0x7a0 mm/mmap.c:3249 > > > __mmput+0x11e/0x430 kernel/fork.c:1199 > > > mmput+0x61/0x70 kernel/fork.c:1221 > > > exit_mm kernel/exit.c:565 [inline] > > > do_exit+0xa4a/0x2790 kernel/exit.c:858 > > > do_group_exit+0xd0/0x2a0 kernel/exit.c:1021 > > > __do_sys_exit_group kernel/exit.c:1032 [inline] > > > __se_sys_exit_group kernel/exit.c:1030 [inline] > > > __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:1030 > > > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > > > do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 > > > entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > > </TASK> > > > > > > > > Since there is no sense in vm accounting for a bad copy of vma, set vm_start > > > to be equal vm_end and vm_pgoff to be equal 0. Previously, a similar issue > > > has been fixed in __split_vma() in the same way [1]. > > > > > > [1]: https://lore.kernel.org/all/20220719201523.3561958-1-Liam.Howlett@xxxxxxxxxx/T/ > > > > Understood that we do this elsewhere, I think equally we should not do this > > there either! :) > > > > > > > > Found by Linux Verification Center (linuxtesting.org) with Syzkaller. > > > > > > Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") > > > Cc: stable@xxxxxxxxxxxxxxx > > > Signed-off-by: Daniil Dulov <d.dulov@xxxxxxxxxx> > > > --- > > > mm/vma.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/mm/vma.c b/mm/vma.c > > > index bb2119e5a0d0..dbc68b7cd0ec 100644 > > > --- a/mm/vma.c > > > +++ b/mm/vma.c > > > @@ -1772,6 +1772,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, > > > return new_vma; > > > > > > out_vma_link: > > > + /* Avoid vm accounting in close() operation */ > > > + new_vma->vm_start = new_vma->vm_end; > > > + new_vma->vm_pgoff = 0; > > > vma_close(new_vma); > > > > > > if (new_vma->vm_file) > > > -- > > > 2.34.1 > > > -- Thanks, Mina