On Sun, Aug 25, 2024 at 01:06:40PM GMT, zhiguojiang wrote: > > > 在 2024/8/25 0:26, Lorenzo Stoakes 写道: > > [Some people who received this message don't often get email from lorenzo.stoakes@xxxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > > > On Fri, Aug 23, 2024 at 11:02:06PM GMT, Zhiguo Jiang wrote: > > > After CoWed by do_wp_page, the vma established a new mapping relationship > > > with the CoWed folio instead of the non-CoWed folio. However, regarding > > > the situation where vma->anon_vma and the non-CoWed folio's anon_vma are > > > not same, the avc binding relationship between them will no longer be > > > needed, so it is issue for the avc binding relationship still existing > > > between them. > > > > > > This patch will remove the avc binding relationship between vma and the > > > non-CoWed folio's anon_vma, which each has their own independent > > > anon_vma. It can also alleviates rmap overhead simultaneously. > > > > > > Signed-off-by: Zhiguo Jiang <justinjiang@xxxxxxxx> > > > > NACK (until fixed). This is broken (see below). > > > Hi Lorenzo Stoakes, > > Thank you for your comments. > > I'm not seeing any numbers to back anything up here as to why we want to > > make changes to this incredibly sensitive code? > I added a debug trace log (as follows) in wp_page_copy() and observed > that a large number of these orphan avc-objects still exist. I believe > this will have a certain redundant overhead impact on anonymous folios' > rmap avcs, so I want to remove it, which is also the most essential > value of this patch. Sorry nack to that idea unless you can provide actual _data_ to demonstrate an overhead. And even if you did, given the original patch was so completely broken, and in such a sensitive area, I'm going to need to be VERY confident you didn't break anything, so we're going to need tests. > > -- the vital part of debug trace patch: Thanks for providing! Will snip for sake of making it easier to reply. > > > > Also anon_vma logic is very complicated and confusing, this commit message > > feels about 3 paragraphs too light. > > > > Under what circumstances will vma->anon_vma be different from > > folio_anon_vma(non_cowed_folio)? etc. > In anon_vma_fork() --> anon_vma_clone(), child vma is bound with parent > vma's anon_vma firstly. > /* > * First, attach the new VMA to the parent VMA's anon_vmas, > * so rmap can find non-COWed pages in child processes. > */ > error = anon_vma_clone(vma, pvma); > > When child vma->anon_vma is NULL in anon_vma_fork(), > /* An existing anon_vma has been reused, all done then. */ > if (vma->anon_vma) > return 0; > > /* Then add our own anon_vma. */ > anon_vma = anon_vma_alloc(); > > new anon_vma will be alloced and filled in this child vma->anon_vma. > Then during CoWed in do_wp_page() --> wp_page_copy(), this child vma's > new anon_vma will be different from folio_anon_vma(non_cowed_folio). Thanks for the explanation, but I was suggesting you have to put this in the commit message rather than in repy to me :) > > Confusing topics strongly require explanations that help (somewhat) > > compensate. This is one of them. > > > > > --- > > > > > > -v2: > > > * Solve the kernel test robot noticed "WARNING" > > > Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> > > > Closes: https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@xxxxxxxxx > > It doesn't. > > > > Saw a bunch of warning output in dmesg when running in qemu, bisected it to > > this commit. The below assert is being fired (did you build this kernel > > with CONFIG_DEBUG_VM?): > > > > VM_WARN_ON(anon_vma->num_children); > > > > From what I saw, these appear to all be cases where anon_vma->num_children == 0... > > > > > > [ 1.905603] ------------[ cut here ]------------ > > [ 1.905604] WARNING: CPU: 2 PID: 231 at mm/rmap.c:443 unlink_anon_vmas+0x181/0x1c0 > > [ 1.905605] Modules linked in: > > [ 1.905605] CPU: 2 UID: 1000 PID: 231 Comm: zsh Tainted: G W 6.11.0-rc4+ #49 > > [ 1.905606] Tainted: [W]=WARN > > [ 1.905606] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 > > [ 1.905607] RIP: 0010:unlink_anon_vmas+0x181/0x1c0 > > [ 1.905608] Code: 48 83 7f 40 00 75 1c f0 ff 4f 30 75 ab e8 d7 fd ff ff eb a4 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 90 0f 0b 90 eb de 90 <0f> 0b 90 eb d1 90 0f 0b 90 48 83 c7 08 e8 4d 7c ea ff e9 fc fe ff > > [ 1.905608] RSP: 0018:ffffc90000547cb0 EFLAGS: 00010286 > > [ 1.905609] RAX: ffff88817b265390 RBX: ffff88817b265380 RCX: ffff88817b2cb790 > > [ 1.905609] RDX: ffff88817b265380 RSI: ffff88817b2cb790 RDI: ffff888179e08888 > > [ 1.905610] RBP: dead000000000122 R08: 000000000000000c R09: 0000000000000010 > > [ 1.905610] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88817b2cb790 > > [ 1.905611] R13: dead000000000100 R14: ffff88817b2cb780 R15: ffff888179e08888 > > 00000000000 > > [ 1.905613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 1.905613] CR2: 0000555bc5d97390 CR3: 000000017c12c000 CR4: 0000000000750ef0 > > [ 1.905614] PKRU: 55555554 > > [ 1.905614] Call Trace: > > [ 1.905614] <TASK> > > [ 1.905615] ? unlink_anon_vmas+0x181/0x1c0 > > [ 1.905615] ? __warn.cold+0x8e/0xe8 > > [ 1.905616] ? unlink_anon_vmas+0x181/0x1c0 > > [ 1.905617] ? report_bug+0xff/0x140 > > [ 1.905618] ? handle_bug+0x3b/0x70 > > [ 1.905619] ? exc_invalid_op+0x17/0x70 > > [ 1.905620] ? asm_exc_invalid_op+0x1a/0x20 > > [ 1.905621] ? unlink_anon_vmas+0x181/0x1c0 > > [ 1.905622] free_pgtables+0x11f/0x250 > > [ 1.905622] exit_mmap+0x15e/0x380 > > [ 1.905624] mmput+0x54/0x110 > > [ 1.905625] do_exit+0x27e/0xa10 > > [ 1.905626] ? __x64_sys_close+0x37/0x80 > > [ 1.905626] do_group_exit+0x2b/0x80 > > [ 1.905628] __x64_sys_exit_group+0x13/0x20 > > [ 1.905629] x64_sys_call+0x14af/0x14b0 > > [ 1.905630] do_syscall_64+0x9e/0x1a0 > > [ 1.905630] entry_SYSCALL_64_after_hwframe+0x77/0x7f > > [ 1.905631] RIP: 0033:0x7f4416ae33ad > > [ 1.905632] Code: Unable to access opcode bytes at 0x7f4416ae3383. > > 000e7 > > [ 1.905633] RAX: ffffffffffffffda RBX: 00007f4416d5e3c0 RCX: 00007f4416ae33ad > > [ 1.905633] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000000 > > [ 1.905633] RBP: 0000555b8eed1378 R08: 0000000000000000 R09: 0000000000000007 > > [ 1.905634] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 > > [ 1.905634] R13: 0000000000000000 R14: 00007ffe7dbe9190 R15: 00007ffe7dbe9110 > > [ 1.905635] </TASK> > > [ 1.905635] ---[ end trace 0000000000000000 ]--- > > [ 1.905638] ------------[ cut here ]------------ > > > > > > > * Update comments to more accurately describe this patch. > > > > > > -v1: > > > https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@xxxxxxxx/ > > > > > > include/linux/rmap.h | 1 + > > > mm/memory.c | 8 +++++++ > > > mm/rmap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++ > > > 3 files changed, 62 insertions(+) > > > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > > index 91b5935e8485..8607d28a3146 > > > --- a/include/linux/rmap.h > > > +++ b/include/linux/rmap.h > > > @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages, > > > folio_remove_rmap_ptes(folio, page, 1, vma) > > > void folio_remove_rmap_pmd(struct folio *, struct page *, > > > struct vm_area_struct *); > > > +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *); > > > > > > void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *, > > > unsigned long address, rmap_t flags); > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 93c0c25433d0..4c89cb1cb73e > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) > > > * old page will be flushed before it can be reused. > > > */ > > > folio_remove_rmap_pte(old_folio, vmf->page, vma); > > > + > > > + /* > > > + * If the new_folio's anon_vma is different from the > > > + * old_folio's anon_vma, the avc binding relationship > > > + * between vma and the old_folio's anon_vma is removed, > > > + * avoiding rmap redundant overhead. > > What overhead? Worth spelling out for instance if it's unnecessary to > > traverse avc's. > I think this will have a certain redundant overhead impact on anonymous > folios rmap traverse avcs process. This is again nowhere near detailed enough, and again I'm asking you to write this _IN THE COMMENT_ not in review. I already understand what you're trying to do (I think the fact I provided a _working_ version of your patch as an attachment in this thread should give a clue ;), this is for the benefit of people coming to read this code. > > > > > + */ > > > + folio_remove_anon_avc(old_folio, vma); > > > } > > > > > > /* Free the old page.. */ > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index 1103a536e474..56fc16fcf2a9 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -1522,6 +1522,59 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page, > > > #endif > > > } > > > > > > +/** > > > + * folio_remove_anon_avc - remove the avc binding relationship between > > > + * folio and vma with different anon_vmas. > > > + * @folio: The folio with anon_vma to remove the binded avc from > > > + * @vma: The vm area to remove the binded avc with folio's anon_vma > > > + * > > > + * The caller is currently used for CoWed scene. > > Strange turn of phrase, > > > > > + */ > > > +void folio_remove_anon_avc(struct folio *folio, > > I think this should be 'oldfolio'. You're not looking at the copied folio, > > but the unCoW'd original folio. > Yes, thanks. > > > > > + struct vm_area_struct *vma) > > > +{ > > > + struct anon_vma *anon_vma = folio_anon_vma(folio); > > > + pgoff_t pgoff_start, pgoff_end; > > > + struct anon_vma_chain *avc; > > > + > > > + /* > > > + * Ensure that the vma's anon_vma and the folio's > > > + * anon_vma exist and are not same. > > > + */ > > > + if (!folio_test_anon(folio) || unlikely(!anon_vma) || > > The folio_test_anon() is already implied by folio_anon_vma() != NULL and > > doesn't preclude KSM. > > > > > + anon_vma == vma->anon_vma) > > > + return; > > This is all super confusing, the 'parent' is actually anon_vma > > (oldfolio). The newly created 'child' anon_vma is vma->anon_vma. Should > > probably rename each accordingly. > OK. > > > > > > > + > > > + pgoff_start = folio_pgoff(folio); > > > + pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; > > > + > > > + if (!anon_vma_trylock_write(anon_vma)) > > > + return; > > > + > > > + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, > > > + pgoff_start, pgoff_end) { > > > + /* > > > + * Find the avc associated with vma from the folio's > > > + * anon_vma and remove it. > > > + */ > > This is a meaningless comment. > > > > This should be something like 'anon_vma_chain objects bind VMAs and > > anon_vma's. Find the avc binding the unCoW'd folio's anon_vma to the new > > VMA, and remove it, as it is now redundant.' > > > > > + if (avc->vma == vma) { > > In testing I found that a lot of the time this isn't found at all... is > > that expected? > > > > > + anon_vma_interval_tree_remove(avc, &anon_vma->rb_root); > > > + /* > > > + * When removing the avc with anon_vma that is > > > + * different from the parent anon_vma from parent > > > + * anon_vma->rb_root, the parent num_children > > > + * count value is needed to reduce one. > > > + */ > > This is a really confusing comment. You're not explaining the 'why' you're > > just essentially asserting that you need to do this, and clearly this is > > broken. > > > > > + anon_vma->num_children--; > > So we know this is broken to start due to VM_WARN_ON() failures. > > > > As per above dmesg analysis, sometimes this is zero, so you're > > underflowing. We definitely need a: > > > > VM_WARN_ON(anon_vma->num_children == 0); > > > > At least. > > > > But also the naming is broken here too, anon_vma is actually the parent > > (oldfolio's) anon_vma... > > > > > > This is also just not correct on any level - the anon_vma->num_children > > field indicates how many child anon_vma objects point at it via > > anon_vma->parent, NOT avc. > > > > You're removing an avc, not disconnecting an anon_vma. > > > > So it seems to me you should have logic to remove the avc AND logic to > > disconnect vma->anon_vma from (parent) anon_vma if it points to it. > > > > You'll need to be careful about locking when you do that too, as anon_vma's > > lock on the root anon_vma, but in isolating the child anon_vma you'd lose > > this lock. > > > > I've tried to write code to fix this but haven't been able to yet, this is > > fiddly stuff. > > > > (I think this might have seemed to work at some point in testing because > > unlink_anon_vmas() uses the avc list to determine what to unlink, rather > > than looking at individual anon_vma's but still). > > > > > + > > > + list_del(&avc->same_vma); > > > + anon_vma_chain_free(avc); > > > + break; > > > + } > > > + } > > > + anon_vma_unlock_write(anon_vma); > > > +} > > > + > > > static __always_inline void __folio_remove_rmap(struct folio *folio, > > > struct page *page, int nr_pages, struct vm_area_struct *vma, > > > enum rmap_level level) > > > -- > > > 2.39.0 > > > > > Again I question the value of this change. Are we REALLY seeing a big > > problem due to unneeded avc's hanging around? This is very sensitive, > > fiddly, confusing code, do we REALLY want to be playing with it? > Thank you for helping to identify mang issues with this patch. However, > I think this will have a certain benefits for anonymous folio rmap > traverse avc overhead. > > > > It'd be good to get some tests though unless you move this to vma.c with > > its userland testing (probably a good idea actually as Andrew suggested) > > this might be tricky. > This patch belongs to anon_vma rmap's content, and it seems more > appropriate in mm/rmap.c? > > > > NACK until the issues are fixed and the approach at least seems more > > correct. > Thanks > Zhiguo > Please see the attachment in thread for an example of a working version of this, this is sadly fundamentally broken. But you're going to really need to sell this a lot better, provide some numbers, and provide extensive testing and a much, much better test for this to stand any chance. I appreciate what you're trying to do here, and it's not totally crazy, but we have to be so, so careful around this code. anon_vma code is horrendously subtle and confusing (I actually had to reference my unpublished book to remind myself how this stuff works :)), so we have to tread very carefully. I definitely think we need ASCII diagrams if we were to go ahead with a new version of this. But then again I'm a bit of a fan of ASCII diagrams... Please cc- me on future revisions of this series, thanks :)