On 8/26/24 05:56, zhiguojiang wrote: > > > 在 2024/8/26 2:13, Mika Penttilä 写道: >> [你通常不会收到来自 mpenttil@xxxxxxxxxx 的电子邮件。请访问 >> https://aka.ms/LearnAboutSenderIdentification,以了解这一点为什么很重要;] >> >> Hi, >> >> On 8/25/24 09:39, Lorenzo Stoakes wrote: >> >>> On Sun, Aug 25, 2024 at 01:06:40PM GMT, zhiguojiang wrote: >>>> 在 2024/8/25 0:26, Lorenzo Stoakes 写道: >>>>> [Some people who received this message don't often get email from >>>>> lorenzo.stoakes@xxxxxxxxxx. Learn why this is important at >>>>> https://aka.ms/LearnAboutSenderIdentification ] >>>>> >>>>> On Fri, Aug 23, 2024 at 11:02:06PM GMT, Zhiguo Jiang wrote: >>>>>> After CoWed by do_wp_page, the vma established a new mapping >>>>>> relationship >>>>>> with the CoWed folio instead of the non-CoWed folio. However, >>>>>> regarding >>>>>> the situation where vma->anon_vma and the non-CoWed folio's >>>>>> anon_vma are >>>>>> not same, the avc binding relationship between them will no >>>>>> longer be >>>>>> needed, so it is issue for the avc binding relationship still >>>>>> existing >>>>>> between them. >>>>>> >>>>>> This patch will remove the avc binding relationship between vma >>>>>> and the >>>>>> non-CoWed folio's anon_vma, which each has their own independent >>>>>> anon_vma. It can also alleviates rmap overhead simultaneously. >>>>>> >>>>>> Signed-off-by: Zhiguo Jiang <justinjiang@xxxxxxxx> >>>>> NACK (until fixed). This is broken (see below). >>>>> >>>> Hi Lorenzo Stoakes, >>>> >>>> Thank you for your comments. >>>>> I'm not seeing any numbers to back anything up here as to why we >>>>> want to >>>>> make changes to this incredibly sensitive code? >>>> I added a debug trace log (as follows) in wp_page_copy() and observed >>>> that a large number of these orphan avc-objects still exist. I believe >>>> this will have a certain redundant overhead impact on anonymous >>>> folios' >>>> rmap avcs, so I want to remove it, which is also the most essential >>>> value of this patch. >>> Sorry nack to that idea unless you can provide actual _data_ to >>> demonstrate >>> an overhead. >>> >>> And even if you did, given the original patch was so completely >>> broken, and >>> in such a sensitive area, I'm going to need to be VERY confident you >>> didn't >>> break anything, so we're going to need tests. >>> >>>> -- the vital part of debug trace patch: >>> Thanks for providing! Will snip for sake of making it easier to reply. >>> >>>>> Also anon_vma logic is very complicated and confusing, this commit >>>>> message >>>>> feels about 3 paragraphs too light. >>>>> >>>>> Under what circumstances will vma->anon_vma be different from >>>>> folio_anon_vma(non_cowed_folio)? etc. >>>> In anon_vma_fork() --> anon_vma_clone(), child vma is bound with >>>> parent >>>> vma's anon_vma firstly. >>>> /* >>>> * First, attach the new VMA to the parent VMA's anon_vmas, >>>> * so rmap can find non-COWed pages in child processes. >>>> */ >>>> error = anon_vma_clone(vma, pvma); >>>> >>>> When child vma->anon_vma is NULL in anon_vma_fork(), >>>> /* An existing anon_vma has been reused, all done then. */ >>>> if (vma->anon_vma) >>>> return 0; >>>> >>>> /* Then add our own anon_vma. */ >>>> anon_vma = anon_vma_alloc(); >>>> >>>> new anon_vma will be alloced and filled in this child vma->anon_vma. >>>> Then during CoWed in do_wp_page() --> wp_page_copy(), this child vma's >>>> new anon_vma will be different from folio_anon_vma(non_cowed_folio). >>> Thanks for the explanation, but I was suggesting you have to put >>> this in >>> the commit message rather than in repy to me :) >> >> You still have to locate the mappings of the other, non cow'ed folios in >> the child's vma *by* the parent's anon_vma, >> so don't think you can delete, on cow, the avc, which presents the >> whole vma? > Hi Mika, > > This patch is designed to remove the avc binding relationship that child > vma anon_vma different from parent vma anon_vma. You mentioned that "non > cow'ed folios in the child's vma *by* the parent's anon_vma" belongs to > another different situation. Of course, if it is possible to remove avc > from this situation, the effect should be better, but the implementation > is more complex. > As discussed off-list, you may not remove the avc in this situation here. It would make impossible for rmap to find references to other non cowed folios in the vma mapped by the child. > Thanks > Zhiguo >> But I could be missing something, it has been long since I went thru >> this code last time... >> >>>>> Confusing topics strongly require explanations that help (somewhat) >>>>> compensate. This is one of them. >>>>> >>>>>> --- >>>>>> >>>>>> -v2: >>>>>> * Solve the kernel test robot noticed "WARNING" >>>>>> Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> >>>>>> Closes: >>>>>> https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@xxxxxxxxx >>>>> It doesn't. >>>>> >>>>> Saw a bunch of warning output in dmesg when running in qemu, >>>>> bisected it to >>>>> this commit. The below assert is being fired (did you build this >>>>> kernel >>>>> with CONFIG_DEBUG_VM?): >>>>> >>>>> VM_WARN_ON(anon_vma->num_children); >>>>> >>>>> From what I saw, these appear to all be cases where >>>>> anon_vma->num_children == 0... >>>>> >>>>> >>>>> [ 1.905603] ------------[ cut here ]------------ >>>>> [ 1.905604] WARNING: CPU: 2 PID: 231 at mm/rmap.c:443 >>>>> unlink_anon_vmas+0x181/0x1c0 >>>>> [ 1.905605] Modules linked in: >>>>> [ 1.905605] CPU: 2 UID: 1000 PID: 231 Comm: zsh Tainted: >>>>> G W 6.11.0-rc4+ #49 >>>>> [ 1.905606] Tainted: [W]=WARN >>>>> [ 1.905606] Hardware name: QEMU Standard PC (i440FX + PIIX, >>>>> 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 >>>>> [ 1.905607] RIP: 0010:unlink_anon_vmas+0x181/0x1c0 >>>>> [ 1.905608] Code: 48 83 7f 40 00 75 1c f0 ff 4f 30 75 ab e8 d7 >>>>> fd ff ff eb a4 5b 5d 41 5c 41 5d 41 5e 41 5f c3 cc cc cc cc 90 0f >>>>> 0b 90 eb de 90 <0f> 0b 90 eb d1 90 0f 0b 90 48 83 c7 08 e8 4d 7c >>>>> ea ff e9 fc fe ff >>>>> [ 1.905608] RSP: 0018:ffffc90000547cb0 EFLAGS: 00010286 >>>>> [ 1.905609] RAX: ffff88817b265390 RBX: ffff88817b265380 RCX: >>>>> ffff88817b2cb790 >>>>> [ 1.905609] RDX: ffff88817b265380 RSI: ffff88817b2cb790 RDI: >>>>> ffff888179e08888 >>>>> [ 1.905610] RBP: dead000000000122 R08: 000000000000000c R09: >>>>> 0000000000000010 >>>>> [ 1.905610] R10: 0000000000000001 R11: 0000000000000000 R12: >>>>> ffff88817b2cb790 >>>>> [ 1.905611] R13: dead000000000100 R14: ffff88817b2cb780 R15: >>>>> ffff888179e08888 >>>>> 00000000000 >>>>> [ 1.905613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 1.905613] CR2: 0000555bc5d97390 CR3: 000000017c12c000 CR4: >>>>> 0000000000750ef0 >>>>> [ 1.905614] PKRU: 55555554 >>>>> [ 1.905614] Call Trace: >>>>> [ 1.905614] <TASK> >>>>> [ 1.905615] ? unlink_anon_vmas+0x181/0x1c0 >>>>> [ 1.905615] ? __warn.cold+0x8e/0xe8 >>>>> [ 1.905616] ? unlink_anon_vmas+0x181/0x1c0 >>>>> [ 1.905617] ? report_bug+0xff/0x140 >>>>> [ 1.905618] ? handle_bug+0x3b/0x70 >>>>> [ 1.905619] ? exc_invalid_op+0x17/0x70 >>>>> [ 1.905620] ? asm_exc_invalid_op+0x1a/0x20 >>>>> [ 1.905621] ? unlink_anon_vmas+0x181/0x1c0 >>>>> [ 1.905622] free_pgtables+0x11f/0x250 >>>>> [ 1.905622] exit_mmap+0x15e/0x380 >>>>> [ 1.905624] mmput+0x54/0x110 >>>>> [ 1.905625] do_exit+0x27e/0xa10 >>>>> [ 1.905626] ? __x64_sys_close+0x37/0x80 >>>>> [ 1.905626] do_group_exit+0x2b/0x80 >>>>> [ 1.905628] __x64_sys_exit_group+0x13/0x20 >>>>> [ 1.905629] x64_sys_call+0x14af/0x14b0 >>>>> [ 1.905630] do_syscall_64+0x9e/0x1a0 >>>>> [ 1.905630] entry_SYSCALL_64_after_hwframe+0x77/0x7f >>>>> [ 1.905631] RIP: 0033:0x7f4416ae33ad >>>>> [ 1.905632] Code: Unable to access opcode bytes at 0x7f4416ae3383. >>>>> 000e7 >>>>> [ 1.905633] RAX: ffffffffffffffda RBX: 00007f4416d5e3c0 RCX: >>>>> 00007f4416ae33ad >>>>> [ 1.905633] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: >>>>> 0000000000000000 >>>>> [ 1.905633] RBP: 0000555b8eed1378 R08: 0000000000000000 R09: >>>>> 0000000000000007 >>>>> [ 1.905634] R10: 0000000000000000 R11: 0000000000000246 R12: >>>>> 0000000000000001 >>>>> [ 1.905634] R13: 0000000000000000 R14: 00007ffe7dbe9190 R15: >>>>> 00007ffe7dbe9110 >>>>> [ 1.905635] </TASK> >>>>> [ 1.905635] ---[ end trace 0000000000000000 ]--- >>>>> [ 1.905638] ------------[ cut here ]------------ >>>>> >>>>> >>>>>> * Update comments to more accurately describe this patch. >>>>>> >>>>>> -v1: >>>>>> >>>>>> https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@xxxxxxxx/ >>>>>> >>>>>> include/linux/rmap.h | 1 + >>>>>> mm/memory.c | 8 +++++++ >>>>>> mm/rmap.c | 53 >>>>>> ++++++++++++++++++++++++++++++++++++++++++++ >>>>>> 3 files changed, 62 insertions(+) >>>>>> >>>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h >>>>>> index 91b5935e8485..8607d28a3146 >>>>>> --- a/include/linux/rmap.h >>>>>> +++ b/include/linux/rmap.h >>>>>> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *, >>>>>> struct page *, int nr_pages, >>>>>> folio_remove_rmap_ptes(folio, page, 1, vma) >>>>>> void folio_remove_rmap_pmd(struct folio *, struct page *, >>>>>> struct vm_area_struct *); >>>>>> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct >>>>>> *); >>>>>> >>>>>> void hugetlb_add_anon_rmap(struct folio *, struct >>>>>> vm_area_struct *, >>>>>> unsigned long address, rmap_t flags); >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index 93c0c25433d0..4c89cb1cb73e >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct >>>>>> vm_fault *vmf) >>>>>> * old page will be flushed before it can >>>>>> be reused. >>>>>> */ >>>>>> folio_remove_rmap_pte(old_folio, >>>>>> vmf->page, vma); >>>>>> + >>>>>> + /* >>>>>> + * If the new_folio's anon_vma is different >>>>>> from the >>>>>> + * old_folio's anon_vma, the avc binding >>>>>> relationship >>>>>> + * between vma and the old_folio's anon_vma >>>>>> is removed, >>>>>> + * avoiding rmap redundant overhead. >>>>> What overhead? Worth spelling out for instance if it's unnecessary to >>>>> traverse avc's. >>>> I think this will have a certain redundant overhead impact on >>>> anonymous >>>> folios rmap traverse avcs process. >>> This is again nowhere near detailed enough, and again I'm asking you to >>> write this _IN THE COMMENT_ not in review. >>> >>> I already understand what you're trying to do (I think the fact I >>> provided >>> a _working_ version of your patch as an attachment in this thread >>> should >>> give a clue ;), this is for the benefit of people coming to read >>> this code. >>> >>>>>> + */ >>>>>> + folio_remove_anon_avc(old_folio, vma); >>>>>> } >>>>>> >>>>>> /* Free the old page.. */ >>>>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>>>> index 1103a536e474..56fc16fcf2a9 >>>>>> --- a/mm/rmap.c >>>>>> +++ b/mm/rmap.c >>>>>> @@ -1522,6 +1522,59 @@ void folio_add_file_rmap_pmd(struct folio >>>>>> *folio, struct page *page, >>>>>> #endif >>>>>> } >>>>>> >>>>>> +/** >>>>>> + * folio_remove_anon_avc - remove the avc binding relationship >>>>>> between >>>>>> + * folio and vma with different anon_vmas. >>>>>> + * @folio: The folio with anon_vma to remove the binded avc from >>>>>> + * @vma: The vm area to remove the binded avc with folio's >>>>>> anon_vma >>>>>> + * >>>>>> + * The caller is currently used for CoWed scene. >>>>> Strange turn of phrase, >>>>> >>>>>> + */ >>>>>> +void folio_remove_anon_avc(struct folio *folio, >>>>> I think this should be 'oldfolio'. You're not looking at the >>>>> copied folio, >>>>> but the unCoW'd original folio. >>>> Yes, thanks. >>>>>> + struct vm_area_struct *vma) >>>>>> +{ >>>>>> + struct anon_vma *anon_vma = folio_anon_vma(folio); >>>>>> + pgoff_t pgoff_start, pgoff_end; >>>>>> + struct anon_vma_chain *avc; >>>>>> + >>>>>> + /* >>>>>> + * Ensure that the vma's anon_vma and the folio's >>>>>> + * anon_vma exist and are not same. >>>>>> + */ >>>>>> + if (!folio_test_anon(folio) || unlikely(!anon_vma) || >>>>> The folio_test_anon() is already implied by folio_anon_vma() != >>>>> NULL and >>>>> doesn't preclude KSM. >>>>> >>>>>> + anon_vma == vma->anon_vma) >>>>>> + return; >>>>> This is all super confusing, the 'parent' is actually anon_vma >>>>> (oldfolio). The newly created 'child' anon_vma is vma->anon_vma. >>>>> Should >>>>> probably rename each accordingly. >>>> OK. >>>>>> + >>>>>> + pgoff_start = folio_pgoff(folio); >>>>>> + pgoff_end = pgoff_start + folio_nr_pages(folio) - 1; >>>>>> + >>>>>> + if (!anon_vma_trylock_write(anon_vma)) >>>>>> + return; >>>>>> + >>>>>> + anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, >>>>>> + pgoff_start, pgoff_end) { >>>>>> + /* >>>>>> + * Find the avc associated with vma from the folio's >>>>>> + * anon_vma and remove it. >>>>>> + */ >>>>> This is a meaningless comment. >>>>> >>>>> This should be something like 'anon_vma_chain objects bind VMAs and >>>>> anon_vma's. Find the avc binding the unCoW'd folio's anon_vma to >>>>> the new >>>>> VMA, and remove it, as it is now redundant.' >>>>> >>>>>> + if (avc->vma == vma) { >>>>> In testing I found that a lot of the time this isn't found at >>>>> all... is >>>>> that expected? >>>>> >>>>>> + anon_vma_interval_tree_remove(avc, >>>>>> &anon_vma->rb_root); >>>>>> + /* >>>>>> + * When removing the avc with anon_vma that is >>>>>> + * different from the parent anon_vma from >>>>>> parent >>>>>> + * anon_vma->rb_root, the parent num_children >>>>>> + * count value is needed to reduce one. >>>>>> + */ >>>>> This is a really confusing comment. You're not explaining the >>>>> 'why' you're >>>>> just essentially asserting that you need to do this, and clearly >>>>> this is >>>>> broken. >>>>> >>>>>> + anon_vma->num_children--; >>>>> So we know this is broken to start due to VM_WARN_ON() failures. >>>>> >>>>> As per above dmesg analysis, sometimes this is zero, so you're >>>>> underflowing. We definitely need a: >>>>> >>>>> VM_WARN_ON(anon_vma->num_children == 0); >>>>> >>>>> At least. >>>>> >>>>> But also the naming is broken here too, anon_vma is actually the >>>>> parent >>>>> (oldfolio's) anon_vma... >>>>> >>>>> >>>>> This is also just not correct on any level - the >>>>> anon_vma->num_children >>>>> field indicates how many child anon_vma objects point at it via >>>>> anon_vma->parent, NOT avc. >>>>> >>>>> You're removing an avc, not disconnecting an anon_vma. >>>>> >>>>> So it seems to me you should have logic to remove the avc AND >>>>> logic to >>>>> disconnect vma->anon_vma from (parent) anon_vma if it points to it. >>>>> >>>>> You'll need to be careful about locking when you do that too, as >>>>> anon_vma's >>>>> lock on the root anon_vma, but in isolating the child anon_vma >>>>> you'd lose >>>>> this lock. >>>>> >>>>> I've tried to write code to fix this but haven't been able to yet, >>>>> this is >>>>> fiddly stuff. >>>>> >>>>> (I think this might have seemed to work at some point in testing >>>>> because >>>>> unlink_anon_vmas() uses the avc list to determine what to unlink, >>>>> rather >>>>> than looking at individual anon_vma's but still). >>>>> >>>>>> + >>>>>> + list_del(&avc->same_vma); >>>>>> + anon_vma_chain_free(avc); >>>>>> + break; >>>>>> + } >>>>>> + } >>>>>> + anon_vma_unlock_write(anon_vma); >>>>>> +} >>>>>> + >>>>>> static __always_inline void __folio_remove_rmap(struct folio >>>>>> *folio, >>>>>> struct page *page, int nr_pages, struct >>>>>> vm_area_struct *vma, >>>>>> enum rmap_level level) >>>>>> -- >>>>>> 2.39.0 >>>>>> >>>>> Again I question the value of this change. Are we REALLY seeing a big >>>>> problem due to unneeded avc's hanging around? This is very sensitive, >>>>> fiddly, confusing code, do we REALLY want to be playing with it? >>>> Thank you for helping to identify mang issues with this patch. >>>> However, >>>> I think this will have a certain benefits for anonymous folio rmap >>>> traverse avc overhead. >>>>> It'd be good to get some tests though unless you move this to >>>>> vma.c with >>>>> its userland testing (probably a good idea actually as Andrew >>>>> suggested) >>>>> this might be tricky. >>>> This patch belongs to anon_vma rmap's content, and it seems more >>>> appropriate in mm/rmap.c? >>>>> NACK until the issues are fixed and the approach at least seems more >>>>> correct. >>>> Thanks >>>> Zhiguo >>>> >>> Please see the attachment in thread for an example of a working >>> version of >>> this, this is sadly fundamentally broken. >>> >>> But you're going to really need to sell this a lot better, provide some >>> numbers, and provide extensive testing and a much, much better test for >>> this to stand any chance. >>> >>> I appreciate what you're trying to do here, and it's not totally >>> crazy, but >>> we have to be so, so careful around this code. >>> >>> anon_vma code is horrendously subtle and confusing (I actually had to >>> reference my unpublished book to remind myself how this stuff works >>> :)), so >>> we have to tread very carefully. >>> >>> I definitely think we need ASCII diagrams if we were to go ahead >>> with a new >>> version of this. But then again I'm a bit of a fan of ASCII diagrams... >>> >>> Please cc- me on future revisions of this series, thanks :) >>> >> Thanks, >> >> Mika >> >> >