在 2020/2/21 8:22, Mike Kravetz 写道: > On 2/19/20 6:30 PM, Longpeng (Mike) wrote: >> 在 2020/2/20 3:33, Mike Kravetz 写道: >>> + Kirill >>> On 2/18/20 5:58 PM, Sean Christopherson wrote: >>>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: > <snip> >>>> The race and the fix make sense. I assumed dereferencing garbage from the >>>> huge page was the issue, but I wasn't 100% that was the case, which is why >>>> I asked about alternative fixes. >>>> >>>>> We change the code from >>>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>>> to >>>>> if (pud_huge(*pud) >>>>> return (pte_t *)pud; >>>>> busy loop for 500ms >>>>> if (!pud_present(*pud)) >>>>> return (pte_t *)pud; >>>>> and the panic will be hit quickly. >>>>> >>>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >>>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >>>>> >>>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >>>>> pud twice and the *pud maybe change in a race, so if we only read the pud once. >>>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if >>>>> possible. >>>> >>>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. >>>> convert everything as a follow-up patch (or patches). I'm fairly confident >>>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly >>>> bet my life on it. I'd much rather the failing scenario be that KVM uses >>>> a sub-optimal page size as opposed to exploding on a bad pointer. >>> >>> Longpeng(Mike) asked in another e-mail specifically about making similar >>> changes to lookup_address_in_mm(). Replying here as there is more context. >>> >>> I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, the >>> problem with the huge_pte_offset routine is that the pud changes from >>> pud_none() to pud_huge() in the middle of >>> 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of >>> lookup_address_in_mm, we know pud was not pud_none() as it was previously >>> checked. I am not aware of any other state transitions which could cause >>> us trouble. However, I am no expert in this area. > > Bad copy/paste by me. Longpeng(Mike) was asking about lookup_address_in_pgd. > >> So... I need just fix huge_pte_offset in mm/hugetlb.c, right? > > Let's start with just a fix for huge_pte_offset() as you can easily reproduce > that issue by adding a delay. > >> Is it possible the pud changes from pud_huge() to pud_none() while another CPU >> is walking the pagetable ? > All right, I'll send V2 to fix it, thanks :) > I believe it is possible. If we hole punch a hugetlbfs file, we will clear > the corresponding pud's. Hence, we can go from pud_huge() to pud_none(). > Unless I am missing something, that does imply we could have issues in places > such as lookup_address_in_pgd: > > pud = pud_offset(p4d, address); > if (pud_none(*pud)) > return NULL; > > *level = PG_LEVEL_1G; > if (pud_large(*pud) || !pud_present(*pud)) > return (pte_t *)pud; > > I hope I am wrong, but it seems like pud_none(*pud) could become true after > the initial check, and before the (pud_large) check. If so, there could be > a problem (addressing exception) when the code continues and looks up the pmd. > > pmd = pmd_offset(pud, address); > if (pmd_none(*pmd)) > return NULL; > > It has been mentioned before that there are many page table walks like this. > What am I missing that prevents races like this? Or, have we just been lucky? > That's what I worry about. Maybe there is no usecase to hit it. -- Regards, Longpeng(Mike)