On 2/19/20 6:30 PM, Longpeng (Mike) wrote: > 在 2020/2/20 3:33, Mike Kravetz 写道: >> + Kirill >> On 2/18/20 5:58 PM, Sean Christopherson wrote: >>> On Wed, Feb 19, 2020 at 09:39:59AM +0800, Longpeng (Mike) wrote: <snip> >>> The race and the fix make sense. I assumed dereferencing garbage from the >>> huge page was the issue, but I wasn't 100% that was the case, which is why >>> I asked about alternative fixes. >>> >>>> We change the code from >>>> if (pud_huge(*pud) || !pud_present(*pud)) >>>> to >>>> if (pud_huge(*pud) >>>> return (pte_t *)pud; >>>> busy loop for 500ms >>>> if (!pud_present(*pud)) >>>> return (pte_t *)pud; >>>> and the panic will be hit quickly. >>>> >>>> ARM64 has already use READ/WRITE_ONCE to access the pagetable, look at this >>>> commit 20a004e7 (arm64: mm: Use READ_ONCE/WRITE_ONCE when accessing page tables). >>>> >>>> The root cause is: 'if (pud_huge(*pud) || !pud_present(*pud))' read entry from >>>> pud twice and the *pud maybe change in a race, so if we only read the pud once. >>>> I use READ_ONCE here is just for safe, to prevents the complier mischief if >>>> possible. >>> >>> FWIW, I'd be in favor of going the READ/WRITE_ONCE() route for x86, e.g. >>> convert everything as a follow-up patch (or patches). I'm fairly confident >>> that KVM's usage of lookup_address_in_mm() is safe, but I wouldn't exactly >>> bet my life on it. I'd much rather the failing scenario be that KVM uses >>> a sub-optimal page size as opposed to exploding on a bad pointer. >> >> Longpeng(Mike) asked in another e-mail specifically about making similar >> changes to lookup_address_in_mm(). Replying here as there is more context. >> >> I 'think' lookup_address_in_mm is safe from this issue. Why? IIUC, the >> problem with the huge_pte_offset routine is that the pud changes from >> pud_none() to pud_huge() in the middle of >> 'if (pud_huge(*pud) || !pud_present(*pud))'. In the case of >> lookup_address_in_mm, we know pud was not pud_none() as it was previously >> checked. I am not aware of any other state transitions which could cause >> us trouble. However, I am no expert in this area. Bad copy/paste by me. Longpeng(Mike) was asking about lookup_address_in_pgd. > So... I need just fix huge_pte_offset in mm/hugetlb.c, right? Let's start with just a fix for huge_pte_offset() as you can easily reproduce that issue by adding a delay. > Is it possible the pud changes from pud_huge() to pud_none() while another CPU > is walking the pagetable ? I believe it is possible. If we hole punch a hugetlbfs file, we will clear the corresponding pud's. Hence, we can go from pud_huge() to pud_none(). Unless I am missing something, that does imply we could have issues in places such as lookup_address_in_pgd: pud = pud_offset(p4d, address); if (pud_none(*pud)) return NULL; *level = PG_LEVEL_1G; if (pud_large(*pud) || !pud_present(*pud)) return (pte_t *)pud; I hope I am wrong, but it seems like pud_none(*pud) could become true after the initial check, and before the (pud_large) check. If so, there could be a problem (addressing exception) when the code continues and looks up the pmd. pmd = pmd_offset(pud, address); if (pmd_none(*pmd)) return NULL; It has been mentioned before that there are many page table walks like this. What am I missing that prevents races like this? Or, have we just been lucky? -- Mike Kravetz