On Tue, Dec 5, 2023 at 6:43 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 04/12/2023 17:26, James Houghton wrote: > > It is currently possible for a userspace application to enter a page > > fault loop when using HugeTLB pages implemented with contiguous PTEs > > when HAFDBS is not available. This happens because: > > 1. The kernel may sometimes write PTEs that are sw-dirty but hw-clean > > (PTE_DIRTY | PTE_RDONLY | PTE_WRITE). > > Hi James, > > Do you know how this happens? Hi Ryan, Thanks for taking a look! I do understand why this is happening. There is an explanation in the reproducer[1] and also in this cover letter (though I realize I could have been a little clearer). See below. > AFAIK, this is the set of valid bit combinations, and > PTE_RDONLY|PTE_WRITE|PTE_DIRTY is not one of them. Perhaps the real solution is > to understand how this is happening and prevent it? > > /* > * PTE bits configuration in the presence of hardware Dirty Bit Management > * (PTE_WRITE == PTE_DBM): > * > * Dirty Writable | PTE_RDONLY PTE_WRITE PTE_DIRTY (sw) > * 0 0 | 1 0 0 > * 0 1 | 1 1 0 > * 1 0 | 1 0 1 > * 1 1 | 0 1 x > * > * When hardware DBM is not present, the sofware PTE_DIRTY bit is updated via > * the page fault mechanism. Checking the dirty status of a pte becomes: > * > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ Thanks for pointing this out. So (1) is definitely a bug. The second patch in this series makes it impossible to create such a PTE via pte_modify (by forcing sw-dirty PTEs to be hw-dirty as well). > > The second patch in this series makes step (1) less likely to occur. It makes it impossible to create this invalid set of bits via pte_modify(). Assuming all PTE pgprot updates are done via the proper interfaces, patch #2 might actually make this invalid bit combination impossible to produce (that's certainly the goal). So perhaps language stronger than "less likely" is appropriate. Here's the sequence of events to trigger this bug, via mprotect(): > > Without this patch, we can get the kernel to write a sw-dirty, hw-clean > > PTE with the following steps (showing the relevant VMA flags and pgprot > > bits): > > i. Create a valid, writable contiguous PTE. > > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE > > VMA pgprot bits: PTE_RDONLY | PTE_WRITE > > PTE pgprot bits: PTE_DIRTY | PTE_WRITE > > ii. mprotect the VMA to PROT_NONE. > > VMA vmflags: VM_SHARED > > VMA pgprot bits: PTE_RDONLY > > PTE pgprot bits: PTE_DIRTY | PTE_RDONLY > > iii. mprotect the VMA back to PROT_READ | PROT_WRITE. > > VMA vmflags: VM_SHARED | VM_READ | VM_WRITE > > VMA pgprot bits: PTE_RDONLY | PTE_WRITE > > PTE pgprot bits: PTE_DIRTY | PTE_WRITE | PTE_RDONLY With patch #2, the PTE pgprot bits in step iii become PTE_DIRTY | PTE_WRITE (hw-dirtiness is set, as the PTE is sw-dirty). Thanks! > > [1]: https://gist.github.com/48ca/11d1e466deee032cb35aa8c2280f93b0