On Wed, Aug 23, 2017 at 11:57:33AM +0200, Jan Kara wrote: > On Tue 22-08-17 16:24:35, Ross Zwisler wrote: > > In DAX there are two separate places where the 2MiB range of a PMD is > > defined. > > > > The first is in the page tables, where a PMD mapping inserted for a given > > address spans from (vmf->address & PMD_MASK) to > > ((vmf->address & PMD_MASK) + PMD_SIZE - 1). That is, from the 2MiB > > boundary below the address to the 2MiB boundary above the address. > > > > So, for example, a fault at address 3MiB (0x30 0000) falls within the PMD > > that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000). > > > > The second PMD range is in the mapping->page_tree, where a given file > > offset is covered by a radix tree entry that spans from one 2MiB aligned > > file offset to another 2MiB aligned file offset. > > > > So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD > > range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to > > 4MiB (pgoff 1024). > > > > This system works so long as the addresses and file offsets for a given > > mapping both have the same offsets relative to the start of each PMD. > > > > Consider the case where the starting address for a given file isn't 2MiB > > aligned - say our faulting address is 3 MiB (0x30 0000), but that > > corresponds to the beginning of our file (pgoff 0). Now all the PMDs in > > the mapping are misaligned so that the 2MiB range defined in the page > > tables never matches up with the 2MiB range defined in the radix tree. > > > > The current code notices this case for DAX faults to storage with the > > following test in dax_pmd_insert_mapping(): > > > > if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR) > > goto unlock_fallback; > > > > This test makes sure that the pfn we get from the driver is 2MiB aligned, > > and relies on the assumption that the 2MiB alignment of the pfn we get back > > from the driver matches the 2MiB alignment of the faulting address. > > > > However, faults to holes were not checked and we could hit the problem > > described above. > > > > This was reported in response to the NVML nvml/src/test/pmempool_sync > > TEST5: > > > > $ cd nvml/src/test/pmempool_sync > > $ make TEST5 > > > > You can grab NVML here: > > > > https://github.com/pmem/nvml/ > > > > The dmesg warning you see when you hit this error is: > > > > WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310 > > > > Where we notice in dax_insert_mapping_entry() that the radix tree entry we > > are about to replace doesn't match the locked entry that we had previously > > inserted into the tree. This happens because the initial insertion was > > done in grab_mapping_entry() using a pgoff calculated from the faulting > > address (vmf->address), and the replacement in > > dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff. > > > > In our failure case those two page offsets (one calculated from > > vmf->address, one using vmf->pgoff) point to different order 9 radix tree > > entries. > > > > This failure case can result in a deadlock because the radix tree unlock > > also happens on the pgoff calculated from vmf->address. This means that > > the locked radix tree entry that we swapped in to the tree in > > dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all > > future faults to that 2MiB range will block forever. > > > > Fix this by validating that the faulting address's PMD offset matches the > > PMD offset from the start of the file. This check is done at the very > > beginning of the fault and covers faults that would have mapped to storage > > as well as faults to holes. I left the COLOUR check in > > dax_pmd_insert_mapping() in place in case we ever hit the insanity > > condition where the alignment of the pfn we get from the driver doesn't > > match the alignment of the userspace address. > > Hum, I'm confused with all these offsets and their alignment requirements. > So let me try to get this straight. We have three different things here: > > 1) virtual address A where we fault > 2) file offset FA corresponding to the virtual address - computed as > linear_page_index(vma, A) = (A - vma->start) >> PAGE_SHIFT + vma->pgoff > - and stored in vmf->pgoff > 3) physical address (or sector in filesystem terminology) the file offset > maps to. > > and then we have the aligned virtual address B = (A & PMD_MASK) and > corresponding file offset FB we've got from linear_page_index(vma, B). Now > if I've correctly understood what you've written above, the problem is that > although B is aligned, FB doesn't have to be. That can happen either when > vma->start is not aligned (which is the example you give above?) or when > vma->pgoff is non aligned. And as a result FA >> 9 != FB >> 9 leading to > issues. > > OK, makes sense. You can add: > > Reviewed-by: Jan Kara <jack@xxxxxxx> Yep, your understanding matches mine. What was happening in one specific failure in the NVML test was: vmf->vm_pgoff = 0x1 /* the vma's page offset from the start of the file */ file offset FA = vmf->pgoff = 0x1200 address A = 0x7f7a8edff000 So, as you say the 0x1200 pgoff is calculated via vmf->pgoff = (A - vma->start) >> PAGE_SHIFT + vma->vm_pgoff 0x1200 = (0x7f7a8edff000 - 0x7f7a8dc00000) >> PAGE_SHIFT + 1 So, when we get the aligned virtual address B = (A & PMD_MASK) in dax_iomap_pmd_fault() and use that to get the aligned page offset: aligned pgoff FB = (B - vma->start) >> PAGE_SHIFT + vma->vm_pgoff 0x1001 = (0x7f7a8ec00000 - 0x7f7a8dc00000) >> PAGE_SHIFT + 1 The aligned pgoff FB of 0x1001 is a different PMD in the radix tree than we get when we use the vmf->pgoff FA of 0x1200. - Ross