Re: [PATCH] dax: fix radix tree insertion race

Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> · Mon, 10 Apr 2017 14:34:29 -0600

On Mon, Apr 10, 2017 at 03:41:11PM +0200, Jan Kara wrote:
> On Thu 06-04-17 15:29:44, Ross Zwisler wrote:
> > While running generic/340 in my test setup I hit the following race. It can
> > happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5.
> > 
> > Thread 1				Thread 2
> > --------				--------
> > dax_iomap_pmd_fault()
> >   grab_mapping_entry()
> >     spin_lock_irq()
> >     get_unlocked_mapping_entry()
> >     'entry' is NULL, can't call lock_slot()
> >     spin_unlock_irq()
> >     radix_tree_preload()
> > 					dax_iomap_pmd_fault()
> > 					  grab_mapping_entry()
> > 					    spin_lock_irq()
> > 					    get_unlocked_mapping_entry()
> > 					    ...
> > 					    lock_slot()
> > 					    spin_unlock_irq()
> > 					  dax_pmd_insert_mapping()
> > 					    <inserts a PMD mapping>
> >     spin_lock_irq()
> >     __radix_tree_insert() fails with -EEXIST
> >     <fall back to 4k fault, and die horribly
> >      when inserting a 4k entry where a PMD exists>
> > 
> > The issue is that we have to drop mapping->tree_lock while calling
> > radix_tree_preload(), but since we didn't have a radix tree entry to lock
> > (unlike in the pmd_downgrade case) we have no protection against Thread 2
> > coming along and inserting a PMD at the same index.  For 4k entries we
> > handled this with a special-case response to -EEXIST coming from the
> > __radix_tree_insert(), but this doesn't save us for PMDs because the
> > -EEXIST case can also mean that we collided with a 4k entry in the radix
> > tree at a different index, but one that is covered by our PMD range.
> > 
> > So, correctly handle both the 4k and 2M collision cases by explicitly
> > re-checking the radix tree for an entry at our index once we reacquire
> > mapping->tree_lock.
> > 
> > This patch has made it through a clean xfstests run with the current
> > v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
> > loop.  It used to fail within the first 10 iterations.
> > 
> > Signed-off-by: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
> > Cc: <stable@xxxxxxxxxxxxxxx>    [4.10+]
> 
> The patch looks good to me (and I can see Andrew already sent it to Linus),
> I'm just worndering where did things actually go wrong? I'd expect we would
> return VM_FAULT_FALLBACK from dax_iomap_pmd_fault() and then do PTE fault
> for the address which should just work out fine...

Yep, that's what I thought as well, and I think it does work for processes
which have separate page tables.  The second process will do a 4k fault (just
as it would have if it had a VMA that was smaller than 2MiB, for example), map
the 4k page into its page table and just dirty the 2MiB DAX entry in the radix
tree.  I've tested this case manually in the past.

I think the error case that I was seeing was for threads that share page
tables.  In that case the 2nd thread falls back to PTEs, but there is already
a PMD in the page table from the first fault.  When we try and insert a PTE
over the PMD we get the following BUG:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: do_raw_spin_trylock+0x5/0x40
PGD 8d6ee0067
PUD 8db6e8067
PMD 0

Oops: 0000 [#1] PREEMPT SMP
Modules linked in: dax_pmem nd_pmem dax nd_btt nd_e820 libnvdimm [last unloaded: scsi_debug]
CPU: 2 PID: 25323 Comm: holetest Not tainted 4.11.0-rc4 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
task: ffff880095492a00 task.stack: ffffc90014048000
RIP: 0010:do_raw_spin_trylock+0x5/0x40
RSP: 0000:ffffc9001404bb60 EFLAGS: 00010296
RAX: ffff880095492a00 RBX: 0000000000000018 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffc9001404bb80 R08: 0000000000000001 R09: 0000000000000000
R10: ffff880095492a00 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8808d5fe4220 R14: ffff88004c3e3c80 R15: 8000000000000025
FS:  00007f7ed7dff700(0000) GS:ffff8808de400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000008d86f6000 CR4: 00000000001406e0
Call Trace:
 ? _raw_spin_lock+0x49/0x80
 ? __get_locked_pte+0x16b/0x1d0
 __get_locked_pte+0x16b/0x1d0
 insert_pfn.isra.68+0x3a/0x100
 vm_insert_mixed+0x64/0x90
 dax_iomap_fault+0xa41/0x1680
 ext4_dax_huge_fault+0xa9/0xd0
 ext4_dax_fault+0x10/0x20
 __do_fault+0x20/0x130
 __handle_mm_fault+0x9b3/0x1190
 handle_mm_fault+0x169/0x370
 ? handle_mm_fault+0x47/0x370
 __do_page_fault+0x28f/0x590
 trace_do_page_fault+0x58/0x2c0
 do_async_page_fault+0x2c/0x90
 async_page_fault+0x28/0x30
RIP: 0033:0x4014b2
RSP: 002b:00007f7ed7dfef20 EFLAGS: 00010216
RAX: 00007f7ec6c00400 RBX: 0000000000010000 RCX: 0000000001c00000
RDX: 0000000000001c01 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 00007f7ed7dff700 R08: 00007f7ed7dff700 R09: 00007f7ed7dff700
R10: 00007f7ed7dff9d0 R11: 0000000000000202 R12: 00007f7ec6c00000
R13: 00007ffe3ffb5b60 R14: 0000000000000400 R15: 00007f7ed7dff700
Code: 30 84 ee 81 48 89 df e8 4a fe ff ff eb 89 89 c6 48 89 df e8 7e e7 ff ff eb 8c 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <8b> 07 55 48 89 e5 85 c0 75 2b ba 01 00 00 00 f0 0f b1 17 85 c0
RIP: do_raw_spin_trylock+0x5/0x40 RSP: ffffc9001404bb60
CR2: 0000000000000000
---[ end trace 75d38250d89b67cd ]---