On Thu, Jun 27, 2019 at 11:58 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > On Thu, Jun 27, 2019 at 11:29 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > On Thu, Jun 27, 2019 at 9:06 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > > > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote: > > > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has > > > > > been encountering intermittent lockups. The backtraces always include > > > > > the filesystem-DAX PMD path, multi-order entries have been a source of > > > > > bugs in the past, and disabling the PMD path allows a test that fails in > > > > > minutes to run for an hour. > > > > > > > > On May 4th, I asked you: > > > > > > > > Since this is provoked by a fatal signal, it must have something to do > > > > with a killable or interruptible sleep. There's only one of those in the > > > > DAX code; fatal_signal_pending() in dax_iomap_actor(). Does rocksdb do > > > > I/O with write() or through a writable mmap()? I'd like to know before > > > > I chase too far down this fault tree analysis. > > > > > > RocksDB in this case is using write() for writes and mmap() for reads. > > > > It's not clear to me that a fatal signal is a component of the failure > > as much as it's the way to detect that the benchmark has indeed locked > > up. > > Even though db_bench is run with the mmap_read=1 option: > > cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \ > --use_existing_db=1 \ > --mmap_read=1 \ > --num=$num_keys \ > --threads=$num_read_threads \ > > When the lockup occurs there are db_bench processes in the write fault path: > > [ 1666.635212] db_bench D 0 2492 2435 0x00000000 > [ 1666.641339] Call Trace: > [ 1666.644072] ? __schedule+0x24f/0x680 > [ 1666.648162] ? __switch_to_asm+0x34/0x70 > [ 1666.652545] schedule+0x29/0x90 > [ 1666.656054] get_unlocked_entry+0xcd/0x120 > [ 1666.660629] ? dax_iomap_actor+0x270/0x270 > [ 1666.665206] grab_mapping_entry+0x14f/0x230 > [ 1666.669878] dax_iomap_pmd_fault.isra.42+0x14d/0x950 > [ 1666.675425] ? futex_wait+0x122/0x230 > [ 1666.679518] ext4_dax_huge_fault+0x16f/0x1f0 > [ 1666.684288] __handle_mm_fault+0x411/0x1350 > [ 1666.688961] ? do_futex+0xca/0xbb0 > [ 1666.692760] ? __switch_to_asm+0x34/0x70 > [ 1666.697144] handle_mm_fault+0xbe/0x1e0 > [ 1666.701429] __do_page_fault+0x249/0x4f0 > [ 1666.705811] do_page_fault+0x32/0x110 > [ 1666.709903] ? page_fault+0x8/0x30 > [ 1666.713702] page_fault+0x1e/0x30 > > ...where __handle_mm_fault+0x411 is in wp_huge_pmd(): > > (gdb) li *(__handle_mm_fault+0x411) > 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800). > 3795 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, > pmd_t orig_pmd) > 3796 { > 3797 if (vma_is_anonymous(vmf->vma)) > 3798 return do_huge_pmd_wp_page(vmf, orig_pmd); > 3799 if (vmf->vma->vm_ops->huge_fault) > 3800 return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD); > 3801 > 3802 /* COW handled on pte level: split pmd */ > 3803 VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma); > 3804 __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); > > This bug feels like we failed to unlock, or unlocked the wrong entry > and this hunk in the bisected commit looks suspect to me. Why do we > still need to drop the lock now that the radix_tree_preload() calls > are gone? Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I wonder why we don't restart the lookup like the old implementation.