Re: [PATCH] filesystem-dax: Disable PMD support

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 27 Jun 2019 12:09:29 -0700

On Thu, Jun 27, 2019 at 11:58 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> On Thu, Jun 27, 2019 at 11:29 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> >
> > On Thu, Jun 27, 2019 at 9:06 AM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> > > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> > > > > been encountering intermittent lockups. The backtraces always include
> > > > > the filesystem-DAX PMD path, multi-order entries have been a source of
> > > > > bugs in the past, and disabling the PMD path allows a test that fails in
> > > > > minutes to run for an hour.
> > > >
> > > > On May 4th, I asked you:
> > > >
> > > > Since this is provoked by a fatal signal, it must have something to do
> > > > with a killable or interruptible sleep.  There's only one of those in the
> > > > DAX code; fatal_signal_pending() in dax_iomap_actor().  Does rocksdb do
> > > > I/O with write() or through a writable mmap()?  I'd like to know before
> > > > I chase too far down this fault tree analysis.
> > >
> > > RocksDB in this case is using write() for writes and mmap() for reads.
> >
> > It's not clear to me that a fatal signal is a component of the failure
> > as much as it's the way to detect that the benchmark has indeed locked
> > up.
>
> Even though db_bench is run with the mmap_read=1 option:
>
>   cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \
>        --use_existing_db=1 \
>         --mmap_read=1 \
>        --num=$num_keys \
>        --threads=$num_read_threads \
>
> When the lockup occurs there are db_bench processes in the write fault path:
>
> [ 1666.635212] db_bench        D    0  2492   2435 0x00000000
> [ 1666.641339] Call Trace:
> [ 1666.644072]  ? __schedule+0x24f/0x680
> [ 1666.648162]  ? __switch_to_asm+0x34/0x70
> [ 1666.652545]  schedule+0x29/0x90
> [ 1666.656054]  get_unlocked_entry+0xcd/0x120
> [ 1666.660629]  ? dax_iomap_actor+0x270/0x270
> [ 1666.665206]  grab_mapping_entry+0x14f/0x230
> [ 1666.669878]  dax_iomap_pmd_fault.isra.42+0x14d/0x950
> [ 1666.675425]  ? futex_wait+0x122/0x230
> [ 1666.679518]  ext4_dax_huge_fault+0x16f/0x1f0
> [ 1666.684288]  __handle_mm_fault+0x411/0x1350
> [ 1666.688961]  ? do_futex+0xca/0xbb0
> [ 1666.692760]  ? __switch_to_asm+0x34/0x70
> [ 1666.697144]  handle_mm_fault+0xbe/0x1e0
> [ 1666.701429]  __do_page_fault+0x249/0x4f0
> [ 1666.705811]  do_page_fault+0x32/0x110
> [ 1666.709903]  ? page_fault+0x8/0x30
> [ 1666.713702]  page_fault+0x1e/0x30
>
> ...where __handle_mm_fault+0x411 is in wp_huge_pmd():
>
> (gdb) li *(__handle_mm_fault+0x411)
> 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800).
> 3795    static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf,
> pmd_t orig_pmd)
> 3796    {
> 3797            if (vma_is_anonymous(vmf->vma))
> 3798                    return do_huge_pmd_wp_page(vmf, orig_pmd);
> 3799            if (vmf->vma->vm_ops->huge_fault)
> 3800                    return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
> 3801
> 3802            /* COW handled on pte level: split pmd */
> 3803            VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
> 3804            __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>
> This bug feels like we failed to unlock, or unlocked the wrong entry
> and this hunk in the bisected commit looks suspect to me. Why do we
> still need to drop the lock now that the radix_tree_preload() calls
> are gone?

Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
wonder why we don't restart the lookup like the old implementation.