Fwd: Spurious SIGBUS when threads race to insert a DAX page

Christopher Hodgkins <George.Hodgkins@xxxxxxxxxxxx> · Mon, 14 Mar 2022 14:04:35 -0600

NOTE: This question is about kernel 4.15. All line numbers and symbol
names correspond to the Git source at tag v4.15.

Hi all,
I've been running some benchmarks using ext4 files on PMEM (first-gen
Intel Optane) as "anonymous" memory, and I've run into a weird error.
For reference, the way this works is that we have a runtime that at
startup `fallocate`s a large PMEM-backed file and maps the whole thing
R/W with MAP_SYNC, and then it interposes on calls to `mmap` in
userspace to return page-sized chunks of PMEM when anonymous memory is
requested.

The error I have encountered is the nondeterministic delivery of
SIGBUS on the first access to an untouched page of the mapped region
(which since the file is passed to the application sequentially, is
also typically the first uninitialized extent in the file at time of
crash). The accesses are aligned and within a mapped region according
to smaps, which eliminates the only documented reasons for delivery of
SIGBUS that I'm aware of.

I did a bit of digging with FTrace, and the course of events at a
crash seems to be as follows. Multiple (>2) threads start faulting in
the page, and go through the "synchronous page fault" path. They all
return error-free from the fdatasync() call at dax.c:1588 and call
dax_insert_pfn_mkwrite. The first thread to exit that function returns
NOPAGE (success) and the others all return SIGBUS, and each raises the
userspace signal on the return path.

My best guess for why this occurs is that the unsuccessful calls all
bounce with EBUSY (because of the successful one?) in insert_pfn
(which tails into the call to vm_insert_mixed_mkwrite at dax.c:1548),
and then dax_fault_return maps that to SIGBUS. The signal is
definitely spurious -- as mentioned, one of the threads returns
success, and if I catch the signal with GDB, the faulting access can
be successfully performed after the signal is caught. Also, as
mentioned above, the error is nondeterministic -- it happens maybe one
out of every five runs. To clarify some other things that could make a
difference, the pages are normal-sized (not huge) and the SIGBUS isn't
due to PMEM failure (ie HWPOISON).

I'm on an old kernel (4.15) so if this is really an error in the
kernel code it may be fixed on the current series. If that's the case,
just point me to a patch or release number where it was fixed and I'll
be happy. It may also be an error in my code -- I will be less happy
in that case, but please still point it out or ask questions for
clarification if you think I'm doing something wrong to cause this.

Thanks,
George Hodgkins