Re: [PATCH 01/15] btrfs: create a mount option for dax

Adam Borowski <kilobyte@xxxxxxxxxx> · Wed, 27 Mar 2019 18:38:10 +0100

On Tue, Mar 26, 2019 at 12:10:01PM -0700, Matthew Wilcox wrote:
> On Tue, Mar 26, 2019 at 02:02:47PM -0500, Goldwyn Rodrigues wrote:
> > The dax option is restricted to non multi-device mounts.
> > dax interacts with the device directly instead of using bio, so
> > all bio-hooks which we use for multi-device cannot be performed
> > here. While regular read/writes could be manipulated with
> > RAID0/1, mmap() is still an issue.
> > 
> > Auto-setting free space tree, because dealing with free space
> > inode (specifically readpages) is a nightmare.
> > Auto-setting nodatasum because we don't get callback for writing
> > checksums after mmap()s.
> 
> Congratulations on getting the bear to dance.  But why?
> 
> To me, the point of btrfs is all the cool stuff it does with built-in
> checksumming and snapshots and RAID and so on.  DAX doesn't let you do
> any of that, so why would somebody want to use btrfs to manage DAX?

If I read this correctly (I merely glanced at it), this patchset _does_
provide the full snapshot functionality.  This is something other
filesystems don't allow: ext4 has no CoW at all, and IIRC on XFS reflinks
and DAX are mutually exclusive.

Obviously, the usual btrfs way of CoWing every write would remove all
(write) upsides of DAX, thus NOCOW (ie, CoW once) is the way to go: a page
fault should happen no more than once per page per snapshot.

On the other hand, checksumming seems useless to me.  Data corruption can
happen either in transit or at rest.  For at rest, disks already have their
own checksums -- and [NV]DIMMs have ECC.  On the other hand, the majority of
the time when someone seeks help on the btrfs mailing list, it turns out to
be a matter of bad RAM, bad motherboard or bad cabling.  This doesn't apply
to pmem.  The usual path is:

   CPU
    |<--->memory
    |
  SATA controller
    |
    (SATA cable)
    |
  disk

The data goes to memory (very unlikely to to remain in the cache before
getting checksummed), then has to travel all the way down.  On the other
hand, the path on pmem is:

  CPU
   |---->memory

So the data written by userspace goes to memory... and that's it.

As for multi-device, at least single block groups would be very nice (to
have a filesystem than spans regions) and easyish to implement, while RAID0
might spoil hugepage fun but may still be straightforward.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Did ya know that typing "test -j8" instead of "ctest -j8"
⢿⡄⠘⠷⠚⠋⠀ will make your testsuite pass much faster, and fix bugs?
⠈⠳⣄⠀⠀⠀⠀