Re: [PATCH] dax: remove the pmem_dax_ops->flush abstraction

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Wed, 6 Sep 2017 02:30:17 -0400 (EDT)

On Thu, 31 Aug 2017, Dan Williams wrote:

> On Thu, Aug 31, 2017 at 8:11 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> > On Thu, Aug 31, 2017 at 7:49 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> >>
> >>
> >> On Thu, 31 Aug 2017, Dan Williams wrote:
> >>
> >>> On Thu, Aug 31, 2017 at 6:47 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> >>> > The patch abebfbe2f731 ("dm: add ->flush() dax operation support") is
> >>> > buggy. A dm device may be composed of multiple underlying devices and all
> >>> > of them need to be flushed. The patch just routes the flush request to the
> >>> > first device and ignores the other devices.
> >>> >
> >>> > It could be fixed by adding more complex logic to the device mapper. But
> >>> > there is only one implementation of the method pmem_dax_ops->flush - that
> >>> > is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
> >>> > don't need the pmem_dax_ops->flush abstraction at all, we can call
> >>> > arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
> >>> > couldn't ever reach anything different from arch_wb_cache_pmem().
> >>>
> >>> Unfortunately, this is not true, see usage of DAXDEV_WRITE_CACHE. This
> >>> is for platforms that arrange for cpu caches to be flushed on
> >>> power-fail, like standalone storage appliances, where it would be a
> >>> waste for the kernel to track and flush dirty cachelines for dax.
> >>> Theoretically this could be done on a per-address range basis (think
> >>> battery backing a subset of the system memory). I think we need to fix
> >>> the routing to flush the same dax_device that ->direct_access() was
> >>> invoked.
> >>
> >> With my patch, dax_flush checks DAXDEV_WRITE_CACHE and doesn't do anything
> >> if it's not set - it just doesn't go through device mapper.
> >
> > True, but there's no guarantee that arch_wb_pmem() will be sufficient
> > going forward and the reason we routed this to the driver in the first
> > place was to allow driver / platform specific flush implementations.
> >
> >>
> >>> > It should be also pointed out that for some uses of persistent memory it
> >>> > is needed to flush only very small amount of data (such as 1 cacheline),
> >>> > and it would be overkill if we go through that device mapper machinery for
> >>> > a single flushed cache line.
> >>>
> >>> I think this is more an argument to not enable DAX on that
> >>> device-mapper topology if operation routing impacts performance. DAX
> >>> is meant to get the kernel out of the way most of the time.
> >>
> >> I don't understand what do you mean? I am writing a device mapper target
> >> that uses persistent memory as a cache. When flushing the metadata,
> >> sometimes it is needed to flush single cache lines. Ideally, I would use
> >> the clwb instruction for this purpose, but that clwb is hidden behind a
> >> complicated pmem_dax_ops->flush abstraction (and that abstraction will get
> >> even more complicated if implemented correctly). How do you think I should
> >> write that target without DAX?
> >
> > I don't think you want DAX for this, I think you want direct control
> > of a persistent-memory address range.  For example the brd driver
> > supports DAX but I wouldn't recommend using it as the basis for a
> > directly managed cache.
> >
> > So I think we need to dive deeper into what an in kernel interface to
> > access a contiguous persistent memory range looks like. DAX was not
> > designed with that use case in mind.
> 
> For a cache driver using pmem, a strawman / starting point would be
> trying to do a ->direct_access() call for the full capacity of the
> device and then:
> 
> 1/ validate that it returns the full capacity (this will preclude
> drivers like brd, and make sure the dax range is contiguous)
> 
> 2/ use the returned pfn to walk the iomem resource tree and validate
> that you are talking to persistent memory and not some special
> dax-device range.

This is exactly what I am doing. (except that if direct_access returns 
less than the size of the device, I call direct_access again and remap the 
discontiguos pages to a contiguous linear region)

But the question is - how should I flush cache lines in my persistent 
memory driver?

* there is clwb() function, but it is specific to x86 architecture
* there is dax_flush(), but it has excessive overhead because it tries to 
  walk the device mapper tree (I measured it on a system with real 
  persistent memory and it is 1.9 times slower than clwb when used on 
  device mapper)
* in the kernel 4.12, there used to be wb_cache_pmem() - this function 
  would be perfect for this purpose, but the commit 
  4e4f00a9b51a1c52ebdd728a1caeb3b9fe48c39d removed it

Mikulas

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel