On Fri, 8 Dec 2017, Dan Williams wrote: > > > What about memcpy_flushcache? > > > > but > > > > - using memcpy_flushcache is overkill if we need just one or two 8-byte > > writes to the metadata area. Why not use movnti directly? > > > > The driver performs so many 8-byte moves that the cost of the > memcpy_flushcache() function call significantly eats into your > performance? > > > - on some architctures, memcpy_flushcache is just an alias for memcpy, so > > there will still be some arch-specific ifdefs > > ...but those should all be hidden by arch code, not in drivers. > > > - there is no architecture-neutral way how to guarantee ordering between > > multiple memcpy_flushcache calls. On x86, we need wmb(), on Power we > > don't, on ARM64 I don't know (arch_wb_cache_pmem calls dmb(osh), > > memcpy_flushcache doesn't - I don't know what are the implications of this > > difference) on other architectures, wmb() is insufficient to guarantee > > ordering between multiple memcpy_flushcache calls. > > wmb() should always be sufficient to order stores on all architectures. No, it isn't. See this example: uint64_t var_a, var_b; void fn(void) { uint64_t val = 3; memcpy_flushcache(&var_a, &val, 8); wmb(); val = 5; memcpy_flushcache(&var_b, &val, 8);; } On x86-64, memcpy_flushcache is implemented using the movnti instruction (that writes the value to the write-combining buffer) and wmb() is compiled into the sfence instruction (that flushes the write-combining buffer) - so it's OK - it is guaranteed that the variable var_a is written to persistent memory before the variable var_b; However, on i686 (and most other architectures), memcpy_flushcache is just an alias for memcpy - see this code in include/linux/string.h: #ifndef __HAVE_ARCH_MEMCPY_FLUSHCACHE static inline void memcpy_flushcache(void *dst, const void *src, size_t cnt) { memcpy(dst, src, cnt); } #endif So, memcpy_flushcache writes the variable var_a to the cache, wmb() flushes the pipeline, but otherwise does nothing (because on i686 all writes to the cache are already ordered) and then memcpy_flushcache writes the variable var_b to the cache - however, both writes end up in cache and wmb() doesn't flush the cache. So wmb() doesn't provide any sort of guarantee that var_a is written to persistent memory before var_b. If the cache sector containing the variable var_b is more contended than the cache sector containg the variable var_a, the CPU will flush cache line containing var_b before var_a. We have the dax_flush() function that (unlike wmb) guarantees that the specified range is flushed from the cache and written to persistent memory. But dax_flush is very slow on x86-64. So, we have a situation where wmb() is fast, but it only guarantees ordering on x86-64 and dax_flush() that guarantees ordering on all architectures, but it is slow on x86-64. So, my driver uses wmb() on x86-64 and dax_flush() on all the others. You argue that the driver shouldn't use any per-architecture #ifdefs, but there is no other way how to do it - using dax_flush() on x86-64 kills performance and using wmb() on all architectures is unreliable because wmb() doesn't guarantee cache flushing at all. > > At least for Broadwell, the write-back memory type on persistent memory > > has so horrible performance that it not really usable. > > ...and my concern is that you're designing a pmem driver mechanism for > Broadwell that predates the clwb instruction for efficient pmem > access. That's what we have for testing. If I get access to some Skylake server, I can test if using write combining is faster than cache flushing. Mikulas -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel