On Mon, May 28, 2018 at 6:32 AM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > On Sat, 26 May 2018, Dan Williams wrote: > >> On Sat, May 26, 2018 at 12:02 AM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: >> > >> > >> > On Fri, 25 May 2018, Dan Williams wrote: >> > >> >> On Fri, May 25, 2018 at 5:51 AM, Mike Snitzer <snitzer@xxxxxxxxxx> wrote: >> >> > On Fri, May 25 2018 at 2:17am -0400, >> >> > Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: >> >> > >> >> >> On Thu, 24 May 2018, Dan Williams wrote: >> >> >> >> >> >> > I don't want to grow driver-local wrappers for pmem. You should use >> >> >> > memcpy_flushcache directly() and if an architecture does not define >> >> >> > memcpy_flushcache() then don't allow building dm-writecache, i.e. this >> >> >> > driver should 'depends on CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE'. I don't >> >> >> > see a need to add a standalone flush operation if all relevant archs >> >> >> > provide memcpy_flushcache(). As for commit, I'd say just use wmb() >> >> >> > directly since all archs define it. Alternatively we could introduce >> >> >> > memcpy_flushcache_relaxed() to be the un-ordered version of the copy >> >> >> > routine and memcpy_flushcache() would imply a wmb(). >> >> >> >> >> >> But memcpy_flushcache() on ARM64 is slow. >> >> >> >> Right, so again, what is wrong with memcpy_flushcache_relaxed() + >> >> wmb() or otherwise making memcpy_flushcache() ordered. I do not see >> >> that as a trailblazing requirement, I see that as typical review and a >> >> reduction of the operation space that you are proposing. >> > >> > memcpy_flushcache on ARM64 is generally wrong thing to do, because it is >> > slower than memcpy and explicit cache flush. >> > >> > Suppose that you want to write data to a block device and make it >> > persistent. So you send a WRITE bio and then a FLUSH bio. >> > >> > Now - how to implement these two bios on persistent memory: >> > >> > On X86, the WRITE bio does memcpy_flushcache() and the FLUSH bio does >> > wmb() - this is the optimal implementation. >> > >> > But on ARM64, memcpy_flushcache() is suboptimal. On ARM64, the optimal >> > implementation is that the WRITE bio does just memcpy() and the FLUSH bio >> > does arch_wb_cache_pmem() on the affected range. >> > >> > Why is memcpy_flushcache() is suboptimal on ARM? The ARM architecture >> > doesn't have non-temporal stores. So, memcpy_flushcache() is implemented >> > as memcpy() followed by a cache flush. >> > >> > Now - if you flush the cache immediatelly after memcpy, the cache is full >> > of dirty lines and the cache-flushing code has to write these lines back >> > and that is slow. >> > >> > If you flush the cache some time after memcpy (i.e. when the FLUSH bio is >> > received), the processor already flushed some part of the cache on its >> > own, so the cache-flushing function has less work to do and it is faster. >> > >> > So the conclusion is - don't use memcpy_flushcache on ARM. This problem >> > cannot be fixed by a better implementation of memcpy_flushcache. >> >> It sounds like ARM might be better off with mapping its pmem as >> write-through rather than write-back, and skip the explicit cache > > I doubt it would perform well - write combining combines the writes into a > larger segments - and write-through doesn't. > Last I checked write-through caching does not disable write combining >> management altogether. You speak of "optimal" and "sub-optimal", but >> what would be more clear is fio measurements of the relative IOPs and >> latency profiles of the different approaches. The reason I am >> continuing to push here is that reducing the operation space from >> 'copy-flush-commit' to just 'copy' or 'copy-commit' simplifies the >> maintenance long term. > > I measured it (with nvme backing store) and late cache flushing has 12% > better performance than eager flushing with memcpy_flushcache(). I assume what you're seeing is ARM64 over-flushing the amount of dirty data so it becomes more efficient to do an amortized flush at the end? However, that effectively makes memcpy_flushcache() unusable in the way it can be used on x86. You claimed that ARM does not support non-temporal stores, but it does, see the STNP instruction. I do not want to see arch specific optimizations in drivers, so either write-through mappings is a potential answer to remove the need to explicitly manage flushing, or just implement STNP hacks in memcpy_flushcache() like you did with MOVNT on x86. > 131836 4k iops - vs - 117016. To be clear this is memcpy_flushcache() vs memcpy + flush? -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel