> -----Original Message----- > From: Mikulas Patocka <mpatocka@xxxxxxxxxx> > Sent: Monday, March 30, 2020 6:32 AM > To: Dan Williams <dan.j.williams@xxxxxxxxx>; Vishal Verma > <vishal.l.verma@xxxxxxxxx>; Dave Jiang <dave.jiang@xxxxxxxxx>; Ira > Weiny <ira.weiny@xxxxxxxxx>; Mike Snitzer <msnitzer@xxxxxxxxxx> > Cc: linux-nvdimm@xxxxxxxxxxxx; dm-devel@xxxxxxxxxx > Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger > lengths > > I tested dm-writecache performance on a machine with Optane nvdimm > and it turned out that for larger writes, cached stores + cache > flushing perform better than non-temporal stores. This is the > throughput of dm- writecache measured with this command: > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct > > block size 512 1024 2048 4096 > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > > We can see that for smaller block, movnti performs better, but for > larger blocks, clflushopt has better performance. There are other interactions to consider... see threads from the last few years on the linux-nvdimm list. For example, software generally expects that read()s take a long time and avoids re-reading from disk; the normal pattern is to hold the data in memory and read it from there. By using normal stores, CPU caches end up holding a bunch of persistent memory data that is probably not going to be read again any time soon, bumping out more useful data. In contrast, movnti avoids filling the CPU caches. Another option is the AVX vmovntdq instruction (if available), the most recent of which does 64-byte (cache line) sized transfers to zmm registers. There's a hefty context switching overhead (e.g., 304 clocks), and the CPU often runs AVX instructions at a slower clock frequency, so it's hard to judge when it's worthwhile. In user space, glibc faces similar choices for its memcpy() functions; glibc memcpy() uses non-temporal stores for transfers > 75% of the L3 cache size divided by the number of cores. For example, with glibc-2.216-16.fc27 (August 2017), on a Broadwell system with E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used for memcpy()s over 36 MiB. It'd be nice if glibc, PMDK, and the kernel used the same algorithms. -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel