[ add x86 and LKML ] On Tue, Mar 31, 2020 at 5:27 AM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > > > On Tue, 31 Mar 2020, Elliott, Robert (Servers) wrote: > > > > > > > > -----Original Message----- > > > From: Mikulas Patocka <mpatocka@xxxxxxxxxx> > > > Sent: Monday, March 30, 2020 6:32 AM > > > To: Dan Williams <dan.j.williams@xxxxxxxxx>; Vishal Verma > > > <vishal.l.verma@xxxxxxxxx>; Dave Jiang <dave.jiang@xxxxxxxxx>; Ira > > > Weiny <ira.weiny@xxxxxxxxx>; Mike Snitzer <msnitzer@xxxxxxxxxx> > > > Cc: linux-nvdimm@xxxxxxxxxxxx; dm-devel@xxxxxxxxxx > > > Subject: [PATCH v2] memcpy_flushcache: use cache flusing for larger > > > lengths > > > > > > I tested dm-writecache performance on a machine with Optane nvdimm > > > and it turned out that for larger writes, cached stores + cache > > > flushing perform better than non-temporal stores. This is the > > > throughput of dm- writecache measured with this command: > > > dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct > > > > > > block size 512 1024 2048 4096 > > > movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s > > > clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s > > > > > > We can see that for smaller block, movnti performs better, but for > > > larger blocks, clflushopt has better performance. > > > > There are other interactions to consider... see threads from the last > > few years on the linux-nvdimm list. > > dm-writecache is the only linux driver that uses memcpy_flushcache on > persistent memory. There is also the btt driver, it uses the "do_io" > method to write to persistent memory and I don't know where this method > comes from. > > Anyway, if patching memcpy_flushcache conflicts with something else, we > should introduce memcpy_flushcache_to_pmem. > > > For example, software generally expects that read()s take a long time and > > avoids re-reading from disk; the normal pattern is to hold the data in > > memory and read it from there. By using normal stores, CPU caches end up > > holding a bunch of persistent memory data that is probably not going to > > be read again any time soon, bumping out more useful data. In contrast, > > movnti avoids filling the CPU caches. > > But if I write one cacheline and flush it immediatelly, it would consume > just one associative entry in the cache. > > > Another option is the AVX vmovntdq instruction (if available), the > > most recent of which does 64-byte (cache line) sized transfers to > > zmm registers. There's a hefty context switching overhead (e.g., > > 304 clocks), and the CPU often runs AVX instructions at a slower > > clock frequency, so it's hard to judge when it's worthwhile. > > The benchmark shows that 64-byte non-temporal avx512 vmovntdq is as good > as 8, 16 or 32-bytes writes. > ram nvdimm > sequential write-nt 4 bytes 4.1 GB/s 1.3 GB/s > sequential write-nt 8 bytes 4.1 GB/s 1.3 GB/s > sequential write-nt 16 bytes (sse) 4.1 GB/s 1.3 GB/s > sequential write-nt 32 bytes (avx) 4.2 GB/s 1.3 GB/s > sequential write-nt 64 bytes (avx512) 4.1 GB/s 1.3 GB/s > > With cached writes (where each cache line is immediatelly followed by clwb > or clflushopt), 8, 16 or 32-byte write performs better than non-temporal > stores and avx512 performs worse. > > sequential write 8 + clwb 5.1 GB/s 1.6 GB/s > sequential write 16 (sse) + clwb 5.1 GB/s 1.6 GB/s > sequential write 32 (avx) + clwb 4.4 GB/s 1.5 GB/s > sequential write 64 (avx512) + clwb 1.7 GB/s 0.6 GB/s This is indeed compelling straight-line data. My concern, similar to Robert's, is what it does to the rest of the system. In addition to increasing cache pollution, which I agree is difficult to quantify, it may also increase read-for-ownership traffic. Could you collect 'perf stat' for this clwb vs nt comparison to check if any of this incidental overhead effect shows up in the numbers? Here is a 'perf stat' line that might capture that. perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetch-misses,LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses,LLC-prefetch-misses -r 5 $benchmark In both cases nt and explicit clwb there's nothing that prevents the dirty-cacheline, or the fill buffer from being written-back / flushed before the full line is populated and maybe you are hitting that scenario differently with the two approaches? I did not immediately see a perf counter for events like this. Going forward I think this gets better with the movdir64b instruction because that can guarantee full-line-sized store-buffer writes. Maybe the perf data can help make a decision about whether we go with your patch in the near term? > > > > In user space, glibc faces similar choices for its memcpy() functions; > > glibc memcpy() uses non-temporal stores for transfers > 75% of the > > L3 cache size divided by the number of cores. For example, with > > glibc-2.216-16.fc27 (August 2017), on a Broadwell system with > > E5-2699 36 cores 45 MiB L3 cache, non-temporal stores are used > > for memcpy()s over 36 MiB. > > BTW. what does glibc do with reads? Does it flush them from the cache > after they are consumed? > > AFAIK glibc doesn't support persistent memory - i.e. there is no function > that flushes data and the user has to use inline assembly for that. Yes, and I don't know of any copy routines that try to limit the cache pollution of pulling the source data for a copy, only the destination. > > It'd be nice if glibc, PMDK, and the kernel used the same algorithms. Yes, it would. Although I think PMDK would make a different decision than the kernel when optimizing for highest bandwidth for the local application vs bandwidth efficiency across all applications. -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel