Dave Chinner wrote on 2016-08-05: > [ cut to just the important points ] > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote: >> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote: >>> If I drop the fsync from the >>> buffered IO path, bandwidth remains the same but runtime drops to >>> 0.55-0.57s, so again the buffered IO write path is faster than DAX >>> while doing more work. >> >> I do not think the test results are relevant on this point because both >> buffered and dax write() paths use uncached copy to avoid clflush. The >> buffered path uses cached copy to the page cache and then use uncached copy to >> PMEM via writeback. Therefore, the buffered IO path also benefits from using >> uncached copy to avoid clflush. > > Except that I tested without the writeback path for buffered IO, so > there was a direct comparison for single cached copy vs single > uncached copy. > > The undenial fact is that a write() with a single cached copy with > all the overhead of dirty page tracking is /faster/ than a much > shorter, simpler IO path that uses an uncached copy. That's what the > numbers say.... > >> Cached copy (req movq) is slightly faster than uncached copy, > > Not according to Boaz - he claims that uncached is 20% faster than > cached. How about you two get together, do some benchmarking and get > your story straight, eh? > >> and should be >> used for writing to the page cache. For writing to PMEM, however, additional >> clflush can be expensive, and allocating cachelines for PMEM leads to evict >> application's cachelines. > > I keep hearing people tell me why cached copies are slower, but > no-one is providing numbers to back up their statements. The only > numbers we have are the ones I've published showing cached copies w/ > full dirty tracking is faster than uncached copy w/o dirty tracking. > > Show me the numbers that back up your statements, then I'll listen > to you. Here are some numbers for a particular scenario, and the code is below. Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer (1M total memcpy()s). For the cached+clflush case, the flushes are done every 4MiB (which seems slightly faster than flushing every 16KiB): NUMA local NUMA remote Cached+clflush 13.5 37.1 movnt 1.0 1.3 In the code below, pmem_persist() does the CLFLUSH(es) on the given range, and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE: #include <stdlib.h> #include <stdio.h> #include <errno.h> #include <string.h> #include <libpmem.h> /* * gcc -Wall -O2 -m64 -mcx16 -o memcpyperf memcpyperf.c -lpmem * * Not sure if -mcx16 allows gcc to use faster memcpy bits? */ /* * our source buffer. we'll copy this much at a time. * align it so that memcpy() doesn't have to do anything funny. */ char __attribute__((aligned(0x100))) src[4 * 4096]; int main( int argc, char* argv[] ) { char* path; char mode; int nloops; char* dstbase; size_t dstsz; int ispmem; int cpysz; char* dst; if (argc != 4) { fprintf(stderr, "ERROR: usage: " "memcpyperf [cached | nt] PATH NLOOPS\n"); exit(1); } mode = argv[1][0]; path = argv[2]; nloops = atoi(argv[3]); dstbase = pmem_map_file(path, 0, 0, 0, &dstsz, &ispmem); if (NULL == dstbase) { perror(path); exit(1); } if (!ispmem) fprintf(stderr, "WARNING: %s is not pmem\n", path); if (dstsz < sizeof(src)) cpysz = dstsz; else cpysz = sizeof(src); fprintf(stderr, "INFO: dst %p src %p dstsz %ld cpysz %d\n", dstbase, src, dstsz, cpysz); dst = dstbase; while (nloops--) { if (mode == 'c') { memcpy(dst, src, cpysz); /* * we could do the clflush here on the 16KiB we just * wrote, but with a 4MiB file (dst buffer) and 16KiB * src buffer, it seems slightly faster to flush the * entire 4MiB below */ //pmem_persist(dst, cpysz); } else { pmem_memcpy_persist(dst, src, cpysz); } dst += cpysz; if ((dst + cpysz) - dstbase > dstsz) { dst = dstbase; /* see note above */ if (mode == 'c') pmem_persist(dst, dstsz); } } exit(0); } /* main() */ EOF Sample runs: $ numactl -N0 time -p ./memcpyperf c /pmem0/brian/cpt 1000000 INFO: dst 0x7f3f1a000000 src 0x601200 dstsz 4194304 cpysz 16384 real 13.53 user 13.53 sys 0.00 $ numactl -N0 time -p ./memcpyperf n /pmem0/brian/cpt 1000000 INFO: dst 0x7f2b54600000 src 0x601200 dstsz 4194304 cpysz 16384 real 1.04 user 1.04 sys 0.00 $ numactl -N1 time -p ./memcpyperf c /pmem0/brian/cpt 1000000 INFO: dst 0x7f8f8c200000 src 0x601200 dstsz 4194304 cpysz 16384 real 37.13 user 37.15 sys 0.00 $ numactl -N1 time -p ./memcpyperf n /pmem0/brian/cpt 1000000 INFO: dst 0x7f77f7400000 src 0x601200 dstsz 4194304 cpysz 16384 real 1.24 user 1.24 sys 0.00 Brian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html