Jan Kara wrote on 2016-08-08: > On Fri 05-08-16 19:58:33, Boylston, Brian wrote: >> Dave Chinner wrote on 2016-08-05: >>> [ cut to just the important points ] >>> On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote: >>>> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote: >>>>> If I drop the fsync from the >>>>> buffered IO path, bandwidth remains the same but runtime drops to >>>>> 0.55-0.57s, so again the buffered IO write path is faster than DAX >>>>> while doing more work. >>>> >>>> I do not think the test results are relevant on this point because both >>>> buffered and dax write() paths use uncached copy to avoid clflush. The >>>> buffered path uses cached copy to the page cache and then use uncached copy to >>>> PMEM via writeback. Therefore, the buffered IO path also benefits from using >>>> uncached copy to avoid clflush. >>> >>> Except that I tested without the writeback path for buffered IO, so >>> there was a direct comparison for single cached copy vs single >>> uncached copy. >>> >>> The undenial fact is that a write() with a single cached copy with >>> all the overhead of dirty page tracking is /faster/ than a much >>> shorter, simpler IO path that uses an uncached copy. That's what the >>> numbers say.... >>> >>>> Cached copy (req movq) is slightly faster than uncached copy, >>> >>> Not according to Boaz - he claims that uncached is 20% faster than >>> cached. How about you two get together, do some benchmarking and get >>> your story straight, eh? >>> >>>> and should be >>>> used for writing to the page cache. For writing to PMEM, however, additional >>>> clflush can be expensive, and allocating cachelines for PMEM leads to evict >>>> application's cachelines. >>> >>> I keep hearing people tell me why cached copies are slower, but >>> no-one is providing numbers to back up their statements. The only >>> numbers we have are the ones I've published showing cached copies w/ >>> full dirty tracking is faster than uncached copy w/o dirty tracking. >>> >>> Show me the numbers that back up your statements, then I'll listen >>> to you. >> >> Here are some numbers for a particular scenario, and the code is below. >> >> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer >> (1M total memcpy()s). For the cached+clflush case, the flushes are done >> every 4MiB (which seems slightly faster than flushing every 16KiB): >> >> NUMA local NUMA remote >> Cached+clflush 13.5 37.1 >> movnt 1.0 1.3 > > Thanks for the test Brian. But looking at the current source of libpmem > this seems to be comparing apples to oranges. Let me explain the details > below: > >> In the code below, pmem_persist() does the CLFLUSH(es) on the given range, >> and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE: > > Yes. libpmem does what you describe above and the name > pmem_memcpy_persist() is thus currently misleading because it is not > guaranteed to be persistent with the current implementation of DAX in > the kernel. > > It is important to know which kernel version and what filesystem have you > used for the test to be able judge the details but generally pmem_persist() > does properly tell the filesystem to flush all metadata associated with the > file, commit open transactions etc. That's the full cost of persistence. I used NVML 1.1 for the measurements. In this version and with the hardware that I used, the pmem_persist() flow is: pmem_persist() pmem_flush() Func_flush() == flush_clflush CLFLUSH pmem_drain() Func_predrain_fence() == predrain_fence_empty no-op So, I don't think that pmem_persist() does anything to cause the filesystem to flush metadata as it doesn't make any system calls? > pmem_memcpy_persist() makes sure the data writes have reached persistent > storage but nothing guarantees associated metadata changes have reached > persistent storage as well. While metadata is certainly important, my goal with this specific test was to measure the "raw" performance of cached+flush vs uncached, without anything else in the way. > To assure that, fsync() (or pmem_persist() > if you wish) is currently the only way from userspace. Perhaps you mean pmem_msync() here? pmem_msync() calls msync(), but pmem_persist() does not. > At which point > you've lost most of the advantages using movnt. Ross researches into > possibilities of allowing more efficient userspace implementation but > currently there are none. Apart from the current performance discussion, if the metadata for a file is already established (file created, space allocated by explicit writes(), and everything synced), then if I map it and do pmem_memcpy_persist(), are there any "ongoing" metadata updates that would need to be flushed (besides timestamps)? Brian -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html