On Fri, 2016-08-05 at 21:27 +1000, Dave Chinner wrote: > [ cut to just the important points ] > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote: > > > > On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote: > > > > > > If I drop the fsync from the > > > buffered IO path, bandwidth remains the same but runtime drops to > > > 0.55-0.57s, so again the buffered IO write path is faster than DAX > > > while doing more work. > > > > I do not think the test results are relevant on this point because both > > buffered and dax write() paths use uncached copy to avoid clflush. The > > buffered path uses cached copy to the page cache and then use uncached > > copy to PMEM via writeback. Therefore, the buffered IO path also benefits > > from using uncached copy to avoid clflush. > > Except that I tested without the writeback path for buffered IO, so > there was a direct comparison for single cached copy vs single > uncached copy. I agree that the result showed a tentative comparison for cached copy vs uncached copy. My point, however, is that writes to PMEM need to persist unlike the page cache. So for PMEM, the comparison should be (cached copy + clflush) vs uncached copy. > The undenial fact is that a write() with a single cached copy with > all the overhead of dirty page tracking is /faster/ than a much > shorter, simpler IO path that uses an uncached copy. That's what the > numbers say.... This cost evaluation needs to include the cost of clflush for cached copy. Also, page cache allocation tends to be faster than disk block allocation. > > > > Cached copy (req movq) is slightly faster than uncached copy, > > Not according to Boaz - he claims that uncached is 20% faster than > cached. How about you two get together, do some benchmarking and get > your story straight, eh? I vaguely remember seeing such results, but I may be wrong about that. Here are performance test results Robert Elliott conducted before. https://lkml.org/lkml/2015/4/2/453 To quote the results relevant to this topic: - Cached copy 2.5 M - Uncached copy w/ MOVNTI 2.6 M - Uncached copy w/ MOVNTDQ 3.5 M Note that we use MOVNTI today, not MOVNTDQ. We instrumented a MOVNTDQ copy function for this test. We can further improve the copy performance by using MOVNTDQ. > > and should be used for writing to the page cache. For writing to PMEM, > > however, additional clflush can be expensive, and allocating cachelines > > for PMEM leads to evict application's cachelines. > > I keep hearing people tell me why cached copies are slower, but > no-one is providing numbers to back up their statements. The only > numbers we have are the ones I've published showing cached copies w/ > full dirty tracking is faster than uncached copy w/o dirty tracking. > > Show me the numbers that back up your statements, then I'll listen > to you. Please see above. Cached copy requires clflush on top of that. Thanks, -Toshi ��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥