On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote: > On Mon, Aug 01, 2016 at 01:13:45PM +0300, Boaz Harrosh wrote: > > > > On 07/30/2016 03:12 AM, Dave Chinner wrote: > > <> > > > If we track the dirty blocks from write in the radix tree like we > > > for mmap, then we can just use a normal memcpy() in dax_do_io(), > > > getting rid of the slow cache bypass that is currently run. Radix > > > tree updates are much less expensive than a slow memcpy of large > > > amounts of data, ad fsync can then take care of persistence, just > > > like we do for mmap. > > > > > > > No! > > > > mov_nt instructions, That "slow cache bypass that is currently run" above > > is actually faster then cached writes by 20%, and if you add the dirty > > tracking and cl_flush instructions it becomes x2 slower in the most > > optimal case and 3 times slower in the DAX case. > > IOWs, we'd expect writing to a file with DAX to be faster than when > buffered through the page cache and fsync()d, right? > > The numbers I get say otherwise. Filesystem on 8GB pmem block device: > > $ sudo mkfs.xfs -f /dev/pmem1 > meta-data=/dev/pmem1 isize=512 agcount=4, agsize=524288 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=0, rmapbt=0, > reflink=0 > data = bsize=4096 blocks=2097152, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=2560, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > Test command that writes 1GB to the filesystem: > > $ sudo time xfs_io -f -c "pwrite 0 1g" -c "sync" /mnt/scratch/foo > wrote 1073741824/1073741824 bytes at offset 0 > 1 GiB, 262144 ops; 0:00:01.00 (880.040 MiB/sec and 225290.3317 ops/sec) > 0.02user 1.13system 0:02.27elapsed 51%CPU (0avgtext+0avgdata > 2344maxresident)k > 0inputs+0outputs (0major+109minor)pagefaults 0swaps > > Results: > > pwrite B/W (MiB/s) runtime > run no DAX DAX no DAX DA > X > 1 880.040 236.352 2.27s > 4.34s > 2 857.094 257.297 2.18s > 3.99s > 3 865.820 236.087 2.13s > 4.34s > > It is quite clear that *DAX is much slower* than normal buffered > IO through the page cache followed by a fsync(). > > Stop and think why that might be. We're only doing one copy with > DAX, so why is the pwrite() speed 4x lower than for a copy into the > page cache? We're not copying 4x the data here. We're copying it > once. But there's another uncached write to each page during > allocation to zero each block first, so we're actually doing two > uncached writes to the page. And we're doing an allocation per page > with DAX, whereas we're using delayed allocation in the buffered IO > case which has much less overhead. > > The only thing we can do here to speed the DAX case up is do cached > memcpy so that the data copy after zeroing runs at L1 cache speed > (i.e. 50x faster than it currently does). > > Let's take the allocation out of it, eh? Let's do overwrite instead, > fsync in the buffered Io case, no fsync for DAX: > > pwrite B/W (MiB/s) runtime > run no DAX DAX no DAX DA > X > 1 1119 1125 1.85s 0.93s > 2 1113 1121 1.83s 0.91s > 3 1128 1078 1.80s 0.94s > > So, pwrite speeds are no different for DAX vs page cache IO. Also, > now we can see the overhead of writeback - a second data copy to > the pmem for the IO during fsync. If I take the fsync() away from > the buffered IO, the runtime drops to 0.89-0.91s, which is identical > to the DAX code. Given the DAX code has a short IO path than > buffered IO, it's not showing any advantage speed for using uncached > IO.... > > Let's go back to the allocation case, but this time take advantage > of the new iomap based Io path in XFS to amortise the DAX allocation > overhead by using a 16MB IO size instead of 4k: > > $ sudo time xfs_io -f -c "pwrite 0 1g -b 16m" -c sync /mnt/scratch/foo > > > pwrite B/W (MiB/s) runtime > run no DAX DAX no DAX DA > X > 1 1344 1028 1.63s 1.03s > 2 1410 980 1.62s 1.06s > 3 1399 1032 1.72s 0.99s > > So, pwrite bandwidth of the copy into the page cache is still much > higher than that of the DAX path, but now the allocation overhead > is minimised and hence the double copy in the buffered IO writeback > path shows up. For completeness, lets just run the overwrite case > here which is effectively just competing memcpy implementations, > fsync for buffered, no fsync for DAX: > > pwrite B/W (MiB/s) runtime > run no DAX DAX no DAX DA > X > 1 1791 1727 1.53s 0.59s > 2 1768 1726 1.57s 0.59s > 3 1799 1729 1.55s 0.59s > > Again, runtime shows the overhead of the double copy in the buffered > IO/writeback path. It also shows the overhead in the DAX path of the > allocation zeroing vs overwrite. If I drop the fsync from the > buffered IO path, bandwidth remains the same but runtime drops to > 0.55-0.57s, so again the buffered IO write path is faster than DAX > while doing more work. > > IOws, the overhead of dirty page tracking in the page cache mapping > tree is not significant in terms of write() performance. Hence > I fail to see why it should be significant in the DAX path - it will > probably have less overhead because we have less to account for in > the DAX write path. The only performance penalty for dirty tracking > is in the fsync writeback path itself, and that a separate issue > for optimisation. I mostly agree with the analysis, but I have a few comments about the use of cached / uncached (movnt) copy to PMEM. I do not think the test results are relevant on this point because both buffered and dax write() paths use uncached copy to avoid clflush. The buffered path uses cached copy to the page cache and then use uncached copy to PMEM via writeback. Therefore, the buffered IO path also benefits from using uncached copy to avoid clflush. Cached copy (req movq) is slightly faster than uncached copy, and should be used for writing to the page cache. For writing to PMEM, however, additional clflush can be expensive, and allocating cachelines for PMEM leads to evict application's cachelines. Thanks, -Toshi��.n��������+%������w��{.n�����{�����ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f