>> [ ... ] there has been quite some other metadata related >> performance improvements. Thus IMHO reducing the recent >> improvements in metadata performance is underselling XFS and >> overselling delaylog. [ ... ] > That's a good way of putting it, and I am pleased that I finally > get a reasonable comment on this story, and one that agrees with > one of my previous points in this thread: [ ... ] [ ... ] > http://xfs.org/images/d/d1/Xfs-scalability-lca2012.pdf > «* Ext4 can be up 20-50x times than XFS when data is also being > written as well (e.g. untarring kernel tarballs). > * This is XFS @ 2009-2010. > * Unless you have seriously fast storage, XFS just won't > perform well on metadata modification heavy workloads.» > It is never mentioned that 'ext4' is 20-50x faster on metadata > modification workloads because it implements much weaker > semantics than «XFS @ 2009-2010», and that 'delaylog' matches > 'ext4' because it implements similarly weaker semantics, by > reducing the frequency of commits, as the XFS FAQ briefly > summarizes: [ ... ] As to this, I have realized that there is a very big detail that I have given for implicit but that perhaps at this point should be made explicit as to the deliberately misleading propaganda that «Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs).»: Almost all «untarring kernel tarballs» "benchmarks" are done with GNU 'tar', and it does not 'fsync'. This matters because XFS has done the "right thing" with 'fsync' for a long time, and if the application does 'fsync' then 'ext4', XFS without and with 'delaylog' are mostly equivalent. Conversely Schilling's 'tar' does 'fsync' and as a result it is often considered (by the gullible crowd to which the presentation propaganda referred to above is addressed) to have less "performance" than GNU 'tar'. To illustrate I have done a tiny test '.tar' file with a directory and two files within, and this is what happens with Schilling's 'tar': $ strace -f -e trace=file,fsync,fdatasync,read,write star xf d.tar open("d.tar", O_RDONLY) = 7 read(7, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512 Process 8201 attached [ ... ] [pid 8200] lstat("d/", 0x7fff174d9490) = -1 ENOENT (No such file or directory) [pid 8200] lstat("d/", 0x7fff174d9330) = -1 ENOENT (No such file or directory) [pid 8200] access("d", F_OK) = -1 ENOENT (No such file or directory) [pid 8200] mkdir("d", 0700) = 0 [pid 8200] lstat("d/", {st_mode=S_IFDIR|0700, st_size=6, ...}) = 0 [pid 8200] lstat("d/f1", 0x7fff174d9490) = -1 ENOENT (No such file or directory) [pid 8200] open("d/f1", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4 [pid 8200] write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128 [pid 8200] fsync(4 <unfinished ...> [pid 8201] <... write resumed> ) = 1 [pid 8201] read(7, "", 10240) = 0 Process 8201 detached <... fsync resumed> ) = 0 --- SIGCHLD (Child exited) @ 0 (0) --- utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0 utimes("d/f1", {{1332588240, 0}, {1332588240, 0}}) = 0 lstat("d/f2", 0x7fff174d9490) = -1 ENOENT (No such file or directory) open("d/f2", O_WRONLY|O_CREAT|O_TRUNC, 0600) = 4 write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096 fsync(4) = 0 utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0 utimes("d/f2", {{1332588257, 0}, {1332588257, 0}}) = 0 utimes("d", {{1332588242, 0}, {1332588242, 0}}) = 0 write(2, "star: 1 blocks + 0 bytes (total "..., 58star: 1 blocks + 0 bytes (total of 10240 bytes = 10.00k). ) = 58 Compare with GNU 'tar': $ strace -f -e trace=file,fsync,fdatasync,read,write tar xf d.tar [ ... ] open("d.tar", O_RDONLY) = 3 read(3, "d/\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 10240) = 10240 [ ... ] mkdir("d", 0700) = -1 EEXIST (File exists) stat("d", {st_mode=S_IFDIR|0700, st_size=24, ...}) = 0 open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists) unlink("d/f1") = 0 open("d/f1", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4 write(4, "3\275@&{U(\356\332\25z\250\236\256v\6U[5\334\265\313\206:\351\335\366Q\21\231\210H"..., 128) = 128 close(4) = 0 utimensat(AT_FDCWD, "d/f1", {{1332589368, 193330071}, {1332588240, 0}}, 0) = 0 open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists) unlink("d/f2") = 0 open("d/f2", O_WRONLY|O_CREAT|O_EXCL, 0600) = 4 write(4, "\377\325\253\257,\210\2719e\24\347*P\325x\357\345\220\375Ei\375\355\22063\17\355\312.\6\347"..., 4096) = 4096 close(4) = 0 utimensat(AT_FDCWD, "d/f2", {{1332589368, 193330071}, {1332588257, 0}}, 0) = 0 close(3) = 0 utimensat(AT_FDCWD, "d", {{1332589368, 193330071}, {1332588242, 0}}, 0) = 0 close(1) = 0 close(2) = 0 In effect running GNU 'tar x' (GNU 'tar') is the same as running 'eatmydata tar x ...'; and indeed as its documentation says, 'eatmydata' is designed to achieve higher "performance" by turning programs that behave like Schilling's 'tar' into programs that behave like GNU 'tar'. When GNU 'tar' is used as a "benchmark" for 'delaylog' and there are no 'fsync's, the longer the interval between commits (and thus the implicit unsafety) the higher the "performance", or at least that's the argument I think propagandists and buffoons may be using. That's one important reason why I mentioned 'eatmydata' as one performance enhancing technique in a group with 'nobarrier' and 'delaylog'; and why I was amused by this buffoonery: «So you're comparing delaylog's volatile buffer architecture to software that *intentionally and transparently disables fsync*?» Because when the 'delaylog' propagandists write that: «Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs).» it is them who are comparing "performance" using GNU 'tar' which intentionally and transparently does not use at all 'fsync'. To illustrate here are some "benchmarks", which hopefully should be revealing as to the merit of the posturings of some of the buffoons or propagandists that have been discontributing to this discussion (note that there are somewhat subtle details both as to the setup and the results): -------------------------------------------------------------- # uname -a Linux base.ty.sabi.co.uk 2.6.18-274.18.1.el5 #1 SMP Thu Feb 9 12:20:03 EST 2012 x86_64 x86_64 x86_64 GNU/Linux # egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort none /tmp tmpfs rw 0 0 /dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0 /dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0 vm.dirty_background_bytes = 900000000 vm.dirty_bytes = 500000000 vm.dirty_expire_centisecs = 2000 vm.dirty_writeback_centisecs = 1000 -------------------------------------------------------------- # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m1.027s user 0m0.105s sys 0m0.922s Dirty: 419700 kB Writeback: 0 kB real 0m5.163s user 0m0.000s sys 0m0.473s -------------------------------------------------------------- # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -no-fsync -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 0m1.204s user 0m0.139s sys 0m1.270s Dirty: 419456 kB Writeback: 0 kB real 0m5.012s user 0m0.000s sys 0m0.458s -------------------------------------------------------------- # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 23m29.346s user 0m0.327s sys 0m2.280s Dirty: 108 kB Writeback: 0 kB real 0m0.236s user 0m0.000s sys 0m0.199s -------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m46.554s user 0m0.107s sys 0m1.271s Dirty: 415168 kB Writeback: 0 kB real 1m54.913s user 0m0.000s sys 0m0.325s ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 60m15.723s user 0m0.442s sys 0m7.009s Dirty: 4 kB Writeback: 0 kB real 0m0.222s user 0m0.000s sys 0m0.194s ---------------------------------------------------------------- >From the above my conclusion is that «XFS @ 2009-2010» half the performance of 'ext4' on this workload, and that «Ext4 can be up 20-50x times than XFS when data is also being written as well (e.g. untarring kernel tarballs).» only when both data and metadata are written to RAM by 'ext4'. One can spend a lot of time changing parameters, as in using 'delaylog' or 'nobarrier' etc. I have tried with my favourite rather "tighter" flusher parameters, some comparisons that I find interesting: ---------------------------------------------------------------- # egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl vm | egrep '_(bytes|centisecs)' | sort none /tmp tmpfs rw 0 0 /dev/sdd3 /tmp/ext4 ext4 rw,barrier=1,data=ordered 0 0 /dev/sdd8 /tmp/xfs xfs rw,nouuid,attr2,inode64,logbsize=256k,sunit=8,swidth=8,noquota 0 0 vm.dirty_background_bytes = 900000000 vm.dirty_bytes = 100000 vm.dirty_expire_centisecs = 200 vm.dirty_writeback_centisecs = 100 # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m6.776s user 0m0.107s sys 0m1.260s Dirty: 1776 kB Writeback: 0 kB real 0m0.231s user 0m0.000s sys 0m0.197s ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 2m25.805s user 0m0.135s sys 0m1.812s Dirty: 2372 kB Writeback: 84 kB real 0m1.683s user 0m0.000s sys 0m0.196s ---------------------------------------------------------------- That's a bit of a surprise, because time to completion on both when the flusher parameters allowed writing entirely to memory for both with 'eatmydata tar' were the same. It looks like that when flushing 'xfs' still does a fair bit of implicit metadata commits, as switching off barriers shows: ---------------------------------------------------------------- # mount -o remount,barrier=0 /dev/sdd8 /tmp/ext4 # (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m7.388s user 0m0.127s sys 0m1.235s Dirty: 508 kB Writeback: 0 kB real 0m0.243s user 0m0.000s sys 0m0.199s ---------------------------------------------------------------- # mount -o remount,nobarrier /dev/sdd3 /tmp/xfs # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m31.047s user 0m0.124s sys 0m1.880s Dirty: 2324 kB Writeback: 24 kB real 0m0.269s user 0m0.000s sys 0m0.195s ---------------------------------------------------------------- While it seems likely 'ext4' runs headlong without commits on either metadata or data ('ext4' and 'ext3' in effect have a rather loose 'delaylog'). XFS however seems to be a bit at a disadvantage though as with 'nobarrier' and 'eatmydata tar' the time to completion should be the same. The partition for XFS is on inner tracks, but that does not make that much of a difference. Also compare with 'ext4' using 'eatmydata tar' with no barriers and using 'star' with no barrier and also 'data=writeback': ---------------------------------------------------------------- base# umount /tmp/ext4; mount -t ext4 -o defaults,barrier=0,data=writeback /dev/sdd3 /tmp/ext4 base# (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m6.158s user 0m0.123s sys 0m1.233s Dirty: 1704 kB Writeback: 0 kB real 0m0.247s user 0m0.001s sys 0m0.194s ---------------------------------------------------------------- base# (cd /tmp/ext4; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 0m32.101s user 0m0.196s sys 0m1.718s Dirty: 24 kB Writeback: 48 kB real 0m0.217s user 0m0.000s sys 0m0.193s ---------------------------------------------------------------- Finally here is on XFS, with 'delaylog', on a system with a 3.x kernel and a rather fast (especially on small random writes) SSD drive (and my usual tighter flusher parameters): ---------------------------------------------------------------- # uname -a Linux.ty.sabi.co.UK 3.0.0-15-generic #26~lucid1-Ubuntu SMP Wed Jan 25 15:37:10 UTC 2012 x86_64 GNU/Linux # egrep ' (/tmp|/tmp/(ext4|xfs))' /proc/mounts; sysctl -a 2>/dev/null | egrep '_(bytes|centisecs)' | sort none /tmp tmpfs rw,relatime,size=1024000k 0 0 /dev/sda6 /tmp/xfs xfs rw,noatime,nodiratime,attr2,delaylog,discard,inode64,logbsize=256k,sunit=16,swidth=8192,noquota 0 0 /dev/sda3 /tmp/ext4 ext4 rw,nodiratime,relatime,errors=remount-ro,user_xattr,acl,barrier=1,data=ordered,discard 0 0 fs.xfs.age_buffer_centisecs = 1500 fs.xfs.filestream_centisecs = 3000 fs.xfs.xfsbufd_centisecs = 100 fs.xfs.xfssyncd_centisecs = 3000 vm.dirty_background_bytes = 900000000 vm.dirty_bytes = 100000000 vm.dirty_expire_centisecs = 200 vm.dirty_writeback_centisecs = 100 ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time tar -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) real 0m5.148s user 0m0.300s sys 0m2.876s Dirty: 50052 kB Writeback: 0 kB WritebackTmp: 0 kB real 0m0.784s user 0m0.000s sys 0m0.100s ---------------------------------------------------------------- # (cd /tmp/xfs; rm -rf linux-2.6.32; sync; time star -x -f /tmp/linux-2.6.32.tar; egrep 'Dirty|Writeback' /proc/meminfo; time sync) star: 37343 blocks + 0 bytes (total of 382392320 bytes = 373430.00k). real 6m21.946s user 0m0.808s sys 0m11.321s Dirty: 0 kB Writeback: 0 kB WritebackTmp: 0 kB real 0m0.097s user 0m0.000s sys 0m0.044s ---------------------------------------------------------------- The effect of 'delaylog' is pretty obvious there. The numbers above with their wide variation depending on changes in the level of safety requested amply demonstrate that it takes the skills of a propagandist or a buffoon to boast about the "performance" of 'delaylog' and comparisons with 'ext4' without prominently mentioning the big safety tradeoffs involved. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs