On Sun, May 01, 2011 at 03:38:25PM +0100, Peter Grandi wrote: > > [ ... ] > > [ ... Extracting a kernel 'tar' with GNU tar on 'ext3': ] > >>> real 0m21.769s > [ ... Extracting a kernel 'tar' with GNU tar on XFS: ] > >>> real 2m20.522s > > >> [ ... ] in most cases the wrong number is the one for 'ext3' > >> on RAID1 (way too small). Even the number for XFS and RAID0 > >> 'delaylog' is a wrong number (somewhat small) in many cases. > > >> There are 38000 files in 440MB in 'linux-2.6.38.tar', ~40% of > >> them are smaller than 4KiB and ~60% smaller than 8KiB. Also you > >> didn't flush caches, and you don't say whether the filesystems > >> are empty or full or at the same position on the disk. > >> > >> Can 'ext3' really commit 1900 small files per second (including > >> directory updates) to a filesystem on a RAID1 that probably can > >> do around 100 IOPS? That would be amazing news. > > In the real world 'ext3' as reported in my previous message can > "really commit" around 50 "small files per second (including > directory updates)" in near-optimal conditions to a storage > device that can proboably do around 100IOPS; copying here the > actual numbers: > > % mount -t ext3 -o relatime /dev/sdb /mnt/sdb > % time sh -c 'cd /mnt/sdb; star -x -b 2048 -f /tmp/linux-2.6.38.tar; cd /; umount /mnt/sdb' > star: 420 blocks + 81920 bytes (total of 440483840 bytes = 430160.00k). Oh, you fsync every file. The problem the user reported did not involve fsync at all, so your straw man isn't really relevant to the reported problem. You're redefining the problem to suit your argument. > > Why? Because the allocator is optimised to pack small files > > written at the same time together on disk, and the elevator > > will merge them into one large IO when they are finally > > written to disk. With a typical 512k max IO size, that's 128 > > <=4k files packed into each IO, > > This is an argument based on a cunning or distracted or ignorant > shift of the goalposts: because this is an argument about purely > *writing* the *data* in those small files, while the bigger > issue is *committing* the *metadata*, all of it "(including > directory updates)". Also, this argument is also based on the > assumption that it is permissible to commit 128 small files when > the last one gets closed, not when each gets committed. I haven't confused anything - indeed I explained exactly why the user got the results they did with ext3. You seem to be implying that the only way for data safety to be given is: write file fsync file fsync parent dir write file fsync file fsync parent dir ..... Which is, quite frankly, a load of bollocks. The user doesn't care if the untar is not complete because a crash occurred during it - they are still going to have to redo it from scratch regardless of whether file-by-file fsync is in use or not. Indeed, doing this: write file write file write file write file write file ..... sync Gives the same overall guarantees as your preferred method, but completes much, much faster. Taking 30s to write the files asynchronously and then another second or two for the sync to complete is far more appropriate for this workload than doing a file-by-file fsync. > In this discussion it is rather comical to make an argument > based on the speed of IO using what is in effect EatMyData as > described here: > > http://talk.maemo.org/showthread.php?t=67901 > > but here it is: /me starts laughing uncontrollably. The source: http://www.flamingspork.com/projects/libeatmydata/ for speeding up database testing where fsync is not needed to determine the success of the test or not. The fact is that the dpkg devs went completely nuts with fsync() when ext4 came around because it had problems with losing files when crashes occurred shortly after upgrades. It was excessive and unneccessary and didn't take into account the transactional grouping of updates. This problem has since been fixed - there is now a sync issued at the end of each package install so the data is on disk before the "installation complete" entry is updated in the dpkg database. A single sync rather than a sync-per-file is much, much faster, and matches the intended "transaction grouping" of the dpkg operation. With the recent addition of a "sync a single fs" syscall, it will get faster again.... > That's a fantastic result, somewhat over 1,300 small files per > second (14 commits per nominal IOPS), but "fantastic" (as in > fantasy) is the keyword, because it is for completely different > and broken semantics, a point that should not be lost on anybody > who can "understand IOPS and metadata and commits and caching". Where's the "broken semantics" here? The filesystem did exactly what you asked, and performed in exactly the way we'd expect it to. Atomicity and stability guarantees are application dependent - they are not defined by the filesystem. Fundamentally, untarring a kernel tarball does not require the same data safety semantics of databases nor does it need to deal with safely overwriting files. Sometimes people care more about performance than they do about data safety, and untarring some huge tarball is usually one of those cases. If they care about data safety, that is what sync(1) is for after the untar... > It is not as if the difference isn't widely known: > > http://cdrecord.berlios.de/private/man/star/star.1.html > > Star is a very fast tar(1) like tape archiver with improved > functionality. > On operating systems with slow file I/O (such as Linux), it > may help to use -no-fsync in addition, but then star is > unable to detect all error conditions; so use with care. Ah, quoting Joerg Schilling FUD about Linux. That's a good way to get people to ignore you.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs