On Thu, Jan 31, 2013, at 08:43 AM, Theodore Ts'o wrote: > On Thu, Jan 31, 2013 at 08:21:50AM +1100, Robert Mueller wrote: (around now, was dropping the kids at school) > > For that matter, one big question I have is why each of these results is > > so different. > > > > [robm@imap14 conf]$ for i in 1 2 3 4 5 6 7 8 9 10; do fallocate -l 20m > > testfile3; filefrag testfile3; /bin/rm testfile3; done > > The most likely reason is that it depends on transaction boundaries. > After a block has been released, we can't reuse it until after the > jbd2 transaction which contains the deletion of the inode has > committed. So even after you've deleted the file, we can't reuse the > blocks right away. The other thing which will influence the block > allocation is which block group the last allocation was for that > particular file. So if blocks become available after a commit > completes, if we've started allocating in another block group, we > won't go back to the initial block group. The particular directory we're doing this test in is a cyrus imapd "conf" directory. It contains mostly symlinks and sub directories (some of them quite hot) but it also contains mailboxes.db, which is a very active database file. In this case it's twoskip, which is a skiplist-based file format. When any change is made to a twoskip file, the IO pattern is: 1) rewrite first 64 bytes (marking file dirty) and fdatasync 2) append new change/delete records and update back pointers (involves between 1 and 20 random rewrites of between 32 and 200ish bytes per change) 3) fsync 4) rewrite first 64 bytes (marking file clean again) and fdatasync So we get two fdatasyncs, one fsync (to save the metadata about the file being longer now) a bunch of random updates throughout the file, and some amount of new data appended to the file. Every so often the file contains too many obsolete records, and it gets repacked. This involves creating a new database file (mailboxes.db.NEW) and walking through the original database copying each record to the new database. Finally, the new database is renamed over the old. It uses flock on the entire file for serialisation, so there can only be a single writer at a time. Writes are done using seek and writev, reads are done by MMAPing the entire file. More detail about twoskip here if anyone cares: http://opera.brong.fastmail.fm/talks/twoskip/ It's the twoskip files that we're particularly concerned about. Not so much that they fragment during use, that's kind of expected - but that a repack doesn't result in a single contiguous file. Apart from the header, I can't see why it doesn't. I could probably change the repack code to not do the two first fdatasyncs, and just do a final fsync before renaming, if you think that initial fsync of just a couple of hundred bytes (header plus initial dummy record) is likely to mess up page allocation. Bron. -- Bron Gondwana brong@xxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html