On May 06, 2009 07:28 -0400, Theodore Ts'o wrote: > So that's the good news. However, the block allocation shows that we > are doing something... strange. Running an e2fsck -E fragcheck report, > the large files seem to be written out in 8 megabyte chunks: > > 1313(f): expecting 51200 actual extent phys 53248 log 2048 len 2048 > 1351(f): expecting 53248 actual extent phys 57344 log 2048 len 2048 > 1351(f): expecting 59392 actual extent phys 67584 log 4096 len 4096 > 1351(f): expecting 71680 actual extent phys 73728 log 8192 len 2048 > 1351(f): expecting 75776 actual extent phys 77824 log 10240 len 2048 > 1574(f): expecting 77824 actual extent phys 81920 log 6144 len 2048 > 1574(f): expecting 83968 actual extent phys 86016 log 8192 len 12288 > 1574(f): expecting 98304 actual extent phys 100352 log 20480 len 32768 Two things might be involved here: - IIRC mballoc limits its extent searches to 8MB, so that it doesn't waste a lot of cycles looking for huge free chunks when there aren't any. For Lustre that didn't make much difference since the largest possible IO size at the server is 1MB. That said, if we have huge delalloc files it might make sense to do some checking for more space, possibly whole free groups for files > 128MB in size. Scanning the buddy bitmaps isn't very expensive, but loading some 10000's of them in a large filesystem IS. - it might also relate to pdflush limiting the background writeout from a single file, and flushing the delalloc pages in round-robin manner. Without delalloc the blocks would already have been allocated, so the writeout speed didn't matter. With delalloc now we might have an unpleasant interaction between how pdflush writes out the dirty pages and how the files are allocated on disk. > Thinking this was perhaps rsync's fault, I tried the experiment where I > copied the files using tar: > > tar -cf - -C /mnt2 . | tar -xpf - -C /mnt . > > However, the same pattern was visible. Tar definitely copies files > using one at a time, so this must be an artifact of the page writeback > algorithms. If you can run a similar test with fsync after each file I suspect the layout will be correct. Alternately, if the kernel did the equivalent of "fallocate(KEEP_SIZE)" for the file as soon as writeout started, it would avoid any interaction between pdflush and the file allocation. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html