Hi, We were recently trying to evaluate the trade-offs between update-in-place and no-overwrite file system designs, and on ext4 I produced some data that does not match my understanding of ext4 internals. I am wondering if this is known behavior and what is going on within the ext4 data structures that would lead to these results. The experiment I was running had four phases: Write an 8 GiB file sequentially (in 4 MiB chunks). Read back the 8 GiB file sequentially (in 4 MiB chunks). Overwrite 10,000 4KiB blocks within the 8GiB file (block aligned offsets chosen uniformly at random) Read back the 8 GiB file sequentially (in 4 MiB chunks). We start with an empty file system on its own partition. Between phases, we drop the caches and we unmount/mount the file system to ensure that the reads are all cold-cache. We ran all experiments using linux 3.11.10, on an ATA disk. We expected the performance of both of the sequential reads to be indistinguishable (based on our assumption that the data blocks are updated in place, so the random overwrites would have no impact on data placement). What we found instead was that the second sequential read had a ~10% slowdown. When I looked at the blktrace output from the random writes, I did not notice anything that I thought was suspicious. When I looked at the blktrace output from the 2nd sequential read, it appears as if there are some small reads performed out of order with respect to LBA. I have linked the seekwatcher I/O output for each of the phases (green indicates a write, and blue indicates a read). I have also attached a zoomed-in detail of the second sequential read (to the best granularity seekwatcher allowed). It covers the first second of the second sequential read. Internally, what data structures are changed by an overwrite that would cause different read patterns? The file is very large and spans many block groups, but my understanding is that the size and allocation information would not change when just overwriting blocks. And the only changes to the file system metadata that I can think of would be the inode's mtime, atime, and ctime. No extents should be split, This is my first time posting to the list, so please let me know if there is anything else I should provide or if there is any etiquette I am violating. I appreciate any insights. Graphs: Sequential write: https://drive.google.com/open?id=0B8HuLLVp2h86SmxmeVFGczFFaEU Sequential read of sequentially-written data: https://drive.google.com/open?id=0B8HuLLVp2h86SmxmeVFGczFFaEU Random 4K-aligned overwrites: https://drive.google.com/open?id=0B8HuLLVp2h86Slo5X1BjUkVxRUE Sequential read of randomly-overwritten data: https://drive.google.com/open?id=0B8HuLLVp2h86UnlzNkJvdG1HYUk (Detail of first 1 second: https://drive.google.com/open?id=0B8HuLLVp2h86R0IzV01Sd3M5ejA) Thank you, Bill -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html