On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote: > > I'm issuing sync + sleep(10) after the extraction, so the writes > should all be flushed. > Also, I repeated the test three times, with very similar results: So that means the problem is not due to page cache writeback interfering with the discards. So it's most likely that the problem is due to how the blocks are allocated and laid out when using data=ordered vs data=writeback. Some experiments to try next. After extracting the files with data=ordered and data=writeback on a freshly formatted file system, use "e2freefrag" to see how the free space is fragmented. This will tell us how the file system is doing from a holistic perspective, in terms of blocks allocated to the extracted files. (E2freefrag is showing you the blocks *not* allocated, of course, but that's a mirror image dual of the blocks that *are* allocated, especially if you start from an identical known state; hence the use of a freshly formatted file system.) Next, we can see how individual files look like with respect to fragmentation. This can be done via using filefrag on all of the files, e.g: find . -type f -print0 | xargs -0 filefrag Another way to get similar (although not identical) information is via running "e2fsck -E fragcheck" on a file system. How they differ is especially more of a big deal on ext3 file systems without extents and flex_bg, since filefrag tries to take into account metadata blocks such as indirect blocks and extent tree blocks, and e2fsck -E fragcheck does not; but it's good enough for getting a good gestalt for the files' overall fragmentation --- and note that as long as the average fragment size is at least a megabyte or two, some fragmentation really isn't that much of a problem from a real-world performance perspective. People can get way too invested in trying to get to perfection with 100% fragmentation-free files. The problem with doing this at the expense of all else is that you can end up making the overall free space fragmentation worse as the file system ages, at which point the file system performance really dives through the floor as the file system approaches 100%, or even 80-90% full, especially on HDD's. For SSD's fragmentation doesn't matter quite so much, unless the average fragment size is *really* small, and when you are discarded freed blocks. Even if the files are showing no substantial difference in fragmentation, and the free space is equally A-OK with respect to fragmentation, the other possibility is the *layout* of the blocks are such that the order in which they are deleted using rm -rf ends up being less friendly from a discard perspective. This can happen if the directory hierarchy is big enough, and/or the journal size is small enough, that the rm -rf requires multiple journal transactions to complete. That's because with mount -o discard, we do the discards after each transaction commit, and it might be that even though the used blocks are perfectly contiguous, because of the order in which the files end up getting deleted, we end up needing to discard them in smaller chunks. For example, one could imagine a case where you have a million 4k files, and they are allocated contiguously, but if you get super-unlucky, such that in the first transaction you delete all of the odd-numbered files, and in second transaction you delete all of the even-numbered files, you might need to do a million 4k discards --- but if all of the deletes could fit into a single transaction, you would only need to do a single million block discard operation. Finally, you may want to consider whether or not mount -o discard really makes sense or not. For most SSD's, especially high-end SSD's, it probably doesn't make that much difference. That's because when you overwrite a sector, the SSD knows (or should know; this might not be some really cheap, crappy low-end flash devices; but on those devices, discard might not be making uch of a difference anyway), that the old contents of the sector is no longer needed. Hence an overwrite effectively is an "implied discard". So long as there is a sufficient number of free erase blocks, the SSD might be able to keep up doing the GC for those "implied discards", and so accelerating the process by sending explicit discards after every journal transaction might not be necessary. Or, maybe it's sufficient to run "fstrim" every week at Sunday 3am local time; or maybe even fstrim once a night or fstrim once a month --- your mileage may vary. It's going to vary from SSD to SSD and from workload to workload, but you might find that mount -o discard isn't buying you all that much --- if you run a random write workload, and you don't notice any performance degradation, and you don't notice an increase in the SSD's write amplification numbers (if they are provided by your SSD), then you might very well find that it's not worth it to use mount -o discard. I personally don't bother using mount -o discard, and instead periodically run fstrim, on my personal machines. Part of that is because I'm mostly just reading and replying to emails, building kernels and editing text files, and that is not nearly as stressful on the FTL as a full-blown random write workload (for example, if you were running a database supporting a transaction processing workload). Cheers, - Ted