On Tue, Dec 22, 2020 at 11:53 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: > > On Dec 22, 2020, at 9:34 AM, Theodore Y. Ts'o <tytso@xxxxxxx> wrote: > > > > On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote: > >> > >> I'm issuing sync + sleep(10) after the extraction, so the writes > >> should all be flushed. > >> Also, I repeated the test three times, with very similar results: > > > > So that means the problem is not due to page cache writeback > > interfering with the discards. So it's most likely that the problem > > is due to how the blocks are allocated and laid out when using > > data=ordered vs data=writeback. > > > > Some experiments to try next. After extracting the files with > > data=ordered and data=writeback on a freshly formatted file system, > > use "e2freefrag" to see how the free space is fragmented. This will > > tell us how the file system is doing from a holistic perspective, in > > terms of blocks allocated to the extracted files. (E2freefrag is > > showing you the blocks *not* allocated, of course, but that's a mirror > > image dual of the blocks that *are* allocated, especially if you start > > from an identical known state; hence the use of a freshly formatted > > file system.) > > > > Next, we can see how individual files look like with respect to > > fragmentation. This can be done via using filefrag on all of the > > files, e.g: > > > > find . -type f -print0 | xargs -0 filefrag > > > > Another way to get similar (although not identical) information is via > > running "e2fsck -E fragcheck" on a file system. How they differ is > > especially more of a big deal on ext3 file systems without extents and > > flex_bg, since filefrag tries to take into account metadata blocks > > such as indirect blocks and extent tree blocks, and e2fsck -E > > fragcheck does not; but it's good enough for getting a good gestalt > > for the files' overall fragmentation --- and note that as long as the > > average fragment size is at least a megabyte or two, some > > fragmentation really isn't that much of a problem from a real-world > > performance perspective. People can get way too invested in trying to > > get to perfection with 100% fragmentation-free files. The problem > > with doing this at the expense of all else is that you can end up > > making the overall free space fragmentation worse as the file system > > ages, at which point the file system performance really dives through > > the floor as the file system approaches 100%, or even 80-90% full, > > especially on HDD's. For SSD's fragmentation doesn't matter quite so > > much, unless the average fragment size is *really* small, and when you > > are discarded freed blocks. > > > > Even if the files are showing no substantial difference in > > fragmentation, and the free space is equally A-OK with respect to > > fragmentation, the other possibility is the *layout* of the blocks are > > such that the order in which they are deleted using rm -rf ends up > > being less friendly from a discard perspective. This can happen if > > the directory hierarchy is big enough, and/or the journal size is > > small enough, that the rm -rf requires multiple journal transactions > > to complete. That's because with mount -o discard, we do the discards > > after each transaction commit, and it might be that even though the > > used blocks are perfectly contiguous, because of the order in which > > the files end up getting deleted, we end up needing to discard them in > > smaller chunks. > > > > For example, one could imagine a case where you have a million 4k > > files, and they are allocated contiguously, but if you get > > super-unlucky, such that in the first transaction you delete all of > > the odd-numbered files, and in second transaction you delete all of > > the even-numbered files, you might need to do a million 4k discards > > --- but if all of the deletes could fit into a single transaction, you > > would only need to do a single million block discard operation. > > > > Finally, you may want to consider whether or not mount -o discard > > really makes sense or not. For most SSD's, especially high-end SSD's, > > it probably doesn't make that much difference. That's because when > > you overwrite a sector, the SSD knows (or should know; this might not > > be some really cheap, crappy low-end flash devices; but on those > > devices, discard might not be making uch of a difference anyway), that > > the old contents of the sector is no longer needed. Hence an > > overwrite effectively is an "implied discard". So long as there is a > > sufficient number of free erase blocks, the SSD might be able to keep > > up doing the GC for those "implied discards", and so accelerating the > > process by sending explicit discards after every journal transaction > > might not be necessary. Or, maybe it's sufficient to run "fstrim" > > every week at Sunday 3am local time; or maybe even fstrim once a night > > or fstrim once a month --- your mileage may vary. > > > > It's going to vary from SSD to SSD and from workload to workload, but > > you might find that mount -o discard isn't buying you all that much > > --- if you run a random write workload, and you don't notice any > > performance degradation, and you don't notice an increase in the SSD's > > write amplification numbers (if they are provided by your SSD), then > > you might very well find that it's not worth it to use mount -o > > discard. > > > > I personally don't bother using mount -o discard, and instead > > periodically run fstrim, on my personal machines. Part of that is > > because I'm mostly just reading and replying to emails, building > > kernels and editing text files, and that is not nearly as stressful on > > the FTL as a full-blown random write workload (for example, if you > > were running a database supporting a transaction processing workload). > > The problem (IMHO) with "-o discard" is that if it is only trimming > *blocks* that were deleted, these may be too small to effectively be > processed by the underlying device (e.g. the "super-unlucky" example > above where interleaved 4KB file deletes result in 1M separate 4KB > trim requests to the device, even when the *space* that is freed by > the unlinks could be handled with far fewer large trim requests. > > There was a discussion previously ("introduce EXT4_BG_WAS_TRIMMED ...") > > https://patchwork.ozlabs.org/project/linux-ext4/patch/1592831677-13945-1-git-send-email-wangshilong1991@xxxxxxxxx/ > > about leveraging the persistent EXT4_BG_WAS_TRIMMED flag in the group > descriptors, and having "-o discard" only track trim on a per-group > basis rather than its current mode of doing trim on a per-block basis, > and then use the same code internally as fstrim to do a trim of free > blocks in that block group. > > Using EXT4_BG_WAS_TRIMMED and tracking *groups* to be trimmed would be > a bit more lazy than the current "-o discard" implementation, but would > be more memory efficient, and also more efficient for the device (fewer, > larger trim requests submitted). It would only need to track groups > that have at least a reasonable amount of free space to be trimmed. If > the group doesn't have enough free blocks to trim now, it will be checked > again in the future when more blocks are freed. > Hi, I gave it a quick run refreshing it for 5.10, but it doesn't seem to help. Are there actions needed other than the patch itself? Regards, -- per aspera ad upstream