On Sat, Aug 08, 2020 at 10:33:08PM -0600, Andreas Dilger wrote: > What about storing "s_min_freed_blocks_to_trim" persistently in the > superblock, and then the admin can adjust this as desired? If it is > set =1, then the "lazy trim" optimization would be disabled (every > FITRIM request would honor the trim requests whenever there is a > freed block in a group). I suppose we could allow =0 to mean "do not > store the WAS_TRIMMED flag persistently", so there would be no change > for current behavior, and it would require a tune2fs option to set the > new value into the superblock (though we might consider setting this > to a non-zero value in mke2fs by default). Currently the the minimum blocks to trim is passed in to FITRIM from userspace; so we would need to define how the passed-in value from the fstrim program interacts with the value stored in the sueprblock. Would we always ignore the value passed-in from userspace? That doesn't seem right... > The other thing we were thinkgin about was changing the "-o discard" code > to leverage the WAS_TRIMMED flag, and just do bulk trim periodically > in the filesystem as blocks are freed from groups, rather than tracking > freed extents in memory and submitting trims actively during IO. Instead, > it would track groups that exceed "s_min_freed_blocks_to_trim", and trim > the whole group in the background when the filesystem is not active. Hmm, maybe. That's an awful lot of complexity, which is my concern with that approach. Part of the problem here is that discard is being used for different things for different use cases and devices with different discard speeds. Right now, one of the primary uses of -o discard is for people who have fast discard implementation(s and/or people who really want to make sure every freed block is immediately discard --- perhaps to meet security / privacy requirements (such as HIPPA compliance, etc.). I don't want to break that. We now have a requirement of people who have very slow discards --- I think at one point people mentioned something about for devices using HDD, probably in some kind of dm-thin use case? One solution that we can use for those is simply use fstrim -m 8M or some such. But it appears that part of the problem is people do want more precision than that? Another solution might be to skip trimming block groups if there have been blocks that have been freshly freed that are pending a commit, and skip that block group until the commit has completed. That might also help reduce contention on a busy file system. Yet another solution might be bias block allocations towards LBA Uranges that have been deleted recently --- since another way to avoid trims is to simply overwrite those LBA's. But then the question is how much memory are we willing to dedicate towards tracking recently released LBA's, and to what level of granularity? Perhaps we just track the freed extents, and if they don't get used within a certain period, or if we start getting put under memory pressure, we then send the discards at that point. Ultimately, though, this is a space full of trade offs, and I'm reminded of one of my father's favorite Chinese sayings: "You're demanding a horse which can run fast, but which doesn't eat much grass." (又要马儿跑,又要马儿不吃草). Or translated more idiomatically, you can't have your cake and eat it too. It seems this desire transcends all cultures. :-) - Ted