Re: [PATCH v3 1/2] ext4: introduce EXT4_BG_WAS_TRIMMED to optimize trim

tytso@xxxxxxx · Mon, 10 Aug 2020 09:24:57 -0400

On Sat, Aug 08, 2020 at 10:33:08PM -0600, Andreas Dilger wrote:
> What about storing "s_min_freed_blocks_to_trim" persistently in the
> superblock, and then the admin can adjust this as desired?  If it is
> set =1, then the "lazy trim" optimization would be disabled (every
> FITRIM request would honor the trim requests whenever there is a
> freed block in a group).  I suppose we could allow =0 to mean "do not
> store the WAS_TRIMMED flag persistently", so there would be no change
> for current behavior, and it would require a tune2fs option to set the
> new value into the superblock (though we might consider setting this
> to a non-zero value in mke2fs by default).

Currently the the minimum blocks to trim is passed in to FITRIM from
userspace; so we would need to define how the passed-in value from the
fstrim program interacts with the value stored in the sueprblock.
Would we always ignore the value passed-in from userspace?  That
doesn't seem right...

> The other thing we were thinkgin about was changing the "-o discard" code
> to leverage the WAS_TRIMMED flag, and just do bulk trim periodically
> in the filesystem as blocks are freed from groups, rather than tracking
> freed extents in memory and submitting trims actively during IO.  Instead,
> it would track groups that exceed "s_min_freed_blocks_to_trim", and trim
> the whole group in the background when the filesystem is not active.

Hmm, maybe.  That's an awful lot of complexity, which is my concern
with that approach.

Part of the problem here is that discard is being used for different
things for different use cases and devices with different discard
speeds.  Right now, one of the primary uses of -o discard is for
people who have fast discard implementation(s and/or people who really
want to make sure every freed block is immediately discard --- perhaps
to meet security / privacy requirements (such as HIPPA compliance,
etc.).   I don't want to break that.

We now have a requirement of people who have very slow discards --- I
think at one point people mentioned something about for devices using
HDD, probably in some kind of dm-thin use case?  One solution that we
can use for those is simply use fstrim -m 8M or some such.  But it
appears that part of the problem is people do want more precision than
that?

Another solution might be to skip trimming block groups if there have
been blocks that have been freshly freed that are pending a commit,
and skip that block group until the commit has completed.  That might
also help reduce contention on a busy file system.

Yet another solution might be bias block allocations towards LBA
Uranges that have been deleted recently --- since another way to avoid
trims is to simply overwrite those LBA's.  But then the question is
how much memory are we willing to dedicate towards tracking recently
released LBA's, and to what level of granularity?  Perhaps we just
track the freed extents, and if they don't get used within a certain
period, or if we start getting put under memory pressure, we then send
the discards at that point.

Ultimately, though, this is a space full of trade offs, and I'm
reminded of one of my father's favorite Chinese sayings: "You're
demanding a horse which can run fast, but which doesn't eat much
grass." (又要马儿跑，又要马儿不吃草).  Or translated more
idiomatically, you can't have your cake and eat it too.  It seems this
desire transcends all cultures.  :-)

	       	   	      	   	- Ted