Re: [PATCH] ext4: Avoid trim error on fs with small groups

Lukas Czerner <lczerner@xxxxxxxxxx> · Mon, 22 Nov 2021 14:53:04 +0100

On Fri, Nov 19, 2021 at 04:43:53PM +0100, Martin Wilck wrote:
> Hi Martin, Lukas,
> 
> On Tue, 2021-11-16 at 23:35 -0500, Martin K. Petersen wrote:
> > 
> > Lukas,
> > 
> > > My understanding always was that the request needs to be both
> > > properly
> > > aligned and of a certain minimum size (discard_granularity) to take
> > > effect.
> > > 
> > > If what you're saying is true then I feel like the documentation of
> > > discard_granularity
> > > 
> > > Documentation/ABI/testing/sysfs-block
> > > Documentation/block/queue-sysfs.rst
> > > 
> > > should change because that's not how I understand the notion of
> > > internal allocation unit. However the sentence you quoted is not
> > > entirely clear to me either so I'll leave that decision to someone
> > > who
> > > understands it better than me.
> > > 
> > > Martin could you please clarify it for us ?
> > 
> > The rationale behind exposing the discard granularity to userland was
> > to
> > facilitate mkfs.* picking allocation units that were friendly wrt. the
> > device's internal granularity. The nature of that granularity depends
> > on
> > the type of device (thin provisioned, resource provisioned, SSD, etc.).
> > 
> > Just like the other queue limits the discard granularity was meant as a
> > hint (some of that hintiness got lost as a result of conflating zeroing
> > and deallocating but that has since been resolved).
> > 
> > It is true that some devices get confused if you submit discards that
> > are smaller than their internal granularity. However, those devices are
> > typically the ones that don't actually support reporting that
> > granularity in the first place! Whereas SCSI devices generally don't
> > care. They'll happily ignore any parts of the request that are smaller
> > than whatever size they use internally.
> > 
> > One of the problems with "optimizing" away pieces that are smaller than
> > the reported granularity is when you combine devices. Say you have a
> > RAID1 of an SSD that reports 4096 bytes and one that reports 256MB. The
> > stacked device will then have a discard granularity of 256MB and thus
> > the SSD with the smaller granularity won't receive any of the discards
> > that might otherwise help it manage its media.

Thanks you Martin P. for clarifying.

> 
> I've checked btrfs and xfs, and they treat the minlen like Jan's patch
> would - the minlen is set to the device's granularity, and discarding
> smaller ranges is silently not even attempted.
> 
> So this seems to be common practice among file system implementations,
> and thus Jan's patch would make ext4 behave like the others. But I'm
> still uncertain if this behavior is ideal, and your remarks seem to
> confirm that.

Yeah I modeled my change for ext4 on the work in xfs done by Christoph
and so it kind of spread organically to other file systems.

> 
> The fstrim man page text is highly misleading. "The default value is
> zero, discarding every free block" is obviously wrong, given the
> current actual behavior of the major filesystems.

Originally there was no discard granularity optimization so it is what
it meant. And it also says "fstrim will adjust the minimum if it's
smaller than the device's minimum". So I am not sure if it's necessarily
misleading.

Still I think it should be best effort from the fs, not guarantee.

> 
> The way this is currently implemented, the user has no way to actually
> "discard every free block" to the extent supported by the device. I
> think that being able to do this would be desirable, at least in
> certain situations.
> 
> If we changed this, with the default value of 0 used by fstrim, file
> systems would attempt to free every block, which is slow and would
> likely either fail or have no effect on may devices. Maybe we should
> treat the value "0" like "automatic", i.e. adjust it the minlen to the
> device's granularity like we do now, but leave the value unchanged if
> the user explicitly sets a small non-zero value? That way "fstrim -m
> 512" could be used to force discarding every block that can be
> discarded.

Perhaps, this sounds like a reasonable solution to me.

> 
> Regards
> Martin
>