Re: Splitting dirty fs folios

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 30 May 2023 08:39:32 +1000

On Mon, May 29, 2023 at 08:59:27PM +0100, Matthew Wilcox wrote:
> At the moment, when we truncate (also holepunch, etc) a file,
> the VFS attempts to split any large folios which overlap the newly
> created boundary in the file.  See mm/truncate.c for the callers of
> ->invalidate_folio.
> 
> We need the filesystem and the MM to cooperate on splitting a folio
> because there's FS metadata attached to folio->private.  We have per-folio
> state (uptodate, dirty) which the filesystem keeps per-block state for
> and uses the folio state as a summary (if every block is uptodate,
> the folio is uptodate.  if any block is dirty, the folio is dirty).
> If we just throw away that per-folio extra state we risk not writing
> back blocks which are dirty, or losing buffered writes as we re-read
> blocks which were more uptodate in memory than on disk.  There's no
> safe state to set the folio to.
> 
> This is fine if the entire folio is uptodate, and it generally is today
> because large folios are only created through readahead, which will
> bring the entire folio uptodate unless there is a read error.  But when
> creating a large folio in the write path, we can end up with large folios
> which are not uptodate under various circumstances.  For example, I've
> captured one where we write to pos:0x2a0e5f len:0xf1a1.  Because this is
> on a 1kB block size filesystem, we leave the first three blocks in the folio
> unread, and the uptodate bits are fffffffffffffff8.  That means that
> the folio as a whole is not uptodate.
> 
> Option 1: Read the start of the folio so we can set the whole folio
> uptodate.  In this case, we're already submitting a read for bytes
> 0x2a0c00-0x2a0fff (so we can overwrite the end of that block).  We could
> expand that to read 0x2a0000-0x2a0fff instead.  This could get tricky;
> at the moment we're guaranteed to have the iomap that covers the start
> of the block, but we might have to do a lookup to find the iomap(s)
> that covers the start of the folio.

Yeah, that seems like a problem - having to go back and get a
different iomap for read while we currently hold an iomap for write
is a potential deadlock for some filesystems. I think at this point,
we would be better off backing out of the write, getting a write
mapping for the entire large folio range and running the existing
"sub folio" zero-or-RMW code...

> Option 2: In the invalidate_folio implementation, writeback the folio
> so it is no longer dirty.  I'm not sure we have all the information we
> need to start writeback, and it'll annoy the filesystem as it has to
> allocate space if it wasn't already allocated.

I don't really like this - we want to throw away dirty data in range
being invalidated, not take a latency/performance hit to write it
back first.

FWIW, is the folio is dirty and the filesystem doesn't have
allocated space for it to be written back, then that is a bug that
needs fixing. Nothing should be dirtying a folio without giving the
filesystem the opportunity to allocate backing space for it
first....

> Option 3: Figure out a more complicated dance between the FS and the MM
> that allows the FS to attach state to the newly created folios before
> finally freeing the original folio.

Complex, but seems possible. Also seems tricky with respect to
making the entire swap appear atomic from an outside observer's
perspective. 

> Option 4: Stop splitting folios on holepunch / truncate.  Folio splits
> can fail, so we all have to cope with folios that substantially overhang
> a hole/data/EOF boundary.  We don't attempt to split folios on readahead
> when we discover we're trying to read from a hole, we just zero the
> appropriate chuks of the folio. 

That sounds like the best idea to me - it matches what we already in
terms of sub-page block size behaviour - zero the part of the page
that was invalidated.

> We do attempt to not allocate folios
> which extend more than one page past EOF, but that's subject to change
> anyway.

Yeah, i think we'll want to change that because if extending
sequential writes don't exactly match large folio alignment we'll
still end up with lots of small folios around the edges of the user
writes....

> Option 5: If the folio is both dirty and !uptodate, just refuse to split
> it, like if somebody else had a reference on it.  A less extreme version
> of #4.

Also seems like a reasonable first step.

> I may have missed some other option.  Option 5 seems like the least
> amount of work.

*nod*

Overall, I think the best way to approach it is to do the simplest,
most obviously correct thing first. If/when we observe performance
problems from the simple approach, then we can decide how to solve
that via one of the more complex approaches...

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx