Re: [PATCH] iomap: Set all uptodate bits for an Uptodate page

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Thu, 24 Sep 2020 18:56:38 +0100

On Thu, Sep 24, 2020 at 01:26:53PM -0400, Brian Foster wrote:
> On Thu, Sep 24, 2020 at 04:22:11PM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 24, 2020 at 11:12:59AM -0400, Brian Foster wrote:
> > > On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote:
> > > > On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote:
> > > > > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote:
> > > > > > For filesystems with block size < page size, we need to set all the
> > > > > > per-block uptodate bits if the page was already uptodate at the time
> > > > > > we create the per-block metadata.  This can happen if the page is
> > > > > > invalidated (eg by a write to drop_caches) but ultimately not removed
> > > > > > from the page cache.
> > > > > > 
> > > > > > This is a data corruption issue as page writeback skips blocks which
> > > > > > are marked !uptodate.
> > > > > 
> > > > > Thanks. Based on my testing of clearing PageUptodate here I suspect this
> > > > > will similarly prevent the problem, but I'll give this a test
> > > > > nonetheless. 
> > > > > 
> > > > > I am a little curious why we'd prefer to fill the iop here rather than
> > > > > just clear the page state if the iop data has been released. If the page
> > > > > is partially uptodate, then we end up having to re-read the page
> > > > > anyways, right? OTOH, I guess this behavior is more consistent with page
> > > > > size == block size filesystems where iop wouldn't exist and we just go
> > > > > by page state, so perhaps that makes more sense.
> > > > 
> > > > Well, it's _true_ ... the PageUptodate bit means that every byte in this
> > > > page is at least as new as every byte on storage.  There's no need to
> > > > re-read it, which is what we'll do if we ClearPageUptodate.
> > > 
> > > Yes, of course. I'm just noting the inconsistent behavior between a full
> > > and partially uptodate page.
> > 
> > Heh, well, we have no way of knowing.  We literally just threw away
> > the information about which blocks are uptodate.  So the best we can
> > do is work with the single bit we have.  We do know that there are no
> > dirty blocks left on the page at this point (... maybe we should add a
> > VM_BUG_ON(!PageUptodate && PageDirty)).
> > 
> 
> Right..
> 
> > Something we could do is summarise the block uptodate information in
> > the 32/64 bits of page_private without setting PagePrivate.  That would
> > cause us to still allocate an iop so we can track reads/writes, but we
> > might be able to avoid a few reads.
> > 
> > But I don't think it's worth it.  Partially uptodate pages are not what
> > we should be optimising for; we should try to get & keep pages uptodate.
> > After all, it's a page cache ;-)
> > 
> 
> Fair enough. I was thinking about whether we could ensure the page is
> released if releasepage() effectively invalidated the page content (or
> avoid the release if we know the mapping won't be removed), but that
> appears to be nontrivial given the refcount interdependencies between
> page private and removing the mapping. I.e., the private data can hold a
> reference on the page and remove_mapping() wants to assume that the
> caller and page cache hold the last references on the page.

We could fix that -- remove_mapping() could take into account
page_has_private() in its call to page_ref_freeze() -- ie:

-       refcount = 1 + compound_nr(page);
+	retcount = 1 + compound_nr(page) + page_has_private(page);

like some other parts of the VM do.  And then the filesystem could
detach_page_private() in its aops->freepage() (which XFS aops don't
currently use).

That change might be a little larger than would be appreciated for a
data corruption fix going back two years.  And there's already other
reasons for wanting to be able to create an iop for an Uptodate page
(ie the THP patchset).