On Thu, Sep 24, 2020 at 01:26:53PM -0400, Brian Foster wrote: > On Thu, Sep 24, 2020 at 04:22:11PM +0100, Matthew Wilcox wrote: > > On Thu, Sep 24, 2020 at 11:12:59AM -0400, Brian Foster wrote: > > > On Thu, Sep 24, 2020 at 02:59:00PM +0100, Matthew Wilcox wrote: > > > > On Thu, Sep 24, 2020 at 09:12:35AM -0400, Brian Foster wrote: > > > > > On Thu, Sep 24, 2020 at 01:56:08PM +0100, Matthew Wilcox (Oracle) wrote: > > > > > > For filesystems with block size < page size, we need to set all the > > > > > > per-block uptodate bits if the page was already uptodate at the time > > > > > > we create the per-block metadata. This can happen if the page is > > > > > > invalidated (eg by a write to drop_caches) but ultimately not removed > > > > > > from the page cache. > > > > > > > > > > > > This is a data corruption issue as page writeback skips blocks which > > > > > > are marked !uptodate. > > > > > > > > > > Thanks. Based on my testing of clearing PageUptodate here I suspect this > > > > > will similarly prevent the problem, but I'll give this a test > > > > > nonetheless. > > > > > > > > > > I am a little curious why we'd prefer to fill the iop here rather than > > > > > just clear the page state if the iop data has been released. If the page > > > > > is partially uptodate, then we end up having to re-read the page > > > > > anyways, right? OTOH, I guess this behavior is more consistent with page > > > > > size == block size filesystems where iop wouldn't exist and we just go > > > > > by page state, so perhaps that makes more sense. > > > > > > > > Well, it's _true_ ... the PageUptodate bit means that every byte in this > > > > page is at least as new as every byte on storage. There's no need to > > > > re-read it, which is what we'll do if we ClearPageUptodate. > > > > > > Yes, of course. I'm just noting the inconsistent behavior between a full > > > and partially uptodate page. > > > > Heh, well, we have no way of knowing. We literally just threw away > > the information about which blocks are uptodate. So the best we can > > do is work with the single bit we have. We do know that there are no > > dirty blocks left on the page at this point (... maybe we should add a > > VM_BUG_ON(!PageUptodate && PageDirty)). > > > > Right.. > > > Something we could do is summarise the block uptodate information in > > the 32/64 bits of page_private without setting PagePrivate. That would > > cause us to still allocate an iop so we can track reads/writes, but we > > might be able to avoid a few reads. > > > > But I don't think it's worth it. Partially uptodate pages are not what > > we should be optimising for; we should try to get & keep pages uptodate. > > After all, it's a page cache ;-) > > > > Fair enough. I was thinking about whether we could ensure the page is > released if releasepage() effectively invalidated the page content (or > avoid the release if we know the mapping won't be removed), but that > appears to be nontrivial given the refcount interdependencies between > page private and removing the mapping. I.e., the private data can hold a > reference on the page and remove_mapping() wants to assume that the > caller and page cache hold the last references on the page. We could fix that -- remove_mapping() could take into account page_has_private() in its call to page_ref_freeze() -- ie: - refcount = 1 + compound_nr(page); + retcount = 1 + compound_nr(page) + page_has_private(page); like some other parts of the VM do. And then the filesystem could detach_page_private() in its aops->freepage() (which XFS aops don't currently use). That change might be a little larger than would be appreciated for a data corruption fix going back two years. And there's already other reasons for wanting to be able to create an iop for an Uptodate page (ie the THP patchset).