Re: [RFC 2/3] ext2: Convert ext2 regular file buffered I/O to use iomap

Jan Kara <jack@xxxxxxx> · Thu, 30 Nov 2023 17:01:56 +0100

On Thu 30-11-23 21:20:41, Ritesh Harjani wrote:
> Jan Kara <jack@xxxxxxx> writes:
> 
> > On Thu 30-11-23 16:29:59, Ritesh Harjani wrote:
> >> Jan Kara <jack@xxxxxxx> writes:
> >> 
> >> > On Thu 30-11-23 13:15:58, Ritesh Harjani wrote:
> >> >> Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes:
> >> >> 
> >> >> > Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes:
> >> >> >
> >> >> >> Christoph Hellwig <hch@xxxxxxxxxxxxx> writes:
> >> >> >>
> >> >> >>> On Wed, Nov 22, 2023 at 01:29:46PM +0100, Jan Kara wrote:
> >> >> >>>> writeback bit set. XFS plays the revalidation sequence counter games
> >> >> >>>> because of this so we'd have to do something similar for ext2. Not that I'd
> >> >> >>>> care as much about ext2 writeback performance but it should not be that
> >> >> >>>> hard and we'll definitely need some similar solution for ext4 anyway. Can
> >> >> >>>> you give that a try (as a followup "performance improvement" patch).
> >> >> 
> >> >> ok. So I am re-thinknig over this on why will a filesystem like ext2
> >> >> would require sequence counter check. We don't have collapse range
> >> >> or COW sort of operations, it is only the truncate which can race,
> >> >> but that should be taken care by folio_lock. And even if the partial
> >> >> truncate happens on a folio, since the logical to physical block mapping
> >> >> never changes, it should not matter if the writeback wrote data to a
> >> >> cached entry, right?
> >> >
> >> > Yes, so this is what I think I've already mentioned. As long as we map just
> >> > the block at the current offset (or a block under currently locked folio),
> >> > we are fine and we don't need any kind of sequence counter. But as soon as
> >> > we start caching any kind of mapping in iomap_writepage_ctx we need a way
> >> > to protect from races with truncate. So something like the sequence counter.
> >> >
> >> 
> >> Why do we need to protect from the race with truncate, is my question here.
> >> So, IMO, truncate will truncate the folio cache first before releasing the FS
> >> blocks. Truncation of the folio cache and the writeback path are
> >> protected using folio_lock()
> >> Truncate will clear the dirty flag of the folio before
> >> releasing the folio_lock() right, so writeback will not even proceed for
> >> folios which are not marked dirty (even if we have a cached wpc entry for
> >> which folio is released from folio cache).
> >> 
> >> Now coming to the stale cached wpc entry for which truncate is doing a
> >> partial truncation. Say, truncate ended up calling
> >> truncate_inode_partial_folio(). Now for such folio (it remains dirty
> >> after partial truncation) (for which there is a stale cached wpc entry),
> >> when writeback writes to the underlying stale block, there is no harm
> >> with that right?
> >> 
> >> Also this will "only" happen for folio which was partially truncated.
> >> So why do we need to have sequence counter for protecting against this
> >> race is my question. 
> >
> > That's a very good question and it took me a while to formulate my "this
> > sounds problematic" feeling into a particular example :) We can still have
> > a race like:
> >
> > write_cache_pages()
> >   cache extent covering 0..1MB range
> >   write page at offset 0k
> > 					truncate(file, 4k)
> > 					  drops all relevant pages
> > 					  frees fs blocks
> > 					pwrite(file, 4k, 4k)
> > 					  creates dirty page in the page cache
> >   writes page at offset 4k to a stale block
> 
> :) Nice! 
> Truncate followed by an append write could cause this race with writeback
> happening in parallel. And this might cause data corruption.
> 
> Thanks again for clearly explaining the race :) 
> 
> So I think just having a seq. counter (no other locks required), should
> be ok to prevent this race. Since mostly what we are interested in is
> whether the stale cached block mapping got changed for the folio which
> writeback is going to continue writing to.
> 
> i.e. the race only happens when 2 of these operation happen while we
> have a cached block mapping for a folio - 
> - a new physical block got allocated for the same logical offset to
> which the previous folio belongs to 
> - the folio got re-instatiated in the page cache as dirty with the new
> physical block mapping at the same logical offset of the file.
> 
> Now when the writeback continues, it will iterate over the same
> dirty folio in question, lock it, check for ->map_blocks which will
> notice a changed seq counter and do ->get_blocks again and then
> we do submit_bio() to the right physical block.
> 
> So, it should be ok, if we simply update the seq counter everytime a
> block is allocated or freed for a given inode... because when we
> check the seq counter, we know the folio is locked and the other
> operation of re-instating a new physical block mapping for a given folio
> can also only happen while under a folio lock. 

Yes.

> OR, one other approach to this can be... 
> 
> IIUC, for a new block mapping for the same folio, the folio can be made to
> get invalidated once in between right? So should this be handled in folio
> cache somehow to know if for a given folio the underlying mapping might
> have changed for which iomap should again query  ->map_blocks() ? 
> ... can that also help against unnecessary re-validations against cached
> block mappings?

This would be difficult because the page cache handling code does not know
someone has cached block mapping somewhere...

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR