[RFC PATCH 0/2] iomap/xfs: fix data corruption due to stale cached iomaps

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 21 Sep 2022 18:29:57 +1000

Hi folks,

THese patches address the data corruption first described here:

https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@xxxxxxxxxxxxxxxxxxx/

This data corruption has been seen in high profile production
systems so there is some urgency to fix it. The underlying flaw is
essentially a zero-day iomap bug, so whatever fix we come up with
needs to be back portable to all supported stable kernels (i.e.
~4.18 onwards).

A combination of concurrent write()s, writeback IO completion, and
memory reclaim combine to expose the fact that the cached iomap that
is held across an iomap_begin/iomap_end iteration can become stale
without the iomap iterator actor being aware that the underlying
filesystem extent map has changed.

Hence actions based on the iomap state (e.g. is unwritten or newly
allocated) may actually be incorrect as writeback actions may have
changed the state (unwritten to written, delalloc to unwritten or
written, etc). This affects partial block/page operations, where we
may need to read from disk or zero cached pages depending on the
actual extent state. Memory reclaim plays it's part here in that it
removes pages containing partial state from the page cache, exposing
future partial page/block operations to incorrect behaviour.

Really, we should have known that this would be a problem - we have
exactly the same issue with cached iomaps for writeback, and the
->map_blocks callback that occurs for every filesystem block we need
to write back is responsible for validating the cached iomap is
still valid. The data corruption on the write() side is a result of
not validating that the iomap is still valid before we initialise
new pages and prepare them for data to be copied in to them....

I'm not really happy with the solution I have for triggering
remapping of an iomap when the current one is considered stale.
Doing the right thing requires both iomap_iter() to handle stale
iomaps correctly (esp. the "map is invalid before the first actor
operation" case), and it requires the filesystem
iomap_begin/iomap_end operations to co-operate and be aware of stale
iomaps.

There are a bunch of *nasty* issues around handling failed writes in
XFS taht this has exposed - a failed write() that races with a
mmap() based write to the same delalloc page will result in the mmap
writes being silently lost if we punch out the delalloc range we
allocated but didn't write to. g/344 and g/346 expose this bug
directly if we punch out delalloc regions allocated by now stale
mappings.

Then, because we can't punch out the delalloc we allocated region
safely when we have a stale iomap, we have to ensure when we remap
it the IOMAP_F_NEW flag is preserved so that the iomap code knows
that it is uninitialised space that is being written into so it will
zero sub page/sub block ranges correctly.

As a result, ->iomap_begin() needs to know if the previous iomap was
IOMAP_F_STALE, and if so, it needs to know if that previous iomap
was IOMAP_F_NEW so it can propagate it to the remap.

So the fix is awful, messy, and I really, really don't like it. But
I don't have any better ideas right now, and the changes as
presented fix the reproducer for the original data corruption and
pass fstests without and XFS regressions for block size <= page size
configurations.

Thoughts?

-Dave.