On 2022/6/24 21:40, Jan Kara wrote:
On Fri 24-06-22 21:19:04, Qu Wenruo wrote:
On 2022/6/24 21:07, Jan Kara wrote:
On Fri 24-06-22 14:51:18, Christoph Hellwig wrote:
On Fri, Jun 24, 2022 at 08:30:00PM +0800, Qu Wenruo wrote:
But from my previous feedback on subpage code, it looks like it's some
hardware archs (S390?) that can not do page flags update atomically.
I have tested similar thing, with extra ASSERT() to make sure the cow
fixup code never get triggered.
At least for x86_64 and aarch64 it's OK here.
So I hope this time we can get a concrete reason on why we need the
extra page Private2 bit in the first place.
I don't think atomic page flags are a thing here. I remember Jan
had chased a bug where we'd get into trouble into this area in
ext4 due to the way pages are locked down for direct I/O, but I
don't even remember seeing that on XFS. Either way the PageOrdered
check prevents a crash in that case and we really can't expect
data to properly be written back in that case.
I'm not sure I get the context 100% right but pages getting randomly dirty
behind filesystem's back can still happen - most commonly with RDMA and
similar stuff which calls set_page_dirty() on pages it has got from
pin_user_pages() once the transfer is done.
Just curious, things like RMDA can mark those pages dirty even without
letting kernel know, but how could those pages be from page cache? By
mmap()?
Yes, you pass virtual address to RDMA ioctl and it uses memory at that
address as a target buffer for RDMA. If the target address happens to be
mmapped file, filesystem has problems...
Oh my god, this is going to be disaster.
RDMA is really almost a blackbox which can do anything to the pages.
If some RDMA drivers choose to screw up with Private2, the btrfs
workaround is also screwed up.
Another problem is related to subpage.
Btrfs (and iomap) all uses page->private to store extra bitmaps for
subpage usage.
If RDMA is changing page flags, it can easily lead to de-sync between
subpage bitmaps with real page flags.
I can no longer sleep well knowing this...
Thanks,
Qu
page_maybe_dma_pinned() should
be usable within filesystems to detect such cases and protect the
filesystem but so far neither me nor John Hubbart has got to implement this
in the generic writeback infrastructure + some filesystem as a sample case
others could copy...
So the generic idea is just to detect if the page is marked dirty by
traditional means, and if not, skip the writeback for them, and wait for
proper notification to fs?
Kind of. The idea is to treat page_maybe_dma_pinned() pages as permanently
dirty (because we have no control over when the hardware decides to modify
the page contents by DMA). So skip the writeback if we can (e.g. memory
cleaning type of writeback) and use bounce pages to do data integrity
writeback (which cannot be skipped).
Honza