Re: [ISSUE] split_folio() and dirty IOMAP folios

David Hildenbrand <david@xxxxxxxxxx> · Thu, 21 Nov 2024 13:15:21 +0100

On 11.11.24 16:19, David Hildenbrand wrote:
On 08.11.24 10:11, David Hildenbrand wrote:
On 07.11.24 21:20, Matthew Wilcox wrote:
On Thu, Nov 07, 2024 at 05:34:40PM +0100, David Hildenbrand wrote:
On 07.11.24 17:09, Matthew Wilcox wrote:
On Thu, Nov 07, 2024 at 04:07:08PM +0100, David Hildenbrand wrote:
I'm debugging an interesting problem: split_folio() will fail on dirty
folios on XFS, and I am not sure who will trigger the writeback in a timely
manner so code relying on the split to work at some point (in sane setups
where page pinning is not applicable) can make progress.

You could call something like filemap_write_and_wait_range()?

Thanks, have to look into some details of that.

Looks like the folio_clear_dirty_for_io() is buried in
folio_prepare_writeback(), so that part is taken care of.

Guess I have to fo from folio to "mapping,lstart,lend" such that
__filemap_fdatawrite_range() would look up the folio again. Sounds doable.

(I assume I have to drop the folio lock+reference before calling that)

I was thinking you'd do it higher in the callchain than
gmap_make_secure().  Presumably userspace says "I want to make this
256MB range secure" and we can start by writing back that entire
256MB chunk of address space.

That doesn't prevent anybody from dirtying it in-between, of course,
so you can still get -EBUSY and have to loop round again.

I'm afraid that won't really work.

On the one hand, we might be allocating these pages (+disk blocks)
during the unpack operation -- where we essentially trigger page faults
first using gmap_fault() -- so the pages might not even exist before the
gmap_make_secure() during unpack. One work around would be to
preallocate+writeback from user space, but it doesn't sound quite right.

But the bigger problem I see is that the initial "unpack" operation is
not the only case where we trigger this conversion to "secure" state.
Once the VM is running, we can see calls on arbitrary guest memory even
during page faults, when gmap_make_secure() is called via
gmap_convert_to_secure().

I'm still not sure why we see essentially no progress being made, even
though we temporarily drop the PTL, mmap lock, folio lock, folio ref ...
maybe related to us triggering a write fault that somehow ends up
setting the folio dirty :/ Or because writeback is simply too slow /
backs off.

I'll play with handling -EBUSY from split_folio() differently: if the
folio is under writeback, wait on that. If the folio is dirty, trigger
writeback. And I'll look into whether we really need a writable PTE, I
suspect not, because we are not actually "modifying" page content.

The following hack makes it fly:

          case -E2BIG:
                  folio_lock(folio);
                  rc = split_folio(folio);
+               if (rc == -EBUSY) {
+                       if (folio_test_dirty(folio) && !folio_test_anon(folio) &&
+                           folio->mapping) {
+                               struct address_space *mapping = folio->mapping;
+                               loff_t lstart = folio_pos(folio);
+                               loff_t lend = lstart + folio_size(folio);
+
+                               folio_unlock(folio);
+                               /* Mapping can go away ... */
+                               filemap_write_and_wait_range(mapping, lstart, lend);
+                       } else {
+                               folio_unlock(folio);
+                       }
+                       folio_wait_writeback(folio);
+                       folio_lock(folio);
+                       split_folio(folio);
+                       folio_unlock(folio);
+                       folio_put(folio);
+                       return -EAGAIN;
+               }
                  folio_unlock(folio);
                  folio_put(folio);

I think the reason why we don't make any progress on s390x is that the writeback will
mark the folio clean and turn the folio read-only in the page tables as well. So when we
lookup the folio again in the page table, we see that the PTE is not writable and
trigger a write fault ...

... the write fault will mark the folio dirty again, so the split will never succeed.

In above diff, we really must try the split_folio() a second time after waiting, otherwise we
run into the same endless loop.

I'm still not 100% sure if we need a writable PTE; after all we are not modifying page content.
But that's just a side effect of not being able to wait for the split_folio() to make progress
in the writeback case so we can retry the split again.

After discussing this with Darrick and Willy yesterday, I think the 
reason we need a writable PTE is because we *might* modify page content:

"Requests the Ultravisor to make a page accessible to a guest. If it's 
brought in the first time, it will be cleared. If it has been exported 
before, it will be decrypted and integrity checked."

So we'll be effectively modifying the page content we will read when the 
(now secure) page is in the unprotected/exported state.

That makes things more complicated, unfortunately :)

--
Cheers,

David / dhildenb