Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 12 Oct 2024 10:01:08 -0700

On Fri, 11 Oct 2024 at 06:06, Chris Mason <clm@xxxxxxxx> wrote:
>
> - Linus's starvation observation.  It doesn't feel like there's enough
> load to cause this, especially given us sitting in truncate, where it
> should be pretty unlikely to have multiple procs banging on the page in
> question.

Yeah, I think the starvation can only possibly happen in
fdatasync-like paths where it's waiting for existing writeback without
holding the page lock. And while Christian has had those backtraces
too, the truncate path is not one of them.

That said, just because I wanted to see how nasty it is, I looked into
changing the rules for folio_wake_bit().

Christian, just to clarify, this is not for  you to test - this is
very experimental - but maybe Willy has comments on it.

Because it *might* be possible to do something like the attached,
where we do the page flags changes atomically but without any locks if
there are no waiters, but if there is a waiter on the page, we always
clear the page flag bit atomically under the waitqueue lock as we wake
up the waiter.

I changed the name (and the return value) of the
folio_xor_flags_has_waiters() function to just not have any
possibility of semantic mixup, but basically instead of doing the xor
atomically and unconditionally (and returning whether we had waiters),
it now does it conditionally only if we do *not* have waiters, and
returns true if successful.

And if there were waiters, it moves the flag clearing into the wakeup function.

That in turn means that the "while whiteback" loop can go back to be
just a non-looping "if writeback", and folio_wait_writeback() can't
get into any starvation with new writebacks always showing up.

The reason I say it *might* be possible to do something like this is
that it changes __folio_end_writeback() to no longer necessarily clear
the writeback bit under the XA lock. If there are waiters, we'll clear
it later (after releasing the lock) in the caller.

Willy? What do you think? Clearly this now makes PG_writeback not
synchronized with the PAGECACHE_TAG_WRITEBACK tag, but the reason I
think it might be ok is that the code that *sets* the PG_writeback bit
in __folio_start_writeback() only ever starts with a page that isn't
under writeback, and has a

        VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);

at the top of the function even outside the XA lock. So I don't think
these *need* to be synchronized under the XA lock, and I think the
folio flag wakeup atomicity might be more important than the XA
writeback tag vs folio writeback bit.

But I'm not going to really argue for this patch at all - I wanted to
look at how bad it was, I wrote it, I'm actually running it on my
machine now and it didn't *immediately* blow up in my face, so it
*may* work just fine.

The patch is fairly simple, and apart from the XA tagging issue is
seems very straightforward. I'm just not sure it's worth synchronizing
one part just to at the same time de-synchronize another..

                   Linus
From 9d4f0d60abc4dce5b7cfbad4576a2829832bb838 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Date: Sat, 12 Oct 2024 09:34:24 -0700
Subject: [PATCH] Test atomic folio bit waiting

---
 include/linux/page-flags.h | 26 ++++++++++++++++----------
 mm/filemap.c               | 28 ++++++++++++++++++++++++++--
 mm/page-writeback.c        |  6 +++---
 3 files changed, 45 insertions(+), 15 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1b3a76710487..b30a73e1c2c7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -730,22 +730,28 @@ TESTPAGEFLAG_FALSE(Ksm, ksm)
 u64 stable_page_flags(const struct page *page);
 
 /**
- * folio_xor_flags_has_waiters - Change some folio flags.
+ * folio_xor_flags_no_waiters - Change folio flags if no waiters
  * @folio: The folio.
- * @mask: Bits set in this word will be changed.
+ * @mask: Which flags to change.
  *
- * This must only be used for flags which are changed with the folio
- * lock held.  For example, it is unsafe to use for PG_dirty as that
- * can be set without the folio lock held.  It can also only be used
- * on flags which are in the range 0-6 as some of the implementations
- * only affect those bits.
+ * This does the optimistic fast-case of changing page flag bits
+ * that has no waiters. Only flags in the first word can be modified,
+ * and the old value must be stable (typically this clears the
+ * locked or writeback bit or similar).
  *
- * Return: Whether there are tasks waiting on the folio.
+ * Return: true if it succeeded
  */
-static inline bool folio_xor_flags_has_waiters(struct folio *folio,
+static inline bool folio_xor_flags_no_waiters(struct folio *folio,
 		unsigned long mask)
 {
-	return xor_unlock_is_negative_byte(mask, folio_flags(folio, 0));
+	const unsigned long waiter_mask = 1ul << PG_waiters;
+	unsigned long *flags = folio_flags(folio, 0);
+	unsigned long val = READ_ONCE(*flags);
+	do {
+		if (val & waiter_mask)
+			return false;
+	} while (!try_cmpxchg_release(flags, &val, val ^ mask));
+	return true;
 }
 
 /**
diff --git a/mm/filemap.c b/mm/filemap.c
index 664e607a71ea..5fbaf6cea964 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1164,6 +1164,14 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 	return (flags & WQ_FLAG_EXCLUSIVE) != 0;
 }
 
+/*
+ * Clear the folio bit and wake waiters atomically under
+ * the folio waitqueue lock.
+ *
+ * Note that the fast-path alternative to calling this is
+ * to atomically clear the bit and check that the PG_waiters
+ * bit was not set.
+ */
 static void folio_wake_bit(struct folio *folio, int bit_nr)
 {
 	wait_queue_head_t *q = folio_waitqueue(folio);
@@ -1175,6 +1183,7 @@ static void folio_wake_bit(struct folio *folio, int bit_nr)
 	key.page_match = 0;
 
 	spin_lock_irqsave(&q->lock, flags);
+	clear_bit_unlock(bit_nr, folio_flags(folio, 0));
 	__wake_up_locked_key(q, TASK_NORMAL, &key);
 
 	/*
@@ -1507,7 +1516,7 @@ void folio_unlock(struct folio *folio)
 	BUILD_BUG_ON(PG_waiters != 7);
 	BUILD_BUG_ON(PG_locked > 7);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	if (folio_xor_flags_has_waiters(folio, 1 << PG_locked))
+	if (!folio_xor_flags_no_waiters(folio, 1 << PG_locked))
 		folio_wake_bit(folio, PG_locked);
 }
 EXPORT_SYMBOL(folio_unlock);
@@ -1535,10 +1544,25 @@ void folio_end_read(struct folio *folio, bool success)
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
 
+	/*
+	 * Try to clear 'locked' at the same time as setting 'uptodate'
+	 *
+	 * Note that if we have lock bit waiters and this fast-case fails,
+	 * we'll have to clear the lock bit atomically under the folio wait
+	 * queue lock, so then we'll set 'update' separately.
+	 *
+	 * Note that this is purely a "avoid multiple atomics in the
+	 * common case" - while the locked bit needs to be cleared
+	 * synchronously wrt waiters, the uptodate bit has no such
+	 * requirements.
+	 */
 	if (likely(success))
 		mask |= 1 << PG_uptodate;
-	if (folio_xor_flags_has_waiters(folio, mask))
+	if (!folio_xor_flags_no_waiters(folio, mask)) {
+		if (success)
+			set_bit(PG_uptodate, folio_flags(folio, 0));
 		folio_wake_bit(folio, PG_locked);
+	}
 }
 EXPORT_SYMBOL(folio_end_read);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fcd4c1439cb9..3277bc3ceff9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -3081,7 +3081,7 @@ bool __folio_end_writeback(struct folio *folio)
 		unsigned long flags;
 
 		xa_lock_irqsave(&mapping->i_pages, flags);
-		ret = folio_xor_flags_has_waiters(folio, 1 << PG_writeback);
+		ret = !folio_xor_flags_no_waiters(folio, 1 << PG_writeback);
 		__xa_clear_mark(&mapping->i_pages, folio_index(folio),
 					PAGECACHE_TAG_WRITEBACK);
 		if (bdi->capabilities & BDI_CAP_WRITEBACK_ACCT) {
@@ -3099,7 +3099,7 @@ bool __folio_end_writeback(struct folio *folio)
 
 		xa_unlock_irqrestore(&mapping->i_pages, flags);
 	} else {
-		ret = folio_xor_flags_has_waiters(folio, 1 << PG_writeback);
+		ret = !folio_xor_flags_no_waiters(folio, 1 << PG_writeback);
 	}
 
 	lruvec_stat_mod_folio(folio, NR_WRITEBACK, -nr);
@@ -3184,7 +3184,7 @@ EXPORT_SYMBOL(__folio_start_writeback);
  */
 void folio_wait_writeback(struct folio *folio)
 {
-	while (folio_test_writeback(folio)) {
+	if (folio_test_writeback(folio)) {
 		trace_folio_wait_writeback(folio, folio_mapping(folio));
 		folio_wait_bit(folio, PG_writeback);
 	}
-- 
2.46.1.608.gc56f2c11c8