Re: 6.9/BUG: Bad page state in process kswapd0 pfn:d6e840

David Hildenbrand <david@xxxxxxxxxx> · Wed, 29 May 2024 08:57:48 +0200

On 28.05.24 16:24, David Hildenbrand wrote:
Am 28.05.24 um 15:57 schrieb David Hildenbrand:
Am 28.05.24 um 08:05 schrieb Mikhail Gavrilov:
On Thu, May 23, 2024 at 12:05 PM Mikhail Gavrilov
<mikhail.v.gavrilov@xxxxxxxxx> wrote:

On Thu, May 9, 2024 at 10:50 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

The only known workload that causes this is updating a large
container. Unfortunately, not every container update reproduces the
problem.

Is it possible to add more debugging information to make it clearer
what's going on?

If we knew who originally allocated that problematic page, that might help.
Maybe page_owner could give some hints?


BUG: Bad page state in process kcompactd0  pfn:605811
page: refcount:0 mapcount:0 mapping:0000000082d91e3e index:0x1045efc4f
pfn:0x605811
aops:btree_aops ino:1
flags:
0x17ffffc600020c(referenced|uptodate|workingset|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc600020c dead000000000100 dead000000000122 ffff888159075220
raw: 00000001045efc4f 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping

Seems to be an order-0 page, otherwise we would have another "head: ..." report.

It's not an anon/ksm/non-lru migration folio, because we clear the page->mapping
field for them manually on the page freeing path. Likely it's a pagecache folio.

So one option is that something seems to not properly set folio->mapping to
NULL. But that problem would then also show up without page migration? Hmm.

Hardware name: ASUS System Product Name/ROG STRIX B650E-I GAMING WIFI,
BIOS 2611 04/07/2024
Call Trace:
   <TASK>
   dump_stack_lvl+0x84/0xd0
   bad_page.cold+0xbe/0xe0
   ? __pfx_bad_page+0x10/0x10
   ? page_bad_reason+0x9d/0x1f0
   free_unref_page+0x838/0x10e0
   __folio_put+0x1ba/0x2b0
   ? __pfx___folio_put+0x10/0x10
   ? __pfx___might_resched+0x10/0x10

I suspect we come via
      migrate_pages_batch()->migrate_folio_unmap()->migrate_folio_done().

Maybe this is the "Folio was freed from under us. So we are done." path
when "folio_ref_count(src) == 1".

Alternatively, we might come via
      migrate_pages_batch()->migrate_folio_move()->migrate_folio_done().

For ordinary migration, move_to_new_folio() will clear src->mapping if
the folio was migrated successfully. That's the very first thing that
migrate_folio_move() does, so I doubt that is the problem.

So I suspect we are in the migrate_folio_unmap() path. But for
a !anon folio, who should be freeing the folio concurrently (and not clearing
folio->mapping?)? After all, we have to hold the folio lock while migrating.

In khugepaged:collapse_file() we manually set folio->mapping = NULL, before
dropping the reference.

Something to try might be (to see if the problem goes away).

diff --git a/mm/migrate.c b/mm/migrate.c
index dd04f578c19c..45e92e14c904 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1124,6 +1124,13 @@ static int migrate_folio_unmap(new_folio_t get_new_folio,
                  /* Folio was freed from under us. So we are done. */
                  folio_clear_active(src);
                  folio_clear_unevictable(src);
+               /*
+                * Anonymous and movable src->mapping will be cleared by
+                * free_pages_prepare so don't reset it here for keeping
+                * the type to work PageAnon, for example.
+                */
+               if (!folio_mapping_flags(src))
+                       src->mapping = NULL;
                  /* free_pages_prepare() will clear PG_isolated. */
                  list_del(&src->lru);
                  migrate_folio_done(src, reason);

But it does feel weird: who freed the page concurrently and didn't clear
folio->mapping ...

We don't hold the folio lock of src, though, but have the only reference. So
another possible thing might be folio refcount mis-counting: folio_ref_count()
== 1 but there are other references (e.g., from the pagecache).

Hmm, your original report mentions kswapd, so I'm getting the feeling someone
does one folio_put() too much and we are freeing a pageache folio that is still
in the pageache and, therefore, has folio->mapping set ... bisecting would
really help.


A little bird just told me that I missed an important piece in the dmesg 
output: "aops:btree_aops ino:1" from dump_mapping():

This is btrfs, i_ino is 1, and we don't have a dentry. Is that 
BTRFS_BTREE_INODE_OBJECTID?

Summarizing what we know so far:
(1) Freeing an order-0 btrfs folio where folio->mapping
    is still set
(2) Triggered by kswapd and kcompactd; not triggered by other means of
    page freeing so far

Possible theories:
(A) folio->mapping not cleared when freeing the folio. But shouldn't
    this also happen on other freeing paths? Or are we simply lucky to
    never trigger that for that folio?
(B) Messed-up refcounting: freeing a folio that is still in use (and
    therefore has folio-> mapping still set)

I was briefly wondering if large folio splitting could be involved.

CCing btrfs maintainers.

--
Cheers,

David / dhildenb