Re: [BUG ?] Offline Memory gets stuck in offline_pages()

David Hildenbrand <david@xxxxxxxxxx> · Mon, 1 Jul 2024 09:14:29 +0200

On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote:
Hi all

Overview:
During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0`
would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to
/sys/devices/system/memory/memoryNNN/state.

Hi,

See

Documentation/admin-guide/mm/memory-hotplug.rst

"
Further, when running into out of memory situations while migrating 
pages, or when still encountering permanently unmovable pages within 
ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it 
eventually succeeds.

When offlining is triggered from user space, the offlining context can 
be terminated by sending a signal. A timeout based offlining can easily 
be implemented via::

	% timeout $TIMEOUT offline_block | failure_handling
"

Workaround:
When it happens, we can type Ctrl-C to abort it and then retry again.
Then the CXL memory is able to offline successfully.

Where the kernel gets stuck:
After digging into the kernel, we found that when the issue occurs, the kernel
is stuck in the outer loop of offline_pages(). Below is a piece of the
highlighted offline_pages():

```
int __ref offline_pages()
{
    do { // outer loop
      pfn = start_pfn;
      do {
        ret = scan_movable_pages(pfn, end_pfn, &pfn);  // It returns -ENOENT
        if (!ret)
           do_migrate_range(pfn, end_pfn);             // Not reach here
      } while (!ret);
      ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE);
      } while (ret);                                   // ret is -EBUSY
}
```

In this case, we dumped the first page that cannot be isolated (see dump_page below), it's
content does not change in each iteration.:
```
Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000
Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:29:26 linux kernel: page dumped because: trouble page...

Are you sure that's the problematic page?

refcount:0

Indicates that the page is free. But maybe it does not have PageBuddy() set.

It could also be that this is a "tail" page of a PageBuddy() page, and 
somehow we always end up on the tail in test_pages_isolated().

Which kernel + architecture are you testing with?

```

Every time the issue occurs, the content of the page structure is similar.

Questions:
Q1. Is this behavior expected? At least for an OS administrator, it should return
      promptly (success or failure) instead of hanging indefinitely.

It's expected that it might take a long time (possibly forever) in 
corner cases. See documentation.

But it's likely unexpected that we have some problematic page here.

Q2. Regarding the offline_pages() function, encountering such a page indeed causes
      an endless loop. Shouldn't another part of the kernel timely changed the state
      of this page?

There are various things that can go wrong. One issue might be that we 
try migrating a page but continuously fail to allocate memory to be used 
as a migration target. It seems unlikely with the page you dumped above, 
though.

Do you maybe have that CXL memory be on a separate "fake" NUMA node, and 
your workload mbind() itself to that NUMA node, possibly refusing to 
migrate somewhere else?

      When I use the workaround mentioned above (Ctrl-C and try offline again), I find
      that the page state changes (see dump_page below):
```
Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd
Jun 28 15:33:12 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff)
Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 dead000000000122 0000000000000000
Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page
```

What our test does:
We have a CXL memory device, which is configured as kmem and online into the MOVABLE
zone as NUMA node2. We run two processes, consume-memory and offline-memory, in parallel,
see the pseudo code below:

```
main()
{
      if (fork() == 0)
          numactl -m 2 ./consume-memory

What exactly does "consume-memory" do? Does it involve hugetlb maybe?

--
Cheers,

David / dhildenb