One or more of the originally attached files triggered the rule module.access.rule.exestrip_notify The following attachments were deleted from the original message. radixcheck.py Original Message: On 9/18/24 2:37 AM, Jens Axboe wrote: > On 9/17/24 7:25 AM, Matthew Wilcox wrote: >> On Tue, Sep 17, 2024 at 01:13:05PM +0200, Chris Mason wrote: >>> On 9/17/24 5:32 AM, Matthew Wilcox wrote: >>>> On Mon, Sep 16, 2024 at 10:47:10AM +0200, Chris Mason wrote: >>>>> I've got a bunch of assertions around incorrect folio->mapping and I'm >>>>> trying to bash on the ENOMEM for readahead case. There's a GFP_NOWARN >>>>> on those, and our systems do run pretty short on ram, so it feels right >>>>> at least. We'll see. >>>> >>>> I've been running with some variant of this patch the whole way across >>>> the Atlantic, and not hit any problems. But maybe with the right >>>> workload ...? >>>> >>>> There are two things being tested here. One is whether we have a >>>> cross-linked node (ie a node that's in two trees at the same time). >>>> The other is whether the slab allocator is giving us a node that already >>>> contains non-NULL entries. >>>> >>>> If you could throw this on top of your kernel, we might stand a chance >>>> of catching the problem sooner. If it is one of these problems and not >>>> something weirder. >>>> >>> >>> This fires in roughly 10 seconds for me on top of v6.11. Since array seems >>> to always be 1, I'm not sure if the assertion is right, but hopefully you >>> can trigger yourself. >> >> Whoops. >> >> $ git grep XA_RCU_FREE >> lib/xarray.c:#define XA_RCU_FREE ((struct xarray *)1) >> lib/xarray.c: node->array = XA_RCU_FREE; >> >> so you walked into a node which is currently being freed by RCU. Which >> isn't a problem, of course. I don't know why I do that; it doesn't seem >> like anyone tests it. The jetlag is seriously kicking in right now, >> so I'm going to refrain from saying anything more because it probably >> won't be coherent. > > Based on a modified reproducer from Chris (N threads reading from a > file, M threads dropping pages), I can pretty quickly reproduce the > xas_descend() spin on 6.9 in a vm with 128 cpus. Here's some debugging > output with a modified version of your patch too, that ignores > XA_RCU_FREE: Jens and I are running slightly different versions of reader.c, but we're seeing the same thing. v6.11 is lasts all night long, and reverting those two commits falls over in about 5 minutes or less. I switched from a VM to bare metal, and managed to hit an assertion I'd added to filemap_get_read_batch() (should look familiar): { struct address_space *fmapping = READ_ONCE(folio->mapping); BUG_ON(fmapping && fmapping != mapping); } Walking the xarray in the crashdump shows that it's probably the same corruption I saw in 5.19. drgn is printing like so: print("0x%x mapping 0x%x radix index %d page index %d flags 0x%x (%s) size %d" % (page.address_of_(), page.mapping.value_(), index, page.index, page.flags, decode_page_flags(page), folio._folio_nr_pages)) And I attached radixcheck.py if you want to see the full script. These are all from the correct mapping: 0xffffea0088b17200 mapping 0xffff88a22a9614e8 radix index 53 page index 53 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 59472 0xffffea008773e940 mapping 0xffff88a22a9614e8 radix index 54 page index 54 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4244589144 0xffffea0084ad1d00 mapping 0xffff88a22a9614e8 radix index 55 page index 55 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4040059330 0xffffea0088c9d840 mapping 0xffff88a22a9614e8 radix index 56 page index 56 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 5958 0xffffea00879c6300 mapping 0xffff88a22a9614e8 radix index 57 page index 57 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 112 0xffffea0086630980 mapping 0xffff88a22a9614e8 radix index 58 page index 58 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4025236287 0xffffea0008eb6580 mapping 0xffff88a22a9614e8 radix index 59 page index 59 flags 0x5ffff000000012c (PG_referenced|PG_uptodate|PG_lru|PG_active|PG_reported) size 269 0xffffea00072db000 mapping 0xffff88a22a9614e8 radix index 60 page index 60 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4 0xffffea000919b600 mapping 0xffff88a22a9614e8 radix index 64 page index 64 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4 These last 3 are not: 0xffffea0008fa7000 mapping 0xffff888124910768 radix index 208 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64 0xffffea0008fa7000 mapping 0xffff888124910768 radix index 224 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64 0xffffea0008fa7000 mapping 0xffff888124910768 radix index 240 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64 I think the bug was in __filemap_add_folio()'s usage of xarray_split_alloc() and the tree changing before taking the lock. It's just a guess, but that was always my biggest suspect. To reproduce, I used: mkfs.xfs -f <some device> mount some_device /xfs for x in `seq 1 8` ; do fallocate -l100m /xfs/file$x ./reader /xfs/file$x & done New reader.c attached. Jens changed his so that every reader thread was using its own offset in the file, and he found that reproduced more consistently. -chris
/* * gcc -Wall -o reader reader.c -lpthread */ #define _GNU_SOURCE #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <sys/sendfile.h> #include <unistd.h> #include <errno.h> #include <err.h> #include <pthread.h> struct thread_data { int fd; int read_size; size_t size; }; static void *drop_pages(void *arg) { struct thread_data *td = arg; int ret; while (1) { ret = posix_fadvise(td->fd, 0, td->size, POSIX_FADV_DONTNEED); if (ret < 0) err(1, "fadvise dontneed"); } return NULL; } #define READ_BUF (2 * 1024 * 1024) static void *read_pages(void *arg) { struct thread_data *td = arg; char buf[READ_BUF]; ssize_t ret; loff_t offset = 8192; while (1) { ret = pread(td->fd, buf, td->read_size, offset); if (ret < 0) err(1, "read"); if (ret == 0) break; } return NULL; } int main(int ac, char **av) { int fd; int ret; struct stat st; int sizes[9] = { 0, 0, 8192, 16834, 32768, 65536, 128 * 1024, 256 * 1024, 1024 * 1024 }; int nr_tids = 9; struct thread_data tds[9]; int i; int sleeps = 0; pthread_t tids[nr_tids]; if (ac != 2) err(1, "usage: reader filename\n"); fd = open(av[1], O_RDONLY, 0600); if (fd < 0) err(1, "unable to open %s", av[1]); ret = fstat(fd, &st); if (ret < 0) err(1, "stat"); for (i = 0; i < nr_tids; i++) { struct thread_data *td = tds + i; td->fd = fd; td->size = st.st_size; td->read_size = sizes[i]; if (i < 2) ret = pthread_create(tids + i, NULL, drop_pages, td); else ret = pthread_create(tids + i, NULL, read_pages, td); if (ret) err(1, "pthread_create"); } for (i = 0; i < nr_tids; i++) { pthread_detach(tids[i]); } while(1) { sleep(122); sleeps++; fprintf(stderr, ":%d:", sleeps * 122); } }