Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One or more of the originally attached files triggered the rule module.access.rule.exestrip_notify

The following attachments were deleted from the original message.
radixcheck.py

Original Message:

On 9/18/24 2:37 AM, Jens Axboe wrote:
> On 9/17/24 7:25 AM, Matthew Wilcox wrote:
>> On Tue, Sep 17, 2024 at 01:13:05PM +0200, Chris Mason wrote:
>>> On 9/17/24 5:32 AM, Matthew Wilcox wrote:
>>>> On Mon, Sep 16, 2024 at 10:47:10AM +0200, Chris Mason wrote:
>>>>> I've got a bunch of assertions around incorrect folio->mapping and I'm
>>>>> trying to bash on the ENOMEM for readahead case.  There's a GFP_NOWARN
>>>>> on those, and our systems do run pretty short on ram, so it feels right
>>>>> at least.  We'll see.
>>>>
>>>> I've been running with some variant of this patch the whole way across
>>>> the Atlantic, and not hit any problems.  But maybe with the right
>>>> workload ...?
>>>>
>>>> There are two things being tested here.  One is whether we have a
>>>> cross-linked node (ie a node that's in two trees at the same time).
>>>> The other is whether the slab allocator is giving us a node that already
>>>> contains non-NULL entries.
>>>>
>>>> If you could throw this on top of your kernel, we might stand a chance
>>>> of catching the problem sooner.  If it is one of these problems and not
>>>> something weirder.
>>>>
>>>
>>> This fires in roughly 10 seconds for me on top of v6.11.  Since array seems
>>> to always be 1, I'm not sure if the assertion is right, but hopefully you
>>> can trigger yourself.
>>
>> Whoops.
>>
>> $ git grep XA_RCU_FREE
>> lib/xarray.c:#define XA_RCU_FREE        ((struct xarray *)1)
>> lib/xarray.c:   node->array = XA_RCU_FREE;
>>
>> so you walked into a node which is currently being freed by RCU.  Which
>> isn't a problem, of course.  I don't know why I do that; it doesn't seem
>> like anyone tests it.  The jetlag is seriously kicking in right now,
>> so I'm going to refrain from saying anything more because it probably
>> won't be coherent.
> 
> Based on a modified reproducer from Chris (N threads reading from a
> file, M threads dropping pages), I can pretty quickly reproduce the
> xas_descend() spin on 6.9 in a vm with 128 cpus. Here's some debugging
> output with a modified version of your patch too, that ignores
> XA_RCU_FREE:

Jens and I are running slightly different versions of reader.c, but we're
seeing the same thing.  v6.11 is lasts all night long, and reverting those
two commits falls over in about 5 minutes or less.

I switched from a VM to bare metal, and managed to hit an assertion I'd
added to filemap_get_read_batch() (should look familiar):

{
	struct address_space *fmapping = READ_ONCE(folio->mapping);
	BUG_ON(fmapping && fmapping != mapping);
}

Walking the xarray in the crashdump shows that it's probably the same
corruption I saw in 5.19.  drgn is printing like so:

print("0x%x mapping 0x%x radix index %d page index %d flags 0x%x (%s) size %d" % (page.address_of_(), page.mapping.value_(), index, page.index, page.flags, decode_page_flags(page), folio._folio_nr_pages))

And I attached radixcheck.py if you want to see the full script.

These are all from the correct mapping:
0xffffea0088b17200 mapping 0xffff88a22a9614e8 radix index 53 page index 53 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 59472
0xffffea008773e940 mapping 0xffff88a22a9614e8 radix index 54 page index 54 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4244589144
0xffffea0084ad1d00 mapping 0xffff88a22a9614e8 radix index 55 page index 55 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4040059330
0xffffea0088c9d840 mapping 0xffff88a22a9614e8 radix index 56 page index 56 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 5958
0xffffea00879c6300 mapping 0xffff88a22a9614e8 radix index 57 page index 57 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 112
0xffffea0086630980 mapping 0xffff88a22a9614e8 radix index 58 page index 58 flags 0x15ffff000000000c (PG_referenced|PG_uptodate|PG_reported) size 4025236287
0xffffea0008eb6580 mapping 0xffff88a22a9614e8 radix index 59 page index 59 flags 0x5ffff000000012c (PG_referenced|PG_uptodate|PG_lru|PG_active|PG_reported) size 269
0xffffea00072db000 mapping 0xffff88a22a9614e8 radix index 60 page index 60 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4
0xffffea000919b600 mapping 0xffff88a22a9614e8 radix index 64 page index 64 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 4

These last 3 are not:
0xffffea0008fa7000 mapping 0xffff888124910768 radix index 208 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64
0xffffea0008fa7000 mapping 0xffff888124910768 radix index 224 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64
0xffffea0008fa7000 mapping 0xffff888124910768 radix index 240 page index 192 flags 0x5ffff000000416c (PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|PG_private|PG_reported) size 64

I think the bug was in __filemap_add_folio()'s usage of xarray_split_alloc()
and the tree changing before taking the lock.  It's just a guess, but that
was always my biggest suspect.

To reproduce, I used:

mkfs.xfs -f <some device>
mount some_device /xfs
for x in `seq 1 8` ; do
	fallocate -l100m /xfs/file$x
	./reader /xfs/file$x &
done

New reader.c attached.  Jens changed his so that every
reader thread was using its own offset in the file,
and he found that reproduced more consistently.

-chris
/*
 * gcc -Wall -o reader reader.c -lpthread
 */
#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/sendfile.h>
#include <unistd.h>
#include <errno.h>
#include <err.h>
#include <pthread.h>

struct thread_data {
	int fd;
	int read_size;
	size_t size;
};

static void *drop_pages(void *arg)
{
	struct thread_data *td = arg;
	int ret;

	while (1) {
		ret = posix_fadvise(td->fd,  0, td->size, POSIX_FADV_DONTNEED);
		if (ret < 0)
			err(1, "fadvise dontneed");
	}
	return NULL;
}

#define READ_BUF (2 * 1024 * 1024)
static void *read_pages(void *arg)
{
	struct thread_data *td = arg;
	char buf[READ_BUF];
	ssize_t ret;
	loff_t offset = 8192;

	while (1) {
		ret = pread(td->fd, buf, td->read_size, offset);
		if (ret < 0)
			err(1, "read");
		if (ret == 0)
			break;
	}
	return NULL;
}

int main(int ac, char **av)
{
	int fd;
	int ret;
	struct stat st;
	int sizes[9] = { 0, 0, 8192, 16834, 32768, 65536, 128 * 1024, 256 * 1024, 1024 * 1024 };
	int nr_tids = 9;
	struct thread_data tds[9];
	int i;
	int sleeps = 0;
	pthread_t tids[nr_tids];

	if (ac != 2)
		err(1, "usage: reader filename\n");

	fd = open(av[1], O_RDONLY, 0600);
	if (fd < 0)
		err(1, "unable to open %s", av[1]);

	ret = fstat(fd, &st);
	if (ret < 0)
		err(1, "stat");


	for (i = 0; i < nr_tids; i++) {
		struct thread_data *td = tds + i;

		td->fd = fd;
		td->size = st.st_size;
		td->read_size = sizes[i];

		if (i < 2)
			ret = pthread_create(tids + i, NULL, drop_pages, td);
		else
			ret = pthread_create(tids + i, NULL, read_pages, td);
		if (ret)
			err(1, "pthread_create");
	}
	for (i = 0; i < nr_tids; i++) {
		pthread_detach(tids[i]);
	}
	while(1) {
		sleep(122);
		sleeps++;
		fprintf(stderr, ":%d:", sleeps * 122);

	}
}

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux