Re: generic/095 failing in ext4 and xfs

[Date Prev] [Date Next] [Thread Prev] [Thread Next] [Date Index] [Thread Index]



On Mon, Oct 04, 2021 at 11:15:59AM +0100, Luis Henriques wrote:
> On Mon, Oct 04, 2021 at 11:08:29AM +0100, Luis Henriques wrote:
> > On Sat, Oct 02, 2021 at 08:59:57AM -0600, Jens Axboe wrote:
> > > On 10/2/21 4:16 AM, Luis Henriques wrote:
> > > > "Theodore Ts'o" <tytso@xxxxxxx> writes:
> > > > 
> > > >> On Fri, Oct 01, 2021 at 02:46:09PM -0600, Jens Axboe wrote:
> > > >>>
> > > >>> Hmm, do older versions fail? I see Ted suggested that 3.27 doesn't, can
> > > >>> you give that a go? If that does work, would be great if you could try
> > > >>> and bisect it.
> > > >>
> > > >> I just tried fio 3.28, and it worked for me.  So I don't think it's
> > > >> fio.
> > > > 
> > > > Awesome, thank you both for checking it out.  So, it's definitely
> > > > something in my test environment.
> > > > 
> > > >> Luis, could it be related to a  kernel config option?
> > > > 
> > > > Yeah, it could be.  I've tested this on a rolling release (openSUSE TW),
> > > > so it's definitely quite different from Debian 10.  It may take me a bit
> > > > to figure out what's going on, but I'll start with this kernel config and
> > > > report back any finding.
> > > > 
> > > > Again, thank you both for confirming it's working on your side.
> > > 
> > > Do you have a core file from fio? Would be interesting to get a
> > > backtrace from it.
> > 
> > Ok, not a lot of progress from my end yet, but here's some info gathered
> > with gdb from the core file:
> > 
> > #0  0x000056505966b361 in io_completed (td=0x7f2b0c5437a0, io_u_ptr=0x7ffec2403e48, icd=0x7ffec2403e60) at /usr/src/debug/fio-3.28-1.1.x86_64/io_u.c:2012
> > #1  0x000056505966b922 in ios_completed (icd=0x7ffec2403e60, td=0x7f2b0c5437a0) at /usr/src/debug/fio-3.28-1.1.x86_64/io_u.c:2086
> > #2  io_u_queued_complete (td=0x7f2b0c5437a0, min_evts=<optimized out>) at /usr/src/debug/fio-3.28-1.1.x86_64/io_u.c:2145
> > #3  0x0000565059680e88 in do_io (td=0x7f2b0c5437a0, bytes_done=0x7ffec2404070) at /usr/src/debug/fio-3.28-1.1.x86_64/backend.c:1176
> > #4  0x000056505968a8ee in thread_main (data=data@entry=0x56505ae43510) at /usr/src/debug/fio-3.28-1.1.x86_64/backend.c:1870
> > #5  0x000056505968ca48 in run_threads (sk_out=0x0) at /usr/src/debug/fio-3.28-1.1.x86_64/backend.c:2460
> > #6  0x000056505968cb55 in fio_backend (sk_out=0x0) at /usr/src/debug/fio-3.28-1.1.x86_64/backend.c:2597
> > #7  fio_backend (sk_out=0x0) at /usr/src/debug/fio-3.28-1.1.x86_64/backend.c:2558
> > #8  0x000056505962fd97 in main (argc=4, argv=0x7ffec240c448, envp=<optimized out>) at /usr/src/debug/fio-3.28-1.1.x86_64/fio.c:60
> > 
> > And here's the io_completed() code where the crash occurs:
> > 
> >    2007                 if (io_u->resid) {
> >    2008                         io_u->xfer_buflen = io_u->resid;
> >    2009                         io_u->xfer_buf += bytes;
> >    2010                         io_u->offset += bytes;
> >    2011                         td->ts.short_io_u[io_u->ddir]++;
> >    2012                         if (io_u->offset < io_u->file->real_file_size) {
> >    2013                                 requeue_io_u(td, io_u_ptr);
> >    2014                                 return;
> >    2015                         }
> >    2016                 }
> 
> I forgot to include the kernel log.  The page cache error seems relevant,
> and, as I said before, I'm seeing it both on ext4 and xfs:
> 
> [   38.014790] fio[762]: segfault at 30 ip 000056505966b361 sp 00007ffec2403df0 error 4 in fio[56505962e000+84000]
> [   38.016320] Code: c1 48 85 c0 74 2e 48 89 45 68 48 8b 45 40 48 63 55 2c 4c 01 4d 60 4c 01 c8 48 89 45 40 49 83 84 d4 70 5d 02 00 01 48 8b 55 20 <48> 3b 42 30 0f 82 75 026
> [   38.016839] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!
> [   38.019520] fio[760]: segfault at 30 ip 000056505966b361 sp 00007ffec2403df0 error 4 in fio[56505962e000+84000]
> [   38.020543] File: /mnt/scratch/file1 PID: 754 Comm: fio
> [   38.022056] Code: c1 48 85 c0 74 2e 48 89 45 68 48 8b 45 40 48 63 55 2c 4c 01 4d 60 4c 01 c8 48 89 45 40 49 83 84 d4 70 5d 02 00 01 48 8b 55 20 <48> 3b 42 30 0f 82 75 026
> [   38.052142] fio[761]: segfault at 30 ip 000056505966b361 sp 00007ffec2403df0 error 4 in fio[56505962e000+84000]
> [   38.053545] Code: c1 48 85 c0 74 2e 48 89 45 68 48 8b 45 40 48 63 55 2c 4c 01 4d 60 4c 01 c8 48 89 45 40 49 83 84 d4 70 5d 02 00 01 48 8b 55 20 <48> 3b 42 30 0f 82 75 026
> [   38.058111] fio[759]: segfault at 30 ip 000056505966b361 sp 00007ffec2403df0 error 4 in fio[56505962e000+84000]
> [   38.059511] Code: c1 48 85 c0 74 2e 48 89 45 68 48 8b 45 40 48 63 55 2c 4c 01 4d 60 4c 01 c8 48 89 45 40 49 83 84 d4 70 5d 02 00 01 48 8b 55 20 <48> 3b 42 30 0f 82 75 026
> [   38.065638] fio[758]: segfault at 30 ip 000056505966b361 sp 00007ffec2403df0 error 4 in fio[56505962e000+84000]
> [   38.067055] Code: c1 48 85 c0 74 2e 48 89 45 68 48 8b 45 40 48 63 55 2c 4c 01 4d 60 4c 01 c8 48 89 45 40 49 83 84 d4 70 5d 02 00 01 48 8b 55 20 <48> 3b 42 30 0f 82 75 026

Ok, I may have narrowed it a bit more.  The disks being used in my testing
were zram-based (I know, I should have mentioned it before :-/ ).  If I use
file-based disks the test passes and I see no crashes in fio.

Cheers,
--
Luís



[Index of Archives]     [Linux Filesystems Development]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux