Re: Test generic/299 stalling forever

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jens, ping?

Have you had a chance to take a look at the stack traces?  Is there
any other debugging information I can get for you?

Thanks,

					- Ted

On Thu, Oct 13, 2016 at 07:19:23PM -0400, Theodore Ts'o wrote:
> On Wed, Oct 12, 2016 at 08:39:57PM -0600, Jens Axboe wrote:
> > 
> > If I just wanted to invoke generic/299 with a setup similar to yours,
> > what should I do? I can attempt to reproduce it.
> 
> Well, if you're willing to get a Google Compute Engine account (it
> will come with a $300 free trial credit, which is plenty for this
> exercise), the simplest way to do an exact reproduction would be to
> follow the instructions here:
> 
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
> 
> If you use the current default gce-xfstests image, which is
> xfstests-201610131131 in the xfstests-cloud project, what I run which
> triggers it quite reliably is:
> 
> gce-xfstests -C 10 generic/299
> 
> On a run that I just did using a 4.8.0+ kernel, it ran 12 times (10
> times with the 4k configuration, and then 2 times with the 1k
> configuration) before it hung.
> 
> FSTESTIMG: xfstests-cloud/xfstests-201610131131
> FSTESTVER: e2fsprogs	v1.43.3-30-g8df85fb (Sun, 4 Sep 2016 21:32:35 -0400)
> FSTESTVER: fio		fio-2.14-27-gafd2cef (Wed, 12 Oct 2016 08:59:25 -0600)
> FSTESTVER: quota	81aca5c (Tue, 12 Jul 2016 16:15:45 +0200)
> FSTESTVER: xfsprogs	v4.5.0 (Tue, 15 Mar 2016 15:25:56 +1100)
> FSTESTVER: xfstests-bld	de1a881 (Fri, 30 Sep 2016 18:23:15 -0400)
> FSTESTVER: xfstests	linux-v3.8-1200-g218a62d (Thu, 29 Sep 2016 23:30:27 -0400)
> FSTESTVER: kernel	4.8.0-ext4-14234-gfb44543 #21 SMP Wed Oct 12 23:40:33 EDT 2016 x86_64
> 
> So on a system which was hung in this way, I ran "apt-get update ;
> apt-get install gdb", and uploaded the build tree (plus source code
> for the fio used to build the image) to /root/fio.
> 
> root@xfstests-201610131148:~# ps axm -o pid,tid,class,pri,psr,pcpu,stat,wchan:20,comm
>   PID   TID CLS PRI PSR %CPU STAT WCHAN                COMMAND
> 	....
> 19781     - -     -   -  1.6 -    -                    fio
>     - 19781 TS   19   0  1.6 Ssl  hrtimer_nanosleep    -
>     - 19800 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
>     - 19801 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
>     - 19802 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
>     - 19803 TS   19   0  0.0 Ssl  futex_wait_queue_me  -
> 	....
> root@xfstests-201610131148:~# gdb /root/fio/fio
> 	....
> (gdb) dir /root/fio
> Source directories searched: /root/fio:$cdir:$cwd
> (gdb) attach 19781
> Attaching to program: /root/fio/fio, process 19781
> 	....
> (gdb) thread apply all bt
> 
> Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> Thread 1 (Thread 0x7fb3de190740 (LWP 19781)):
> #0  0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
> #1  0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
> #2  0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738
> #3  0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268
> #4  0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
> #5  0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65
> 
> (gdb) thread apply all bt full
> 
> Thread 5 (Thread 0x7fb3ddfed700 (LWP 19800)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
>         list = {next = 0x7fb3ddfeced0, prev = 0x7fb3ddfeced0}
>         td = 0x7fb3c7590610
>         io_u = 0x202fac0
>         ret = 0
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfed700) at pthread_create.c:309
>         __res = <optimized out>
>         pd = 0x7fb3ddfed700
>         now = <optimized out>
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500339456, -2782054375790187915, 0, 140410502103136, 140410120428088, 
>                 140410500339456, 2739195402349996661, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
>             data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         pagesize_m1 = <optimized out>
>         sp = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
> 
> Thread 4 (Thread 0x7fb3ddfe5700 (LWP 19801)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
>         list = {next = 0x7fb3ddfe4ed0, prev = 0x7fb3ddfe4ed0}
>         td = 0x7fb3c7590610
>         io_u = 0x203d280
>         ret = 0
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfe5700) at pthread_create.c:309
>         __res = <optimized out>
>         pd = 0x7fb3ddfe5700
>         now = <optimized out>
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500306688, -2782054375790187915, 0, 140410502103136, 140410120428088, 
>                 140410500306688, 2739195406644963957, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
>             data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         pagesize_m1 = <optimized out>
>         sp = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
> 
> Thread 3 (Thread 0x7fb3ddfdd700 (LWP 19802)):
> ---Type <return> to continue, or q <return> to quit---
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
>         list = {next = 0x7fb3ddfdced0, prev = 0x7fb3ddfdced0}
>         td = 0x7fb3c7590610
>         io_u = 0x203cf80
>         ret = 0
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfdd700) at pthread_create.c:309
>         __res = <optimized out>
>         pd = 0x7fb3ddfdd700
>         now = <optimized out>
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500273920, -2782054375790187915, 0, 140410502103136, 140410120428088, 
>                 140410500273920, 2739195393760062069, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
>             data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         pagesize_m1 = <optimized out>
>         sp = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
> 
> Thread 2 (Thread 0x7fb3ddfd5700 (LWP 19803)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> No locals.
> #1  0x000000000043987b in verify_async_thread (data=0x7fb3c7590610) at verify.c:1241
>         list = {next = 0x7fb3ddfd4ed0, prev = 0x7fb3ddfd4ed0}
>         td = 0x7fb3c7590610
>         io_u = 0x203ccc0
>         ret = 0
> #2  0x00007fb3dd85d0a4 in start_thread (arg=0x7fb3ddfd5700) at pthread_create.c:309
>         __res = <optimized out>
>         pd = 0x7fb3ddfd5700
>         now = <optimized out>
>         unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140410500241152, -2782054375790187915, 0, 140410502103136, 140410120428088, 
>                 140410500241152, 2739195398055029365, 2739194912765274741}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, 
>             data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
>         not_first_call = <optimized out>
>         pagesize_m1 = <optimized out>
>         sp = <optimized out>
>         freesize = <optimized out>
>         __PRETTY_FUNCTION__ = "start_thread"
> #3  0x00007fb3dd38e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> No locals.
> 
> Thread 1 (Thread 0x7fb3de190740 (LWP 19781)):
> #0  0x00007fb3dd35ff2d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
> No locals.
> ---Type <return> to continue, or q <return> to quit---
> #1  0x00007fb3dd387fb4 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
>         ts = {tv_sec = 0, tv_nsec = 1000000}
> #2  0x000000000045b541 in thread_main (data=0x18fca000) at backend.c:1738
>         verify_bytes = 419209216
>         elapsed_us = {0, 0, 0}
>         td = 0x7fb3c7590610
>         sk_out = 0x2039600
>         clear_state = 33789440
> #3  0x000000000045d4d3 in run_threads (sk_out=sk_out@entry=0x0) at backend.c:2268
>         ret = <optimized out>
>         pid = <optimized out>
>         this_jobs = 6
>         fd = 0x2029720
>         map = {0x7fb3c754c000, 0x7fb3c7559ad0, 0x7fb3c75675a0, 0x7fb3c7575070, 0x7fb3c7582b40, 0x7fb3c7590610, 0x0 <repeats 11 times>, 
>           0x10000000, 0x0 <repeats 771 times>, 0x7fb3ddf83c1c <check_match+300>, 0x0, 0xd827524, 0x5, 0x7, 0x7fb3de192040, 
>           0x7fb3ddf8458e <do_lookup_x+2334>, 0x0, 0x7ffd24b8ea80, 0x7fb3dd2a9d28, 0x7fb3dd2af020, 0x7ffd24b8eb90, 0x3609d4, 
>           0x7ffd24b8eb80, 0x0, 0x0, 0x7fb3de1920c0, 0x7fb3de1939d8, 0x406851, 0x7fb3dd2b6d78, 0x401e18, 0x100000000, 0x100000375, 0x0, 
>           0x7fb3de1920c0, 0x7ffd24b8ec20, 0x7fb3de19d500, 0x7ffd24b8ec48, 0x7fb3de19d1a8, 0x1, 
>           0x7fb3ddf8473d <_dl_lookup_symbol_x+349>, 0x0, 0x7fb3de1920c0, 0x1, 0x0, 0x1, 0x7fb3de19d1a8, 0x0, 0x0, 0x0, 0x0, 0x0, 
>           0x7fb3de19d500, 0x7ffd24b8eb90, 0x7ffd24b8eb80, 0xd827524, 0x406851, 0xffffffff, 0x0, 0x7fb3dd2af020, 0x7fb3de1939d8, 
>           0x0 <repeats 11 times>, 0x698620 <memcmp@xxxxxxx>, 0x0, 0x7ffd24b8ecf8, 0x7ffd24b8ecf8, 0x3, 0x7ffd24b8f6f0, 
>           0x7fb3ddf88ecc <_dl_fixup+508>, 0x9000000000001, 0x0, 0x8ff5c28f5c28f, 0x7fb3dd2af020, 0x140, 0x7ffd24b8f6f0, 
>           0x7ffd24b91720, 0x7fb3ddf8f2e5 <_dl_runtime_resolve+53>, 0x0, 0x20262a0, 0x2bc4, 0x7ffd24b92530, 0x7ffd24b8f960, 
>           0x8ff5c28f5c200, 0x2026270, 0x7fb3de19d1a8, 0x54, 0x46144b <fio_test_cconv+3595>, 0x0, 0x7ffd24b8ece0, 
>           0x0 <repeats 16 times>, 0x2026210, 0x0, 0x2026270, 0x0, 0x1, 0x40000000400, 0x1, 0x0, 0x1, 0x100000001, 0x100000000, 0x0, 
>           0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x100000001000, 0x1000, 0x0 <repeats 14 times>, 0x1, 0x0, 0x100000000, 0x100000001, 0x0, 0x0, 
>           0x0, 0x100000001, 0x400, 0x0 <repeats 87 times>, 0x100000000, 0x100000000, 0x0, 0x89, 0x0 <repeats 20 times>, 0x6400000000, 
>           0x6400000064, 0x200000, 0x0, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1f400000000, 0x1000001f4...}
>         this_start = {tv_sec = 0, tv_usec = 0}
>         left = <optimized out>
>         td = 0x7fb3c7590610
>         i = 5
>         todo = 6
>         nr_running = 0
>         m_rate = 0
>         t_rate = 0
>         nr_started = 6
>         spent = <optimized out>
> #4  0x000000000045d80d in fio_backend (sk_out=sk_out@entry=0x0) at backend.c:2400
>         td = <optimized out>
>         i = <optimized out>
> #5  0x000000000040cbc8 in main (argc=2, argv=0x7ffd24b952e8, envp=<optimized out>) at fio.c:65
>         ret = 1
> 
> > > As I said, I can attach to the hanging fio using a gdb and give you
> > > stackdumps for all of the threads if that would be helpful.
> > 
> > That would definitely be helpful as well, especially if I cannot
> > reproduce it.
> 
> Here you go!
> 
> If I kill -9 the fio process, the tests continue to run, and there are
> no kernel complaints, so I'm pretty sure this is a fio issue.
> 
> Many thanks!!
> 
>    	  	      	     	    	 - Ted
> 
> P.S.  In answer to your question from another e-mail message on this
> thread, the block device in volved is a Google Compute Engine
> Persistent Disk device backed using an SSD.  To a very *rough*
> approximation, think about something kinda like a qemu qcow image on
> an very fast flash device in terms of performance.  I've seen
> the problem before on a 5400 RPM laptop drive, but it's ***much***
> rarer.   It reproduces much more easily faster block devices.
> 
> P.P.S.   And here's the fio job file that it was running:
> 
> ###########
> # 299 test fio activity
> # Filenames derived from jobsname and jobid like follows:
> # ..
> [global]
> ioengine=libaio
> bs=128k
> directory=/xt-vdc
> filesize=5368709120
> size=999G
> iodepth=128*1
> continue_on_error=write
> ignore_error=,ENOSPC
> error_dump=0
> create_on_open=1
> fallocate=none
> exitall=1
> 
> ## Perform direct aio, to files which may be truncated
> ## by external task
> [direct_aio]
> direct=1
> buffered=0
> numjobs=4
> rw=randwrite
> runtime=100*1
> time_based
> 
> # Perform direct aio and verify data
> # This test case should check use-after-free issues
> [aio-dio-verifier]
> numjobs=1
> verify=crc32c-intel
> verify_fatal=1
> verify_dump=1
> verify_backlog=1024
> verify_async=4
> verifysort=1
> direct=1
> bs=4k
> rw=randrw
> filename=aio-dio-verifier
> 
> # Perform buffered aio and verify data
> # This test case should check use-after-free issues
> [buffered-aio-verifier]
> numjobs=1
> verify=crc32c-intel
> verify_fatal=1
> verify_dump=1
> verify_backlog=1024
> verify_async=4
> verifysort=1
> direct=0
> buffered=1
> bs=4k
> rw=randrw
> filename=buffered-aio-verifier
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux