Good morning, The test ran over the weekend and did not hang once. It is currently at 550 iterations without hitting the hang. I would like to say that corrected the issue. Thank you for your help! Ryan On Fri, Aug 26, 2016 at 4:09 PM, Ryan Stiles <ras@xxxxxxxxx> wrote: > Thanks Jens, > > I have the change built. I will leave the test running over the > weekend to see if it reproduces. It was not very common before but I > did hit it twice in 2 days or every 500 sessions or so. > > Ryan > > On Thu, Aug 25, 2016 at 9:37 PM, Jens Axboe <axboe@xxxxxxxxx> wrote: >> On 08/25/2016 10:16 AM, Ryan Stiles wrote: >>> >>> Good day, >>> >>> >>> I am running fio version 2.12 compiled with the noshmem option and >>> occasionally they get stuck. I have seen this twice in the last 2 >>> days.: >>> >>> /tmp/fio-2.12-noshmem --version >>> >>> fio-2.12 >>> >>> >>> fio command: >>> >>> /tmp/fio-2.12-noshmem >>> /var/tmp/fio.23959cf869a111e69bb800269eb5da06.cfg --output-format=json >>> --size=10g --status-interval=10 >>> >>> >>> fio config file: >>> >>> [global] >>> >>> ioengine=libaio >>> >>> direct=1 >>> >>> loops=1 >>> >>> ramp_time=50 >>> >>> runtime=300 >>> >>> randrepeat=0 >>> >>> group_reporting >>> >>> time_based=1 >>> >>> filename=/dev/disk/by-uuid/069e5a48-1a50-437f-bbc0-4fa612565378 >>> >>> filename=/dev/disk/by-uuid/0ed13d1e-3544-49fb-a757-143d186810c3 >>> >>> filename=/dev/disk/by-uuid/fed1ec42-c168-45d4-b8f6-da54ccb4fdcd >>> >>> filename=/dev/disk/by-uuid/60f5d8f9-3014-40ee-bdef-9d0e87742fd9 >>> >>> filename=/dev/disk/by-uuid/dcbf8aac-f39e-45d1-9b5a-1a62aed3fb8e >>> >>> filename=/dev/disk/by-uuid/3cbbd8dc-8efa-48df-80ba-f0a6375c12ee >>> >>> filename=/dev/disk/by-uuid/b960a7a4-c1b1-47ba-9883-fc9397fae0dc >>> >>> filename=/dev/disk/by-uuid/ea0f60d1-bd1d-4741-8529-cc816aad4632 >>> >>> filename=/dev/disk/by-uuid/1f401ec1-4f47-49b1-9cf3-f38953714107 >>> >>> filename=/dev/disk/by-uuid/0c7644d7-5f3d-405d-ae74-976f7d6ea4ff >>> >>> filename=/dev/disk/by-uuid/6e7dbb4c-5e01-436d-a053-972a59ba39ff >>> >>> filename=/dev/disk/by-uuid/4addf688-8b78-4681-88fb-a64a8649eba0 >>> >>> filename=/dev/disk/by-uuid/2b7e8012-0e00-4a95-be3d-fb3b06b2c9f6 >>> >>> filename=/dev/disk/by-uuid/7844d521-46aa-4a12-a69e-af0020370f3d >>> >>> filename=/dev/disk/by-uuid/a0f1719c-7638-4c9a-96f4-76080c5ef3a3 >>> >>> filename=/dev/disk/by-uuid/ea44c274-9eb2-4061-af99-b45cdd87b59a >>> >>> >>> [30%_write_4k_bs_bandwidth] >>> >>> rw=randrw >>> >>> bs=4k >>> >>> rwmixread=70 >>> >>> randrepeat=0 >>> >>> numjobs=8 >>> >>> iodepth=4096 >>> >>> >>> when I attach to one of the processes (I am running 8 and they all say >>> the same thing): >>> >>> Process 826 attached >>> >>> futex(0x7f110e43802c, FUTEX_WAIT, 11, NULL >>> >>> >>> I did some searching and found this link which had a similar issue but >>> I couldn't find which version it was (from 2014 though) and it says >>> that the problem was fixed for the people involved. >>> >>> http://www.spinics.net/lists/fio/msg03558.html >>> >>> This is a much more recent link with an issue: >>> >>> https://github.com/axboe/fio/issues/52 >>> >>> >>> But when I look at the code, I see the comment you put in for that >>> issue (I am assuming): >>> >>> clear_state = 1; >>> >>> >>> /* >>> >>> * Make sure we've successfully updated the rusage stats >>> >>> * before waiting on the stat mutex. Otherwise we could >>> have >>> >>> * the stat thread holding stat mutex and waiting for >>> >>> * the rusage_sem, which would never get upped because >>> >>> * this thread is waiting for the stat mutex. >>> >>> */ >>> >>> check_u >>> >>> All of the threads are at this pthread_cond_wait: >>> >>> Thread 8 (Thread 0x7f10ede69700 (LWP 826)): >>> >>> #0 0x00007f110d0ce6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> >>> #1 0x0000000000443f49 in fio_mutex_down (mutex=0x7f110e438000) at >>> mutex.c:213 >>> >>> #2 0x000000000045b73b in thread_main (data=<optimized out>) at >>> backend.c:1689 >>> >>> #3 0x00007f110d0cadc5 in start_thread () from /lib64/libpthread.so.0 >>> >>> #4 0x00007f110cbf3ced in clone () from /lib64/libc.so.6 >>> >>> >>> But one of the thread is at a little big different pthread_cond_wait: >>> >>> Thread 9 (Thread 0x7f10ed668700 (LWP 785)): >>> >>> #0 0x00007f110d0ce6d5 in pthread_cond_wait@@GLIBC_2.3.2 () from >>> /lib64/libpthread.so.0 >>> >>> #1 0x0000000000443f49 in fio_mutex_down (mutex=0x7f110e436000) at >>> mutex.c:213 >>> >>> #2 0x000000000042dbfe in __show_running_run_stats () at stat.c:1762 >>> >>> #3 0x0000000000467ba9 in helper_thread_main (data=0x7f110b4802f0) at >>> helper_thread.c:122 >>> >>> #4 0x00007f110d0cadc5 in start_thread () from /lib64/libpthread.so.0 >>> >>> #5 0x00007f110cbf3ced in clone () from /lib64/libc.so.6 >>> >>> >>> Looks like that mutex isn't getting woken up by thread 9 for some reason. >>> >>> >>> Any idea what could be causing this? >> >> >> Hmm, can you try with the below patch? >> >> diff --git a/backend.c b/backend.c >> index bb0200bb2b0f..31653f4f94c1 100644 >> --- a/backend.c >> +++ b/backend.c >> @@ -1684,9 +1684,13 @@ static void *thread_main(void *data) >> * the rusage_sem, which would never get upped because >> * this thread is waiting for the stat mutex. >> */ >> - check_update_rusage(td); >> + do { >> + check_update_rusage(td); >> + if (!fio_mutex_down_trylock(stat_mutex)) >> + break; >> + usleep(1000); >> + } while (1); >> >> - fio_mutex_down(stat_mutex); >> if (td_read(td) && td->io_bytes[DDIR_READ]) >> update_runtime(td, elapsed_us, DDIR_READ); >> if (td_write(td) && td->io_bytes[DDIR_WRITE]) >> >> -- >> Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe fio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html