RE: fio main thread got stuck over the weekend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> it's just 18.9 minutes. Total runtime is summed after the fact. The one
> I think is interesting is the first thread, that is the one that should
> be printing eta stats. And it seems to be stuck:
> 
>  >    212 Thread 0x7fa9a086e700 (LWP 6509)  0x0000003974c0b98e in
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> 
> can you do a bt on that? This looks like it _might_ be an issue that was
> fixed recently with stats. Are you running fio with any output or status
> options?
> 

No output options.

(gdb) thread 212
[Switching to thread 212 (Thread 0x7fa9a086e700 (LWP 6509))]#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003974c0b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000449381 in helper_thread_main (data=<value optimized out>) at backend.c:2127
#2  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#3  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 1
(gdb) info local
ts = {tv_sec = 1418658662, tv_nsec = 980074000}
tv = {tv_sec = 1418658662, tv_usec = 730074}
ret = <value optimized out>
(gdb) print helper_exit
$36 = 0
(gdb) print nr_thread
$37 = 224
(gdb) print nr_process
$38 = 0

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fa9a99b1720 (LWP 6508))]#0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
(gdb) bt
#0  0x00000039748acced in nanosleep () from /lib64/libc.so.6
#1  0x00000039748e1e64 in usleep () from /lib64/libc.so.6
#2  0x000000000044d019 in do_usleep () at backend.c:1841
#3  run_threads () at backend.c:2083
#4  0x000000000044d8ab in fio_backend () at backend.c:2199
#5  0x000000397481ed1d in __libc_start_main () from /lib64/libc.so.6
#6  0x000000000040a199 in _start ()
 (gdb) select-frame 3
 (gdb) info local
td = <value optimized out>
i = <value optimized out>
todo = <value optimized out>
nr_running = 210
m_rate = 0
t_rate = 0
nr_started = <value optimized out>
spent = <value optimized out>
(gdb) print thread_number
$39 = 224

so run_threads() is probably in this loop waiting for 210 threads to quit:
        while (nr_running) {
                reap_threads(&nr_running, &t_rate, &m_rate);
                do_usleep(10000);
        }

Given:

(gdb) thread 2
[Switching to thread 2 (Thread 0x7fa92bf87700 (LWP 6733))]#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
(gdb) print td->error
No symbol "td" in current context.
(gdb) bt
#0  0x0000003657600667 in io_submit () from /lib64/libaio.so.1
#1  0x0000000000457058 in fio_libaio_commit (td=0x7fa9a0dd1860) at engines/libaio.c:255
#2  0x000000000040b395 in td_io_commit (td=0x7fa9a0dd1860) at ioengines.c:396
#3  0x000000000040bea1 in td_io_queue (td=0x7fa9a0dd1860, io_u=0x7fa8e4015400) at ioengines.c:343
#4  0x000000000044a75d in do_io (td=0x7fa9a0dd1860) at backend.c:792
#5  0x000000000044c209 in thread_main (data=0x7fa9a0dd1860) at backend.c:1504
#6  0x0000003974c079d1 in start_thread () from /lib64/libpthread.so.0
#7  0x00000039748e8b7d in clone () from /lib64/libc.so.6
(gdb) select-frame 5
(gdb) print td->error
$40 = 0
(gdb) print td->terminate
$41 = 0
(gdb) print td->done
$42 = 0
(gdb) print td->o.time_based
$43 = 1
(gdb) print o->create_only
$47 = 0

I guess the threads are not leaving this loop in thread_main(): 
        while (keep_running(td)) {
        ...
        }

static int keep_running(struct thread_data *td)
{
        unsigned long long limit;

        if (td->done)
                return 0;
        if (td->o.time_based)
                return 1;
...

After the runtime has elapsed, should td->done or td->terminate
get set?


The jobfile is:
[global]
direct=1
ioengine=libaio
norandommap
randrepeat=0
bs=512
iodepth=96
numjobs=1
numjobs=14
runtime=216000
time_based=1
group_reporting
thread
gtod_reduce=1
iodepth_batch=14
iodepth_batch_complete=14
iodepth_low=14
userspace_reap=1
cpus_allowed=0-13
cpus_allowed_policy=split
rw=randread

[drive_b]
filename=/dev/sdb

[drive_c]
filename=/dev/sdc

[drive_d]
filename=/dev/sdd

[drive_e]
filename=/dev/sde

[drive_f]
filename=/dev/sdf

[drive_g]
filename=/dev/sdg

[drive_h]
filename=/dev/sdh

[drive_i]
filename=/dev/sdi

[drive_j]
filename=/dev/sdj

[drive_k]
filename=/dev/sdk

[drive_l]
filename=/dev/sdl

[drive_m]
filename=/dev/sdm

[drive_n]
filename=/dev/sdn

[drive_o]
filename=/dev/sdo

[drive_p]
filename=/dev/sdp

[drive_q]
filename=/dev/sdq



--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Linux SCSI]     [Linux IDE]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux