Fwd: tgt reset on simple simulator

osishkin osishkin <osishkin@xxxxxxxxx> · Thu, 13 Jun 2013 11:55:48 +0300

Hi,

My name is Aviad, I'm a student in Tel Aviv university.
I've been working on a simple SSD simulator with tgt. The basic idea
is that data is saved in RAM.
The "disk" itself is multi-threaded, with one threads handling
requests and spreading them between other threads (representing NAND
flash chips).
I'm using tgt-1.0.1 on a multi-core machine (128 cores) with kernel
3.2. So things should be pretty parallelized.
Some minor implementation details - every scsi_cmd request is
"translated" into small multiple 4KB requests to my code, which
services them, responds, and when all parts of the scsi_cmd are
serviced, I terminate the scsi_cmd.
Things have been working well, I've been able to use dd to issue
requests, mkfs and mount a file system (ext3) on top of it.
But whenever I run some file system benchmarks, which means more
sophisticated workloads, at some points I get resets in tgt code and
requests being denied. tgt simply hangs for some time (several seconds
up to 60s or so, anyway more than ext3 flush intervals), and most
requests get lost, or only get processed after a very long time.

My part of the code waits meanwhile doing nothing, just waiting for
incoming requests
to arrive (without any requests in it's own internal queues).
So I suspect the problem is not there. When I run with debugging
printout I get these kind of messages, which suggest a timeout has
been exceeded (from what I've read here). But I dont understand how
could this happen, since I am not servicing any requests meanwhile!

tgtd: iscsi_noop_out_rx_start(1607) ffffffff 5e 0
tgtd: iscsi_task_queue(1514) 88d1 88d1 40
tgtd: iscsi_task_tx_start(1860) found a task 0 4294967295 0 0
tgtd: iscsi_task_tx_start(1885) no more data
tgtd: iscsi_noop_out_rx_start(1607) ffffffff 8 0
tgtd: iscsi_task_queue(1514) 88d1 88d1 40
tgtd: iscsi_task_tx_start(1860) found a task 0 4294967295 0 0
tgtd: iscsi_task_tx_start(1885) no more data
tgtd: iscsi_noop_out_rx_start(1607) ffffffff c 0
tgtd: iscsi_task_queue(1514) 88d1 88d1 40
tgtd: iscsi_task_tx_start(1860) found a task 0 4294967295 0 0
tgtd: iscsi_task_tx_start(1885) no more data
tgtd: iscsi_task_queue(1514) 88d1 88d1 42
tgtd: abort_task_set(1008) found 14 0
tgtd: iscsi_task_tx_start(1860) found a task 0 855638016 0 0
tgtd: iscsi_task_tx_start(1885) no more data
tgtd: iscsi_task_queue(1514) 88d1 88d1 42
tgtd: abort_task_set(1008) found 0 0
tgtd: abort_cmd(984) found 33 6

When I run tgt with gdb, and check the status of threads when tgt
hangs, I get this backtrace from tgt threads

Thread 6 (Thread 0x7ffff5631710 (LWP 66238)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/
linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004215e7 in bs_thread_worker_fn (arg=<value optimized
out>) at bs.c:196
#2  0x00007ffff7bc69ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#3  0x00007ffff771c69d in clone () from /lib/tls/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 5 (Thread 0x7ffff5e32710 (LWP 66237)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004215e7 in bs_thread_worker_fn (arg=<value optimized
out>) at bs.c:196
#2  0x00007ffff7bc69ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#3  0x00007ffff771c69d in clone () from /lib/tls/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 4 (Thread 0x7ffff6633710 (LWP 66236)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004215e7 in bs_thread_worker_fn (arg=<value optimized
out>) at bs.c:196
---Type <return> to continue, or q <return> to quit---
#2  0x00007ffff7bc69ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#3  0x00007ffff771c69d in clone () from /lib/tls/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 3 (Thread 0x7ffff6e34710 (LWP 66235)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004215e7 in bs_thread_worker_fn (arg=<value optimized
out>) at bs.c:196
#2  0x00007ffff7bc69ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#3  0x00007ffff771c69d in clone () from /lib/tls/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7ffff7635710 (LWP 66234)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000004217b5 in bs_thread_ack_fn (arg=<value optimized out>)
at bs.c:89
#2  0x00007ffff7bc69ca in start_thread (arg=<value optimized out>) at
pthread_create.c:300
#3  0x00007ffff771c69d in clone () from /lib/tls/libc.so.6
#4  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7ffff7fb1700 (LWP 66220)):
#0  0x00007ffff771cc93 in epoll_wait () from /lib/tls/libc.so.6
#1  0x000000000040fde0 in event_loop () at tgtd.c:263
#2  0x0000000000410309 in main (argc=<value optimized out>,
argv=<value optimized out>) at tgtd.c:438

Any idea what might be going on? I'm at a loss here.
I even tried to run the file system (ext3) in sync mode, to lower the
stress on tgt. Did not help at all.
Using tgt 1.0.36 did not resolve this either. Same reset problem.

I'm a newbee to tgt, so it is possible I'm missing something. I'd
appreciate your help.

Thank you
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html