On Mon, 2011-04-04 at 14:47 +0100, Richard Kennedy wrote: > On Thu, 2011-03-31 at 15:49 +0100, Richard Kennedy wrote: > > On Thu, 2011-03-31 at 15:33 +0200, Jens Axboe wrote: > >[...] > > > >>> Hi Jens, > > > >>> > > > >>> I'm seeing a problem with fio never completing when writing to 2 disks > > > >>> simultaneously. In my test case I'm writing 2Gb to both a LVM volume & a > > > >>> pata drive on x86_64 on a AMD X2. Could this be a related issue? > > > >>> > > > >>> I'm not getting anything reported in the log, lockup detection doesn't > > > >>> report anything either. The write seems to have finished (the disk light > > > >>> activity has stopped) and the cpu cores are both below 10% usage, but > > > >>> fio never returns. The test does complete some times, but it seems to be > > > >>> one 1 in 4. > > > >> > > > >> So when you say PATA, it's /dev/hdaX something as well? > > > >> > > > >>> I'm going to try tracing it and see if I can spot where it's stuck. > > > >> > > > >> Thanks, that would be nice. > > > >> > > > > The second drive is /dev/sdb1 mounted on /opt, both file systems are > > > > ext4. > > > > > > So probably not related. What does the fio job look like? > > > > > fio job file -- > > [global] > > pre_read=1 > > ioengine=mmap > > > > [f1] > > size=2g > > rw=write > > directory=/home/tests > > > > [f2] > > size=2g > > rw=write > > directory=/opt/tests > > > > Fio gets run from a script that also collects stats but it's been > > running without any problems up until 2.6.39-rc1. > > > Hi Jens > I've upgrade to the latest fio version in the git repo 1.51 and I'm > still seeing this problem. > > Fio gets stuck after it writes the 100% complete message and strace on > the processes shows this. > > the controlling fio process :- > ... > [pid 8439] wait4(8442, 0x7fff848203ac, WNOHANG, NULL) = 0 > [pid 8439] nanosleep({0, 10000000}, NULL) = 0 > [pid 8439] wait4(8441, 0x7fff848203ac, WNOHANG, NULL) = 0 > [pid 8439] wait4(8442, 0x7fff848203ac, WNOHANG, NULL) = 0 > [pid 8439] nanosleep({0, 10000000} > > & the 2 workers are both stopped here, strace shows only the one line > for each process. > > Process 8441 attached - interrupt to quit > futex(0x7f9db76a802c, FUTEX_WAIT_PRIVATE, 2, NULL > > > Process 8442 attached - interrupt to quit > futex(0x7f9db76a802c, FUTEX_WAIT_PRIVATE, 2, NULL > > How do I find out which futex it's waiting for? > Any ideas where I should look next ? > > I can run the same test successfully on 2.6.38 so is it worth trying to > bisect this ? > > thanks > Richard > My problem has gone away in v2.6.39-rc3. I've just finished bisecting it down to 6de9843dab3f, & that got reverted in rc3, so no problem ;) (The data corruption caused by that faulty commit was zeroing out the shared mutexs in fio & the worker threads were getting stuck on the writeout_mutex.) regards Richard -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html