Re: Test generic/299 stalling forever

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/21/2016 04:15 PM, Theodore Ts'o wrote:
On Thu, Oct 20, 2016 at 08:22:00AM -0600, Jens Axboe wrote:
So what's happening is that generic/299 is looping in the
fallocate/truncate loop until fio exits, but since fio never exits, so
it ends up looping forever.

I'm setting up the GCE now, I've had the tests running for about 24h now
on another test box and haven't been able to trigger any hangs. I'll
match your setup as closely as I can, hopefully that'll work.

Any luck reproducing the problem?

On Wed, Oct 19, 2016 at 08:06:44AM -0600, Jens Axboe wrote:

I'll take a look today. I agree, this definitely looks like a fio
bug. But not related to the mutex issue for the stat part, all verifier
threads are waiting to be woken up, but the main thread is done.


I was taking a closer look at this, and it does look ike it's related
to the stat_mutex.  The main thread (according to gdb) seems to be
stuck in this loop in backend.c line 1738 (in thread_main):

		do {
			check_update_rusage(td);
			if (!fio_mutex_down_trylock(stat_mutex))
				break;
			usleep(1000);   <----- line 1738
		} while (1);

So it looks like it's not able to grab the stat_mutex.  But I can't
figure out how the stat_mutex could be down.  None of the strack
traces seem to show that, and I've looked at all of the places where
stat_mutex is taken, and it doesn't look like stat_mutex should ever
be down for more than, say, a second?

So as a temporary workaround, I'm considering adding a check to see if
we stay stuck in this loop for than a thousand times, and if so, print
an error to stderr and then call _exit(1), or maybe just break out two
levels by jumping to line 1778 at "td_set_runstate(td, TD_FINISHING)"
and just give up on the usage statistics (since for xfstests we really
don't care about the usage stats).

Very strange. Can you see who the owner is of stat_mutex->lock, that's
the pthread_mutex_t they are sleeping on.

For now, I'll apply the work-around you sent. I haven't been able to
reproduce this, but knowing that it's the stat_mutex will allow me to
better make up a test case to hit it.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux