On Dec 29, 2007 1:58 PM, dean gaudet <dean@xxxxxxxxxx> wrote: > On Sat, 29 Dec 2007, Dan Williams wrote: > > > On Dec 29, 2007 9:48 AM, dean gaudet <dean@xxxxxxxxxx> wrote: > > > hmm bummer, i'm doing another test (rsync 3.5M inodes from another box) on > > > the same 64k chunk array and had raised the stripe_cache_size to 1024... > > > and got a hang. this time i grabbed stripe_cache_active before bumping > > > the size again -- it was only 905 active. as i recall the bug we were > > > debugging a year+ ago the active was at the size when it would hang. so > > > this is probably something new. > > > > I believe I am seeing the same issue and am trying to track down > > whether XFS is doing something unexpected, i.e. I have not been able > > to reproduce the problem with EXT3. MD tries to increase throughput > > by letting some stripe work build up in batches. It looks like every > > time your system has hung it has been in the 'inactive_blocked' state > > i.e. > 3/4 of stripes active. This state should automatically > > clear... > > cool, glad you can reproduce it :) > > i have a bit more data... i'm seeing the same problem on debian's > 2.6.22-3-amd64 kernel, so it's not new in 2.6.24. > This is just brainstorming at this point, but it looks like xfs can submit more requests in the bi_end_io path such that it can lock itself out of the RAID array. The sequence that concerns me is: return_io->xfs_buf_end_io->xfs_buf_io_end->xfs_buf_iodone_work->xfs_buf_iorequest->make_request-><hang> I need verify whether this path is actually triggering, but if we are in an inactive_blocked condition this new request will be put on a wait queue and we'll never get to the release_stripe() call after return_io(). It would be interesting to see if this is new XFS behavior in recent kernels. -- Dan - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html