On Thu, 21 Feb 2013 15:58:24 -0700 Tregaron Bayly <tbayly@xxxxxxxxxxxx> wrote: > Symptom: > A RAID 1 array ends up with two threads (flush and raid1) stuck in D > state forever. The array is inaccessible and the host must be restarted > to restore access to the array. > > I have some scripted workloads that reproduce this within a maximum of a > couple hours on kernels from 3.6.11 - 3.8-rc7. I cannot reproduce on > 3.4.32. 3.5.7 ends up with three threads stuck in D state, but the > stacks are different from this bug (as it's EOL maybe of interest in > bisecting the problem?). Can you post the 3 stacks from the 3.5.7 case? It might help get a more complete understanding. ... > Both processes end up in wait_event_lock_irq() waiting for favorable > conditions in the struct r1conf to proceed. These conditions obviously > seem to never arrive. I placed printk statements in freeze_array() and > wait_barrier() directly before calling their respective > wait_event_lock_irq() and this is an example output: > > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Attempting to freeze array: barrier (1), nr_waiting (1), nr_pending (5), nr_queued (3) > Feb 20 17:47:35 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (2), nr_pending (5), nr_queued (3) > Feb 20 17:47:38 sanclient kernel: [4946b55d-bb0a-7fce-54c8-ac90615dabc1] Awaiting barrier: barrier (1), nr_waiting (3), nr_pending (5), nr_queued (3) This is very useful, thanks. Clearly there is one 'pending' request that isn't being counted, but also isn't being allowed to complete. Maybe it is in pending_bio_list, and so counted in conf->pending_count. Could you print out that value as well and try to trigger the bug again? If conf->pending_count is non-zero, then it seems very likely the we have found the problem. Fixing it isn't quite so easy. 'nr_pending' counts request from the filesystem that are still pending. 'pending_count' sounds request down to the underlying device that are still pending. There isn't a 1-to-1 correspondence, so we cannot just subtract one from the other. It will require more thought than that. Thanks for the thorough report, NeilBrown
Attachment:
signature.asc
Description: PGP signature