Forgot to include the mailing list on this. > Hi Mike, > thanks for the updates. > > I'm not entirely clear what is happening (in fact, due to a cold that I am > still fighting off, nothing is entirely clear at the moment), but it looks > very likely that the problem is due to an interplay between barrier handling, > and the multi-level structure of your array (a raid0 being a member of a > raid5). > > When a barrier request is processed, both arrays will schedule 'work' to be > done by the 'event' thread and I'm guess that you can get into a situation > where one work time is wait for the other, but the other is behind the one on > the single queue (I wonder if that make sense...) > > Anyway, this patch might make a difference, It reduced the number of work > items schedule in a way that could conceivably fix the problem. > > If you can test this, please report the results. I cannot easily reproduce > the problem so there is limited testing that I can do. > > Thanks, > NeilBrown > > > diff --git a/drivers/md/md.c b/drivers/md/md.c > index f20d13e..7f2785c 100644 > --- a/drivers/md/md.c > +++ b/drivers/md/md.c > @@ -294,6 +294,23 @@ EXPORT_SYMBOL(mddev_congested); > > #define POST_REQUEST_BARRIER ((void*)1) > > +static void md_barrier_done(mddev_t *mddev) > +{ > + struct bio *bio = mddev->barrier; > + > + if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags)) > + bio_endio(bio, -EOPNOTSUPP); > + else if (bio->bi_size == 0) > + bio_endio(bio, 0); > + else { > + /* other options need to be handled from process context */ > + schedule_work(&mddev->barrier_work); > + return; > + } > + mddev->barrier = NULL; > + wake_up(&mddev->sb_wait); > +} > + > static void md_end_barrier(struct bio *bio, int err) > { > mdk_rdev_t *rdev = bio->bi_private; > @@ -310,7 +327,7 @@ static void md_end_barrier(struct bio *bio, int err) > wake_up(&mddev->sb_wait); > } else > /* The pre-request barrier has finished */ > - schedule_work(&mddev->barrier_work); > + md_barrier_done(mddev); > } > bio_put(bio); > } > @@ -350,18 +367,12 @@ static void md_submit_barrier(struct work_struct *ws) > > atomic_set(&mddev->flush_pending, 1); > > - if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags)) > - bio_endio(bio, -EOPNOTSUPP); > - else if (bio->bi_size == 0) > - /* an empty barrier - all done */ > - bio_endio(bio, 0); > - else { > - bio->bi_rw &= ~REQ_HARDBARRIER; > - if (mddev->pers->make_request(mddev, bio)) > - generic_make_request(bio); > - mddev->barrier = POST_REQUEST_BARRIER; > - submit_barriers(mddev); > - } > + bio->bi_rw &= ~REQ_HARDBARRIER; > + if (mddev->pers->make_request(mddev, bio)) > + generic_make_request(bio); > + mddev->barrier = POST_REQUEST_BARRIER; > + submit_barriers(mddev); > + > if (atomic_dec_and_test(&mddev->flush_pending)) { > mddev->barrier = NULL; > wake_up(&mddev->sb_wait); > @@ -383,7 +394,7 @@ void md_barrier_request(mddev_t *mddev, struct bio *bio) > submit_barriers(mddev); > > if (atomic_dec_and_test(&mddev->flush_pending)) > - schedule_work(&mddev->barrier_work); > + md_barrier_done(mddev); > } > EXPORT_SYMBOL(md_barrier_request); > > > Neil, thanks for the patch. I experienced the lockup for the 5th time an hour ago (about 3 hours after the last hard reboot) so I thought it would be a good time to try your patch. Unfortunately I'm getting an error: patching file drivers/md/md.c Hunk #1 succeeded at 291 with fuzz 1 (offset -3 lines). Hunk #2 FAILED at 324. Hunk #3 FAILED at 364. Hunk #4 FAILED at 391. 3 out of 4 hunks FAILED -- saving rejects to file drivers/md/md.c.rej "uname -r" gives "2.6.35-gentoo-r4", so I suspect that's why. I guess the standard gentoo patchset does something with that file. I'm skimming through md.c to see if I can understand it well enough to apply the patch functionality manually. I've also uploaded my 2.6.35-gentoo-r4 md.c to www.hartmanipulation.com/raid/ with the other files in case you or someone else wants to take a look at it. Mike -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html