Re: Raid10 device hangs during resync and heavy I/O.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 23/07/10 13:19 +1000, Neil Brown wrote:
> On Thu, 22 Jul 2010 14:49:33 -0400
> Justin Bronder <jsbronder@xxxxxxxxxx> wrote:
> 
> > On 16/07/10 14:46 -0400, Justin Bronder wrote:
> > 
> > I've done some more research that may potentially help. All of
> > the following was done with 2.6.34.1.
> > 
> > Still produces the hang:
> >     - Using cp (may take a bit longer).
> >     - Using jfs as the filesystem.
> >     - Dropping RESYNC_DEPTH to 32
> >     - Using the offset layout.
> > 
> > Does not produce the hang:
> >     - Using the near layout.
> >     - Using dd on the partition directly instead of on a
> >       filesystem via something like:
> >       dd if=/dev/${MD_DEV}p1 of=/dev/${MD_DEV}p1 seek=4001 bs=1M
> > 
> > 
> > As the barrier code is very similiar, I repeated a number of
> > these tests using raid1 instead of raid10.  In every case, I was
> > unable to cause the system to hang.  I focused on the barriers
> > due to the tracebacks in the previous email.  For the heck of it,
> > I added some tracing (patch below) where the reason for the hang
> > is fairly obvious.  Of course, how it happened isn't.
> > 
> > The last bit of the trace before the hang.
> 
> Thanks for doing this!
> 
> See below...

<previous trace cut>

> 
> 
> So the 'dd' process successfully waited for the barrier to be gone at
> 189.021179, and thus set pending to '1'.  It then submitted the IO request.
> We should then see swapper (or possibly some other thread) calling
> allow_barrier when the request completes.  But we don't.
> A request could possibly take many milliseconds to complete, but it shouldn't
> take seconds and certainly not minutes.
> 
> It might be helpful if you could run this again, and in make_request(), after
> the call to "wait_barrier()" print out:
>   bio->bi_sector, bio->bi_size, bio->bi_rw
> 
> I'm guessing that the last request that doesn't seem to complete will be
> different from the other in some important way.

Nothing stood out to me, but here's the tail end of a couple of different
traces.

           <...>-5047  [002]   207.023784: wait_barrier: in:  dd - w:0 p:11 b:0
           <...>-5047  [002]   207.023784: wait_barrier: out: dd - w:0 p:12 b:0
           <...>-5047  [002]   207.023785: make_request: dd - sector:7472001 sz:40960 rw:0
           <...>-4958  [002]   207.023872: raise_barrier: mid: md99_resync - w:0 p:12 b:1
           <...>-5047  [002]   207.024689: allow_barrier:     dd - w:0 p:11 b:1
           <...>-5047  [002]   207.024695: allow_barrier:     dd - w:0 p:10 b:1
           <...>-5047  [002]   207.024697: allow_barrier:     dd - w:0 p:9 b:1
           <...>-5047  [002]   207.024710: allow_barrier:     dd - w:0 p:8 b:1
           <...>-5047  [002]   207.024713: allow_barrier:     dd - w:0 p:7 b:1
           <...>-5047  [002]   207.026679: wait_barrier: in:  dd - w:0 p:7 b:1
          <idle>-0     [003]   207.043049: allow_barrier:     swapper - w:1 p:6 b:1
          <idle>-0     [003]   207.043058: allow_barrier:     swapper - w:1 p:5 b:1
          <idle>-0     [003]   207.043063: allow_barrier:     swapper - w:1 p:4 b:1
          <idle>-0     [003]   207.043070: allow_barrier:     swapper - w:1 p:3 b:1
          <idle>-0     [003]   207.043074: allow_barrier:     swapper - w:1 p:2 b:1
          <idle>-0     [003]   207.043079: allow_barrier:     swapper - w:1 p:1 b:1
          <idle>-0     [003]   207.043084: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4958  [003]   207.043108: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4958  [003]   207.043150: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4957  [003]   207.051206: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-5047  [002]   207.051215: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-5047  [002]   207.051216: make_request: dd - sector:7472081 sz:20480 rw:0
           <...>-4958  [003]   207.051218: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-5047  [002]   207.051227: wait_barrier: in:  dd - w:0 p:1 b:1
          <idle>-0     [002]   207.058929: allow_barrier:     swapper - w:1 p:0 b:1
           <...>-4958  [003]   207.058938: raise_barrier: out: md99_resync - w:1 p:0 b:1
           <...>-4958  [003]   207.059044: raise_barrier: in:  md99_resync - w:1 p:0 b:1
           <...>-4957  [003]   207.067171: lower_barrier:     md99_raid10 - w:1 p:0 b:0
           <...>-5047  [002]   207.067179: wait_barrier: out: dd - w:0 p:1 b:0
           <...>-5047  [002]   207.067180: make_request: dd - sector:7472121 sz:3584 rw:0
           <...>-4958  [003]   207.067182: raise_barrier: mid: md99_resync - w:0 p:1 b:1
           <...>-5047  [002]   207.067184: wait_barrier: in:  dd - w:0 p:1 b:1



          <idle>-0     [000]   463.231730: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [000]   463.231739: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [000]   463.231746: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [000]   463.231765: allow_barrier:     swapper - w:2 p:1 b:1
          <idle>-0     [000]   463.231774: allow_barrier:     swapper - w:2 p:0 b:1
           <...>-5004  [000]   463.231792: raise_barrier: out: md99_resync - w:2 p:0 b:1
           <...>-5004  [000]   463.232005: raise_barrier: in:  md99_resync - w:2 p:0 b:1
           <...>-5003  [001]   463.232453: lower_barrier:     md99_raid10 - w:2 p:0 b:0
           <...>-5009  [000]   463.232463: wait_barrier: out: flush-9:99 - w:1 p:1 b:0
           <...>-5009  [000]   463.232464: make_request: flush-9:99 - sector:13931137 sz:61440 rw:1
           <...>-5105  [001]   463.232466: wait_barrier: out: dd - w:0 p:2 b:0
           <...>-5105  [001]   463.232467: make_request: dd - sector:7204393 sz:40960 rw:0
           <...>-5009  [000]   463.232476: wait_barrier: in:  flush-9:99 - w:0 p:2 b:0
           <...>-5009  [000]   463.232477: wait_barrier: out: flush-9:99 - w:0 p:3 b:0
           <...>-5009  [000]   463.232477: make_request: flush-9:99 - sector:13931257 sz:3584 rw:1
           <...>-5009  [000]   463.232481: wait_barrier: in:  flush-9:99 - w:0 p:3 b:0
           <...>-5009  [000]   463.232482: wait_barrier: out: flush-9:99 - w:0 p:4 b:0
           <...>-5009  [000]   463.232483: make_request: flush-9:99 - sector:13931264 sz:512 rw:1
           <...>-5105  [001]   463.232492: wait_barrier: in:  dd - w:0 p:4 b:0
           <...>-5105  [001]   463.232493: wait_barrier: out: dd - w:0 p:5 b:0
           <...>-5105  [001]   463.232494: make_request: dd - sector:7204473 sz:3584 rw:0
           <...>-5004  [000]   463.232495: raise_barrier: mid: md99_resync - w:0 p:5 b:1
           <...>-5105  [001]   463.232496: wait_barrier: in:  dd - w:0 p:5 b:1
           <...>-5009  [000]   463.232522: wait_barrier: in:  flush-9:99 - w:1 p:5 b:1
          <idle>-0     [000]   463.232726: allow_barrier:     swapper - w:2 p:4 b:1
          <idle>-0     [001]   463.240520: allow_barrier:     swapper - w:2 p:3 b:1
          <idle>-0     [000]   463.240946: allow_barrier:     swapper - w:2 p:2 b:1
          <idle>-0     [000]   463.240955: allow_barrier:     swapper - w:2 p:1 b:1

Thanks,

-- 
Justin Bronder

Attachment: pgpLzPS0yqanj.pgp
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux