Re: [BUG] MD/RAID1 hung forever on freeze_array

Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> · Thu, 8 Dec 2016 10:50:22 +0100

Thanks Neil for valuable inputs, please see comments inline.

On Thu, Dec 8, 2016 at 4:17 AM, NeilBrown <neilb@xxxxxxxx> wrote:
> On Thu, Dec 08 2016, Jinpu Wang wrote:
>
>> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang
>> <jinpu.wang@xxxxxxxxxxxxxxxx> wrote:
>>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@xxxxxxx> wrote:
>>>> On 2016/11/28 下午5:02, Jinpu Wang wrote:
>>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@xxxxxxx> wrote:
>>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote:
>>>>>>> snip
>>>>>>>>>>
>>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we
>>>>>>>>>> forgot to increase nr_queued somewhere?
>>>>>>>>>>
>>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in
>>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug.
>>>>>>>>>>
>>>>>>>>>> Could you give your suggestion?
>>>>>>>>>>
>>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28
>
>>
>> I continue debug the bug:
>>
>> 20161207
>
>>   nr_pending = 948,
>>   nr_waiting = 9,
>>   nr_queued = 946, // again we need one more to finished wait_event.
>>   barrier = 0,
>>   array_frozen = 1,
>
>> on conf->bio_end_io_list we have 91 entries.
>
>> on conf->retry_list we have 855
>
> This is useful.  It confirms that nr_queued is correct, and that
> nr_pending is consistently 1 higher than expected.
> This suggests that a request has been counted in nr_pending, but hasn't
> yet been submitted, or has been taken off one of the queues but has not
> yet been processed.
>
> I notice that in your first email the Blocked tasks listed included
> raid1d which is blocked in freeze_array() and a few others in
> make_request() blocked on wait_barrier().
> In that case nr_waiting was 100, so there should have been 100 threads
> blocked in wait_barrier().  Is that correct?  I assume you thought it
> was pointless to list them all, which seems reasonable.
This is correct. From my test, I initially use numjobs set to 100 in
fio. then later I reduce that to 10,
I can still trigger the bug.

>
> I asked because I wonder if there might have been one thread in
> make_request() which was blocked on something else.  There are a couple
> of places when make_request() will wait after having successfully called
> wait_barrier().  If that happened, it would cause exactly the symptoms
> you report.  Could you check all blocked threads carefully please?

Every time it's the same hung task, raid1d is blocked in freeze_array
PID: 5002   TASK: ffff8800ad430d00  CPU: 0   COMMAND: "md1_raid1"
 #0 [ffff8800ad44bbf8] __schedule at ffffffff81811483
 #1 [ffff8800ad44bc50] schedule at ffffffff81811c60
 #2 [ffff8800ad44bc68] freeze_array at ffffffffa085a1e1 [raid1]
 #3 [ffff8800ad44bcc0] handle_read_error at ffffffffa085bfe9 [raid1]
 #4 [ffff8800ad44bd68] raid1d at ffffffffa085d29b [raid1]
 #5 [ffff8800ad44be60] md_thread at ffffffffa0549040 [md_mod]
and 9 fio threads are blocked in wait_barrier
PID: 5172   TASK: ffff88022bf4a700  CPU: 3   COMMAND: "fio"
 #0 [ffff88022186b6d8] __schedule at ffffffff81811483
 #1 [ffff88022186b730] schedule at ffffffff81811c60
 #2 [ffff88022186b748] wait_barrier at ffffffffa085a0b6 [raid1]
 #3 [ffff88022186b7b0] make_request at ffffffffa085c3cd [raid1]
 #4 [ffff88022186b890] md_make_request at ffffffffa0549903 [md_mod]
The other fio thread is  sleeping in futex wait, I think it's unrelated.
PID: 5171   TASK: ffff88022bf49a00  CPU: 2   COMMAND: "fio"
 #0 [ffff880221f97c08] __schedule at ffffffff81811483
 #1 [ffff880221f97c60] schedule at ffffffff81811c60
 #2 [ffff880221f97c78] futex_wait_queue_me at ffffffff810c8a42
 #3 [ffff880221f97cb0] futex_wait at ffffffff810c9c72

>
>
> There are other ways that nr_pending and nr_queued can get out of sync,
> though I think they would result in nr_pending being less than
> nr_queued, not more.
>
> If the presense of a bad block in the bad block log causes a request to
> be split into two r1bios, and if both of those end up on one of the
> queues, then they would be added to nr_queued twice, but to nr_pending
> only once.  We should fix that.
I checked, there're some bad_blocks md detected, not sure if it helps.
root@ibnbd-clt1:~/jack# cat /sys/block/md1/md/dev-ibnbd0/bad_blocks
180630 512
181142 7
982011 8
1013386 255

I also checked each md_rdev:
crash> struct md_rdev 0xffff880230630000
struct md_rdev {
  same_set = {
    next = 0xffff88022db03818,
    prev = 0xffff880230653e00
  },
  sectors = 2095104,
  mddev = 0xffff88022db03800,
  last_events = 125803676,
  meta_bdev = 0x0,
  bdev = 0xffff880234a56080,
  sb_page = 0xffffea0008d3edc0,
  bb_page = 0xffffea0008b17400,
  sb_loaded = 1,
  sb_events = 72,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff88022c567510 "dev-loop1",
    entry = {
      next = 0xffff880230630080,
      prev = 0xffff880230630080
    },
    parent = 0xffff88022db03850,
    kset = 0x0,
    ktype = 0xffffffffa055f200 <rdev_ktype>,
    sd = 0xffff88022c4ac708,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 2,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff8802306300c8,
      prev = 0xffff8802306300c8
    }
  },
  desc_nr = 1,
  raid_disk = 1,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 1
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff88022c4ac780,
  badblocks = {
    count = 0,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff88022740d000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 60
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 0,
    size = 0
  }
}
crash>
crash> struct md_rdev 0xffff880230653e00
struct md_rdev {
  same_set = {
    next = 0xffff880230630000,
    prev = 0xffff88022db03818
  },
  sectors = 2095104,
  mddev = 0xffff88022db03800,
  last_events = 9186098,
  meta_bdev = 0x0,
  bdev = 0xffff880234a56700,
  sb_page = 0xffffea0007758f40,
  bb_page = 0xffffea000887b480,
  sb_loaded = 1,
  sb_events = 42,
  data_offset = 2048,
  new_data_offset = 2048,
  sb_start = 8,
  sb_size = 512,
  preferred_minor = 65535,
  kobj = {
    name = 0xffff880233c825b0 "dev-ibnbd0",
    entry = {
      next = 0xffff880230653e80,
      prev = 0xffff880230653e80
    },
    parent = 0xffff88022db03850,
    kset = 0x0,
    ktype = 0xffffffffa055f200 <rdev_ktype>,
    sd = 0xffff8800b9fcce88,
    kref = {
      refcount = {
        counter = 1
      }
    },
    state_initialized = 1,
    state_in_sysfs = 1,
    state_add_uevent_sent = 0,
    state_remove_uevent_sent = 0,
    uevent_suppress = 0
  },
  flags = 581,
  blocked_wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffff880230653ec8,
      prev = 0xffff880230653ec8
    }
  },
  desc_nr = 0,
  raid_disk = 0,
  new_raid_disk = 0,
  saved_raid_disk = -1,
  {
    recovery_offset = 18446744073709551615,
    journal_tail = 18446744073709551615
  },
  nr_pending = {
    counter = 856
  },
  read_errors = {
    counter = 0
  },
  last_read_error = {
    tv_sec = 0,
    tv_nsec = 0
  },
  corrected_errors = {
    counter = 0
  },
  del_work = {
    data = {
      counter = 0
    },
    entry = {
      next = 0x0,
      prev = 0x0
    },
    func = 0x0
  },
  sysfs_state = 0xffff8800b9fccf78,
  badblocks = {
    count = 4,
    unacked_exist = 0,
    shift = 0,
    page = 0xffff880227211000,
    changed = 0,
    lock = {
      seqcount = {
        sequence = 1624
      },
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      }
    },
    sector = 80,
    size = 8
  }
}
>
>
>>
>> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio
>>
>> ffff8800b9791ff8
>> struct r1bio {
>>   remaining = {
>>     counter = 0
>>   },
>>   behind_remaining = {
>>     counter = 0
>>   },
>>   sector = 18446612141670676480, // corrupted?
>>   start_next_window = 18446612141565972992, //ditto
>
> I don't think this is corruption.
>
>> crash> struct r1conf 0xffff8800b9792000
>> struct r1conf {
>   ....
>>   retry_list = {
>>     next = 0xffff8800afe690c0,
>>     prev = 0xffff8800b96acac0
>>   },
>
> The pointer you started at was at the end of the list.
> So this r1bio structure you are seeing is not an r1bio at all but the
> memory out of the middle of the r1conf, being interpreted as an r1bio.
> You can confirm this by noticing that retry_list in the r1bio:
>
>>   retry_list = {
>>     next = 0xffff8800afe690c0,
>>     prev = 0xffff8800b96acac0
>>   },
>
> is exactly the same as the retry_list in the r1conf.
>
> NeilBrown

Oh, thanks for explanation, I notice this coincidence, was curious why.
I still have my hung task machine alive, I can just what ever you
think necessary.

Thanks again for debuging this with me.

-- 
Jinpu Wang
Linux Kernel Developer

ProfitBricks GmbH
Greifswalder Str. 207
D - 10405 Berlin

Tel:       +49 30 577 008  042
Fax:      +49 30 577 008 299
Email:    jinpu.wang@xxxxxxxxxxxxxxxx
URL:      https://www.profitbricks.de

Sitz der Gesellschaft: Berlin
Registergericht: Amtsgericht Charlottenburg, HRB 125506 B
Geschäftsführer: Achim Weiss
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html