Thanks Neil for valuable inputs, please see comments inline. On Thu, Dec 8, 2016 at 4:17 AM, NeilBrown <neilb@xxxxxxxx> wrote: > On Thu, Dec 08 2016, Jinpu Wang wrote: > >> On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang >> <jinpu.wang@xxxxxxxxxxxxxxxx> wrote: >>> On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@xxxxxxx> wrote: >>>> On 2016/11/28 下午5:02, Jinpu Wang wrote: >>>>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@xxxxxxx> wrote: >>>>>> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>>>>>> snip >>>>>>>>>> >>>>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>>>>>> forgot to increase nr_queued somewhere? >>>>>>>>>> >>>>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>>>>>> >>>>>>>>>> Could you give your suggestion? >>>>>>>>>> >>>>>>>>> Sorry, forgot to mention kernel version is 4.4.28 > >> >> I continue debug the bug: >> >> 20161207 > >> nr_pending = 948, >> nr_waiting = 9, >> nr_queued = 946, // again we need one more to finished wait_event. >> barrier = 0, >> array_frozen = 1, > >> on conf->bio_end_io_list we have 91 entries. > >> on conf->retry_list we have 855 > > This is useful. It confirms that nr_queued is correct, and that > nr_pending is consistently 1 higher than expected. > This suggests that a request has been counted in nr_pending, but hasn't > yet been submitted, or has been taken off one of the queues but has not > yet been processed. > > I notice that in your first email the Blocked tasks listed included > raid1d which is blocked in freeze_array() and a few others in > make_request() blocked on wait_barrier(). > In that case nr_waiting was 100, so there should have been 100 threads > blocked in wait_barrier(). Is that correct? I assume you thought it > was pointless to list them all, which seems reasonable. This is correct. From my test, I initially use numjobs set to 100 in fio. then later I reduce that to 10, I can still trigger the bug. > > I asked because I wonder if there might have been one thread in > make_request() which was blocked on something else. There are a couple > of places when make_request() will wait after having successfully called > wait_barrier(). If that happened, it would cause exactly the symptoms > you report. Could you check all blocked threads carefully please? Every time it's the same hung task, raid1d is blocked in freeze_array PID: 5002 TASK: ffff8800ad430d00 CPU: 0 COMMAND: "md1_raid1" #0 [ffff8800ad44bbf8] __schedule at ffffffff81811483 #1 [ffff8800ad44bc50] schedule at ffffffff81811c60 #2 [ffff8800ad44bc68] freeze_array at ffffffffa085a1e1 [raid1] #3 [ffff8800ad44bcc0] handle_read_error at ffffffffa085bfe9 [raid1] #4 [ffff8800ad44bd68] raid1d at ffffffffa085d29b [raid1] #5 [ffff8800ad44be60] md_thread at ffffffffa0549040 [md_mod] and 9 fio threads are blocked in wait_barrier PID: 5172 TASK: ffff88022bf4a700 CPU: 3 COMMAND: "fio" #0 [ffff88022186b6d8] __schedule at ffffffff81811483 #1 [ffff88022186b730] schedule at ffffffff81811c60 #2 [ffff88022186b748] wait_barrier at ffffffffa085a0b6 [raid1] #3 [ffff88022186b7b0] make_request at ffffffffa085c3cd [raid1] #4 [ffff88022186b890] md_make_request at ffffffffa0549903 [md_mod] The other fio thread is sleeping in futex wait, I think it's unrelated. PID: 5171 TASK: ffff88022bf49a00 CPU: 2 COMMAND: "fio" #0 [ffff880221f97c08] __schedule at ffffffff81811483 #1 [ffff880221f97c60] schedule at ffffffff81811c60 #2 [ffff880221f97c78] futex_wait_queue_me at ffffffff810c8a42 #3 [ffff880221f97cb0] futex_wait at ffffffff810c9c72 > > > There are other ways that nr_pending and nr_queued can get out of sync, > though I think they would result in nr_pending being less than > nr_queued, not more. > > If the presense of a bad block in the bad block log causes a request to > be split into two r1bios, and if both of those end up on one of the > queues, then they would be added to nr_queued twice, but to nr_pending > only once. We should fix that. I checked, there're some bad_blocks md detected, not sure if it helps. root@ibnbd-clt1:~/jack# cat /sys/block/md1/md/dev-ibnbd0/bad_blocks 180630 512 181142 7 982011 8 1013386 255 I also checked each md_rdev: crash> struct md_rdev 0xffff880230630000 struct md_rdev { same_set = { next = 0xffff88022db03818, prev = 0xffff880230653e00 }, sectors = 2095104, mddev = 0xffff88022db03800, last_events = 125803676, meta_bdev = 0x0, bdev = 0xffff880234a56080, sb_page = 0xffffea0008d3edc0, bb_page = 0xffffea0008b17400, sb_loaded = 1, sb_events = 72, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff88022c567510 "dev-loop1", entry = { next = 0xffff880230630080, prev = 0xffff880230630080 }, parent = 0xffff88022db03850, kset = 0x0, ktype = 0xffffffffa055f200 <rdev_ktype>, sd = 0xffff88022c4ac708, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 2, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff8802306300c8, prev = 0xffff8802306300c8 } }, desc_nr = 1, raid_disk = 1, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 1 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff88022c4ac780, badblocks = { count = 0, unacked_exist = 0, shift = 0, page = 0xffff88022740d000, changed = 0, lock = { seqcount = { sequence = 60 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 0, size = 0 } } crash> crash> struct md_rdev 0xffff880230653e00 struct md_rdev { same_set = { next = 0xffff880230630000, prev = 0xffff88022db03818 }, sectors = 2095104, mddev = 0xffff88022db03800, last_events = 9186098, meta_bdev = 0x0, bdev = 0xffff880234a56700, sb_page = 0xffffea0007758f40, bb_page = 0xffffea000887b480, sb_loaded = 1, sb_events = 42, data_offset = 2048, new_data_offset = 2048, sb_start = 8, sb_size = 512, preferred_minor = 65535, kobj = { name = 0xffff880233c825b0 "dev-ibnbd0", entry = { next = 0xffff880230653e80, prev = 0xffff880230653e80 }, parent = 0xffff88022db03850, kset = 0x0, ktype = 0xffffffffa055f200 <rdev_ktype>, sd = 0xffff8800b9fcce88, kref = { refcount = { counter = 1 } }, state_initialized = 1, state_in_sysfs = 1, state_add_uevent_sent = 0, state_remove_uevent_sent = 0, uevent_suppress = 0 }, flags = 581, blocked_wait = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff880230653ec8, prev = 0xffff880230653ec8 } }, desc_nr = 0, raid_disk = 0, new_raid_disk = 0, saved_raid_disk = -1, { recovery_offset = 18446744073709551615, journal_tail = 18446744073709551615 }, nr_pending = { counter = 856 }, read_errors = { counter = 0 }, last_read_error = { tv_sec = 0, tv_nsec = 0 }, corrected_errors = { counter = 0 }, del_work = { data = { counter = 0 }, entry = { next = 0x0, prev = 0x0 }, func = 0x0 }, sysfs_state = 0xffff8800b9fccf78, badblocks = { count = 4, unacked_exist = 0, shift = 0, page = 0xffff880227211000, changed = 0, lock = { seqcount = { sequence = 1624 }, lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } } }, sector = 80, size = 8 } } > > >> >> list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio >> >> ffff8800b9791ff8 >> struct r1bio { >> remaining = { >> counter = 0 >> }, >> behind_remaining = { >> counter = 0 >> }, >> sector = 18446612141670676480, // corrupted? >> start_next_window = 18446612141565972992, //ditto > > I don't think this is corruption. > >> crash> struct r1conf 0xffff8800b9792000 >> struct r1conf { > .... >> retry_list = { >> next = 0xffff8800afe690c0, >> prev = 0xffff8800b96acac0 >> }, > > The pointer you started at was at the end of the list. > So this r1bio structure you are seeing is not an r1bio at all but the > memory out of the middle of the r1conf, being interpreted as an r1bio. > You can confirm this by noticing that retry_list in the r1bio: > >> retry_list = { >> next = 0xffff8800afe690c0, >> prev = 0xffff8800b96acac0 >> }, > > is exactly the same as the retry_list in the r1conf. > > NeilBrown Oh, thanks for explanation, I notice this coincidence, was curious why. I still have my hung task machine alive, I can just what ever you think necessary. Thanks again for debuging this with me. -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html