On Tue, Nov 29, 2016 at 12:15 PM, Jinpu Wang <jinpu.wang@xxxxxxxxxxxxxxxx> wrote: > On Mon, Nov 28, 2016 at 10:10 AM, Coly Li <colyli@xxxxxxx> wrote: >> On 2016/11/28 下午5:02, Jinpu Wang wrote: >>> On Mon, Nov 28, 2016 at 9:54 AM, Coly Li <colyli@xxxxxxx> wrote: >>>> On 2016/11/28 下午4:24, Jinpu Wang wrote: >>>>> snip >>>>>>>> >>>>>>>> every time nr_pending is 1 bigger then (nr_queued + 1), so seems we >>>>>>>> forgot to increase nr_queued somewhere? >>>>>>>> >>>>>>>> I've noticed (commit ccfc7bf1f09d61)raid1: include bio_end_io_list in >>>>>>>> nr_queued to prevent freeze_array hang. Seems it fixed similar bug. >>>>>>>> >>>>>>>> Could you give your suggestion? >>>>>>>> >>>>>>> Sorry, forgot to mention kernel version is 4.4.28 >>>>>> >>>>>> This commit is Cced to stable@xxxxxxxxxxxxxxx for v4.3+, do you use a >>>>>> stable kernel or a distribution with 4.4.28 kernel ? >>>>>> >>>>>> Coly >>>>>> >>>>>> >>>>> Hi Coly, >>>>> >>>>> I'm using Debian8 with 4.4.28 kernel. >>>> >>>> Hi Jinpu, >>>> >>>> Is it possible for your to run a upstream kernel or vanilla kernel to >>>> test whether the issue still can be reproduced ? Then we can know >>>> whether it is an upstream bug or a distro issue. >>>> >>>> Thanks. >>>> >>>> Coly >>> >>> Hi Coly, >>> >>> I did run kernel 4.4.34 (I download from kernel.org), I can reproduce >>> the same bug. >>> >>> I can also try latest 4.8 or 4.9 rc kernel, if you think it's necessary? >>> >> Yes, please. If it can be reproduced on upstream kernel by a set of >> scripts, it will be very helpful to debug and fix this issue. >> >> Thanks in advance. >> >> Coly > > Hi Coly, > > I tried with kernel 4.9-cr7, I can't reproduce it with my testcase anymore. > > It's hard to say the bug is fixed or harder to reproduce because code > changed a lot. > > -- > Jinpu Wang > Linux Kernel Developer > > ProfitBricks GmbH > Greifswalder Str. 207 > D - 10405 Berlin > > Tel: +49 30 577 008 042 > Fax: +49 30 577 008 299 > Email: jinpu.wang@xxxxxxxxxxxxxxxx > URL: https://www.profitbricks.de > > Sitz der Gesellschaft: Berlin > Registergericht: Amtsgericht Charlottenburg, HRB 125506 B > Geschäftsführer: Achim Weiss Hi, I continue debug the bug: 20161207 crash> struct r1conf 0xffff8800b9792000 struct r1conf { mddev = 0xffff88022db03800, mirrors = 0xffff880227729200, raid_disks = 2, next_resync = 18446744073709527039, start_next_window = 18446744073709551615, current_window_requests = 0, next_window_requests = 0, device_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, retry_list = { next = 0xffff8800afe690c0, prev = 0xffff8800b96acac0 }, bio_end_io_list = { next = 0xffff8800b96ac2c0, prev = 0xffff88003735f140 }, pending_bio_list = { head = 0x0, tail = 0x0 }, pending_count = 0, wait_barrier = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, task_list = { next = 0xffff880221bc7770, prev = 0xffff8800ad44bc88 } }, resync_lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, nr_pending = 948, nr_waiting = 9, nr_queued = 946, // again we need one more to finished wait_event. barrier = 0, array_frozen = 1, fullsync = 0, recovery_disabled = 2, poolinfo = 0xffff88022c567580, r1bio_pool = 0xffff88022fdccea0, r1buf_pool = 0x0, tmppage = 0xffffea0002bf1600, thread = 0x0, cluster_sync_low = 0, cluster_sync_high = 0 } crash> exit on conf->bio_end_io_list we have 91 entries. crash> list -H 0xffff88003735f140 ffff8800b9792048 ffff8800b96ac2c0 ffff8800b96ac1c0 snip ffff88022243dfc0 ffff88022243dc40 crash> on conf->retry_list we have 855 crash> list -H 0xffff8800b96acac0 ffff8800b9792038 ffff8800afe690c0 snip list -H 0xffff8800b96acac0 r1bio.retry_list -s r1bio ffff8800b9791ff8 struct r1bio { remaining = { counter = 0 }, behind_remaining = { counter = 0 }, sector = 18446612141670676480, // corrupted? start_next_window = 18446612141565972992, //ditto sectors = 2, state = 18446744073709527039, // ditto mddev = 0xffffffffffffffff, master_bio = 0x0, read_disk = 0, retry_list = { next = 0xffff8800afe690c0, prev = 0xffff8800b96acac0 }, behind_bvecs = 0xffff8800b96ac2c0, behind_page_count = 926282048, bios = 0xffff8800b9792058 } ffff8800afe69080 struct r1bio { remaining = { counter = 0 }, behind_remaining = { counter = 0 }, sector = 1566, start_next_window = 0, sectors = 128, state = 257, mddev = 0xffff88022db03800, master_bio = 0xffff8800371c1f00, read_disk = 0, retry_list = { next = 0xffff8800ad41a540, prev = 0xffff8800b9792038 }, behind_bvecs = 0x0, behind_page_count = 0, bios = 0xffff8800afe690e0 } check conf->bio_end_io_list list -H 0xffff88003735f140 r1bio.retry_list -s r1bio ffff8800b9792008 struct r1bio { remaining = { counter = 661819904 }, behind_remaining = { counter = -30718 }, sector = 2, start_next_window = 18446744073709527039, // corrupted? sectors = -1, // corrupted? state = 0, mddev = 0x0, // corrupted? master_bio = 0xffff8800afe690c0, read_disk = -1184183616, // ? retry_list = { next = 0xffff8800b96ac2c0, prev = 0xffff88003735f140 }, behind_bvecs = 0x0, behind_page_count = 0, bios = 0xffff8800b9792068 } ffff8800b96ac280 struct r1bio { remaining = { counter = 0 }, behind_remaining = { counter = 0 }, sector = 980009, start_next_window = 0, sectors = 16, state = 257, mddev = 0xffff88022db03800, master_bio = 0xffff8800370b0600, read_disk = 0, retry_list = { next = 0xffff8800b96ac1c0, prev = 0xffff8800b9792048 }, behind_bvecs = 0x0, behind_page_count = 0, bios = 0xffff8800b96ac2e0 } I still have no clue what it could be, any one has idea? -- Jinpu Wang Linux Kernel Developer ProfitBricks GmbH Greifswalder Str. 207 D - 10405 Berlin Tel: +49 30 577 008 042 Fax: +49 30 577 008 299 Email: jinpu.wang@xxxxxxxxxxxxxxxx URL: https://www.profitbricks.de Sitz der Gesellschaft: Berlin Registergericht: Amtsgericht Charlottenburg, HRB 125506 B Geschäftsführer: Achim Weiss -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html