RE: BUG in "RAID5: batch adjacent full stripe write" - commit 1b8035b0a84a69b3148bbfa1df1350501d0b0869

Sushma Gurram <Sushma.Gurram@xxxxxxxxxxx> · Tue, 20 Jan 2015 23:57:33 +0000

Sorry, the call trace in the previous email is incorrect. It should have been this:

<0>[ 1483.429759] invalid opcode: 0000 [#1] SMP
<4>[ 1483.429761] CPU 1
<4>[ 1483.429762] Modules linked in: raid456(F) async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid0(F) raid1(F) joydev ip6t_REJECT ip6t_LOG nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_tcpudp xt_pkttype ipt_LOG xt_limit xt_state iptable_raw ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip6table_filter ip6_tables scst_vdisk(PFN) iscsi_scst(FN) scst(PFN) scst_mods(FN) dlm(FN) configfs sctp(F) mptctl mptbase ipmi_si ipmi_devintf ipmi_msghandler iptable_filter ip_tables x_tables af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave pcc_cpufreq mperf binfmt_misc iomemory_vsl(PFN) dm_mod tg3 ptp sr_mod hpwdt cdrom pps_core hpilo sg ipv6 ipv6_lib serio_raw button container acpi_power_meter rtc_cmos ttm drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect i2c_core syscopyarea btrfs zlib_deflate crc32c libcrc32c sd_mod crc_t10dif usbhid hid uhci_hcd ahci libahci libata ehci_hcd processor thermal_sys hwmon scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_dh scsi_mod asix usbnet usbcore usb_common mii [last unloaded: loop]
<4>[ 1483.429819] Supported: No, Proprietary and Unsupported modules are loaded
<4>[ 1483.429821]
<4>[ 1483.429823] Pid: 22628, comm: raid5auxd0_1 Tainted: PF          N  3.0.101-0.15.1.6651.0.PTF-default #1 HP ProLiant ML310e Gen8
<4>[ 1483.429828] RIP: 0010:[<ffffffffa0921c51>]  [<ffffffffa0921c51>] ops_run_biodrain+0x261/0x2a0 [raid456]
<4>[ 1483.429834] RSP: 0018:ffff88014f6ffb90  EFLAGS: 00010086
<4>[ 1483.429836] RAX: 00000000000000c0 RBX: 0000000000000000 RCX: ffff8801482bf188
<4>[ 1483.429838] RDX: 0000000000000025 RSI: 00000000000001e0 RDI: ffff8801482bf130
<4>[ 1483.429840] RBP: ffff8801482bf130 R08: 0000000000000000 R09: 0000000000000002
<4>[ 1483.429842] R10: ffff8803efc55000 R11: 0000000000000002 R12: ffff8801482bf0c8
<4>[ 1483.429844] R13: ffff8803efc556a0 R14: ffff8801482bf0c8 R15: ffff8801482bf110
<4>[ 1483.429846] FS:  0000000000000000(0000) GS:ffff88040ae20000(0000) knlGS:0000000000000000
<4>[ 1483.429848] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[ 1483.429850] CR2: 00007f66e5c0d000 CR3: 00000003bf3f5000 CR4: 00000000001407e0
<4>[ 1483.429852] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 1483.429854] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 1483.429857] Process raid5auxd0_1 (pid: 22628, threadinfo ffff88014f6fe000, task ffff88034a1fc4c0)
<0>[ 1483.429859] Stack:
<4>[ 1483.429860]  0000000000000000 00000000000001e0 0000000000000210 00000000000000c0
<4>[ 1483.429863]  ffffffffa0921240 ffff8801483bf980 ffff8801482bf0c8 ffff8801483bfb68
<4>[ 1483.429866]  000000014090de00 0000000000000000 ffff8801482bf0c8 ffff8801482bf0c8
<0>[ 1483.429869] Call Trace:
<4>[ 1483.429888]  [<ffffffffa0922f80>] __raid_run_ops+0x330/0x470 [raid456]
<4>[ 1483.429894]  [<ffffffffa0927364>] handle_stripe+0x654/0xda0 [raid456]
<4>[ 1483.429902]  [<ffffffffa0927b40>] handle_active_stripes+0x90/0x150 [raid456]
<4>[ 1483.429909]  [<ffffffffa0927c88>] raid5_do_work+0x88/0x120 [raid456]
<4>[ 1483.429916]  [<ffffffff81082812>] kthread_worker_fn+0xd2/0x190
<4>[ 1483.429920]  [<ffffffff81082966>] kthread+0x96/0xa0
<4>[ 1483.429924]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10
<0>[ 1483.429927] Code: 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 be 97 05 00 00 48 c7 c7 08 c8 92 a0 4c 89 04 24 e8 08 eb 73 e0 4c 8b 04 24 e9 ba fe ff ff <0f> 0b eb fe 48 8b 4f 38 48 c7 c2 a0 c5 92 a0 48 c7 c6 87 d8 92
<1>[ 1483.429945] RIP  [<ffffffffa0921c51>] ops_run_biodrain+0x261/0x2a0 [raid456]
<4>[ 1483.429949]  RSP <ffff88014f6ffb90>

Thanks,
Sushma

-----Original Message-----
From: Sushma Gurram
Sent: Tuesday, January 20, 2015 3:35 PM
To: linux-raid@xxxxxxxxxxxxxxx
Subject: BUG in "RAID5: batch adjacent full stripe write" - commit 1b8035b0a84a69b3148bbfa1df1350501d0b0869

Hi Neil/Shaohua,

The following commit sometimes crashes due to an assert in the function ops_run_biodrain().
http://git.neil.brown.name/?p=md.git;a=commit;h=1b8035b0a84a69b3148bbfa1df1350501d0b0869

The assert which causes the crash is:
BUG_ON(dev->written);

Here's the call trace for the crash.
2351 <0>[  864.952605] Call Trace:
2352 <4>[  864.952616]  [<ffffffff81263919>] flex_array_free+0x9/0x20
2353 <4>[  864.952622]  [<ffffffffa09c5de6>] raid5_free_percpu+0x56/0xc0 [raid456]
2354 <4>[  864.952644]  [<ffffffffa09c5eba>] free_conf+0x6a/0x90 [raid456]
2355 <4>[  864.952651]  [<ffffffffa09c75ff>] stop+0x3f/0x70 [raid456]
2356 <4>[  864.952661]  [<ffffffff8136746f>] md_stop+0x2f/0x80
2357 <4>[  864.952667]  [<ffffffff8136efa6>] do_md_stop+0x86/0x470
2358 <4>[  864.952672]  [<ffffffff81371f26>] md_ioctl+0xba6/0xc90
2359 <4>[  864.952678]  [<ffffffff8122f390>] blkdev_ioctl+0x2a0/0x710
2360 <4>[  864.952686]  [<ffffffff8118dbf5>] block_ioctl+0x35/0x40
2361 <4>[  864.952694]  [<ffffffff8116c01b>] do_vfs_ioctl+0x8b/0x3b0
2362 <4>[  864.952698]  [<ffffffff8116c3e1>] sys_ioctl+0xa1/0xb0
2363 <4>[  864.952705]  [<ffffffff81468d92>] system_call_fastpath+0x16/0x1b
2364 <4>[  864.952712]  [<00007fbef8190fe7>] 0x7fbef8190fe6

Reproduction steps:
1. Create a 3 drive RAID5 with chunk=32K 2. Run 4K sequential writes using fio

Root cause:
1. Typical code flow when an IO is submitted:
make_request() -> add_stripe_bio() -> stripe_add_to_batch_list()

This puts the stripe in the handle_list OR batch_list.
All non-batched stripes are added to handle_list.
If adjacent stripes can be batched together, the first stripe in the batch is added to handle_list and the remaining stripes are added to batch_list with the batch_head being the first stripe in the batch.
As a result, there will be fewer stripes processed by handle_active_stripes().
Note that only full stripe writes are batched.

2. Stripe processing path:
raid5d()/raid5_do_work() -> handle_active_stripes() -> handle_stripe()

In handle_active_stripes(), stripes are pulled from the handle_list and added to a local array (struct stripe_head *batch[MAX_STRIPE_BATCH]), before calling handle_stripe()

Ideally, we would not expect non-head stripes in the batch_list to be processed by handle_stripe(). However, in the following scenario handle_stripe() operates on non-head batched stripe.

Let’s assume there are two adjacent stripes s1 and s2 on a 3 drive RAID5.
Following is the sequence in which IO happens:
1.      make_request() for s1:Disk1 => add s1 to handle_list, batch_head=NULL
2.      make_request() for s2:Disk1 => add s2 to handle_list, batch_head=NULL
3.      make_request() for s1:Disk2 => add s1 to handle_list, batch_head=NULL
4.      make_request() for s2:Disk2 => s1 and s2 can be batched (full stripes) and added to batch_list. Set batch_head=s1 for both stripes. Add only head stripe i.e. s1 to handle_list

If both stripes are processed (in handle_active_stripes) after Step 4, we would have no issues.

However, if handle_active_stripes() is called after Step 2, it’ll remove stripe s2 from handle_list and adds it to the local array. It later calls handle_stripe() for all stripes in the local array.
In handle_stripe(), STRIPE_BATCH_READY flag is cleared which would prevent future IOs to be batched with the active stripe.
However, note that there is a tiny window between handle_active_stripes() pulling the request from handle_list and handle_stripe() clearing STRIPE_BATCH_READY.
It’s in this window that stripe s2 is added to batch_list (with batch_head=s1) and at the same time goes through stripe processing state machine.
During stripe s1 processing, stripe s2 is again incorrectly processed in ops_run_biodrain (because s2 is in its batch_list) and thus causes the kernel assert in ops_run_biodrain().

>From the code, it appears that the following check in clear_batch_ready() called from handle_stripe() should handle the above case:
static int clear_batch_ready(struct stripe_head *sh) { ...
        /*
         * this stripe could be added to a batch list before we check
         * BATCH_READY, skips it
         */
        if (sh->batch_head != sh) {
                spin_unlock(&sh->stripe_lock);
                return 1;
        }
...
}

However, by the time clear_batch_ready() is called on stripe s2, STRIPE_BATCH_READY bit would have been cleared due to stripe s1's invocation of clear_batch_ready() due to the following code.
static int clear_batch_ready(struct stripe_head *sh) { ....
        list_for_each_entry(tmp, &sh->batch_list, batch_list)
                        clear_bit(STRIPE_BATCH_READY, &tmp->state); ...
}

Hence when clear_batch_ready() is called for stripe s2, it returns from the first "if" check which tests STRIPE_BATCH_READY, even before getting to (sh->head != sh) check. Hence, handle_stripe() incorrectly continues with stripe s2.

Potential fix:
I moved the clear_batch_ready() from handle_stripe() to handle_active_stripes() before adding to the local array as follows:

static int handle_active_stripes(struct r5conf *conf, int group,
                                 struct r5worker *worker,
                                 struct list_head *temp_inactive_list) { ...
        while (batch_size < MAX_STRIPE_BATCH &&
                        (sh = __get_priority_stripe(conf, group)) != NULL) {
                if (!clear_batch_ready(sh))
                        batch[batch_size++] = sh;
        }
...
}

This would ensure that the STRIPE_BATCH_READY is cleared before adding it batch[] array in handle_active_stripes().
This prevents new IO requests to be batched with the stripe being processed in handle_active_stripes().

Please let me know if the above fix seems correct.

Thanks,
Sushma

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f