RE: BUG in "RAID5: batch adjacent full stripe write" - commit 1b8035b0a84a69b3148bbfa1df1350501d0b0869

Sushma Gurram <Sushma.Gurram@xxxxxxxxxxx> · Wed, 21 Jan 2015 00:05:05 +0000

Apologies for resending multiple times. Not sure why the first email didn't end up on the mailing list, but the second one did!!! To keep the whole email context, resending it.

Hi Neil/Shaohua,

The following commit sometimes crashes due to an assert in the function ops_run_biodrain().
http://git.neil.brown.name/?p=md.git;a=commit;h=1b8035b0a84a69b3148bbfa1df1350501d0b0869

The assert which causes the crash is:
BUG_ON(dev->written);

Here's the call trace for the crash.
2580 <4>[53646.100012]  [<ffffffffa09e1f80>] __raid_run_ops+0x330/0x470 [raid456]
2581 <4>[53646.100024]  [<ffffffffa09e6334>] handle_stripe+0x654/0xda0 [raid456]
2582 <4>[53646.100037]  [<ffffffffa09e6b10>] handle_active_stripes+0x90/0x150 [raid456]
2583 <4>[53646.100049]  [<ffffffffa09e6c58>] raid5_do_work+0x88/0x120 [raid456]
2584 <4>[53646.100061]  [<ffffffff81082812>] kthread_worker_fn+0xd2/0x190
2585 <4>[53646.100069]  [<ffffffff81082966>] kthread+0x96/0xa0
2586 <4>[53646.100076]  [<ffffffff81469ee4>] kernel_thread_helper+0x4/0x10
2587 <0>[53646.100082] Code: 89 c0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 be 97 05 00 00 48 c7 c7 08 b8 9e a0 4c 89 04 24 e8 08 fb 67 e0 4c 8b 04 24 e9 ba fe ff ff <0f> 0b eb      fe 48 8b 4f 38 48 c7 c2 a0 b5 9e a0 48 c7 c6 57 c8 9e
2588 <1>[53646.100118] RIP  [<ffffffffa09e0c51>] ops_run_biodrain+0x261/0x2a0 [raid456]
2589 <4>[53646.100125]  RSP <ffff8803dd39fb90>

Reproduction steps:
1. Create a 3 drive RAID5 with chunk=32K

2. Run 4K sequential writes using fio

Root cause:
1. Typical code flow when an IO is submitted:
make_request() -> add_stripe_bio() -> stripe_add_to_batch_list()

This puts the stripe in the handle_list OR batch_list.
All non-batched stripes are added to handle_list.
If adjacent stripes can be batched together, the first stripe in the batch is added to handle_list and the remaining stripes are added to batch_list with the batch_head being the first stripe in the batch.
As a result, there will be fewer stripes processed by handle_active_stripes().
Note that only full stripe writes are batched.

2. Stripe processing path:
raid5d()/raid5_do_work() -> handle_active_stripes() -> handle_stripe()

In handle_active_stripes(), stripes are pulled from the handle_list and added to a local array (struct stripe_head *batch[MAX_STRIPE_BATCH]), before calling handle_stripe()

Ideally, we would not expect non-head stripes in the batch_list to be processed by handle_stripe(). However, in the following scenario handle_stripe() operates on non-head batched stripe.

Let’s assume there are two adjacent stripes s1 and s2 on a 3 drive RAID5.
Following is the sequence in which IO happens:
1.      make_request() for s1:Disk1 => add s1 to handle_list, batch_head=NULL
2.      make_request() for s2:Disk1 => add s2 to handle_list, batch_head=NULL
3.      make_request() for s1:Disk2 => add s1 to handle_list, batch_head=NULL
4.      make_request() for s2:Disk2 => s1 and s2 can be batched (full stripes) and added to batch_list. Set batch_head=s1 for both stripes. Add only head stripe i.e. s1 to handle_list

If both stripes are processed (in handle_active_stripes) after Step 4, we would have no issues.

However, if handle_active_stripes() is called after Step 2, it’ll remove stripe s2 from handle_list and adds it to the local array. It later calls handle_stripe() for all stripes in the local array.
In handle_stripe(), STRIPE_BATCH_READY flag is cleared which would prevent future IOs to be batched with the active stripe.
However, note that there is a tiny window between handle_active_stripes() pulling the request from handle_list and handle_stripe() clearing STRIPE_BATCH_READY.
It’s in this window that stripe s2 is added to batch_list (with batch_head=s1) and at the same time goes through stripe processing state machine.
During stripe s1 processing, stripe s2 is again incorrectly processed in ops_run_biodrain (because s2 is in its batch_list) and thus causes the kernel assert in ops_run_biodrain().

>From the code, it appears that the following check in clear_batch_ready() called from handle_stripe() should handle the above case:
static int clear_batch_ready(struct stripe_head *sh) { ...
        /*
         * this stripe could be added to a batch list before we check
         * BATCH_READY, skips it
         */
        if (sh->batch_head != sh) {
                spin_unlock(&sh->stripe_lock);
                return 1;
        }
...
}

However, by the time clear_batch_ready() is called on stripe s2, STRIPE_BATCH_READY bit would have been cleared due to stripe s1's invocation of clear_batch_ready() due to the following code.
static int clear_batch_ready(struct stripe_head *sh) { ....
        list_for_each_entry(tmp, &sh->batch_list, batch_list)
                        clear_bit(STRIPE_BATCH_READY, &tmp->state); ...
}

Hence when clear_batch_ready() is called for stripe s2, it returns from the first "if" check which tests STRIPE_BATCH_READY, even before getting to (sh->head != sh) check. Hence, handle_stripe() incorrectly continues with stripe s2.

Potential fix:
I moved the clear_batch_ready() from handle_stripe() to handle_active_stripes() before adding to the local array as follows:

static int handle_active_stripes(struct r5conf *conf, int group,
                                 struct r5worker *worker,
                                 struct list_head *temp_inactive_list) { ...
        while (batch_size < MAX_STRIPE_BATCH &&
                        (sh = __get_priority_stripe(conf, group)) != NULL) {
                if (!clear_batch_ready(sh))
                        batch[batch_size++] = sh;
        }
...
}

This would ensure that the STRIPE_BATCH_READY is cleared before adding it batch[] array in handle_active_stripes().
This prevents new IO requests to be batched with the stripe being processed in handle_active_stripes().

Please let me know if the above fix seems correct.

Thanks,
Sushma

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f