> On Aug 21, 2017, at 10:46 AM, Shaohua Li <shli@xxxxxxxxxx> wrote: > <snip> >>> The new dump info does reveal some infos. Not sure if it's the issue you found, >>> but I did find a race condition. Please try below patch and report back: >>> >>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >>> index ed5cd705b985..35637fe34820 100644 >>> --- a/drivers/md/raid5.c >>> +++ b/drivers/md/raid5.c >>> @@ -806,13 +806,19 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh >>> } >>> >>> /* >>> + * We must assign batch_head of this stripe within the >>> + * batch_lock, otherwise clear_batch_ready of batch head >>> + * stripe could clear BATCH_READY bit of this stripe and this >>> + * stripe->batch_head doesn't get assigned, which could >>> + * confuse clear_batch_ready for this stripe >>> + */ >>> + sh->batch_head = head->batch_head; >>> + /* >>> * at this point, head's BATCH_READY could be cleared, but we >>> * can still add the stripe to batch list >>> */ >>> list_add(&sh->batch_list, &head->batch_list); >>> spin_unlock(&head->batch_head->batch_lock); >>> - >>> - sh->batch_head = head->batch_head; >>> } else { >>> head->batch_head = head; >>> sh->batch_head = head->batch_head; >> >> Awesome! I will apply your patch today on two of my Lustre servers and report back if I see another occurrence, or after some time if it doesn’t show up. We’ll need to wait for at least a couple weeks to be sure this does actually fix the issue I’m seeing. > > Thanks! > Cc: masterprenium too, who reported the issue before. please check if the patch fix the issue. Shaohua, we have now been running with your patch for 15 days without any issue on two Lustre servers that were never idle with mixed workload and checks running from time to time. Looking at the previous failures, it is very likely that this patch does actually fix our issue! I’ll update if needed. Thanks again. Best, Stephane ��.n��������+%������w��{.n�����{����w��ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f