Re: [PATCH 0/4] RFC: attempt to remove md deadlocks with metadata without

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 10 2017, Xiao Ni wrote:

> On 10/09/2017 01:52 PM, NeilBrown wrote:
>> On Mon, Oct 09 2017, Xiao Ni wrote:
>>
>>> On 10/09/2017 12:57 PM, NeilBrown wrote:
>>>> It would if you had applied
>>>>      [PATCH 3/4] md: use mddev_suspend/resume instead of ->quiesce()
>>>>
>>>> Did you apply all 4 patches?
>>> Sorry, it's my mistake. I insmod the wrong module. I'll apply the four
>>> patches
>>> and do test again.
>>>> Thanks.  I looks suspend_lo_store() is calling raid5_quiesce() directly
>>>> as you say - so a patch is missing.
>>> Yes, thanks for pointing about this.
>
> Hi Neil
>
> I applied the four patches and one patch "md: fix deadlock error in 
> recent patch."
> There is a new stuck. It's stuck at suspend_hi_store this time. I add 
> the calltrace
> as an attachment.
>
> I added some printk to print some information.
>
> [12695.993329] mddev suspend : 1
> [12695.996270] mddev ro : 0
> [12695.998790] mddev insync : 0
> [12696.001641] mddev active io: 1

You didn't tell me where (in the code) you printed this information.
That makes it hard to interpret.

If mddev->active_io is 1, then some thread must be in this range
of code

	atomic_inc(&mddev->active_io);
	rcu_read_unlock();

	if (!mddev->pers->make_request(mddev, bio)) {
		atomic_dec(&mddev->active_io);
		wake_up(&mddev->sb_wait);
		goto check_suspended;
	}

	if (atomic_dec_and_test(&mddev->active_io) && mddev->suspended)
		wake_up(&mddev->sb_wait);

If that thread is blocked (which appears to be the case) it must be in
->make_request() because nothing else there blocks.
None of the threads you showed are in that code.
But you didn't report all the threads - only those which hard printed
warnings.

  echo t > /proc/sysrq-trigger

will produce the stack traces of *all* threads.  That would be more
useful.

>
> Can it be:
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index b6b7a28..55e9280 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -7777,7 +7777,7 @@ void md_check_recovery(struct mddev *mddev)
>          if (mddev->ro && !test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))
>                  return;
>          if ( ! (
> -               (mddev->flags & ~ (1<<MD_CHANGE_PENDING)) ||
> +               (mddev->flags & (mddev->external == 1 &&  ~ 
> (1<<MD_CHANGE_PENDING))) ||

Please read that code again and see how it doesn't make any sense at
all.

Thanks,
NeilBrown

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux