Re: [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode

Song Liu <songliubraving@xxxxxx> · Wed, 10 May 2017 23:58:38 +0000

> On May 10, 2017, at 3:51 PM, Shaohua Li <shli@xxxxxxxxxx> wrote:
> 
> On Wed, May 10, 2017 at 09:45:05PM +0000, Song Liu wrote:
>> 
>>> On May 10, 2017, at 10:01 AM, Shaohua Li <shli@xxxxxxxxxx> wrote:
>>> 
>>> On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote:
>>>> For the raid456 with writeback cache, when journal device failed during
>>>> normal operation, it is still possible to persist all data, as all
>>>> pending data is still in stripe cache. However, it is necessary to handle
>>>> journal failure gracefully.
>>>> 
>>>> During journal failures, this patch makes the follow changes to land data
>>>> in cache to raid disks gracefully:
>>>> 
>>>> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
>>>>  to make progress;
>>>> 2. In delay_towrite(), only process data in the cache (skip dev->towrite);
>>>> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
>>>>  in loprio_list
>>> 
>>> Applied the first patch. For this patch, I don't have a clear picture about
>>> what you are trying to do. Please describe the steps we are doing to do after
>>> journal failure.
>> 
>> I will add more description to the next version. 
>> 
>>> 
>>>> Signed-off-by: Song Liu <songliubraving@xxxxxx>
>>>> ---
>>>> drivers/md/raid5-cache.c | 13 ++++++++++---
>>>> drivers/md/raid5-log.h   |  3 ++-
>>>> drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
>>>> 3 files changed, 35 insertions(+), 10 deletions(-)
>>>> 
>>>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>>>> index dc1dba6..e6032f6 100644
>>>> --- a/drivers/md/raid5-cache.c
>>>> +++ b/drivers/md/raid5-cache.c
>>>> @@ -24,6 +24,7 @@
>>>> #include "md.h"
>>>> #include "raid5.h"
>>>> #include "bitmap.h"
>>>> +#include "raid5-log.h"
>>>> 
>>>> /*
>>>> * metadata/data stored in disk with 4k size unit (a block) regardless
>>>> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>>>> 		return;
>>>> 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
>>>> 		mdname(mddev));
>>>> +	md_update_sb(mddev, 1);
>>> 
>>> Why this? And md_update_sb must be called within mddev->reconfig_mutex locked.
>> 
>> This is to avoid skipping in handle_stripe():
>> 
>>        if (s.handle_bad_blocks ||
>>            test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>>                set_bit(STRIPE_HANDLE, &sh->state);
>>                goto finish;
>>        }
>> 
>> I haven't got a better idea than calling md_update_sb() somewhere. It is also tricky
>> to lock mddev->reconfigured_mutex here, due to potential deadlocking with 
>> mddev->open_mutex. 
>> 
>> Do you have suggestions on this?
> 
> This sounds not necessary then. The mddev->thread will call md_update_sb and
> clear the MD_SB_CHANGE_PENDING bit. After that, the stripes will be handled.
> This is running in a workqueue, I'm wondering what kind of deadlock issue there
> is.

mddev->thread calls md_check_recovery() in raid5d(), and md_check_recovery() 
calls md_update_sb(). However, we need to call md_update_sb() before 
mddev_suspend() here. Otherwise, mddev_suspend() will increase mddev->suspended
and then md_check_recovery() will exit before calling md_update_sb(). 

The deadlock of reconfig_mutex and open_mutex is as the following:

[  211.753343] ======================================================
[  211.753823] [ INFO: possible circular locking dependency detected ]
[  211.754310] 4.11.0+ #88 Not tainted
[  211.754582] -------------------------------------------------------
[  211.755079] mdadm/1582 is trying to acquire lock:
[  211.755444]  ((&log->disable_writeback_work)){+.+...}, at: [<ffffffff8107d842>] flush_work+0x12/0x290
[  211.756160]
[  211.756160] but task is already holding lock:
[  211.756609]  (&mddev->reconfig_mutex){+.+.+.}, at: [<ffffffffa0052654>] rdev_attr_store+0x64/0xd0 [md_mod]
[  211.757354]
[  211.757354] which lock already depends on the new lock.
[  211.757354]
[  211.757984]
[  211.757984] the existing dependency chain (in reverse order) is:
[  211.758560]
[  211.758560] -> #1 (&mddev->reconfig_mutex){+.+.+.}:
[  211.759062]        lock_acquire+0xc2/0x230
[  211.759391]        __mutex_lock+0x7b/0x9d0
[  211.759716]        mutex_lock_interruptible_nested+0x1b/0x20
[  211.760158]        r5c_disable_writeback_async+0x6b/0xa0 [raid456]
[  211.760634]        process_one_work+0x1d6/0x650
[  211.760986]        worker_thread+0x4d/0x380
[  211.761311]        kthread+0x117/0x150
[  211.761602]        ret_from_fork+0x2e/0x40
[  211.761924]
[  211.761924] -> #0 ((&log->disable_writeback_work)){+.+...}:
[  211.762470]        __lock_acquire+0x1458/0x1770
[  211.762825]        lock_acquire+0xc2/0x230
[  211.763149]        flush_work+0x4a/0x290
[  211.763457]        r5l_exit_log+0x31/0x80 [raid456]
[  211.763840]        raid5_remove_disk+0x8a/0x240 [raid456]
[  211.764265]        remove_and_add_spares+0x175/0x390 [md_mod]
[  211.764714]        state_store+0xa3/0x4f0 [md_mod]
[  211.765090]        rdev_attr_store+0x89/0xd0 [md_mod]
[  211.765486]        sysfs_kf_write+0x4f/0x70
[  211.765815]        kernfs_fop_write+0x147/0x1d0
[  211.766170]        __vfs_write+0x28/0x120
[  211.766485]        vfs_write+0xd3/0x1d0
[  211.766789]        SyS_write+0x52/0xa0
[  211.767083]        do_syscall_64+0x6a/0x210
[  211.767408]        return_from_SYSCALL_64+0x0/0x7a
[  211.767782]
[  211.767782] other info that might help us debug this:
[  211.767782]
[  211.768399]  Possible unsafe locking scenario:
[  211.768399]
[  211.768861]        CPU0                    CPU1
[  211.769214]        ----                    ----
[  211.769567]   lock(&mddev->reconfig_mutex);
[  211.769897]                                lock((&log->disable_writeback_work));
[  211.770471]                                lock(&mddev->reconfig_mutex);
[  211.770993]   lock((&log->disable_writeback_work));

> 
>> 
>>>> 	mddev_suspend(mddev);
>>>> 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>>>> 	mddev_resume(mddev);
>>>> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>>>> void r5l_quiesce(struct r5l_log *log, int state)
>>>> {
>>>> 	struct mddev *mddev;
>>>> +	struct r5conf *conf;
>>>> +
>>>> 	if (!log || state == 2)
>>>> 		return;
>>>> 	if (state == 0)
>>>> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
>>>> 	else if (state == 1) {
>>>> 		/* make sure r5l_write_super_and_discard_space exits */
>>>> 		mddev = log->rdev->mddev;
>>>> +		conf = mddev->private;
>>>> 		wake_up(&mddev->sb_wait);
>>>> 		kthread_park(log->reclaim_thread->tsk);
>>>> 		r5l_wake_reclaim(log, MaxSector);
>>>> -		r5l_do_reclaim(log);
>>>> +		if (!r5l_log_disk_error(conf))
>>>> +			r5l_do_reclaim(log);
>>> 
>>> I think r5c_disable_writeback_async() will call into this, so we flush all
>>> stripe cache out to raid disks, why skip the reclaim?
>>> 
>> 
>> r5l_do_reclaim() reclaims log space with 2 steps:
>> 1. clear all io_unit lists (flushing_ios, etc.) by waking up mddev->thread. 
>> 2. update log_tail in the journal device, and issue discard to journal device. 
>> 
>> When we are handling log failures in r5c_disable_writeback_async(), we are 
>> flushing the cache, so it is not necessary to wake up mddev->thread. Also, 
>> with log device error, it is not necessary update log_tail or issue discard. 
>> 
>> Therefore, r5l_do_reclaim is not necessary in log disk errors. 
> 
> Ok, there is no side effect, right? let's call it anyway.

Let me double check potential side effect. If it is safe, I will remove it. 

Thanks,
Song--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html