Re: [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode

Song Liu <songliubraving@xxxxxx> · Wed, 10 May 2017 21:45:05 +0000

> On May 10, 2017, at 10:01 AM, Shaohua Li <shli@xxxxxxxxxx> wrote:
> 
> On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote:
>> For the raid456 with writeback cache, when journal device failed during
>> normal operation, it is still possible to persist all data, as all
>> pending data is still in stripe cache. However, it is necessary to handle
>> journal failure gracefully.
>> 
>> During journal failures, this patch makes the follow changes to land data
>> in cache to raid disks gracefully:
>> 
>> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
>>   to make progress;
>> 2. In delay_towrite(), only process data in the cache (skip dev->towrite);
>> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
>>   in loprio_list
> 
> Applied the first patch. For this patch, I don't have a clear picture about
> what you are trying to do. Please describe the steps we are doing to do after
> journal failure.

I will add more description to the next version. 

> 
>> Signed-off-by: Song Liu <songliubraving@xxxxxx>
>> ---
>> drivers/md/raid5-cache.c | 13 ++++++++++---
>> drivers/md/raid5-log.h   |  3 ++-
>> drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
>> 3 files changed, 35 insertions(+), 10 deletions(-)
>> 
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index dc1dba6..e6032f6 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -24,6 +24,7 @@
>> #include "md.h"
>> #include "raid5.h"
>> #include "bitmap.h"
>> +#include "raid5-log.h"
>> 
>> /*
>>  * metadata/data stored in disk with 4k size unit (a block) regardless
>> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>> 		return;
>> 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
>> 		mdname(mddev));
>> +	md_update_sb(mddev, 1);
> 
> Why this? And md_update_sb must be called within mddev->reconfig_mutex locked.

This is to avoid skipping in handle_stripe():

        if (s.handle_bad_blocks ||
            test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
                set_bit(STRIPE_HANDLE, &sh->state);
                goto finish;
        }

I haven't got a better idea than calling md_update_sb() somewhere. It is also tricky
to lock mddev->reconfigured_mutex here, due to potential deadlocking with 
mddev->open_mutex. 

Do you have suggestions on this?

>> 	mddev_suspend(mddev);
>> 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> 	mddev_resume(mddev);
>> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> void r5l_quiesce(struct r5l_log *log, int state)
>> {
>> 	struct mddev *mddev;
>> +	struct r5conf *conf;
>> +
>> 	if (!log || state == 2)
>> 		return;
>> 	if (state == 0)
>> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
>> 	else if (state == 1) {
>> 		/* make sure r5l_write_super_and_discard_space exits */
>> 		mddev = log->rdev->mddev;
>> +		conf = mddev->private;
>> 		wake_up(&mddev->sb_wait);
>> 		kthread_park(log->reclaim_thread->tsk);
>> 		r5l_wake_reclaim(log, MaxSector);
>> -		r5l_do_reclaim(log);
>> +		if (!r5l_log_disk_error(conf))
>> +			r5l_do_reclaim(log);
> 
> I think r5c_disable_writeback_async() will call into this, so we flush all
> stripe cache out to raid disks, why skip the reclaim?
> 

r5l_do_reclaim() reclaims log space with 2 steps:
1. clear all io_unit lists (flushing_ios, etc.) by waking up mddev->thread. 
2. update log_tail in the journal device, and issue discard to journal device. 

When we are handling log failures in r5c_disable_writeback_async(), we are 
flushing the cache, so it is not necessary to wake up mddev->thread. Also, 
with log device error, it is not necessary update log_tail or issue discard. 

Therefore, r5l_do_reclaim is not necessary in log disk errors. 

Thanks,
Song

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html