> On May 10, 2017, at 3:51 PM, Shaohua Li <shli@xxxxxxxxxx> wrote: > > On Wed, May 10, 2017 at 09:45:05PM +0000, Song Liu wrote: >> >>> On May 10, 2017, at 10:01 AM, Shaohua Li <shli@xxxxxxxxxx> wrote: >>> >>> On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote: >>>> For the raid456 with writeback cache, when journal device failed during >>>> normal operation, it is still possible to persist all data, as all >>>> pending data is still in stripe cache. However, it is necessary to handle >>>> journal failure gracefully. >>>> >>>> During journal failures, this patch makes the follow changes to land data >>>> in cache to raid disks gracefully: >>>> >>>> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0) >>>> to make progress; >>>> 2. In delay_towrite(), only process data in the cache (skip dev->towrite); >>>> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck >>>> in loprio_list >>> >>> Applied the first patch. For this patch, I don't have a clear picture about >>> what you are trying to do. Please describe the steps we are doing to do after >>> journal failure. >> >> I will add more description to the next version. >> >>> >>>> Signed-off-by: Song Liu <songliubraving@xxxxxx> >>>> --- >>>> drivers/md/raid5-cache.c | 13 ++++++++++--- >>>> drivers/md/raid5-log.h | 3 ++- >>>> drivers/md/raid5.c | 29 +++++++++++++++++++++++------ >>>> 3 files changed, 35 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c >>>> index dc1dba6..e6032f6 100644 >>>> --- a/drivers/md/raid5-cache.c >>>> +++ b/drivers/md/raid5-cache.c >>>> @@ -24,6 +24,7 @@ >>>> #include "md.h" >>>> #include "raid5.h" >>>> #include "bitmap.h" >>>> +#include "raid5-log.h" >>>> >>>> /* >>>> * metadata/data stored in disk with 4k size unit (a block) regardless >>>> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work) >>>> return; >>>> pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n", >>>> mdname(mddev)); >>>> + md_update_sb(mddev, 1); >>> >>> Why this? And md_update_sb must be called within mddev->reconfig_mutex locked. >> >> This is to avoid skipping in handle_stripe(): >> >> if (s.handle_bad_blocks || >> test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) { >> set_bit(STRIPE_HANDLE, &sh->state); >> goto finish; >> } >> >> I haven't got a better idea than calling md_update_sb() somewhere. It is also tricky >> to lock mddev->reconfigured_mutex here, due to potential deadlocking with >> mddev->open_mutex. >> >> Do you have suggestions on this? > > This sounds not necessary then. The mddev->thread will call md_update_sb and > clear the MD_SB_CHANGE_PENDING bit. After that, the stripes will be handled. > This is running in a workqueue, I'm wondering what kind of deadlock issue there > is. mddev->thread calls md_check_recovery() in raid5d(), and md_check_recovery() calls md_update_sb(). However, we need to call md_update_sb() before mddev_suspend() here. Otherwise, mddev_suspend() will increase mddev->suspended and then md_check_recovery() will exit before calling md_update_sb(). The deadlock of reconfig_mutex and open_mutex is as the following: [ 211.753343] ====================================================== [ 211.753823] [ INFO: possible circular locking dependency detected ] [ 211.754310] 4.11.0+ #88 Not tainted [ 211.754582] ------------------------------------------------------- [ 211.755079] mdadm/1582 is trying to acquire lock: [ 211.755444] ((&log->disable_writeback_work)){+.+...}, at: [<ffffffff8107d842>] flush_work+0x12/0x290 [ 211.756160] [ 211.756160] but task is already holding lock: [ 211.756609] (&mddev->reconfig_mutex){+.+.+.}, at: [<ffffffffa0052654>] rdev_attr_store+0x64/0xd0 [md_mod] [ 211.757354] [ 211.757354] which lock already depends on the new lock. [ 211.757354] [ 211.757984] [ 211.757984] the existing dependency chain (in reverse order) is: [ 211.758560] [ 211.758560] -> #1 (&mddev->reconfig_mutex){+.+.+.}: [ 211.759062] lock_acquire+0xc2/0x230 [ 211.759391] __mutex_lock+0x7b/0x9d0 [ 211.759716] mutex_lock_interruptible_nested+0x1b/0x20 [ 211.760158] r5c_disable_writeback_async+0x6b/0xa0 [raid456] [ 211.760634] process_one_work+0x1d6/0x650 [ 211.760986] worker_thread+0x4d/0x380 [ 211.761311] kthread+0x117/0x150 [ 211.761602] ret_from_fork+0x2e/0x40 [ 211.761924] [ 211.761924] -> #0 ((&log->disable_writeback_work)){+.+...}: [ 211.762470] __lock_acquire+0x1458/0x1770 [ 211.762825] lock_acquire+0xc2/0x230 [ 211.763149] flush_work+0x4a/0x290 [ 211.763457] r5l_exit_log+0x31/0x80 [raid456] [ 211.763840] raid5_remove_disk+0x8a/0x240 [raid456] [ 211.764265] remove_and_add_spares+0x175/0x390 [md_mod] [ 211.764714] state_store+0xa3/0x4f0 [md_mod] [ 211.765090] rdev_attr_store+0x89/0xd0 [md_mod] [ 211.765486] sysfs_kf_write+0x4f/0x70 [ 211.765815] kernfs_fop_write+0x147/0x1d0 [ 211.766170] __vfs_write+0x28/0x120 [ 211.766485] vfs_write+0xd3/0x1d0 [ 211.766789] SyS_write+0x52/0xa0 [ 211.767083] do_syscall_64+0x6a/0x210 [ 211.767408] return_from_SYSCALL_64+0x0/0x7a [ 211.767782] [ 211.767782] other info that might help us debug this: [ 211.767782] [ 211.768399] Possible unsafe locking scenario: [ 211.768399] [ 211.768861] CPU0 CPU1 [ 211.769214] ---- ---- [ 211.769567] lock(&mddev->reconfig_mutex); [ 211.769897] lock((&log->disable_writeback_work)); [ 211.770471] lock(&mddev->reconfig_mutex); [ 211.770993] lock((&log->disable_writeback_work)); > >> >>>> mddev_suspend(mddev); >>>> log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH; >>>> mddev_resume(mddev); >>>> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space) >>>> void r5l_quiesce(struct r5l_log *log, int state) >>>> { >>>> struct mddev *mddev; >>>> + struct r5conf *conf; >>>> + >>>> if (!log || state == 2) >>>> return; >>>> if (state == 0) >>>> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state) >>>> else if (state == 1) { >>>> /* make sure r5l_write_super_and_discard_space exits */ >>>> mddev = log->rdev->mddev; >>>> + conf = mddev->private; >>>> wake_up(&mddev->sb_wait); >>>> kthread_park(log->reclaim_thread->tsk); >>>> r5l_wake_reclaim(log, MaxSector); >>>> - r5l_do_reclaim(log); >>>> + if (!r5l_log_disk_error(conf)) >>>> + r5l_do_reclaim(log); >>> >>> I think r5c_disable_writeback_async() will call into this, so we flush all >>> stripe cache out to raid disks, why skip the reclaim? >>> >> >> r5l_do_reclaim() reclaims log space with 2 steps: >> 1. clear all io_unit lists (flushing_ios, etc.) by waking up mddev->thread. >> 2. update log_tail in the journal device, and issue discard to journal device. >> >> When we are handling log failures in r5c_disable_writeback_async(), we are >> flushing the cache, so it is not necessary to wake up mddev->thread. Also, >> with log device error, it is not necessary update log_tail or issue discard. >> >> Therefore, r5l_do_reclaim is not necessary in log disk errors. > > Ok, there is no side effect, right? let's call it anyway. Let me double check potential side effect. If it is safe, I will remove it. Thanks, Song-- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html