> On Sep 27, 2016, at 5:34 PM, Shaohua Li <shli@xxxxxxxxxx> wrote: > >> >> + if (!conf->log) >> + return; >> + spin_lock(&conf->device_lock); >> + if (r5c_total_cached_stripes(conf) > conf->max_nr_stripes * 3 / 4 || >> + atomic_read(&conf->empty_inactive_list_nr) > 0) >> + r5c_flush_cache(conf, R5C_RECLAIM_STRIPE_GROUP); > > I still worry about the max_nr_stripes usage. It can be changed at runtime. If > there are no enough stripes, should we just allocate more stripes or reclaim > stripe cache? If memory system tries to shrink stripes (eg, decrease > max_nr_stripes), will it cause deadlock for r5cache? > Write cache will not have dead lock due to stripe cache usage, because cached stripe will hold a stripe until it is flushed to the raid disks. >> + else if (r5c_total_cached_stripes(conf) > >> + conf->max_nr_stripes * 1 / 2) >> + r5c_flush_cache(conf, 1); > > This one is a defensive reclaim. It should always reclaim stripes with full > data. If there are no enough such stripes, do nothing. Flushing 1 stripe would > always be wrong unless we are in critical stripe space shortage, as reclaim > involves disk cache flush and is slow, we should do aggretation as much as > possible. I will remove the defensive reclaim. > >> * >> @@ -198,10 +260,9 @@ void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh) >> { >> struct r5conf *conf = sh->raid_conf; >> >> - if (!conf->log) >> + if (!conf->log || test_bit(STRIPE_R5C_FROZEN, &sh->state)) >> return; >> >> - WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state)); >> set_bit(STRIPE_R5C_FROZEN, &sh->state); > > This is confusing. The WARN_ON suggests the STRIPE_R5C_FROZEN isn't set for sh, > but the change suggests it's possible the bit is set. Which one is correct? > Will fix this. >> >> @@ -518,6 +583,14 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh, >> atomic_inc(&io->pending_stripe); >> sh->log_io = io; >> >> + if (sh->log_start == MaxSector) { >> + BUG_ON(!list_empty(&sh->r5c)); >> + sh->log_start = io->log_start; >> + spin_lock_irqsave(&log->stripe_in_cache_lock, flags); >> + list_add_tail(&sh->r5c, >> + &log->stripe_in_cache); >> + spin_unlock_irqrestore(&log->stripe_in_cache_lock, flags); >> + } > what if it's in writethrogh mode? This does not impact write through mode. But let me add clear check just in case. > >> + last_checkpoint = (list_first_entry(&log->stripe_in_cache, >> + struct stripe_head, r5c))->log_start; >> + spin_unlock(&log->stripe_in_cache_lock); >> + if (sh->log_start != last_checkpoint) { >> + spin_lock(&log->no_space_stripes_lock); >> + list_add_tail(&sh->log_list, &log->no_space_stripes); >> + spin_unlock(&log->no_space_stripes_lock); >> + mutex_unlock(&log->io_mutex); >> + return -ENOSPC; > > So if a stripe is in cache, we try to reclaim it. We should have some mechanism > to guarantee there are enough space for reclaim (eg for parity). Otherwise > there could be a deadlock because the space allocation in reclaim path is to > free space. Could you please explain how this is done? > I redefined this part in newer version, which should be clear. >> + } else if (!r5l_has_free_space(log, reserve)) { >> + WARN(1, "%s: run out of journal space\n", __func__); >> + BUG(); > that's scaring, why it happens? > I rewrite some of the code in newer version. But in case something similar happens, it is a bug with reclaim. >> >> + if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) && >> + sh->log_start != last_checkpoint) >> + continue; > what's this check for? With this check, the code only flushes stripes at last_checkpoint on journal. This is no longer needed in newer version. I will remove it. > >> list_del_init(&sh->log_list); >> + >> static sector_t r5l_reclaimable_space(struct r5l_log *log) >> { >> + struct r5conf *conf = log->rdev->mddev->private; >> + >> return r5l_ring_distance(log, log->last_checkpoint, >> - log->next_checkpoint); >> + r5c_calculate_last_cp(conf)); > will this work for writethrouth? r5c_calculate_last_cp() returns next_checkpoint for write through mode. > >> } >> >> >> if (reclaimable == 0) >> return; >> - >> /* >> * write_super will flush cache of each raid disk. We must write super >> * here, because the log area might be reused soon and we don't want to >> @@ -877,10 +995,7 @@ static void r5l_do_reclaim(struct r5l_log *log) >> >> mutex_lock(&log->io_mutex); >> log->last_checkpoint = next_checkpoint; >> - log->last_cp_seq = next_cp_seq; > why not update last_cp_seq? Currently, we don't need last_cp_seq anywhere in the code. To keep track of last_cp_seq, we need add this field to stripe_head. From what I can this, this seems not necessary. >> mutex_unlock(&log->io_mutex); >> - >> - r5l_run_no_space_stripes(log); > > I don't understand why move r5l_run_no_space_stripes to r5c_flush_cache. It's > natural we run this after some spaces are reclaimed. I will move it back. >> } >> >> static void r5l_reclaim_thread(struct md_thread *thread) >> @@ -891,7 +1006,9 @@ static void r5l_reclaim_thread(struct md_thread *thread) >> >> if (!log) >> return; >> + r5c_do_reclaim(conf); >> r5l_do_reclaim(log); >> + md_wakeup_thread(mddev->thread); > > this wakeup is a bit strange. After we reclaim some spaces, we will rerun > pending stripes, which will wakeup mddev->thread. Do miss some wakeup in other > reclaim places? This one is not really necessary. I will remove it. >> } >> >> void r5l_wake_reclaim(struct r5l_log *log, sector_t space) >> >> } >> >> -static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh) >> +/* >> + * r5c_flush_cache will move stripe from cached list to handle_list or >> + * r5c_priority_list > > What's the r5c_priority_list for? If you want to make sure reclaim makes > progress, I think it's the wrong way. If there are no spaces, handling other > normal stripes will mean moving them to no_space list and do nothing else. Then > the reclaim stripes will get the turn to run. There is no extra list required. I will try to remove it. > >> >> + BUG_ON(test_bit(STRIPE_R5C_PRIORITY, &sh->state) && >> + !test_bit(STRIPE_HANDLE, &sh->state)); >> + >> + if (test_bit(STRIPE_R5C_PRIORITY, &sh->state)) >> + return 0; >> + if (test_bit(STRIPE_HANDLE, &sh->state) && !priority) >> + return 0; >> + >> r5c_freeze_stripe_for_reclaim(sh); >> - atomic_inc(&conf->active_stripes); >> + if (!test_and_set_bit(STRIPE_HANDLE, &sh->state)) { >> + atomic_inc(&conf->active_stripes); >> + } > > shouldn't the stripe is always accounted to active before it's reclaimed? Do we > decrease the count before the stripe is reclaimed? sounds like a bug. > We decrease active count when the stripe is on r5c_cached_full_stripe or r5c_cached_partial stripe. These two lists are variation of inactive_list. >> >> + >> + mutex_unlock(&log->io_mutex); >> + return -ENOSPC; >> } >> + pr_debug("%s: write sh %lu to free log space\n", __func__, sh->sector); >> + } >> + if (!r5l_has_free_space(log, reserve)) { >> + pr_err("%s: cannot reserve space %d\n", __func__, reserve); >> + BUG(); > > same here. we should put the stripe into no_space list. If we can't allocate > space eventually, it indicates reclaim has bug. BUG() here already indicates bug in reclaim. I will test this thoroughly with new version. >> >> + if (before_jiffies > 20) >> + pr_debug("%s: wait for sh takes %lu jiffies\n", __func__, before_jiffies); > please remove the debug code. > > Thanks, > Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html