Thanks Shaohua! I am working on adding more information in comments and commit notes. >> >> +/* >> + * Freeze the stripe, thus send the stripe into reclaim path. >> + * >> + * This function should only be called from raid5d that handling this stripe, >> + * or when holds conf->device_lock >> + */ > > Do you mean if this called in raid5d, the lock isn't required? Please note we > could have several threads (like raid5d) handle stripes. Is there any problem > here? This requirement is not necessary anymore. I will update the comment here. > >> +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh) >> +{ >> + struct r5conf *conf = sh->raid_conf; >> + >> + if (!conf->log) >> + return; >> +static void r5c_handle_parity_cached(struct stripe_head *sh) >> +{ >> + int i; >> + >> + for (i = sh->disks; i--; ) >> + if (test_bit(R5_InCache, &sh->dev[i].flags)) >> + set_bit(R5_Wantwrite, &sh->dev[i].flags); >> + set_bit(STRIPE_R5C_WRITTEN, &sh->state); >> +} >> + >> +static void r5c_finish_cache_stripe(struct stripe_head *sh) >> +{ > > I'm hoping this one has a > if (not in writeback mode) > return > so it's clearly this is only for writeback mode. > >> + if (test_bit(STRIPE_R5C_FROZEN, &sh->state)) >> + r5c_handle_parity_cached(sh); >> + else >> + r5c_handle_data_cached(sh); >> +} >> + >> >> return 0; >> } >> >> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space); >> /* >> * running in raid5d, where reclaim could wait for raid5d too (when it flushes >> * data from log to raid disks), so we shouldn't wait for reclaim here >> @@ -456,11 +541,17 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh) >> return -EAGAIN; >> } >> >> + WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state)); > > is this set even for write through? In current implementation, the code still sets STRIPE_R5C_FROZEN in write through mode (in r5c_handle_stripe_dirtying). The reclaim path handles has same behavior in write through mode as original journal. >> >> + >> + if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state)) >> + return -EAGAIN; >> + >> + if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH || >> + conf->quiesce != 0 || conf->mddev->degraded != 0) { >> + /* write through mode */ >> + r5c_freeze_stripe_for_reclaim(sh); >> + return -EAGAIN; > > will this change behavior of R5C_STATE_WRITE_THROUGH? > r5c_freeze_stripe_for_reclaim does change something not related to writethrough > mode. The PREREAD_ACTIVE part of freeze is not necessary for write through. I will fix it. Other that, the reclaim path works the same as write through mode (as original journal code). I will double check and make it clear. > The quiesce check sounds not necessary. The patch flush all caches in quiesce, > so no IO is running in quiesce state. Great catch. Other checks already covers all cases in quiesce. >> >> @@ -901,6 +918,13 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s) >> >> might_sleep(); >> >> + if (s->to_cache) { >> + if (r5c_cache_data(conf->log, sh, s) == 0) >> + return; > > At this stage we fallback to no cache. But I don't see R5_Wantcache is cleared, > is it a problem? We did similar check for journal part. I think it will be easier if we add a check in r5l_init_log(), and never create journal/cache when the array is too big? > >> @@ -1543,9 +1570,18 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu) >> static void ops_complete_prexor(void *stripe_head_ref) >> { >> struct stripe_head *sh = stripe_head_ref; >> + int i; >> >> pr_debug("%s: stripe %llu\n", __func__, >> (unsigned long long)sh->sector); >> + >> + for (i = sh->disks; i--; ) >> + if (sh->dev[i].page != sh->dev[i].orig_page) { >> + struct page *p = sh->dev[i].page; >> + >> + sh->dev[i].page = sh->dev[i].orig_page; >> + put_page(p); > What if array hasn't log but skipcopy is enabled? In case of skip copy, page and orig_page are the same during ops_complete_prexor: page == orig_page == old data. bio_page is the new data. After prexor is done, we run biodrain, where page is set to bio_page to skip copy. >> >> @@ -4416,7 +4538,7 @@ static void handle_stripe(struct stripe_head *sh) >> struct r5dev *dev = &sh->dev[i]; >> if (test_bit(R5_LOCKED, &dev->flags) && >> (i == sh->pd_idx || i == sh->qd_idx || >> - dev->written)) { >> + dev->written || test_bit(R5_InCache, &dev->flags))) { >> pr_debug("Writing block %d\n", i); >> set_bit(R5_Wantwrite, &dev->flags); >> if (prexor) >> @@ -4456,6 +4578,12 @@ static void handle_stripe(struct stripe_head *sh) >> test_bit(R5_Discard, &qdev->flags)))))) >> handle_stripe_clean_event(conf, sh, disks, &s.return_bi); >> >> + if (s.just_cached) >> + r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi); >> + >> + if (test_bit(STRIPE_R5C_FROZEN, &sh->state)) >> + r5l_stripe_write_finished(sh); > what's this for? r5l_stripe_write_finished() removes sh->log_io. It is actually not necessary to check for R5C_FROZEN. I will fix that. > >> + >> /* Now we might consider reading some blocks, either to check/generate >> * parity, or to satisfy requests >> * or to load a block that is being partially written. >> @@ -4467,13 +4595,17 @@ static void handle_stripe(struct stripe_head *sh) >> || s.expanding) >> handle_stripe_fill(sh, &s, disks); >> >> - /* Now to consider new write requests and what else, if anything >> - * should be read. We do not handle new writes when: >> + r5c_handle_stripe_written(conf, sh); > > please explain why this is required? Added the following in comments: /* * If the stripe finishes full journal write cycle (write to journal * and raid disk, this is the clean up procedure so it is ready for * next operation. */ >> >> /* maybe we need to check and possibly fix the parity for this stripe >> @@ -5192,7 +5324,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi) >> * later we might have to read it again in order to reconstruct >> * data on failed drives. >> */ >> - if (rw == READ && mddev->degraded == 0 && >> + if (rw == READ && mddev->degraded == 0 && conf->log == NULL && >> mddev->reshape_position == MaxSector) { >> bi = chunk_aligned_read(mddev, bi); >> if (!bi) > This should be only for R5C_STATE_WRITE_BACK. Will fix this. >> >> @@ -7662,8 +7800,11 @@ static void raid5_quiesce(struct mddev *mddev, int state) >> /* '2' tells resync/reshape to pause so that all >> * active stripes can drain >> */ >> + r5c_flush_cache(conf, 0); >> conf->quiesce = 2; >> wait_event_cmd(conf->wait_for_quiescent, >> + atomic_read(&conf->r5c_cached_partial_stripes) == 0 && >> + atomic_read(&conf->r5c_cached_full_stripes) == 0 && > I don't see a wakeup for this after the check condition is met. The following in release_inactive_stripe_list will wake this up. if (atomic_read(&conf->active_stripes) == 0) wake_up(&conf->wait_for_quiescent); This is OK because cached stripes (full stripe or partial) has to become active first before being inactive. Thanks, Song -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html