> On Oct 13, 2016, at 11:13 PM, NeilBrown <neilb@xxxxxxxx> wrote: > > On Thu, Oct 13 2016, Song Liu wrote: > >> >> /* >> * raid5 cache state machine >> * >> * The RAID cache works in two major states for each stripe: write state >> * and reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN >> * and STRIPE_R5C_WRITTEN > > Hi, > It would really help at this point to understand exactly what it is > that is "FROZEN". I guess the contents of the stripe_head are frozen? > so that nothing changes between writing to the journal and writing to > the RAID? Making that explicit would help. I guess I (wrongly) borrowed the frozen name from Shaohua's earlier patch, where frozen means "no more writes until flush done". In this implementation, STRIPE_R5C_FROZEN is to differentiate write state and reclaim state (reclaim may not be a good name here though). In write state, data writes are handled as: write to journal return IO write to journal return IO, etc In reclaim state, the state machine works as calculate parity write parity (maybe also some data) to journal write data and parity to raid disks (maybe) return IO The reclaim state is very similar to current write journal (or write through cache) behavior; while the write state is only applicable to write back cache. This particular patch does NOT add detailed logic of the write back cache, so the stripe enters reclaim state (set STRIPE_R5C_FROZEN) immediately on every write (in handle dirtying). > Only... the stripe_head gets "frozen" before that. It gets frozen > before the parity is calculated, and not released until the data and > parity is written to the RAID. This new "FROZEN" state is a subset of > that - yes? Yes, the stripe got frozen before parity is calculated. And released after data and parity is written to RAID. The timeline is like: write data to journal, return IO write data to journal, return IO .... frozen calculate parity write parity to journal write data and parity to RAID release or un-frozen or melt. > > Given that STRIPE_R5C_FROZEN is set when the stripe is in "reclaim > state", I wonder why you didn't call it "STRIPE_R5C_RECLAIM" > .... though I'm not sure "reclaim" is a good word. Reclaim is > something that happens to the journal after the data has been written > to the RAID. Reclaim might be a good name here. It is the flag that the state machine is working towards "write all data and parity to RAID". >> >> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c >> index 7557791b..9e05850 100644 >> --- a/drivers/md/raid5-cache.c >> +++ b/drivers/md/raid5-cache.c >> @@ -40,6 +40,47 @@ >> */ >> #define R5L_POOL_SIZE 4 >> >> +enum r5c_state { >> + R5C_STATE_NO_CACHE = 0, >> + R5C_STATE_WRITE_THROUGH = 1, >> + R5C_STATE_WRITE_BACK = 2, >> + R5C_STATE_CACHE_BROKEN = 3, > > How is CACHE_BROKEN different from NO_CACHE?? > Maybe a latter patch will tell me, but I'd rather know now :-) Yeah, a later patch (sysfs entry r5c_state) has more details. In current write journal implementation, we force the array read-only when the journal device is broken. The array is read-only even after re-assemble, because it still has JOURNAL bit in its feature map. So that is CACHE_BROKEN, > As this state is stored in the r5l_log, and as we don't allocate that > when there is no cache, why have a NO_CACHE state? We only store "write through" or "write back" in r5l_log. However, the r5c_state sysfs entry has 2 functionalities: 1. switching between "write through" and "write back"; 2. put array from CACHE_BROKEN to NO_CACHE (remove JOURNAL bit from feature map). >> >> +/* >> + * Freeze the stripe, thus send the stripe into reclaim path. >> + * >> + * In current implementation, STRIPE_R5C_FROZEN is also set in write through >> + * mode (in r5c_handle_stripe_dirtying). This does not change the behavior of >> + * for write through mode. >> + */ >> +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh) > > "freeze_stripe_for_reclaim" seems an odd name. How does "reclaim" fit? > You are freezing the stripe so it can be written to the journal, and > then eventually to the RAID are you not? > Or are you freezing it so the space used in the journal cannot be reclaimed? > > Maybe I misunderstand the "FROZEN" concept still. It is more like "frozen" so we kickoff the reclaim process. > >> +{ >> + struct r5conf *conf = sh->raid_conf; >> + struct r5l_log *log = conf->log; >> + >> + if (!log) >> + return; >> + WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state)); >> + set_bit(STRIPE_R5C_FROZEN, &sh->state); >> +} >> + >> +static void r5c_finish_cache_stripe(struct stripe_head *sh) > > A comment just before this function saying when it is called would help. > "Call when a stripe_head is safe in the journal, but has not yet been > written to the RAID" ?? Yes, exactly "Call when a stripe_head is safe in the journal, but has not yet been written to the RAID" >> +{ >> + struct r5l_log *log = sh->raid_conf->log; >> + >> + if (log->r5c_state == R5C_STATE_WRITE_THROUGH) { >> + BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state)); >> + set_bit(STRIPE_R5C_WRITTEN, &sh->state); >> + } else >> + BUG(); /* write back logic in next patch */ >> +} >> + >> static void r5l_io_run_stripes(struct r5l_io_unit *io) >> { >> struct stripe_head *sh, *next; >> >> list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) { >> list_del_init(&sh->log_list); >> + >> + r5c_finish_cache_stripe(sh); >> + >> set_bit(STRIPE_HANDLE, &sh->state); >> raid5_release_stripe(sh); >> } >> @@ -412,18 +493,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh, >> r5l_append_payload_page(log, sh->dev[i].page); >> } >> >> - if (sh->qd_idx >= 0) { >> + if (parity_pages == 2) { >> r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, >> sh->sector, sh->dev[sh->pd_idx].log_checksum, >> sh->dev[sh->qd_idx].log_checksum, true); >> r5l_append_payload_page(log, sh->dev[sh->pd_idx].page); >> r5l_append_payload_page(log, sh->dev[sh->qd_idx].page); >> - } else { >> + } else if (parity_pages == 1) { >> r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY, >> sh->sector, sh->dev[sh->pd_idx].log_checksum, >> 0, false); >> r5l_append_payload_page(log, sh->dev[sh->pd_idx].page); >> - } >> + } else >> + BUG_ON(parity_pages != 0); > > > Why this change? I don't think it is wrong, but I don't see why it > belongs here. How does it help? In the journal only implementation, parity_pages could only be 1 or 2. With write back cache, parity pages could also be 0 (data only). So this is new case to be handled here. The "BUG_ON" part is not really necessary. > >> + >> + if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state)) >> + return -EAGAIN; >> + >> + if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH || >> + conf->mddev->degraded != 0) { > > Why do you disable write-back when the array is degraded? I think this is not really necessary. Our original thought was to simplify the case. Since the array is already much slower in degraded mode, write-back cache cannot make it awesome anyway. I think we can enable write-back in degraded mode later on. >> >> + /* >> + * Now to consider new write requests, cache write back and what else, >> + * if anything should be read. We do not handle new writes when: >> * 1/ A 'write' operation (copy+xor) is already in flight. >> * 2/ A 'check' operation is in flight, as it may clobber the parity >> * block. >> + * 3/ A r5c cache log write is in flight. >> */ >> - if (s.to_write && !sh->reconstruct_state && !sh->check_state) >> + if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) && >> + !sh->reconstruct_state && !sh->check_state && !sh->log_io) >> handle_stripe_dirtying(conf, sh, &s, disks); > > OK, I'm definitely confused now. > If the stripe is FROZEN, surely you don't want to call > handle_stripe_dirtying(). handle_stripe_dirtying() is the beginning of original write path: shall we do rcw or rmw? Once frozen, we need to calculate parity and write to RAID, so we need handle_stripe_dirtying here. Thanks, Song -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html