On 2018/1/22 16:03, guoweichao wrote: > > > On 2018/1/22 14:19, Chao Yu wrote: >> On 2018/1/22 10:40, guoweichao wrote: >>> Hi Chao, >>> >>> On 2018/1/21 10:34, Chao Yu wrote: >>>> Hi Weichao, >>>> >>>> On 2018/1/20 23:50, guoweichao wrote: >>>>> Hi Chao, >>>>> >>>>> Yes, it is exactly what I mean. >>>>> It seems that F2FS has no responsibility to cover hardware problems. >>>>> However, file systems usually choose redundancy for super block fault tolerance. >>>>> So I think we actually have considered some external errors when designing a file system. >>>>> Our dual checkpoint mechanism is mainly designed to keep at least one stable >>>>> CP pack while creating a new one in case of SPO. And it has fault tolerant effects. >>>>> As CP pack is also very critical to F2FS, why not make checkpoint more robustness >>>> >>>> I think you're trying to change basic design of A/B upgrade system to A/B/C one, >>>> which can keep always two checkpoint valid. There will be no simple modification >>>> to implement that one, in where we should cover not only prefree case but also >>>> SSR case. >>>> >>> Nope, I do not intent to add another checkpoint. My main idea is reusing the invalid blocks >>> after two checkpoints are recorded them as invalid. We could mark or record the blocks that >>> are free to one checkpoint and valid to the other one as PARTIAL_FREE. And when set prefree >>> segments as free or use invalid blocks in SSR mode, just igore PARTIAL_FREE blocks. If there >>> is no other blocks can be used, just write a new checkpoint. >> >> Actually, your implementation looks like a variant of A/B/C upgrade system >> (or A/B upgrade system), as checkpoint C (or inflight A/B) uses main area >> separated from checkpoint A/B, and its inflight dirty metadata will persist >> into checkpoint A or B since we haven't actually allocated checkpoint C >> area during mkfs. >> >> Anyway, I don't think we have a strong reason to implement this one. >> >>> >>>> IMO, the biggest problem there is available space, since in checkpoint C, we can >>>> only use invalid blocks in both checkpoint A and B, so in some cases there will >>>> almost be no valid space we can use during allocation, result in frequently >>>> checkpoint. >>>> >>>> IMO, what we can do is trying to keep last valid checkpoint being integrity as >>>> possible as we can. One way is that we can add mirror or parity for the >>>> checkpoint which can help to do recovery once checkpoint is corrupted. At >>>> least, I hope that with it in debug version we can help hardware staff to fix >>>> their issue instead of wasting much time to troubleshoot filesystem issue. >>>> >>> Yes, I agree. Adding a backup for the latest checkpoint in debug version is OK. Sometimes, >> >> Parity will be better due to light overhead? > Yes, backup may also cause disk layout changes. How about adding a parity block for the blocks Right, > in a checkpoint similar to RAID-5 or other erasure code methods? Yes, I think we can. Thanks, > > Thanks, > >> >> Thanks, >> >>> hardware issues are hard to uncover. We should try to separate them from F2FS. >>> >>> Thanks, >>> >>>> Thanks, >>>> >>>>> with simple modification in current design and little overhead except for FG_GC. >>>>> Of cause, keep unchanged is OK. I just want to discuss this proposal. :) >>>>> >>>>> Thanks, >>>>> *From:*Chao Yu >>>>> *To:*guoweichao,jaegeuk@xxxxxxxxxx, >>>>> *Cc:*linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx,linux-fsdevel@xxxxxxxxxxxxxxx,heyunlei, >>>>> *Date:*2018-01-20 15:43:23 >>>>> *Subject:*Re: [PATCH RFC] f2fs: add PRE2 to mark segments free to one checkpoint but obsolete to the other >>>>> >>>>> Hi Weichao, >>>>> >>>>> On 2018/1/20 7:29, Weichao Guo wrote: >>>>>> Currently, we set prefree segments as free ones after writing >>>>>> a checkpoint, then believe the segments could be used safely. >>>>>> However, if a sudden power-off coming and the newer checkpoint >>>>>> corrupted due to hardware issues at the same time, we will try >>>>>> to use the old checkpoint and may face an inconsistent file >>>>>> system status. >>>>> >>>>> IIUC, you mean: >>>>> >>>>> 1. write nodes into segment x; >>>>> 2. write checkpoint A; >>>>> 3. remove nodes in segment x; >>>>> 4. write checkpoint B; >>>>> 5. issue discard or write datas into segment x; >>>>> 6. sudden power-cut >>>>> >>>>> But after reboot, we found checkpoint B is corrupted due to hardware, and >>>>> then start to use checkpoint A, but nodes in segment x recorded as valid >>>>> data in checkpoint A has been overcovered in step 5), so we will encounter >>>>> inconsistent meta data, right? >>>>> >>>>> Thanks, >>>>> >>>>>> >>>>>> How about add an PRE2 status for prefree segments, and make >>>>>> sure the segments could be used safely to both checkpoints? >>>>>> Or any better solutions? Or this is not a problem? >>>>>> >>>>>> Look forward to your comments! >>>>>> >>>>>> Signed-off-by: Weichao Guo <guoweichao@xxxxxxxxxx> >>>>>> --- >>>>>> fs/f2fs/gc.c | 11 +++++++++-- >>>>>> fs/f2fs/segment.c | 21 ++++++++++++++++++--- >>>>>> fs/f2fs/segment.h | 6 ++++++ >>>>>> 3 files changed, 33 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c >>>>>> index 33e7969..153e3ea 100644 >>>>>> --- a/fs/f2fs/gc.c >>>>>> +++ b/fs/f2fs/gc.c >>>>>> @@ -1030,7 +1030,12 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync, >>>>>> * threshold, we can make them free by checkpoint. Then, we >>>>>> * secure free segments which doesn't need fggc any more. >>>>>> */ >>>>>> - if (prefree_segments(sbi)) { >>>>>> + if (prefree_segments(sbi) || prefree2_segments(sbi)) { >>>>>> + ret = write_checkpoint(sbi, &cpc); >>>>>> + if (ret) >>>>>> + goto stop; >>>>>> + } >>>>>> + if (has_not_enough_free_secs(sbi, 0, 0) && prefree2_segments(sbi)) { >>>>>> ret = write_checkpoint(sbi, &cpc); >>>>>> if (ret) >>>>>> goto stop; >>>>>> @@ -1063,8 +1068,10 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync, >>>>>> goto gc_more; >>>>>> } >>>>>> >>>>>> - if (gc_type == FG_GC) >>>>>> + if (gc_type == FG_GC) { >>>>>> + ret = write_checkpoint(sbi, &cpc); >>>>>> ret = write_checkpoint(sbi, &cpc); >>>>>> + } >>>>>> } >>>>>> stop: >>>>>> SIT_I(sbi)->last_victim[ALLOC_NEXT] = 0; >>>>>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c >>>>>> index 2e8e054d..9dec445 100644 >>>>>> --- a/fs/f2fs/segment.c >>>>>> +++ b/fs/f2fs/segment.c >>>>>> @@ -1606,7 +1606,7 @@ static void set_prefree_as_free_segments(struct f2fs_sb_info *sbi) >>>>>> unsigned int segno; >>>>>> >>>>>> mutex_lock(&dirty_i->seglist_lock); >>>>>> - for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], MAIN_SEGS(sbi)) >>>>>> + for_each_set_bit(segno, dirty_i->dirty_segmap[PRE2], MAIN_SEGS(sbi)) >>>>>> __set_test_and_free(sbi, segno); >>>>>> mutex_unlock(&dirty_i->seglist_lock); >>>>>> } >>>>>> @@ -1617,13 +1617,17 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc) >>>>>> struct list_head *head = &dcc->entry_list; >>>>>> struct discard_entry *entry, *this; >>>>>> struct dirty_seglist_info *dirty_i = DIRTY_I(sbi); >>>>>> - unsigned long *prefree_map = dirty_i->dirty_segmap[PRE]; >>>>>> + unsigned long *prefree_map; >>>>>> unsigned int start = 0, end = -1; >>>>>> unsigned int secno, start_segno; >>>>>> bool force = (cpc->reason & CP_DISCARD); >>>>>> + int phase = 0; >>>>>> + enum dirty_type dirty_type = PRE2; >>>>>> >>>>>> mutex_lock(&dirty_i->seglist_lock); >>>>>> >>>>>> +next_step: >>>>>> + prefree_map = dirty_i->dirty_segmap[dirty_type]; >>>>>> while (1) { >>>>>> int i; >>>>>> start = find_next_bit(prefree_map, MAIN_SEGS(sbi), end + 1); >>>>>> @@ -1635,7 +1639,7 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc) >>>>>> for (i = start; i < end; i++) >>>>>> clear_bit(i, prefree_map); >>>>>> >>>>>> - dirty_i->nr_dirty[PRE] -= end - start; >>>>>> + dirty_i->nr_dirty[dirty_type] -= end - start; >>>>>> >>>>>> if (!test_opt(sbi, DISCARD)) >>>>>> continue; >>>>>> @@ -1663,6 +1667,16 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc) >>>>>> else >>>>>> end = start - 1; >>>>>> } >>>>>> + if (phase == 0) { >>>>>> + /* status change: PRE -> PRE2 */ >>>>>> + for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], MAIN_SEGS(sbi)) >>>>>> + if (!test_and_set_bit(segno, prefree_map)) >>>>>> + dirty_i->nr_dirty[PRE2]++; >>>>>> + >>>>>> + phase = 1; >>>>>> + dirty_type = PRE; >>>>>> + goto next_step; >>>>>> + } >>>>>> mutex_unlock(&dirty_i->seglist_lock); >>>>>> >>>>>> /* send small discards */ >>>>>> @@ -2245,6 +2259,7 @@ static void change_curseg(struct f2fs_sb_info *sbi, int type) >>>>>> >>>>>> mutex_lock(&dirty_i->seglist_lock); >>>>>> __remove_dirty_segment(sbi, new_segno, PRE); >>>>>> + __remove_dirty_segment(sbi, new_segno, PRE2); >>>>>> __remove_dirty_segment(sbi, new_segno, DIRTY); >>>>>> mutex_unlock(&dirty_i->seglist_lock); >>>>>> >>>>>> diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h >>>>>> index 71a2aaa..f702726 100644 >>>>>> --- a/fs/f2fs/segment.h >>>>>> +++ b/fs/f2fs/segment.h >>>>>> @@ -263,6 +263,7 @@ enum dirty_type { >>>>>> DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */ >>>>>> DIRTY, /* to count # of dirty segments */ >>>>>> PRE, /* to count # of entirely obsolete segments */ >>>>>> + PRE2, /* to count # of the segments free to one checkpoint but obsolete to the other checkpoint */ >>>>>> NR_DIRTY_TYPE >>>>>> }; >>>>>> >>>>>> @@ -477,6 +478,11 @@ static inline unsigned int prefree_segments(struct f2fs_sb_info *sbi) >>>>>> return DIRTY_I(sbi)->nr_dirty[PRE]; >>>>>> } >>>>>> >>>>>> +static inline unsigned int prefree2_segments(struct f2fs_sb_info *sbi) >>>>>> +{ >>>>>> + return DIRTY_I(sbi)->nr_dirty[PRE2]; >>>>>> +} >>>>>> + >>>>>> static inline unsigned int dirty_segments(struct f2fs_sb_info *sbi) >>>>>> { >>>>>> return DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_DATA] + >>>>>> >>>> >>>> . >>>> >>> >>> >>> . >>> >> >> >> . >> > > > . >