Re: [PATCH RFC] f2fs: add PRE2 to mark segments free to one checkpoint but obsolete to the other

Chao Yu <yuchao0@xxxxxxxxxx> · Mon, 22 Jan 2018 14:19:46 +0800

On 2018/1/22 10:40, guoweichao wrote:
> Hi Chao,
> 
> On 2018/1/21 10:34, Chao Yu wrote:
>> Hi Weichao,
>>
>> On 2018/1/20 23:50, guoweichao wrote:
>>> Hi Chao,
>>>
>>> Yes, it is exactly what I mean.
>>> It seems that F2FS has no responsibility to cover hardware problems.
>>> However, file systems usually choose redundancy for super block fault tolerance.
>>> So I think we actually have considered some external errors when designing a file system.
>>> Our dual checkpoint mechanism is mainly designed to keep at least one stable
>>> CP pack while creating a new one in case of SPO. And it has fault tolerant effects.
>>> As CP pack is also very critical to F2FS, why not make checkpoint more robustness
>>
>> I think you're trying to change basic design of A/B upgrade system to A/B/C one,
>> which can keep always two checkpoint valid. There will be no simple modification
>> to implement that one, in where we should cover not only prefree case but also
>> SSR case.
>>
> Nope, I do not intent to add another checkpoint. My main idea is reusing the invalid blocks
> after two checkpoints are recorded them as invalid. We could mark or record the blocks that
> are free to one checkpoint and valid to the other one as PARTIAL_FREE. And when set prefree
> segments as free or use invalid blocks in SSR mode, just igore PARTIAL_FREE blocks. If there
> is no other blocks can be used, just write a new checkpoint.

Actually, your implementation looks like a variant of A/B/C upgrade system
(or A/B upgrade system), as checkpoint C (or inflight A/B) uses main area
separated from checkpoint A/B, and its inflight dirty metadata will persist
into checkpoint A or B since we haven't actually allocated checkpoint C
area during mkfs.

Anyway, I don't think we have a strong reason to implement this one.

> 
>> IMO, the biggest problem there is available space, since in checkpoint C, we can
>> only use invalid blocks in both checkpoint A and B, so in some cases there will
>> almost be no valid space we can use during allocation, result in frequently
>> checkpoint.
>>
>> IMO, what we can do is trying to keep last valid checkpoint being integrity as
>> possible as we can. One way is that we can add mirror or parity for the
>> checkpoint which can help to do recovery once checkpoint is corrupted. At
>> least, I hope that with it in debug version we can help hardware staff to fix
>> their issue instead of wasting much time to troubleshoot filesystem issue.
>>
> Yes, I agree. Adding a backup for the latest checkpoint in debug version is OK. Sometimes,

Parity will be better due to light overhead?

Thanks,

> hardware issues are hard to uncover. We should try to separate them from F2FS.
> 
> Thanks,
> 
>> Thanks,
>>
>>> with simple modification in current design and little overhead except for FG_GC.
>>> Of cause, keep unchanged is OK. I just want to discuss this proposal. :)
>>>
>>> Thanks,
>>> *From:*Chao Yu
>>> *To:*guoweichao,jaegeuk@xxxxxxxxxx,
>>> *Cc:*linux-f2fs-devel@xxxxxxxxxxxxxxxxxxxxx,linux-fsdevel@xxxxxxxxxxxxxxx,heyunlei,
>>> *Date:*2018-01-20 15:43:23
>>> *Subject:*Re: [PATCH RFC] f2fs: add PRE2 to mark segments free to one checkpoint but obsolete to the other
>>>
>>> Hi Weichao,
>>>
>>> On 2018/1/20 7:29, Weichao Guo wrote:
>>>> Currently, we set prefree segments as free ones after writing
>>>> a checkpoint, then believe the segments could be used safely.
>>>> However, if a sudden power-off coming and the newer checkpoint
>>>> corrupted due to hardware issues at the same time, we will try
>>>> to use the old checkpoint and may face an inconsistent file
>>>> system status.
>>>
>>> IIUC, you mean:
>>>
>>> 1. write nodes into segment x;
>>> 2. write checkpoint A;
>>> 3. remove nodes in segment x;
>>> 4. write checkpoint B;
>>> 5. issue discard or write datas into segment x;
>>> 6. sudden power-cut
>>>
>>> But after reboot, we found checkpoint B is corrupted due to hardware, and
>>> then start to use checkpoint A, but nodes in segment x recorded as valid
>>> data in checkpoint A has been overcovered in step 5), so we will encounter
>>> inconsistent meta data, right?
>>>
>>> Thanks,
>>>
>>>>
>>>> How about add an PRE2 status for prefree segments, and make
>>>> sure the segments could be used safely to both checkpoints?
>>>> Or any better solutions? Or this is not a problem?
>>>>
>>>> Look forward to your comments!
>>>>
>>>> Signed-off-by: Weichao Guo <guoweichao@xxxxxxxxxx>
>>>> ---
>>>>  fs/f2fs/gc.c      | 11 +++++++++--
>>>>  fs/f2fs/segment.c | 21 ++++++++++++++++++---
>>>>  fs/f2fs/segment.h |  6 ++++++
>>>>  3 files changed, 33 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
>>>> index 33e7969..153e3ea 100644
>>>> --- a/fs/f2fs/gc.c
>>>> +++ b/fs/f2fs/gc.c
>>>> @@ -1030,7 +1030,12 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>>>    * threshold, we can make them free by checkpoint. Then, we
>>>>    * secure free segments which doesn't need fggc any more.
>>>>    */
>>>> - if (prefree_segments(sbi)) {
>>>> + if (prefree_segments(sbi) || prefree2_segments(sbi)) {
>>>> + ret = write_checkpoint(sbi, &cpc);
>>>> + if (ret)
>>>> + goto stop;
>>>> + }
>>>> + if (has_not_enough_free_secs(sbi, 0, 0) && prefree2_segments(sbi)) {
>>>>   ret = write_checkpoint(sbi, &cpc);
>>>>   if (ret)
>>>>   goto stop;
>>>> @@ -1063,8 +1068,10 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
>>>>   goto gc_more;
>>>>   }
>>>>  
>>>> - if (gc_type == FG_GC)
>>>> + if (gc_type == FG_GC) {
>>>> + ret = write_checkpoint(sbi, &cpc);
>>>>   ret = write_checkpoint(sbi, &cpc);
>>>> + }
>>>>   }
>>>>  stop:
>>>>   SIT_I(sbi)->last_victim[ALLOC_NEXT] = 0;
>>>> diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
>>>> index 2e8e054d..9dec445 100644
>>>> --- a/fs/f2fs/segment.c
>>>> +++ b/fs/f2fs/segment.c
>>>> @@ -1606,7 +1606,7 @@ static void set_prefree_as_free_segments(struct f2fs_sb_info *sbi)
>>>>   unsigned int segno;
>>>>  
>>>>   mutex_lock(&dirty_i->seglist_lock);
>>>> - for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], MAIN_SEGS(sbi))
>>>> + for_each_set_bit(segno, dirty_i->dirty_segmap[PRE2], MAIN_SEGS(sbi))
>>>>   __set_test_and_free(sbi, segno);
>>>>   mutex_unlock(&dirty_i->seglist_lock);
>>>>  }
>>>> @@ -1617,13 +1617,17 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc)
>>>>   struct list_head *head = &dcc->entry_list;
>>>>   struct discard_entry *entry, *this;
>>>>   struct dirty_seglist_info *dirty_i = DIRTY_I(sbi);
>>>> - unsigned long *prefree_map = dirty_i->dirty_segmap[PRE];
>>>> + unsigned long *prefree_map;
>>>>   unsigned int start = 0, end = -1;
>>>>   unsigned int secno, start_segno;
>>>>   bool force = (cpc->reason & CP_DISCARD);
>>>> + int phase = 0;
>>>> + enum dirty_type dirty_type = PRE2;
>>>>  
>>>>   mutex_lock(&dirty_i->seglist_lock);
>>>>  
>>>> +next_step:
>>>> + prefree_map = dirty_i->dirty_segmap[dirty_type];
>>>>   while (1) {
>>>>   int i;
>>>>   start = find_next_bit(prefree_map, MAIN_SEGS(sbi), end + 1);
>>>> @@ -1635,7 +1639,7 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc)
>>>>   for (i = start; i < end; i++)
>>>>   clear_bit(i, prefree_map);
>>>>  
>>>> - dirty_i->nr_dirty[PRE] -= end - start;
>>>> + dirty_i->nr_dirty[dirty_type] -= end - start;
>>>>  
>>>>   if (!test_opt(sbi, DISCARD))
>>>>   continue;
>>>> @@ -1663,6 +1667,16 @@ void clear_prefree_segments(struct f2fs_sb_info *sbi, struct cp_control *cpc)
>>>>   else
>>>>   end = start - 1;
>>>>   }
>>>> + if (phase == 0) {
>>>> + /* status change: PRE -> PRE2 */
>>>> + for_each_set_bit(segno, dirty_i->dirty_segmap[PRE], MAIN_SEGS(sbi))
>>>> + if (!test_and_set_bit(segno, prefree_map))
>>>> + dirty_i->nr_dirty[PRE2]++;
>>>> +
>>>> + phase = 1;
>>>> + dirty_type = PRE;
>>>> + goto next_step;
>>>> + }
>>>>   mutex_unlock(&dirty_i->seglist_lock);
>>>>  
>>>>   /* send small discards */
>>>> @@ -2245,6 +2259,7 @@ static void change_curseg(struct f2fs_sb_info *sbi, int type)
>>>>  
>>>>   mutex_lock(&dirty_i->seglist_lock);
>>>>   __remove_dirty_segment(sbi, new_segno, PRE);
>>>> + __remove_dirty_segment(sbi, new_segno, PRE2);
>>>>   __remove_dirty_segment(sbi, new_segno, DIRTY);
>>>>   mutex_unlock(&dirty_i->seglist_lock);
>>>>  
>>>> diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
>>>> index 71a2aaa..f702726 100644
>>>> --- a/fs/f2fs/segment.h
>>>> +++ b/fs/f2fs/segment.h
>>>> @@ -263,6 +263,7 @@ enum dirty_type {
>>>>   DIRTY_COLD_NODE, /* dirty segments assigned as cold node logs */
>>>>   DIRTY, /* to count # of dirty segments */
>>>>   PRE, /* to count # of entirely obsolete segments */
>>>> + PRE2, /* to count # of the segments free to one checkpoint but obsolete to the other checkpoint */
>>>>   NR_DIRTY_TYPE
>>>>  };
>>>>  
>>>> @@ -477,6 +478,11 @@ static inline unsigned int prefree_segments(struct f2fs_sb_info *sbi)
>>>>   return DIRTY_I(sbi)->nr_dirty[PRE];
>>>>  }
>>>>  
>>>> +static inline unsigned int prefree2_segments(struct f2fs_sb_info *sbi)
>>>> +{
>>>> + return DIRTY_I(sbi)->nr_dirty[PRE2];
>>>> +}
>>>> +
>>>>  static inline unsigned int dirty_segments(struct f2fs_sb_info *sbi)
>>>>  {
>>>>   return DIRTY_I(sbi)->nr_dirty[DIRTY_HOT_DATA] +
>>>>
>>
>> .
>>
> 
> 
> .
>