On 2022/11/8 18:33, Ryusuke Konishi wrote: > On Tue, Nov 8, 2022 at 3:49 PM Liu Shixin wrote: >> >> >> On 2022/11/8 12:41, Ryusuke Konishi wrote: >>> Hi Liu Shixin, >>> >>> On Tue, Nov 8, 2022 at 10:41 AM Liu Shixin wrote: >>>> Syzbot reported a NULL pointer dereference: >>>> >>>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000168 >>>> Mem abort info: >>>> ESR = 0x0000000096000004 >>>> EC = 0x25: DABT (current EL), IL = 32 bits >>>> SET = 0, FnV = 0 >>>> EA = 0, S1PTW = 0 >>>> FSC = 0x04: level 0 translation fault >>>> Data abort info: >>>> ISV = 0, ISS = 0x00000004 >>>> CM = 0, WnR = 0 >>>> user pgtable: 4k pages, 48-bit VAs, pgdp=0000000108bcf000 >>>> [0000000000000168] pgd=0000000000000000, p4d=0000000000000000 >>>> Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP >>>> Modules linked in: >>>> CPU: 1 PID: 3032 Comm: segctord Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 >>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 >>>> pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) >>>> pc : _compound_head include/linux/page-flags.h:253 [inline] >>>> pc : lock_page+0x28/0x1e0 include/linux/pagemap.h:958 >>>> lr : lock_page+0x28/0x1e0 include/linux/pagemap.h:956 >>>> sp : ffff80001290bc00 >>>> x29: ffff80001290bc00 x28: ffff80001290bde0 x27: 000000000000001b >>>> x26: fffffc000330d7c0 x25: ffff0000caa56d68 x24: ffff0000ca9fb1c0 >>>> x23: 0000000000000080 x22: ffff0000ca9fb130 x21: 0000000000000160 >>>> x20: ffff0000c91e10b8 x19: 0000000000000160 x18: 00000000000000c0 >>>> x17: ffff80000dd0b198 x16: ffff80000db49158 x15: ffff0000c3e63500 >>>> x14: 0000000000000000 x13: 00000000ffffffff x12: ffff0000c3e63500 >>>> x11: ff808000095d1a0c x10: 0000000000000000 x9 : 0000000000000000 >>>> x8 : 0000000000000000 x7 : ffff80000856806c x6 : 0000000000000000 >>>> x5 : 0000000000000080 x4 : 0000000000000000 x3 : 0000000000000000 >>>> x2 : 0000000000000000 x1 : ffff80000cb431b1 x0 : 0000000000000000 >>>> Call trace: >>>> lock_page+0x28/0x1e0 include/linux/pagemap.h:956 >>>> nilfs_segctor_prepare_write+0x6c/0x21c fs/nilfs2/segment.c:1658 >>>> nilfs_segctor_do_construct+0x9f4/0xee8 fs/nilfs2/segment.c:2068 >>>> nilfs_segctor_construct+0xa0/0x380 fs/nilfs2/segment.c:2375 >>>> nilfs_segctor_thread_construct fs/nilfs2/segment.c:2483 [inline] >>>> nilfs_segctor_thread+0x180/0x660 fs/nilfs2/segment.c:2566 >>>> kthread+0x12c/0x158 kernel/kthread.c:376 >>>> ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 >>>> >>>> If didn't call nilfs_sufile_alloc() in nilfs_segctor_begin_construction(), >>>> nilfs_sufile_header's sh_last_alloc is not updated. In such case, we will >>>> add a bh in two segbuf->sb_segsum_buffers. And finally cause list error. >>>> >>>> Reported-by: syzbot+77e4f005cb899d4268d1@xxxxxxxxxxxxxxxxxxxxxxxxx >>>> Fixes: 9ff05123e3bf ("nilfs2: segment constructor") >>>> Signed-off-by: Liu Shixin <liushixin2@xxxxxxxxxx> >>>> --- >>>> fs/nilfs2/segment.c | 1 + >>>> fs/nilfs2/sufile.c | 2 +- >>>> 2 files changed, 2 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c >>>> index b4cebad21b48..7be632c15f91 100644 >>>> --- a/fs/nilfs2/segment.c >>>> +++ b/fs/nilfs2/segment.c >>>> @@ -1371,6 +1371,7 @@ static int nilfs_segctor_extend_segments(struct nilfs_sc_info *sci, >>>> sci->sc_segbuf_nblocks += segbuf->sb_rest_blocks; >>>> >>>> /* allocate the next next full segment */ >>>> + nextnextnum = segbuf->sb_segnum; >>>> err = nilfs_sufile_alloc(sufile, &nextnextnum); >>>> if (unlikely(err)) >>>> goto failed_segbuf; >>>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c >>>> index 77ff8e95421f..853a8212114f 100644 >>>> --- a/fs/nilfs2/sufile.c >>>> +++ b/fs/nilfs2/sufile.c >>>> @@ -317,7 +317,7 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump) >>>> goto out_sem; >>>> kaddr = kmap_atomic(header_bh->b_page); >>>> header = kaddr + bh_offset(header_bh); >>>> - last_alloc = le64_to_cpu(header->sh_last_alloc); >>>> + last_alloc = max(le64_to_cpu(header->sh_last_alloc), *segnump); >>>> kunmap_atomic(kaddr); >>>> >>>> nsegments = nilfs_sufile_get_nsegments(sufile); >>>> -- >>>> 2.25.1 >>> Thank you for your help. I have a few questions, so I'll ask them below. >>> >>>> If didn't call nilfs_sufile_alloc() in nilfs_segctor_begin_construction(), >>>> nilfs_sufile_header's sh_last_alloc is not updated. In such case, we will >>>> add a bh in two segbuf->sb_segsum_buffers. >>> If nilfs_sufile_alloc() succeeds to allocate a segment, sh_last_alloc >>> is updated. >>> all segment allocation must be done through nilfs_sufile_alloc(). >>> And, the allocated segment is marked dirty on the sufile not to be >>> reallocated until it's freed. >>> >>> So, why is it happening that the same segment is allocated twice in a log ? >>> Is it hard to fix the problem by correcting the calling sequence of >>> nilfs_sufile_alloc()/free()/etc without touching nilfs_sufile_alloc() >>> ? >> The problem happened when we call nilfs_segctor_begin_construction() and satisfied >> condition nilfs->ns_segnum != nilfs->ns_nextnum. In such scenario, nilfs_sufile_alloc() >> will be skipped, but we call nilfs_segbuf_map() and nilfs_segbuf_set_next_segnum() >> all the time, so last_alloc is not updated. >> Then in nilfs_segctor_extend_segments(), we set sb_segnum by prev->sb_nextnum directly, >> and calculate next sb_segnum by nilfs_sufile_alloc(), since last_alloc is not updated, >> we will get sb_segnum again. >> >> By the way, I still don't understand why skip nilfs_sufile_alloc() in some cases and why >> nilfs->ns_segnum != nilfs->ns_nextnum. Do you have any ideas? >>> I haven't looked closely at this patch yet, but I'm concerned about >>> the impact on other places as well. >>> nilfs_sufile_alloc() is also used in >>> nilfs_segctor_begin_construction() and >>> nilfs_prepare_segment_for_recovery(). Are there any side effects? >>> >>> This patch turns an output-only argument into both input and output, >>> and that input value is always used in the calculation of >>> "last_alloc". >>> So, this change requires all callers to pass a meaningful initial >>> value (at least a valid value) to *segnump. >>> >>> Another question, will this work near the end of the segments ? >>> Since segments are used cyclically, wouldn't comparison with the max >>> function break down there? >>> (I mean it seems that sh_last_alloc may be chosen unintentionally at the end.) >> Thanks for the heads-up,I need to look at it again. This patch can only prevent this problem, >> and seems to need improvement. Maybe there is a more reasonable solution. >> >> Thanks, > I will explain step by step. > >> By the way, I still don't understand why skip nilfs_sufile_alloc() in some cases and why >> nilfs->ns_segnum != nilfs->ns_nextnum. Do you have any ideas? > nilfs->ns_segnum holds the segment number where nilfs2 currently > writes logs to, and nilfs->ns_nextnum holds the next segment number > where nilfs2 will select to write logs when the current segment gets > full. > Here, a segment is a fixed sized container of logs (*), in which > multiple logs can be appended and written in. > (*) Exceptionally, segment #0 is shorter than others to exclude the > superblock area. > > The reason "nextnum" is needed is that mount-time recovery processing > needs to know which segment the logs follow. > > Both segments (nilfs->ns_segnum and nilfs->ns_nextnum) must be already > allocated on the sufile. > Here, "allocated" means the corresponding segment usage entry is > marked "dirty" on the sufile. > In nilfs_sufile_alloc() this is done with the following line: > > nilfs_segment_usage_set_dirty(su); > > On the other hand, the test whether the segment is allocatable or not > is done at the following part: > > if (!nilfs_segment_usage_clean(su)) > continue; > /* found a clean segment */ > > Coming back to your last question, if nilfs->ns_segnum != > nilfs->ns_nextnum, it indicates that nilfs2 is still writing a log to > the segment of ns_segnum. > Both should be in the allocated state, so there is no need to call > nilfs_sufile_alloc() in this case. > > If nilfs->ns_segnum == nilfs->ns_nextnum in > nilfs_segctor_begin_construction(), then nilfs_shift_to_next_segment() > sets ns_segnum to ns_nextnum because the current segment (ns_segnum) > is full and no more logs can be written. > In this case, a new segment must be allocated for new ns_nextnum, > therefore nilfs_sufile_alloc() will be called. > >> The problem happened when we call nilfs_segctor_begin_construction() and satisfied >> condition nilfs->ns_segnum != nilfs->ns_nextnum. In such scenario, nilfs_sufile_alloc() >> will be skipped, but we call nilfs_segbuf_map() and nilfs_segbuf_set_next_segnum() >> all the time, so last_alloc is not updated. > As I explained above, it is expected behavior that > nilfs_sufile_alloc() is not called and last_alloc is not updated if > nilfs->ns_segnum != nilfs->ns_nextnum. > > nilfs_segbuf_map() is always called there to set the geometry > information in the segbuf. This is expected behavior. > And, nilfs_segbuf_set_next_segnum() is called later to set the > determined ns_nextnum to the segbuf, so this is expected behavior as > well. > >> Then in nilfs_segctor_extend_segments(), we set sb_segnum by prev->sb_nextnum directly, >> and calculate next sb_segnum by nilfs_sufile_alloc(), since last_alloc is not updated, >> we will get sb_segnum again. > nilfs_segctor_extend_segments() pre-allocates one or more segments > large enough to store updated blocks of metadata files that need to be > written in a series of logs at once, and sets up a chain of segbufs. > (Sorry for the missing function comment). > > sb_segnum is set by prev->sb_nextnum to form a chain of buffers for > segments. This is expected behavior. > And, the sb_nextnum (= next sb_segnum) will be given by > nilfs_sufile_alloc(). This is also expected. > It looks like the problem is that nilfs_sufile_alloc() here allocates > the same segnum again. Thanks to your series of explanations, I learned a lot. So I can assume that the skip of nilfs_sufile_alloc() here is because it's already been allocated. So in next time, we will skip previous sb_segnum natually. And the problem should not be happend. > > Because sb_segnum is set by prev->sb_nextnum which is allocated by the > previous nilfs_sufile_alloc() call, > this usually does not happen. > > A possible anomaly is if the segment pointed by the first nextnum (or > segnum) was not marked dirty on sufile. > This may happen if the sufile is corrupted on the test disk image that > syzbot provided (mounted). > > Can you confirm if this is actually happening? This problem was discovered through syzbot and can be reproduced. The root reason I think is data corruption rather than logical error. So we should add some protection instead of modifying the logic, right? > > Regards, > Ryusuke Konishi > . >