On Fri, 2023-08-11 at 12:15 -0300, Paulo Alcantara wrote: > Jeff Layton <jlayton@xxxxxxxxxx> writes: > > > On Wed, 2021-06-23 at 19:34 -0500, Steve French wrote: > > > updated patch attached with Aurelien's suggestion. > > > > > > On Wed, Jun 23, 2021 at 7:17 AM Paulo Alcantara <pc@xxxxxx> wrote: > > > > > > > > Agreed. > > > > > > > > On June 23, 2021 8:48:24 AM GMT-03:00, "Aurélien Aptel" <aaptel@xxxxxxxx> wrote: > > > > > Steve French <smfrench@xxxxxxxxx> writes: > > > > > > We weren't checking if tcon is null before setting dfs path, > > > > > > although we check for null tcon in an earlier assignment statement. > > > > > > > > > > If tcon is NULL there is no point in continuing in that function, we > > > > > should have exited earlier. > > > > > > > > > > If tcon is NULL it means mount_get_conns() failed so presumably rc will > > > > > be != 0 and we would goto error. > > > > > > > > > > I don't think this is needed. We could change the existing check after > > > > > the loop to this you really want to be safe: > > > > > > > > > > if (rc || !tcon) > > > > > goto error; > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > I know this patch is ancient and the mainline code has marched on, but > > it seems really suspicious to me. > > Yes, it is. > > > With this, we have cifs_mount returning 0, even though the superblock > > hasn't been properly initialized. Is that expected? Shouldn't it return > > an error in that case? > > No, that isn't expected. And yes, if @tcon would ever be NULL at that > point, we should be returning an error instead. Otherwise we'd end up > dereferencing a NULL @tcon while trying to get an inode for the root > dentry later. > > However, by quickly looking at the old code -- on top of 162004a2f7ef -- > I don't see how we'd end up having a NULL @tcon with rc == 0 as > mount_get_conns() would return -errno if it couldn't get a tcon. Please > correct me if I'm missing something. Whether it is possibile or not, > the NULL @tcon check is certainly missing a 'rc = -ENOENT' or some other > error before bailing out as you've pointed out. Thanks for the confirmation. There were some oopses on some RHEL8 (5.14 based kernels). The stack looked something like this: PID: 2415716 TASK: ffff937139090000 CPU: 3 COMMAND: "ls" #0 [ffff9ef946b23728] machine_kexec at ffffffffac867cfe #1 [ffff9ef946b23780] __crash_kexec at ffffffffac9ad94d #2 [ffff9ef946b23848] crash_kexec at ffffffffac9ae881 #3 [ffff9ef946b23860] oops_end at ffffffffac8274f1 #4 [ffff9ef946b23880] no_context at ffffffffac879a03 #5 [ffff9ef946b238d8] __bad_area_nosemaphore at ffffffffac879d64 #6 [ffff9ef946b23920] do_page_fault at ffffffffac87a617 #7 [ffff9ef946b23950] page_fault at ffffffffad20111e [exception RIP: cifs_mount+1126] RIP: ffffffffc08e8826 RSP: ffff9ef946b23a00 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff936c8221ea00 RCX: ffff936f8018b320 RDX: 0000000000000001 RSI: ffff936f8018b420 RDI: ffff936c8221ea00 RBP: ffff9ef946b23a90 R8: 5346445756535c5c R9: 6765642e50313031 R10: 65622e666f6f7267 R11: 0063696c6275505c R12: ffff9370b192fc00 R13: ffff936f8018b420 R14: 00000000003097ad R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffff9ef946b23a98] cifs_smb3_do_mount at ffffffffc08d11f2 [cifs] #9 [ffff9ef946b23aa0] cifs_smb3_do_mount at ffffffffc08d11f2 [cifs] #10 [ffff9ef946b23b08] smb3_get_tree at ffffffffc0930ae0 [cifs] #11 [ffff9ef946b23b30] vfs_get_tree at ffffffffacb52365 #12 [ffff9ef946b23b50] fc_mount at ffffffffacb7485e #13 [ffff9ef946b23b60] vfs_kern_mount at ffffffffacb748ec #14 [ffff9ef946b23b80] cifs_dfs_do_automount at ffffffffc093552e [cifs] #15 [ffff9ef946b23bc0] cifs_dfs_d_automount at ffffffffc0935880 [cifs] #16 [ffff9ef946b23bd0] follow_managed at ffffffffacb5bdaf #17 [ffff9ef946b23c10] lookup_fast at ffffffffacb5c7e5 #18 [ffff9ef946b23c68] walk_component at ffffffffacb5d258 #19 [ffff9ef946b23cc8] path_lookupat at ffffffffacb5e215 #20 [ffff9ef946b23d28] filename_lookup at ffffffffacb62710 #21 [ffff9ef946b23e40] vfs_statx at ffffffffacb55874 #22 [ffff9ef946b23e98] __do_sys_statx at ffffffffacb5692b #23 [ffff9ef946b23f38] do_syscall_64 at ffffffffac8043ab #24 [ffff9ef946b23f50] entry_SYSCALL_64_after_hwframe at ffffffffad2000a9 RIP: 00007ff6a2637edf RSP: 00007ffe040017d0 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 00007ffe04001910 RCX: 00007ff6a2637edf RDX: 0000000000000100 RSI: 00007ffe04001910 RDI: 00000000ffffff9c RBP: 0000000000000100 R8: 00007ffe040017f0 R9: 00000000ffffff9c R10: 0000000000000002 R11: 0000000000000246 R12: 00007ffe040017f0 R13: 0000000000000000 R14: 0000000000000003 R15: 000055ce1a3ae1b8 ORIG_RAX: 000000000000014c CS: 0033 SS: 002b Analysis of the vmcore by Roberto showed that we had ended up past that point with tcon==NULL and rc==0. Steve's patch would have fixed the panic there, but I think the host would have ended up with a successful mount, but with a broken superblock. The current code seems a bit less fragile, and I didn't see any similar brokenness there, but I didn't look too hard either. In any case, we'll plan to fix this up with a one-off in RHEL/Centos. Thanks again for the sanity check! -- Jeff Layton <jlayton@xxxxxxxxxx>