Re: multiuser mount option regression

Satadru Pramanik <satadru@xxxxxxxxx> · Thu, 17 Mar 2022 10:09:55 -0400

Suspending, rebooting the server, and resuming reproduced the issue.

On Thu, Mar 17, 2022 at 9:15 AM ronnie sahlberg
<ronniesahlberg@xxxxxxxxx> wrote:
>
> On Thu, Mar 17, 2022 at 11:04 PM Satadru Pramanik <satadru@xxxxxxxxx> wrote:
> >
> > I have one of those set already:
> >
> > grep "UBSAN\|KASAN" /boot/config-5.17.0-051700rc8-generic
> > CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
> > CONFIG_UBSAN=y
> > # CONFIG_UBSAN_TRAP is not set
> > CONFIG_CC_HAS_UBSAN_BOUNDS=y
> > CONFIG_UBSAN_BOUNDS=y
> > CONFIG_UBSAN_ONLY_BOUNDS=y
> > CONFIG_UBSAN_SHIFT=y
> > # CONFIG_UBSAN_DIV_ZERO is not set
> > CONFIG_UBSAN_BOOL=y
> > CONFIG_UBSAN_ENUM=y
> > # CONFIG_UBSAN_ALIGNMENT is not set
> > CONFIG_UBSAN_SANITIZE_ALL=y
> > # CONFIG_TEST_UBSAN is not set
> > CONFIG_HAVE_ARCH_KASAN=y
> > CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
> > CONFIG_CC_HAS_KASAN_GENERIC=y
> > # CONFIG_KASAN is not set
> >
> >
> > What is weird is that this issue is only arising after suspending for
> > several hours now.  I can shut my screen, and come back in a few
> > minutes and reopen the screen and watch the machine come out of
> > suspend now, and the mount still work.
>
> Can you try to suspend, then reboot the server, then resume and see if
> that makes any difference?
> And if it reproduces the issue without having to wait several hours?
> I want to rule out if "some setting might make tcp sessions take
> several hours to time out".
>
>
> >
> > I come back after several hours after closing the screen, and that's
> > when the mount is broken.
> >
> > I think that shows up in the following dmesg. The first several lines
> > are from waking up this morning after the machine has been asleep for
> > many hours. Then I manually unmount an remount the cifs mount, and
> > then sleep/resume work.
> > Is there a deeper sleep state after several hours which is able to trigger this?
> >
> > [ 3130.975013] CIFS: VFS: cifs_tree_connect: could not find superblock: -22
> > [ 3132.991105] CIFS: VFS: cifs_tree_connect: could not find superblock: -22
> > [ 3135.007102] CIFS: VFS: cifs_tree_connect: could not find superblock: -22
> > [ 3136.296846] CIFS: VFS: cifs_tree_connect: could not find superblock: -22
> > [ 3139.363995] CIFS: Attempting to mount \\cheekon\localnet
> > [ 3397.489607] PM: suspend entry (deep)
> > [ 3400.856311] Filesystems sync: 3.366 seconds
> > [ 3400.856559] Freezing user space processes ... (elapsed 0.002 seconds) done.
> > [ 3400.858844] OOM killer disabled.
> > [ 3400.858845] Freezing remaining freezable tasks ... (elapsed 0.325
> > seconds) done.
> > [ 3401.184895] printk: Suspending console(s) (use no_console_suspend to debug)
> > [ 3401.707091] ACPI: EC: interrupt blocked
> > [ 3402.658748] ACPI: PM: Preparing to enter system sleep state S3
> > [ 3402.659592] ACPI: EC: event blocked
> > [ 3402.659593] ACPI: EC: EC stopped
> > [ 3402.659594] ACPI: PM: Saving platform NVS memory
> > [ 3402.659622] Disabling non-boot CPUs ...
> > [ 3402.661485] smpboot: CPU 1 is now offline
> > [ 3402.671576] smpboot: CPU 2 is now offline
> > [ 3402.689819] smpboot: CPU 3 is now offline
> > [ 3402.700758] smpboot: CPU 4 is now offline
> > [ 3402.703343] smpboot: CPU 5 is now offline
> > [ 3402.705884] smpboot: CPU 6 is now offline
> > [ 3402.708564] smpboot: CPU 7 is now offline
> > [ 3402.710867] ACPI: PM: Low-level resume complete
> > [ 3402.710901] ACPI: EC: EC started
> > [ 3402.710903] ACPI: PM: Restoring platform NVS memory
> > [ 3402.711337] Enabling non-boot CPUs ...
> > [ 3402.711422] x86: Booting SMP configuration:
> > [ 3402.711424] smpboot: Booting Node 0 Processor 1 APIC 0x2
> > [ 3403.034812] CPU1 is up
> > [ 3403.046711] smpboot: Booting Node 0 Processor 2 APIC 0x4
> > [ 3403.402877] CPU2 is up
> > [ 3403.414254] smpboot: Booting Node 0 Processor 3 APIC 0x6
> > [ 3403.770377] CPU3 is up
> > [ 3403.780699] smpboot: Booting Node 0 Processor 4 APIC 0x1
> > [ 3403.795814] CPU4 is up
> > [ 3403.795848] smpboot: Booting Node 0 Processor 5 APIC 0x3
> > [ 3404.207469] CPU5 is up
> > [ 3404.216103] smpboot: Booting Node 0 Processor 6 APIC 0x5
> > [ 3404.735629] CPU6 is up
> > [ 3404.744063] smpboot: Booting Node 0 Processor 7 APIC 0x7
> > [ 3405.351922] CPU7 is up
> > [ 3405.358923] ACPI: PM: Waking up from system sleep state S3
> > [ 3405.360630] ACPI: EC: interrupt unblocked
> > [ 3405.360798] pcieport 0000:00:1c.0: Enabling MPC IRBNCE
> > [ 3405.360802] pcieport 0000:00:1c.0: Intel PCH root port ACS workaround enabled
> > [ 3405.361022] pcieport 0000:00:1c.2: Enabling MPC IRBNCE
> > [ 3405.361025] pcieport 0000:00:1c.2: Intel PCH root port ACS workaround enabled
> > [ 3405.380367] pcieport 0000:00:1c.3: Enabling MPC IRBNCE
> > [ 3405.380371] pcieport 0000:00:1c.3: Intel PCH root port ACS workaround enabled
> > [ 3405.440455] pcieport 0000:00:1c.4: Enabling MPC IRBNCE
> > [ 3405.440459] pcieport 0000:00:1c.4: Intel PCH root port ACS workaround enabled
> > [ 3405.464845] ACPI: EC: event unblocked
> > [ 3405.466699] pcieport 0000:00:1c.0: pciehp: Slot(0): Card present
> > [ 3405.742733] nouveau 0000:01:00.0: devinit: 0x000066e5[0]: script
> > needs OR link
> > [ 3406.016118] nvme nvme0: 8/0/0 default/read/poll queues
> > [ 3406.117380] OOM killer enabled.
> > [ 3406.117382] Restarting tasks ... done.
> > [ 3406.147389] DMAR: DRHD: handling fault status reg 2
> > [ 3406.147408] DMAR: [DMA Write NO_PASID] Request device [00:02.0]
> > fault addr 0xfed69000 [fault reason 0x05] PTE Write access is not set
> > [ 3406.202211] DMAR: DRHD: handling fault status reg 2
> > [ 3406.202219] DMAR: [DMA Write NO_PASID] Request device [00:02.0]
> > fault addr 0x3dffc000 [fault reason 0x05] PTE Write access is not set
> > [ 3407.536312] pcieport 0000:00:1c.0: pciehp: Slot(0): No link
> > [ 3408.144356] PM: suspend exit
> > [ 3412.177126] IPv6: ADDRCONF(NETDEV_CHANGE): wlp3s0: link becomes ready
> >
> > On Thu, Mar 17, 2022 at 8:50 AM Shyam Prasad N <nspmangalore@xxxxxxxxx> wrote:
> > >
> > > On Thu, Mar 17, 2022 at 11:53 AM Satadru Pramanik <satadru@xxxxxxxxx> wrote:
> > > >
> > > > I am testing Ronnie's patch from earlier today now, and will see if
> > > > the mount has broken later this morning, as I'm putting the machine to
> > > > sleep now.
> > > >
> > > > I tried just reverting 73f9bfbe3d818bb52266d5c9f3ba57d97842ffe7 in
> > > > 5.17-rc8, but it broke cifs mounting entirely with the error I
> > > > mentioned earlier today:
> > > >
> > > > [  242.560881] INFO: task mount.smb3:3219 blocked for more than 120 seconds.
> > > > [  242.560901]       Tainted: P           OE
> > > > 5.17.0-051700rc8-generic #202203132130
> > > > [  242.560904] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > > disables this message.
> > > > [  242.560907] task:mount.smb3      state:D stack:    0 pid: 3219
> > > > ppid:     1 flags:0x00004006
> > > > [  242.560914] Call Trace:
> > > > [  242.560918]  <TASK>
> > > > [  242.560927]  __schedule+0x240/0x5a0
> > > > [  242.560939]  schedule+0x55/0xd0
> > > > [  242.560941]  schedule_preempt_disabled+0x15/0x20
> > > > [  242.560944]  __mutex_lock.constprop.0+0x2e0/0x4b0
> > > > [  242.560949]  __mutex_lock_slowpath+0x13/0x20
> > > > [  242.560953]  mutex_lock+0x34/0x40
> > > > [  242.560958]  cifs_get_smb_ses+0x367/0xab0 [cifs]
> > > > [  242.561108]  ? __queue_delayed_work+0x5c/0x90
> > > > [  242.561120]  mount_get_conns+0x63/0x430 [cifs]
> > > > [  242.561182]  cifs_mount+0x86/0x420 [cifs]
> > > > [  242.561222]  cifs_smb3_do_mount+0x10d/0x320 [cifs]
> > > > [  242.561252]  ? cifs_smb3_do_mount+0x10d/0x320 [cifs]
> > > > [  242.561283]  ? vfs_parse_fs_string+0x7f/0xb0
> > > > [  242.561290]  smb3_get_tree+0x3e/0x70 [cifs]
> > > > [  242.561337]  vfs_get_tree+0x27/0xc0
> > > > [  242.561343]  do_new_mount+0x14b/0x1a0
> > > > [  242.561348]  path_mount+0x1d4/0x530
> > > > [  242.561350]  ? putname+0x55/0x60
> > > > [  242.561357]  __x64_sys_mount+0x108/0x140
> > > > [  242.561360]  do_syscall_64+0x59/0xc0
> > > > [  242.561368]  ? do_syscall_64+0x69/0xc0
> > > > [  242.561372]  ? handle_mm_fault+0xba/0x290
> > > > [  242.561376]  ? do_user_addr_fault+0x1dd/0x670
> > > > [  242.561382]  ? syscall_exit_to_user_mode+0x27/0x50
> > > > [  242.561385]  ? exit_to_user_mode_prepare+0x37/0xb0
> > > > [  242.561392]  ? irqentry_exit_to_user_mode+0x9/0x20
> > > > [  242.561394]  ? irqentry_exit+0x33/0x40
> > > > [  242.561397]  ? exc_page_fault+0x89/0x180
> > > > [  242.561399]  ? asm_exc_page_fault+0x8/0x30
> > > > [  242.561405]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > > [  242.561409] RIP: 0033:0x7f42af11ceae
> > > > [  242.561414] RSP: 002b:00007fff6af66c48 EFLAGS: 00000206 ORIG_RAX:
> > > > 00000000000000a5
> > > > [  242.561418] RAX: ffffffffffffffda RBX: 000055dcbe40beb0 RCX: 00007f42af11ceae
> > > > [  242.561420] RDX: 000055dcbe1a447e RSI: 000055dcbe1a44da RDI: 00007fff6af67ea6
> > > > [  242.561421] RBP: 0000000000000000 R08: 000055dcbe40beb0 R09: 000055dcbe40cf40
> > > > [  242.561423] R10: 0000000000000000 R11: 0000000000000206 R12: 00007fff6af67e9b
> > > > [  242.561424] R13: 00007f42af237000 R14: 00007f42af23990f R15: 000055dcbe40cf40
> > > > [  242.561427]  </TASK>
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Mar 17, 2022 at 2:03 AM Steve French <smfrench@xxxxxxxxx> wrote:
> > > > >
> > > > > I narrowed the regression for multiuser mounts down (which Ronnie had
> > > > > mentioned) to this patch (one of the first applied to 5.17 merge
> > > > > window for cifs.ko).   I am curious whether this is also related to
> > > > > the hard to reproduce reconnect issue that the regression tracker was
> > > > > monitoring
> > > > >
> > > > > commit 73f9bfbe3d818bb52266d5c9f3ba57d97842ffe7 (HEAD -> tmp)
> > > > > Author: Shyam Prasad N <sprasad@xxxxxxxxxxxxx>
> > > > > Date:   Mon Jul 19 17:37:52 2021 +0000
> > > > >
> > > > >     cifs: maintain a state machine for tcp/smb/tcon sessions
> > > > >
> > > > >     If functions like cifs_negotiate_protocol, cifs_setup_session,
> > > > >     cifs_tree_connect are called in parallel on different channels,
> > > > >     each of these will be execute the requests. This maybe unnecessary
> > > > >     in some cases, and only the first caller may need to do the work.
> > > > >
> > > > >     This is achieved by having more states for the tcp/smb/tcon session
> > > > >     status fields. And tracking the state of reconnection based on the
> > > > >     state machine.
> > > > >
> > > > >     For example:
> > > > >     for tcp connections:
> > > > >     CifsNew/CifsNeedReconnect ->
> > > > >       CifsNeedNegotiate ->
> > > > >         CifsInNegotiate ->
> > > > >           CifsNeedSessSetup ->
> > > > >             CifsInSessSetup ->
> > > > >               CifsGood
> > > > >
> > > > >     for smb sessions:
> > > > >     CifsNew/CifsNeedReconnect ->
> > > > >       CifsGood
> > > > >
> > > > >      CifsNew/CifsNeedReconnect ->
> > > > >       CifsInFilesInvalidate ->
> > > > >         CifsNeedTcon ->
> > > > >           CifsInTcon ->
> > > > >             CifsGood
> > > > >
> > > > >     If any channel reconnect sees that it's in the middle of
> > > > >     transition to CifsGood, then they can skip the function.
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > >
> > > > > Steve
> > >
> > > Hi Satadru,
> > >
> > > If you are able to build the kernel yourself, can you try building
> > > with a couple of these extra config options?
> > > CONFIG_KASAN=y
> > > CONFIG_UBSAN=y
> > >
> > > I tried these two tests a few times on my VM:
> > > 1. suspend/resume of the VM while share is mounted
> > > 2. restart smbd when share is mounted.
> > >
> > > In both cases, I'm unable to repro this.
> > > The above config options may slow down your system, since it enables
> > > memory sanitization. You may use this customized kernel just for the
> > > repro.
> > >
> > > --
> > > Regards,
> > > Shyam