Re: [PATCH v2] cifs: ignore cached share root handle closing errors

Pavel Shilovsky <piastryyy@xxxxxxxxx> · Fri, 3 Apr 2020 11:04:29 -0700

ср, 1 апр. 2020 г. в 05:50, Aurelien Aptel <aaptel@xxxxxxxx>:
>
> Fix tcon use-after-free and NULL ptr deref.
>
> Customer system crashes with the following kernel log:
>
> [462233.169868] CIFS VFS: Cancelling wait for mid 4894753 cmd: 14       => a QUERY DIR
> [462233.228045] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.305922] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.306205] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.347060] CIFS VFS: cifs_put_smb_ses: Session Logoff failure rc=-4
> [462233.347107] CIFS VFS: Close unmatched open
> [462233.347113] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> ...
>     [exception RIP: cifs_put_tcon+0xa0] (this is doing tcon->ses->server)
>  #6 [...] smb2_cancelled_close_fid at ... [cifs]
>  #7 [...] process_one_work at ...
>  #8 [...] worker_thread at ...
>  #9 [...] kthread at ...
>
> The most likely explanation we have is:
>
> * When we put the last reference of a tcon (refcount=0), we close the
>   cached share root handle.
> * If closing a handle is interupted, SMB2_close() will
>   queue a SMB2_close() in a work thread.
> * The queued object keeps a tcon ref so we bump the tcon
>   refcount, jumping from 0 to 1.
> * We reach the end of cifs_put_tcon(), we free the tcon object despite
>   it now having a refcount of 1.
> * The queued work now runs, but the tcon, ses & server was freed in
>   the meantime resulting in a crash.
>
> THREAD 1
> ========
> cifs_put_tcon                 => tcon refcount reach 0
>   SMB2_tdis
>    close_shroot_lease
>     close_shroot_lease_locked => if cached root has lease && refcount reach 0
>      smb2_close_cached_fid    => if cached root valid
>       SMB2_close              => retry close in a worker thread if interrupted
>        smb2_handle_cancelled_close
>         __smb2_handle_cancelled_close    => !! tcon refcount bump 0 => 1 !!
>          INIT_WORK(&cancelled->work, smb2_cancelled_close_fid);
>          queue_work(cifsiod_wq, &cancelled->work) => queue work
>  tconInfoFree(tcon);    ==> freed!
>  cifs_put_smb_ses(ses); ==> freed!
>
> THREAD 2 (workqueue)
> ========
> smb2_cancelled_close_fid
>   SMB2_close(0, cancelled->tcon, ...); => use-after-free of tcon
>   cifs_put_tcon(cancelled->tcon);      => tcon refcount reach 0 second time
>   *CRASH*
>
> Fixes: d9191319358d ("CIFS: Close cached root handle only if it has a lease")
> Signed-off-by: Aurelien Aptel <aaptel@xxxxxxxx>
> ---
>  fs/cifs/smb2misc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/fs/cifs/smb2misc.c b/fs/cifs/smb2misc.c
> index 0511aaf451d4..965276aeffcf 100644
> --- a/fs/cifs/smb2misc.c
> +++ b/fs/cifs/smb2misc.c
> @@ -766,6 +766,12 @@ smb2_handle_cancelled_close(struct cifs_tcon *tcon, __u64 persistent_fid,
>
>         cifs_dbg(FYI, "%s: tc_count=%d\n", __func__, tcon->tc_count);
>         spin_lock(&cifs_tcp_ses_lock);
> +       if (tcon->tc_count <= 0) {
> +               spin_unlock(&cifs_tcp_ses_lock);
> +               cifs_dbg(VFS, "tcon is closing, skipping async close retry of fid %llu %llu\n",

Thanks for a good catch! Some suggestions below:

- Why VFS? Share is being disconnected anyway, so all associated
handles should be closed by the server eventually. I think FYI is
sufficient here.
- cifs_dbg_server may be better than cifs_dbg and please include tcon
ID here otherwise FIDs may be meaningless if there are several shares
mounted on a host.
- May be WARN_ONCE if tc_count is negative since it is a possible bug?

Reviewed-by: Pavel Shilovsky <pshilov@xxxxxxxxxxxxx>

--
Best regards,
Pavel Shilovsky