> On Feb 25, 2021, at 1:58 PM, dai.ngo@xxxxxxxxxx wrote: > > > On 2/24/21 6:26 PM, dai.ngo@xxxxxxxxxx wrote: >> Hi Olga and Bruce, >> >> On 2/24/21 2:35 PM, Olga Kornievskaia wrote: >>> On Mon, Feb 22, 2021 at 5:09 PM <dai.ngo@xxxxxxxxxx> wrote: >>>> >>>> On 2/22/21 2:01 PM, dai.ngo@xxxxxxxxxx wrote: >>>>> On 2/22/21 1:46 PM, dai.ngo@xxxxxxxxxx wrote: >>>>>> On 2/22/21 10:34 AM, dai.ngo@xxxxxxxxxx wrote: >>>>>>> On 2/20/21 8:16 PM, dai.ngo@xxxxxxxxxx wrote: >>>>>>>> On 2/20/21 6:08 AM, Olga Kornievskaia wrote: >>>>>>>>> On Fri, Feb 19, 2021 at 10:21 PM J. Bruce Fields >>>>>>>>> <bfields@xxxxxxxxxxxx> wrote: >>>>>>>>>> On Fri, Feb 19, 2021 at 05:31:58PM -0800, dai.ngo@xxxxxxxxxx wrote: >>>>>>>>>>> If this is the cause why we don't drop the mount after the copy >>>>>>>>>>> then I can restore the patch and look into this problem. >>>>>>>>>>> Unfortunately, >>>>>>>>>>> all my test machines are down for maintenance until Sunday/Monday. >>>>>>>>>> I think we can take some time to figure out what's actually going on >>>>>>>>>> here before reverting anything. >>>>>>>>> Yes I agree. We need to fix the use-after-free and also make sure >>>>>>>>> that >>>>>>>>> reference will go away. >>>>>>> I reverted the patch, verified the warning message is back: >>>>>>> >>>>>>> Feb 22 10:07:45 nfsvmf24 kernel: ------------[ cut here ]------------ >>>>>>> Feb 22 10:07:45 nfsvmf24 kernel: refcount_t: underflow; use-after-free. >>>>>>> >>>>>>> then did a inter-server copy and waited for more than 20 mins and >>>>>>> the destination server still maintains the session with the source >>>>>>> server. It must be some other references that prevent the mount >>>>>>> to go away. >>>>>> This change fixed the unmount after inter-server copy problem: >>>>>> >>>>>> diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c >>>>>> index 8d6d2678abad..87687cd18938 100644 >>>>>> --- a/fs/nfsd/nfs4proc.c >>>>>> +++ b/fs/nfsd/nfs4proc.c >>>>>> @@ -1304,7 +1304,7 @@ nfsd4_cleanup_inter_ssc(struct vfsmount >>>>>> *ss_mnt, struct nfsd_file *src, >>>>>> struct nfsd_file *dst) >>>>>> { >>>>>> nfs42_ssc_close(src->nf_file); >>>>>> - /* 'src' is freed by nfsd4_do_async_copy */ >>>>>> + nfsd_file_put(src); >>>>>> nfsd_file_put(dst); >>>>>> mntput(ss_mnt); >>>>>> } >>>> This change is not need. It's left over from my testing to >>>> reproduce the warning messages. Only the change in >>>> nfsd4_do_async_copy is needed for the unmount problem. >>>> >>>> -Dai >>>> >>>>>> @@ -1472,14 +1472,12 @@ static int nfsd4_do_async_copy(void *data) >>>>>> copy->nf_src = kzalloc(sizeof(struct nfsd_file), >>>>>> GFP_KERNEL); >>>>>> if (!copy->nf_src) { >>>>>> copy->nfserr = nfserr_serverfault; >>>>>> - nfsd4_interssc_disconnect(copy->ss_mnt); >>>>>> goto do_callback; >>>>>> } >>>>>> copy->nf_src->nf_file = nfs42_ssc_open(copy->ss_mnt, >>>>>> ©->c_fh, >>>>>> ©->stateid); >>>>>> if (IS_ERR(copy->nf_src->nf_file)) { >>>>>> copy->nfserr = nfserr_offload_denied; >>>>>> - nfsd4_interssc_disconnect(copy->ss_mnt); >>>>>> goto do_callback; >>>>>> } >>>>>> } >>>>>> @@ -1498,6 +1496,7 @@ static int nfsd4_do_async_copy(void *data) >>>>>> &nfsd4_cb_offload_ops, >>>>>> NFSPROC4_CLNT_CB_OFFLOAD); >>>>>> nfsd4_run_cb(&cb_copy->cp_cb); >>>>>> out: >>>>>> + nfsd4_interssc_disconnect(copy->ss_mnt); >>>>>> if (!copy->cp_intra) >>>>>> kfree(copy->nf_src); >>>>>> cleanup_async_copy(copy); >>>>>> >>>>>> But there is something new. I tried inter-server copy twice. >>>>>> First time I can verify from tshark capture that a session was >>>>>> created and destroy, along with all the NFS ops. On 2nd try, >>>>>> I can >>> Hi Dai/Bruce, >>> >>> While I believe the fix works (as in the mount goes away), I'm not >>> comfortable with this fix as I believe we will be leaking resources. >>> Server calls nfs42_ssc_open() which creates a legit file pointer (yes >>> it takes a reference on superblock but it also allocates memory for >>> "file" structure). Normally a file structure requires doing an fput() >>> which if I look thru the code does a bunch of things before in the end >>> calling mntput(). While we free the copy->nf_src in >>> nfs4_do_asyn_copy(), the copy->nf_src->nf_file was allocated >>> separately and would have been freed from calling fput() on it. >>> >>> So I guess it's not correct to do kfree(copy->nf_src) in teh >>> nfs4_do_async_copy() I think that's probably where use-after-free >>> comes in. We need to keep it until the cleanup. I'm thinking perhaps >>> call fput(copy->nf_src->nf_file) first and then free it? Just an idea >>> but I haven't tested it. >> >> I think the unmount can be treated separately and the fix seems >> to be valid for this issue. I think kfree(copy->nf_src) is called >> after nfsd4_cleanup_inter_ssc so it's not the reason for the >> 'use-after-free' warning. A quick look at nfsd4_do_async_copy >> I see nf_src was kzalloc'ed but no ref count added to it. >> >> I will review the whole cleanup part and report back. > > I think this would fix the resource leak problem: > > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c > index 57b3821d975a..742fc128fdc8 100644 > --- a/fs/nfs/nfs4file.c > +++ b/fs/nfs/nfs4file.c > @@ -405,6 +405,11 @@ static void __nfs42_ssc_close(struct file *filep) > struct nfs_open_context *ctx = nfs_file_open_context(filep); > ctx->state->flags = 0; > + if (!filep) > + return; > + get_file(filep); > + filp_close(filep, NULL); > + fput(filep); > } > static const struct nfs4_ssc_client_ops nfs4_ssc_clnt_ops_tbl = { > > -Dai I'm not clear on the final outcome. Is a code change needed? If so, can someone post a patch (with a full description and Signed-off-by) or point me to one that's been posted already? Thanks! -- Chuck Lever