Hi Olga, we have installed kernel 4.7.0 on one of the nodes and don't see missing closes from that node. Nevertheless, I don't think that the commit you have mentioned is fixing that, as it fixes OPEN_DOWNGRADE, but we have a sequence of OPEN->CLOSE->OPEN. The OPEN_DOWNGRADE is not expected - file is already closed when a second open is sent and both requests using the same session slot. Have you seen a similar issue on vanilla or rhel kernel? Thanks a lot, Tigran. ----- Original Message ----- > From: "Olga Kornievskaia" <aglo@xxxxxxxxx> > To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@xxxxxxx> > Cc: "Andy Adamson" <William.Adamson@xxxxxxxxxx>, "Linux NFS Mailing List" <linux-nfs@xxxxxxxxxxxxxxx>, "Trond Myklebust" > <trond.myklebust@xxxxxxxxxxxxxxx>, "Steve Dickson" <steved@xxxxxxxxxx> > Sent: Thursday, July 14, 2016 4:52:59 PM > Subject: Re: Lost CLOSE with NFSv4.1 on RHEL7 ( and bejond?) > Hi Tigran, > > On Wed, Jul 13, 2016 at 12:49 PM, Mkrtchyan, Tigran > <tigran.mkrtchyan@xxxxxxx> wrote: >> >> >> Hi Andy, >> >> I will try to get upstream kernel on one of the nodes. It will take >> some time as we need to add a new host into the cluster and get >> some traffic go through it. >> >> In the mean while, with RHEL7 we get it easy reproduced - about 10 >> such cases per day. Is there any tool that will help us to see where >> it happens? Some traces points? Call trace from vfs close to NFS close? > > There are NFS tracepoints but I don't know think there are VFS > tracepoints. Unfortunately, there was a bug in the OPEN tracepoints > that caused a kernel crash. I had a bugzilla out for RHEL7.2. It says > it's fixed in the later kernel (.381) but it's currently not back > ported to RHEL7.2z but hopefully will be soon (just chatted with Steve > about getting the fix into zstream). I made no progress in figuring > out what could be causing the lack of CLOSE and it was hard for me to > reproduce. > > Just recently Trond fixed a problem where a CLOSE that was suppose to > be sent as an OPEN_DOWNGRADE wasn't sent (commit 0979bc2a59) . I > wonder if that can be fixing this problem.... > >> There is a one comment in the kernel code, which sounds similar: >> (http://git.linux-nfs.org/?p=trondmy/linux-nfs.git;a=blob;f=fs/nfs/nfs4proc.c;h=519368b987622ea23bea210929bebfd0c327e14e;hb=refs/heads/linux-next#l2955) >> >> nfs4proc.c: 2954 >> ==== >> >> /* >> * It is possible for data to be read/written from a mem-mapped file >> * after the sys_close call (which hits the vfs layer as a flush). >> * This means that we can't safely call nfsv4 close on a file until >> * the inode is cleared. This in turn means that we are not good >> * NFSv4 citizens - we do not indicate to the server to update the file's >> * share state even when we are done with one of the three share >> * stateid's in the inode. >> * >> * NOTE: Caller must be holding the sp->so_owner semaphore! >> */ >> int nfs4_do_close(struct nfs4_state *state, gfp_t gfp_mask, int wait) >> > > I'm not sure if the comment means to say that there is a possibility > that NFS won't send a CLOSE (or at least I hope not). I thought that > because we keep a reference count on the inode and send the CLOSE when > it goes down to 0. Basically the last WRITE will trigger the nfs close > not the vfs_close. > > >> ==== >> >> >> Tigran. >> >> >> ----- Original Message ----- >>> From: "Andy Adamson" <William.Adamson@xxxxxxxxxx> >>> To: "Mkrtchyan, Tigran" <tigran.mkrtchyan@xxxxxxx> >>> Cc: "Linux NFS Mailing List" <linux-nfs@xxxxxxxxxxxxxxx>, "Andy Adamson" >>> <William.Adamson@xxxxxxxxxx>, "Trond Myklebust" >>> <trond.myklebust@xxxxxxxxxxxxxxx>, "Steve Dickson" <steved@xxxxxxxxxx> >>> Sent: Tuesday, July 12, 2016 7:16:19 PM >>> Subject: Re: Lost CLOSE with NFSv4.1 on RHEL7 ( and bejond?) >> >>> Hi Tigran >>> >>> Can you test with an upstream kernel? Olga has seen issues around no CLOSE being >>> sent - it is really hard to reproduce…. >>> >>> —>Andy >>> >>> >>>> On Jul 7, 2016, at 6:49 AM, Mkrtchyan, Tigran <tigran.mkrtchyan@xxxxxxx> wrote: >>>> >>>> >>>> >>>> Dear NFS folks, >>>> >>>> we observe orphan open-states on our deployment with nfsv4.1. >>>> Our setup - two client nodes, running RHEL-7.2 with kernel >>>> 3.10.0-327.22.2.el7.x86_64. Both nodes running ownCloud (like >>>> a dropbox) which nfsv4.1 mounts to dCache storage. Some clients >>>> connected to node1, others to node2. >>>> >>>> Time-to-time we see some 'active' transfers on data our DS >>>> which do nothing. There is a corresponding state on MDS. >>>> >>>> I have traced one one such cases: >>>> >>>> - node1 uploads the file. >>>> - node2 reads the file couple of times, OPEN+LAYOUTGET+CLOSE >>>> - node2 sends OPEN+LAYOUTGET >>>> - there is no open file on node2 which points to it. >>>> - CLOSE never send to the server. >>>> - node1 eventually removes the removes the file >>>> >>>> We have many other cases where file is not removed, but this one I was >>>> able to trace. The link to capture files: >>>> >>>> https://desycloud.desy.de/index.php/s/YldowcRzTGJeLbN >>>> >>>> We had ~ 10^6 transfers in last 2 days and 29 files in such state (~0.0029%). >>>> >>> > Tigran. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html