Re: gnfs socket errors during client mount, unmount and patch to illustrate

Erik Jacobson <erik.jacobson@xxxxxxx> · Tue, 22 Feb 2022 12:58:43 -0600

We are in the final stretch of a release and are slammed.

Myself, working with two talented people in our QE team, are really
hoping to push to Ganesha on our solutions. We see decreased intereset
in fixing NFS issues, which puts more burden and risk on us.

I don't have the data you require, we only recently set up a test case.
While we sell supercomputers, are internal test systems... aren't :)

But we were able to set up a test case with a 300-node cluster. While I
don't have the data handy, I hope in a week or two we will repeat
experiments.

The setup is something like this:
* 3 leader nodes (gluster/ganesha servers)
    (supercomputers have up to 24 leaders/gluster servers
    distributed/replicated using a combo of sharded and non-sharded
    volumes)
* All 300 compute nodes get their roots from TMPFS in this case -- it is
  normally NFS but for the test case we isolate NFS to just the test.
  We didn't want root on NFS while pynamic was also on NFS.
* The OS image they boot MPI installed
* We use the simple management network (no high speed/IB/etc) and the
  head node as the launc point
* We ocmpile pynamic with certain options to simulate a load we want
  It sits on an NFS export of some sort, depending on the test
* Then we use MPI to run it on the 300 computes
* Pynmaic will report how long various stages of library loads take and
  report back timings for each node.
* We can run the test pure-TMPFS and not NFS (pynamic in TMPFS)
* We can switch to kernel NFS (pynamic NFS hosted from the head node)
* We can switch between gluster NFS and Ganesha (on the "leaders" aka
  "gluster servers"

The pynamic installation location can be a typical NFS expanded tree,
which produces tons of NFS metadata traffic, or a SquashFS Image object
on NFS, where the NFS clietns and server just manage a single file.

This allows us to simulate what a customer operating system may have to
go through when loading libraries for a giant job and where the slow
downs may be.

Our hope is to find a way -- I'm not sure how -- to boil this down to a
test case that can be shared by others to try out. It's a bit of a pain
to do that.

Hoping to write back more in the coming weeks. I may start here at
Gluster just since I'm familiar with you all and once you take a look, I
can try to move the discussion to the Ganesha community.

On Tue, Feb 22, 2022 at 06:44:46PM +0000, Strahil Nikolov wrote:
> Hey Erik,
> 
> Can you provide a short comparison of the difference between Ganesha and
> Gluster-NFS for the same workload (if possible a job execution with the same
> data) ?
> 
> Best Regards,
> Strahil Nikolov
> 
> 
>     On Tue, Feb 22, 2022 at 20:35, Erik Jacobson
>     <erik.jacobson@xxxxxxx> wrote:
>     We have hacked around these errors, produced in glusterfs79 and
>     glusterfs93 when an NFS client mounts or unmounts. On one of the
>     installed superocmputers, one of gluster server nfs.log files as over 1GB.
> 
>         [2022-02-21 22:39:32.803070 +0000] W [socket.c:767:__socket_rwv]
>     0-socket.nfs-server: readv on 172.23.0.5:60126 failed (No data available)
>         [2022-02-21 22:39:32.806102 +0000] W [socket.c:767:__socket_rwv]
>     0-socket.nfs-server: readv on 172.23.0.5:919 failed (No data available)
>         [2022-02-21 22:39:32.863435 +0000] W [socket.c:767:__socket_rwv]
>     0-socket.nfs-server: readv on 172.23.0.5:60132 failed (No data available)
>         [2022-02-21 22:39:32.864202 +0000] W [socket.c:767:__socket_rwv]
>     0-socket.nfs-server: readv on 172.23.0.5:673 failed (No data available)
>         [2022-02-21 22:39:32.934893 +0000] W [socket.c:767:__socket_rwv]
>     0-socket.nfs-server: readv on 172.23.0.5:857 failed (No data available)
>         [2022-02-21 22:39:48.744882 +0000] W [socket.c:767:__socket_rwv]
>     0-socket.nfs-server: readv on 127.0.0.1:949 failed (No data available)
> 
>     We hacked around this with the following patch, which is not a patch for
>     inclusion but illustrates the issue. Since we are not experts in gluster
>     code, we isolated it to the nfs-server use of socket.c. We understand
>     that is likely not appropriate convention for a released patch.
> 
> 
>     diff -Narup glusterfs-9.3-orig/rpc/rpc-transport/socket/src/socket.c
>     glusterfs-9.3/rpc/rpc-transport/socket/src/socket.c
>     --- glusterfs-9.3-orig/rpc/rpc-transport/socket/src/socket.c    2021-06-29
>     00:27:44.382408295 -0500
>     +++ glusterfs-9.3/rpc/rpc-transport/socket/src/socket.c    2022-02-21
>     20:23:41.101667807 -0600
>     @@ -733,6 +733,15 @@ __socket_rwv(rpc_transport_t *this, stru
>             } else {
>                 ret = __socket_cached_read(this, opvector, opcount);
>                 if (ret == 0) {
>     +                if(strcmp(this->name,"nfs-server")) {
>     +                  /* nfs mount, unmount can produce ENODATA */
>     +                  gf_log(this->name, GF_LOG_DEBUG,
>     +                        "HPE - EOF from peer %s, since NFS, return
>     ENOTCONN",
>     +                        this->peerinfo.identifier);
>     +                  opcount = -1;
>     +                  errno = ENOTCONN;
>     +                  break;
>     +                }
>                     gf_log(this->name, GF_LOG_DEBUG,
>                             "EOF on socket %d (errno:%d:%s); returning
>     ENODATA",
>                             sock, errno, strerror(errno));
> 
> 
> 
> 
>     * We understand you want us to move to Ganesha NFS. I mentioned in my
>     other notes that we are unable to move due to problems with Ganesha when
>     serving the NFS root in sles15sp3 (it gets stuck with nscd when nscd
>     is opening passwd, group files). While sles15sp4 fixes that, Ganesha
>     seems to be 25-35% slower than Gluster NFS and that would possibly
>     require us to increase already installed systems to have more
>     gluster/ganesha servers just to service the software update. We hope we can
>     provide test cases to Ganesha community and see if we can help speed it up
>     for our workloads. In our next release, we have tooling in place to
>     support Ganesha as a tech preview, off by default. So it will be there
>     to experiment and compare the two in our installations.
>     -------
> 
>     Community Meeting Calendar:
>     Schedule -
>     Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>     Bridge: https://meet.google.com/cpu-eiue-hvk
> 
>     Gluster-devel mailing list
>     Gluster-devel@xxxxxxxxxxx
>     https://lists.gluster.org/mailman/listinfo/gluster-devel
> 
> 
-------

Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel