Re: NFS-Ganesha CEPH_FSAL | potential locking issue

David C <dcsysengineer@xxxxxxxxx> · Fri, 17 May 2019 16:18:14 +0100

Thanks for your response on that, Jeff. Pretty sure this is nothing to do with Ceph or Ganesha, sorry for wasting your time. What I'm seeing is related to writeback on the client. I can mitigate the behaviour a bit by playing around with the vm.dirty* parameters.

On Tue, Apr 16, 2019 at 7:07 PM Jeff Layton <jlayton@xxxxxxxxxxxxxxx> wrote:
On Tue, Apr 16, 2019 at 10:36 AM David C <dcsysengineer@xxxxxxxxx> wrote:

>

> Hi All

>

> I have a single export of my cephfs using the ceph_fsal [1]. A CentOS 7 machine mounts a sub-directory of the export [2] and is using it for the home directory of a user (e.g everything under ~ is on the server).

>

> This works fine until I start a long sequential write into the home directory such as:

>

> dd if=/dev/zero of=~/deleteme bs=1M count=8096

>

> This saturates the 1GbE link on the client which is great but during the transfer, apps that are accessing files in home start to lock up. Google Chrome for example, which puts it's config in ~/.config/google-chrome/,  locks up during the transfer, e.g I can't move between tabs, as soon as the transfer finishes, Chrome goes back to normal. Essentially the desktop environment reacts as I'd expect if the server was to go away. I'm using the MATE DE.

>

> However, if I mount a separate directory from the same export on the machine [3] and do the same write into that directory, my desktop experience isn't affected.

>

> I hope that makes some sense, it's a bit of a weird one to describe. This feels like a locking issue to me, although I can't explain why a single write into the root of a mount would affect access to other files under that same mount.

>

It's not a single write. You're doing 8G worth of 1M I/Os. The server

then has to do all of those to the OSD backing store.

> [1] CephFS export:

>

> EXPORT

> {

>     Export_ID=100;

>     Protocols = 4;

>     Transports = TCP;

>     Path = /;

>     Pseudo = /ceph/;

>     Access_Type = RW;

>     Attr_Expiration_Time = 0;

>     Disable_ACL = FALSE;

>     Manage_Gids = TRUE;

>     Filesystem_Id = 100.1;

>     FSAL {

>         Name = CEPH;

>     }

> }

>

> [2] Home directory mount:

>

> 10.10.10.226:/ceph/homes/username on /homes/username type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)

>

> [3] Test directory mount:

>

> 10.10.10.226:/ceph/testing on /tmp/testing type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)

>

> Versions:

>

> Luminous 12.2.10

> nfs-ganesha-2.7.1-0.1.el7.x86_64

> nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64

>

> Ceph.conf on nfs-ganesha server:

>

> [client]

>         mon host = 10.10.10.210:6789, 10.10.10.211:6789, 10.10.10.212:6789

>         client_oc_size = 8388608000

>         client_acl_type=posix_acl

>         client_quota = true

>         client_quota_df = true

>

No magic bullets here, I'm afraid.

Sounds like ganesha is probably just too swamped with write requests

to do much else, but you'll probably want to do the legwork starting

with the hanging application, and figure out what it's doing that

takes so long. Is it some syscall? Which one?

>From there you can start looking at statistics in the NFS client to

see what's going on there. Are certain RPCs taking longer than they

should? Which ones?

Once you know what's going on with the client, you can better tell

what's going on with the server.

-- 

Jeff Layton <jlayton@xxxxxxxxxxxxxxx>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com