Re: CephFS fuse client users stuck

John Spray <jspray@xxxxxxxxxx> · Mon, 13 Mar 2017 22:08:31 +0000

On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Dear Cephers,
>
> We're using the ceph file system with the fuse client, and lately some of
> our processes are getting stuck seemingly waiting for fuse operations.  At
> the same time, the cluster is healthy, no slow requests, all OSDs up and
> running, and both the MDS and the fuse client think that there are no
> pending operations.  The situation is semi-reproducible.  When I run a
> various cluster jobs, some get stuck after a few hours of correct operation.
> The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, but I
> have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is on
> CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).
>
> Here are some details:
>
> The node with the stuck processes:
>
> [root@worker1070 ~]# ps -auxwww | grep 30519
> apataki   30519 39.8  0.9 8728064 5257588 ?     Dl   12:11  60:50 ./Arepo
> param.txt 2 6
> [root@worker1070 ~]# cat /proc/30519/stack
> [<ffffffffa0a1d7bb>] fuse_file_aio_write+0xbb/0x340 [fuse]
> [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
> [<ffffffff811de55d>] vfs_write+0xbd/0x1e0
> [<ffffffff811defff>] SyS_write+0x7f/0xe0
> [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> [root@worker1070 ~]# ps -auxwww | grep 30533
> apataki   30533 39.8  0.9 8795316 5261308 ?     Sl   12:11  60:55 ./Arepo
> param.txt 2 6
> [root@worker1070 ~]# cat /proc/30533/stack
> [<ffffffffa0a12241>] wait_answer_interruptible+0x91/0xe0 [fuse]
> [<ffffffffa0a12653>] __fuse_request_send+0x253/0x2c0 [fuse]
> [<ffffffffa0a126d2>] fuse_request_send+0x12/0x20 [fuse]
> [<ffffffffa0a1b966>] fuse_send_write+0xd6/0x110 [fuse]
> [<ffffffffa0a1d45d>] fuse_perform_write+0x2ed/0x590 [fuse]
> [<ffffffffa0a1d9a1>] fuse_file_aio_write+0x2a1/0x340 [fuse]
> [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
> [<ffffffff811de55d>] vfs_write+0xbd/0x1e0
> [<ffffffff811defff>] SyS_write+0x7f/0xe0
> [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Presumably the second process is waiting on the first holding some lock ...
>
> The fuse client on the node:
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok status
> {
>     "metadata": {
>         "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
>         "ceph_version": "ceph version 10.2.6
> (656b5b63ed7c43bd014bcafd81b001959d5f089f)",
>         "entity_id": "admin",
>         "hostname": "worker1070",
>         "mount_point": "\/mnt\/ceph",
>         "root": "\/"
>     },
>     "dentry_count": 40,
>     "dentry_pinned_count": 23,
>     "inode_count": 123,
>     "mds_epoch": 19041,
>     "osd_epoch": 462327,
>     "osd_epoch_barrier": 462326
> }
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> mds_sessions
> {
>     "id": 3616543,
>     "sessions": [
>         {
>             "mds": 0,
>             "addr": "10.128.128.110:6800\/909443124",
>             "seq": 338,
>             "cap_gen": 0,
>             "cap_ttl": "2017-03-13 14:47:37.575229",
>             "last_cap_renew_request": "2017-03-13 14:46:37.575229",
>             "cap_renew_seq": 12694,
>             "num_caps": 713,
>             "state": "open"
>         }
>     ],
>     "mdsmap_epoch": 19041
> }
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> mds_requests
> {}
>
>
> The overall cluster health and the MDS:
>
> [root@cephosd000 ~]# ceph -s
>     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>      health HEALTH_WARN
>             noscrub,nodeep-scrub,require_jewel_osds flag(s) set
>      monmap e17: 3 mons at
> {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
>             election epoch 29148, quorum 0,1,2 hyperv029,hyperv030,hyperv031
>       fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
>      osdmap e462328: 624 osds: 624 up, 624 in
>             flags noscrub,nodeep-scrub,require_jewel_osds
>       pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
>             2154 TB used, 1791 TB / 3946 TB avail
>                42496 active+clean
>   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr
>
> [root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok ops
> {
>     "ops": [],
>     "num_ops": 0
> }
>
>
> The odd thing is that if in this state I restart the MDS, the client process
> wakes up and proceeds with its work without any errors.  As if a request was
> lost and somehow retransmitted/restarted when the MDS got restarted and the
> fuse layer reconnected to it.

Interesting.  A couple of ideas for more debugging:

* Next time you go through this process of restarting the MDS while
there is a stuck client, first increase the client's logging (ceph
daemon <path to /var/run/ceph/ceph-<id>.asok> config set debug_client
20").  Then we should get a clear sense of exactly what's happening on
the MDS restart that's enabling the client to proceed.
* When inspecting the client's "mds_sessions" output, also check the
"session ls" output on the MDS side to make sure the MDS and client
both agree that it has an open session.

John

>
> When I try to attach a gdb session to either of the client processes, gdb
> just hangs.  However, right after the MDS restart gdb attaches to the
> process successfully, and shows that the getting stuck happened on closing
> of a file.  In fact, it looks like both processes were trying to write to
> the same file opened with fopen("filename", "a") and close it:
>
> (gdb) where
> #0  0x00002aaaadc53abd in write () from /lib64/libc.so.6
> #1  0x00002aaaadbe2383 in _IO_new_file_write () from /lib64/libc.so.6
> #2  0x00002aaaadbe37ec in __GI__IO_do_write () from /lib64/libc.so.6
> #3  0x00002aaaadbe30e0 in __GI__IO_file_close_it () from /lib64/libc.so.6
> #4  0x00002aaaadbd7020 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
> ...
>
> It seems like the fuse client wasn't handling this case well, when two
> processes write to the same file and close it perhaps?  This is just a
> speculation.  Any ideas on how to proceed?  Is there perhaps a known issue
> related to this?
>
> Thanks,
>
> Andras
> apataki@xxxxxxxxxxxxxxxxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com