Re: CephFS fuse client users stuck

Henrik Korkuc <lists@xxxxxxxxx> · Tue, 14 Mar 2017 11:51:47 +0200

On 17-03-14 00:08, John Spray wrote:
On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.  At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ?     Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[<ffffffffa0a1d7bb>] fuse_file_aio_write+0xbb/0x340 [fuse]
[<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
[<ffffffff811de55d>] vfs_write+0xbd/0x1e0
[<ffffffff811defff>] SyS_write+0x7f/0xe0
[<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ?     Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[<ffffffffa0a12241>] wait_answer_interruptible+0x91/0xe0 [fuse]
[<ffffffffa0a12653>] __fuse_request_send+0x253/0x2c0 [fuse]
[<ffffffffa0a126d2>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa0a1b966>] fuse_send_write+0xd6/0x110 [fuse]
[<ffffffffa0a1d45d>] fuse_perform_write+0x2ed/0x590 [fuse]
[<ffffffffa0a1d9a1>] fuse_file_aio_write+0x2a1/0x340 [fuse]
[<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
[<ffffffff811de55d>] vfs_write+0xbd/0x1e0
[<ffffffff811defff>] SyS_write+0x7f/0xe0
[<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Presumably the second process is waiting on the first holding some lock ...

The fuse client on the node:

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok status
{
     "metadata": {
         "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
         "ceph_version": "ceph version 10.2.6
(656b5b63ed7c43bd014bcafd81b001959d5f089f)",
         "entity_id": "admin",
         "hostname": "worker1070",
         "mount_point": "\/mnt\/ceph",
         "root": "\/"
     },
     "dentry_count": 40,
     "dentry_pinned_count": 23,
     "inode_count": 123,
     "mds_epoch": 19041,
     "osd_epoch": 462327,
     "osd_epoch_barrier": 462326
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_sessions
{
     "id": 3616543,
     "sessions": [
         {
             "mds": 0,
             "addr": "10.128.128.110:6800\/909443124",
             "seq": 338,
             "cap_gen": 0,
             "cap_ttl": "2017-03-13 14:47:37.575229",
             "last_cap_renew_request": "2017-03-13 14:46:37.575229",
             "cap_renew_seq": 12694,
             "num_caps": 713,
             "state": "open"
         }
     ],
     "mdsmap_epoch": 19041
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_requests
{}

The overall cluster health and the MDS:

[root@cephosd000 ~]# ceph -s
     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
      health HEALTH_WARN
             noscrub,nodeep-scrub,require_jewel_osds flag(s) set
      monmap e17: 3 mons at
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
             election epoch 29148, quorum 0,1,2 hyperv029,hyperv030,hyperv031
       fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
      osdmap e462328: 624 osds: 624 up, 624 in
             flags noscrub,nodeep-scrub,require_jewel_osds
       pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
             2154 TB used, 1791 TB / 3946 TB avail
                42496 active+clean
   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr

[root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok ops
{
     "ops": [],
     "num_ops": 0
}

The odd thing is that if in this state I restart the MDS, the client process
wakes up and proceeds with its work without any errors.  As if a request was
lost and somehow retransmitted/restarted when the MDS got restarted and the
fuse layer reconnected to it.
Interesting.  A couple of ideas for more debugging:

* Next time you go through this process of restarting the MDS while
there is a stuck client, first increase the client's logging (ceph
daemon <path to /var/run/ceph/ceph-<id>.asok> config set debug_client
20").  Then we should get a clear sense of exactly what's happening on
the MDS restart that's enabling the client to proceed.
* When inspecting the client's "mds_sessions" output, also check the
"session ls" output on the MDS side to make sure the MDS and client
both agree that it has an open session.

John

please check if kicking ceph-fuse "ceph --admin-daemon 
/var/run/ceph/ceph-client.something_something.asok kick_stale_sessions" 
works for you.

I experienced similar problems, but didn't go down fully on it. You can 
check out:
http://tracker.ceph.com/issues/17275 - should be resolved
http://tracker.ceph.com/issues/17413
http://tracker.ceph.com/issues/17660
http://tracker.ceph.com/issues/18757 - if it's this problem, then you 
should see that MDS is closing stale connection in logs. I have a PR 
there but it needs some more discussions and possibly work
When I try to attach a gdb session to either of the client processes, gdb
just hangs.  However, right after the MDS restart gdb attaches to the
process successfully, and shows that the getting stuck happened on closing
of a file.  In fact, it looks like both processes were trying to write to
the same file opened with fopen("filename", "a") and close it:

(gdb) where
#0  0x00002aaaadc53abd in write () from /lib64/libc.so.6
#1  0x00002aaaadbe2383 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00002aaaadbe37ec in __GI__IO_do_write () from /lib64/libc.so.6
#3  0x00002aaaadbe30e0 in __GI__IO_file_close_it () from /lib64/libc.so.6
#4  0x00002aaaadbd7020 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
...

It seems like the fuse client wasn't handling this case well, when two
processes write to the same file and close it perhaps?  This is just a
speculation.  Any ideas on how to proceed?  Is there perhaps a known issue
related to this?

Thanks,

Andras
apataki@xxxxxxxxxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com