Re: CephFS fuse client users stuck

John Spray <jspray@xxxxxxxxxx> · Tue, 14 Mar 2017 16:55:55 +0000

On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi John,
>
> I've checked the MDS session list, and the fuse client does appear on that
> with 'state' as 'open'.  So both the fuse client and the MDS agree on an
> open connection.
>
> Attached is the log of the ceph fuse client at debug level 20.  The MDS got
> restarted at 9:44:20, and it went through its startup, and was in an
> 'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
> 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
> 10.128.129.63 is the IP of the client the log is from.

So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3f7700 thread in that log, which eventually
completes a ll_write on inode 100024ebea8 after the MDS restart).
Hard to say whether the MDS failed to send it the proper messages, or
if the client somehow missed it.

It would be useful to have equally verbose logs from the MDS side from
earlier on, at the point that the client started trying to do the
write.  I wonder if you could see if your MDS+client can handle both
being run at "debug mds = 20", "debug client = 20" respectively for a
while, then when a client gets stuck, do the MDS restart, and follow
back in the client log to work out which inode it was stuck on, then
find log areas on the MDS side relating to that inode number.

BTW I see this "FsUsage" stuff in your log, which I don't recognise
from mainline Ceph, are you running something modified?

John

> As per another suggestion, I've also tried kick_stale_sessions on the fuse
> client, which didn't help (I guess since it doesn't think the session is
> stale).
> Let me know if there is anything else I can do to help.
>
> Andras
>
>
>
> On 03/13/2017 06:08 PM, John Spray wrote:
>>
>> On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
>> <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
>>>
>>> Dear Cephers,
>>>
>>> We're using the ceph file system with the fuse client, and lately some of
>>> our processes are getting stuck seemingly waiting for fuse operations.
>>> At
>>> the same time, the cluster is healthy, no slow requests, all OSDs up and
>>> running, and both the MDS and the fuse client think that there are no
>>> pending operations.  The situation is semi-reproducible.  When I run a
>>> various cluster jobs, some get stuck after a few hours of correct
>>> operation.
>>> The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6,
>>> but I
>>> have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is
>>> on
>>> CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).
>>>
>>> Here are some details:
>>>
>>> The node with the stuck processes:
>>>
>>> [root@worker1070 ~]# ps -auxwww | grep 30519
>>> apataki   30519 39.8  0.9 8728064 5257588 ?     Dl   12:11  60:50 ./Arepo
>>> param.txt 2 6
>>> [root@worker1070 ~]# cat /proc/30519/stack
>>> [<ffffffffa0a1d7bb>] fuse_file_aio_write+0xbb/0x340 [fuse]
>>> [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
>>> [<ffffffff811de55d>] vfs_write+0xbd/0x1e0
>>> [<ffffffff811defff>] SyS_write+0x7f/0xe0
>>> [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> [root@worker1070 ~]# ps -auxwww | grep 30533
>>> apataki   30533 39.8  0.9 8795316 5261308 ?     Sl   12:11  60:55 ./Arepo
>>> param.txt 2 6
>>> [root@worker1070 ~]# cat /proc/30533/stack
>>> [<ffffffffa0a12241>] wait_answer_interruptible+0x91/0xe0 [fuse]
>>> [<ffffffffa0a12653>] __fuse_request_send+0x253/0x2c0 [fuse]
>>> [<ffffffffa0a126d2>] fuse_request_send+0x12/0x20 [fuse]
>>> [<ffffffffa0a1b966>] fuse_send_write+0xd6/0x110 [fuse]
>>> [<ffffffffa0a1d45d>] fuse_perform_write+0x2ed/0x590 [fuse]
>>> [<ffffffffa0a1d9a1>] fuse_file_aio_write+0x2a1/0x340 [fuse]
>>> [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
>>> [<ffffffff811de55d>] vfs_write+0xbd/0x1e0
>>> [<ffffffff811defff>] SyS_write+0x7f/0xe0
>>> [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>
>>> Presumably the second process is waiting on the first holding some lock
>>> ...
>>>
>>> The fuse client on the node:
>>>
>>> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
>>> status
>>> {
>>>      "metadata": {
>>>          "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
>>>          "ceph_version": "ceph version 10.2.6
>>> (656b5b63ed7c43bd014bcafd81b001959d5f089f)",
>>>          "entity_id": "admin",
>>>          "hostname": "worker1070",
>>>          "mount_point": "\/mnt\/ceph",
>>>          "root": "\/"
>>>      },
>>>      "dentry_count": 40,
>>>      "dentry_pinned_count": 23,
>>>      "inode_count": 123,
>>>      "mds_epoch": 19041,
>>>      "osd_epoch": 462327,
>>>      "osd_epoch_barrier": 462326
>>> }
>>>
>>> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
>>> mds_sessions
>>> {
>>>      "id": 3616543,
>>>      "sessions": [
>>>          {
>>>              "mds": 0,
>>>              "addr": "10.128.128.110:6800\/909443124",
>>>              "seq": 338,
>>>              "cap_gen": 0,
>>>              "cap_ttl": "2017-03-13 14:47:37.575229",
>>>              "last_cap_renew_request": "2017-03-13 14:46:37.575229",
>>>              "cap_renew_seq": 12694,
>>>              "num_caps": 713,
>>>              "state": "open"
>>>          }
>>>      ],
>>>      "mdsmap_epoch": 19041
>>> }
>>>
>>> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
>>> mds_requests
>>> {}
>>>
>>>
>>> The overall cluster health and the MDS:
>>>
>>> [root@cephosd000 ~]# ceph -s
>>>      cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>>>       health HEALTH_WARN
>>>              noscrub,nodeep-scrub,require_jewel_osds flag(s) set
>>>       monmap e17: 3 mons at
>>>
>>> {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
>>>              election epoch 29148, quorum 0,1,2
>>> hyperv029,hyperv030,hyperv031
>>>        fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
>>>       osdmap e462328: 624 osds: 624 up, 624 in
>>>              flags noscrub,nodeep-scrub,require_jewel_osds
>>>        pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
>>>              2154 TB used, 1791 TB / 3946 TB avail
>>>                 42496 active+clean
>>>    client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr
>>>
>>> [root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok
>>> ops
>>> {
>>>      "ops": [],
>>>      "num_ops": 0
>>> }
>>>
>>>
>>> The odd thing is that if in this state I restart the MDS, the client
>>> process
>>> wakes up and proceeds with its work without any errors.  As if a request
>>> was
>>> lost and somehow retransmitted/restarted when the MDS got restarted and
>>> the
>>> fuse layer reconnected to it.
>>
>> Interesting.  A couple of ideas for more debugging:
>>
>> * Next time you go through this process of restarting the MDS while
>> there is a stuck client, first increase the client's logging (ceph
>> daemon <path to /var/run/ceph/ceph-<id>.asok> config set debug_client
>> 20").  Then we should get a clear sense of exactly what's happening on
>> the MDS restart that's enabling the client to proceed.
>> * When inspecting the client's "mds_sessions" output, also check the
>> "session ls" output on the MDS side to make sure the MDS and client
>> both agree that it has an open session.
>>
>> John
>>
>>> When I try to attach a gdb session to either of the client processes, gdb
>>> just hangs.  However, right after the MDS restart gdb attaches to the
>>> process successfully, and shows that the getting stuck happened on
>>> closing
>>> of a file.  In fact, it looks like both processes were trying to write to
>>> the same file opened with fopen("filename", "a") and close it:
>>>
>>> (gdb) where
>>> #0  0x00002aaaadc53abd in write () from /lib64/libc.so.6
>>> #1  0x00002aaaadbe2383 in _IO_new_file_write () from /lib64/libc.so.6
>>> #2  0x00002aaaadbe37ec in __GI__IO_do_write () from /lib64/libc.so.6
>>> #3  0x00002aaaadbe30e0 in __GI__IO_file_close_it () from /lib64/libc.so.6
>>> #4  0x00002aaaadbd7020 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
>>> ...
>>>
>>> It seems like the fuse client wasn't handling this case well, when two
>>> processes write to the same file and close it perhaps?  This is just a
>>> speculation.  Any ideas on how to proceed?  Is there perhaps a known
>>> issue
>>> related to this?
>>>
>>> Thanks,
>>>
>>> Andras
>>> apataki@xxxxxxxxxxxxxxxxxxxxx
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com