Re: CephFS fuse client users stuck

John Spray <jspray@xxxxxxxxxx> · Fri, 31 Mar 2017 18:25:24 +0100

On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi John,
>
> It took a while but I believe now I have a reproducible test case for the
> capabilities being lost issue in CephFS I wrote about a couple of weeks ago.
> The quick summary of problem is that often processes hang using CephFS
> either for a while or sometimes indefinitely.  The fuse client believes it
> is waiting for some caps which it does not get from the MDS.
>
> A quick summary of the setup: the whole system is on Jewel 10.2.6.  We are
> using the fuse client on all nodes for the file system.
>
> The full test program is uploaded together with the verbose logs of all
> clients and the MDS (at log level 20) to
> http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
> It essentially runs in a loop opening a file for read/write, reading from it
> and closing it.  The read/write open is key, if open the file read-only, the
> problem doesn't happen:
>
>     while (time(NULL) - t0 < INTERVAL) {
>
>         int fd = open(FILENAME, O_RDWR);
>         if (fd < 0) {
>             printf("Could not open file '%s' for read/write", FILENAME);
>             exit(1);
>         }
>
>         read(fd, buffer, 100);
>
>         close(fd);
>         count++;
>     }
>
> With INTERVAL = 10 seconds, on one machine with a single process I get
> something like 30k opens per 10 seconds - an excellent result.  When I run 3
> processes on 3 different nodes, I get a few (perhaps 4 or 5) opens per 10
> seconds per process.  The particular case I collected logs for looks like
> this:
>
> apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
> ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
> ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
> ~/prg/ceph/test/timed_openrw_read < /dev/null &
> Fri Mar 31 07:33:04 EDT 2017
> [1] 53833
> [2] 53834
> [3] 53835
> apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4
> Number of repeats: 3
> Number of repeats: 5
>
> [1]   Done                    ssh worker1109
> ~/prg/ceph/test/timed_openrw_read < /dev/null
> [2]-  Done                    ssh worker1110
> ~/prg/ceph/test/timed_openrw_read < /dev/null
> [3]+  Done                    ssh worker1108
> ~/prg/ceph/test/timed_openrw_read < /dev/null
> apataki@worker1108:~/prg/ceph/test$
>
> Looking at the client, it looks like there are 5 second pauses waiting for
> capabilities.  It seems that it doesn't get a response, and perhaps at some
> frequency it polls back to find out what happened to its capabilities.  Here
> is an example of such 5 second gaps from the client logs:
>
> 2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting for caps
> need Fr want Fc
> 2017-03-31 07:33:05.464058 2aaaafd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:06.464180 2aaaafd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:07.464307 2aaaafd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:08.464421 2aaaafd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:09.464554 2aaaafd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:10.464680 2aaaafd47700 10 client.3849178 check_caps on
> 100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 cap_refs={1024=0,2048=0}
> open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478
> caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0
> dirty_or_tx 0] parents=0x55555f5d5340 0x55555f73ea00) wanted pAsxXsxFsxcrwb
> used - issued pAsLsXsFr revoking Fsc is_delayed=1
>
> The inode in question is 100024a7b1d which corresponds to the test file
> /mnt/ceph/users/apataki/test/TEST-READ.txt
> When I grep for this inode in the MDS logs, it also has a gap in the 7:33:05
> to 7:33:10 time.  I've learned some about the caps by trying to reproduce
> the problem, but I'm afraid I don't understand enough of the MDS logic to
> see what the problem is there.
>
> The full logs are too large for the mailing list, so I've put them here:
> http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
>
> Some help/advice with this would very much be appreciated.  Thanks in
> advance!

Thanks for the reproducer -- I've tried this against master ceph with
several clients on one node and I'm seeing hundreds of repeats per 10
second interval, although my clients are all on the same node.  I got
a similar result when setting the client to inject 10ms delays on
messages to simulate being remote.  I ran it in a loop with three
clients several tens of times without seeing a hang.

I'm compiling jewel now to see if it has a different behaviour to
master (with the optimistic notion that we might have already fixed
the underlying cause...)

John

> Andras
>
>
>
>
> On 03/14/2017 12:55 PM, John Spray wrote:
>
> On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
> <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi John,
>
> I've checked the MDS session list, and the fuse client does appear on that
> with 'state' as 'open'.  So both the fuse client and the MDS agree on an
> open connection.
>
> Attached is the log of the ceph fuse client at debug level 20.  The MDS got
> restarted at 9:44:20, and it went through its startup, and was in an
> 'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
> 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
> 10.128.129.63 is the IP of the client the log is from.
>
> So it looks like the client is getting stuck waiting for some
> capabilities (the 7fff9c3f7700 thread in that log, which eventually
> completes a ll_write on inode 100024ebea8 after the MDS restart).
> Hard to say whether the MDS failed to send it the proper messages, or
> if the client somehow missed it.
>
> It would be useful to have equally verbose logs from the MDS side from
> earlier on, at the point that the client started trying to do the
> write.  I wonder if you could see if your MDS+client can handle both
> being run at "debug mds = 20", "debug client = 20" respectively for a
> while, then when a client gets stuck, do the MDS restart, and follow
> back in the client log to work out which inode it was stuck on, then
> find log areas on the MDS side relating to that inode number.
>
> BTW I see this "FsUsage" stuff in your log, which I don't recognise
> from mainline Ceph, are you running something modified?
>
> John
>
>
> As per another suggestion, I've also tried kick_stale_sessions on the fuse
> client, which didn't help (I guess since it doesn't think the session is
> stale).
> Let me know if there is anything else I can do to help.
>
> Andras
>
>
>
> On 03/13/2017 06:08 PM, John Spray wrote:
>
> On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
> <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Dear Cephers,
>
> We're using the ceph file system with the fuse client, and lately some of
> our processes are getting stuck seemingly waiting for fuse operations.
> At
> the same time, the cluster is healthy, no slow requests, all OSDs up and
> running, and both the MDS and the fuse client think that there are no
> pending operations.  The situation is semi-reproducible.  When I run a
> various cluster jobs, some get stuck after a few hours of correct
> operation.
> The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6,
> but I
> have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is
> on
> CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).
>
> Here are some details:
>
> The node with the stuck processes:
>
> [root@worker1070 ~]# ps -auxwww | grep 30519
> apataki   30519 39.8  0.9 8728064 5257588 ?     Dl   12:11  60:50 ./Arepo
> param.txt 2 6
> [root@worker1070 ~]# cat /proc/30519/stack
> [<ffffffffa0a1d7bb>] fuse_file_aio_write+0xbb/0x340 [fuse]
> [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
> [<ffffffff811de55d>] vfs_write+0xbd/0x1e0
> [<ffffffff811defff>] SyS_write+0x7f/0xe0
> [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> [root@worker1070 ~]# ps -auxwww | grep 30533
> apataki   30533 39.8  0.9 8795316 5261308 ?     Sl   12:11  60:55 ./Arepo
> param.txt 2 6
> [root@worker1070 ~]# cat /proc/30533/stack
> [<ffffffffa0a12241>] wait_answer_interruptible+0x91/0xe0 [fuse]
> [<ffffffffa0a12653>] __fuse_request_send+0x253/0x2c0 [fuse]
> [<ffffffffa0a126d2>] fuse_request_send+0x12/0x20 [fuse]
> [<ffffffffa0a1b966>] fuse_send_write+0xd6/0x110 [fuse]
> [<ffffffffa0a1d45d>] fuse_perform_write+0x2ed/0x590 [fuse]
> [<ffffffffa0a1d9a1>] fuse_file_aio_write+0x2a1/0x340 [fuse]
> [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
> [<ffffffff811de55d>] vfs_write+0xbd/0x1e0
> [<ffffffff811defff>] SyS_write+0x7f/0xe0
> [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> Presumably the second process is waiting on the first holding some lock
> ...
>
> The fuse client on the node:
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> status
> {
>      "metadata": {
>          "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
>          "ceph_version": "ceph version 10.2.6
> (656b5b63ed7c43bd014bcafd81b001959d5f089f)",
>          "entity_id": "admin",
>          "hostname": "worker1070",
>          "mount_point": "\/mnt\/ceph",
>          "root": "\/"
>      },
>      "dentry_count": 40,
>      "dentry_pinned_count": 23,
>      "inode_count": 123,
>      "mds_epoch": 19041,
>      "osd_epoch": 462327,
>      "osd_epoch_barrier": 462326
> }
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> mds_sessions
> {
>      "id": 3616543,
>      "sessions": [
>          {
>              "mds": 0,
>              "addr": "10.128.128.110:6800\/909443124",
>              "seq": 338,
>              "cap_gen": 0,
>              "cap_ttl": "2017-03-13 14:47:37.575229",
>              "last_cap_renew_request": "2017-03-13 14:46:37.575229",
>              "cap_renew_seq": 12694,
>              "num_caps": 713,
>              "state": "open"
>          }
>      ],
>      "mdsmap_epoch": 19041
> }
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> mds_requests
> {}
>
>
> The overall cluster health and the MDS:
>
> [root@cephosd000 ~]# ceph -s
>      cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>       health HEALTH_WARN
>              noscrub,nodeep-scrub,require_jewel_osds flag(s) set
>       monmap e17: 3 mons at
>
> {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
>              election epoch 29148, quorum 0,1,2
> hyperv029,hyperv030,hyperv031
>        fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
>       osdmap e462328: 624 osds: 624 up, 624 in
>              flags noscrub,nodeep-scrub,require_jewel_osds
>        pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
>              2154 TB used, 1791 TB / 3946 TB avail
>                 42496 active+clean
>    client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr
>
> [root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok
> ops
> {
>      "ops": [],
>      "num_ops": 0
> }
>
>
> The odd thing is that if in this state I restart the MDS, the client
> process
> wakes up and proceeds with its work without any errors.  As if a request
> was
> lost and somehow retransmitted/restarted when the MDS got restarted and
> the
> fuse layer reconnected to it.
>
> Interesting.  A couple of ideas for more debugging:
>
> * Next time you go through this process of restarting the MDS while
> there is a stuck client, first increase the client's logging (ceph
> daemon <path to /var/run/ceph/ceph-<id>.asok> config set debug_client
> 20").  Then we should get a clear sense of exactly what's happening on
> the MDS restart that's enabling the client to proceed.
> * When inspecting the client's "mds_sessions" output, also check the
> "session ls" output on the MDS side to make sure the MDS and client
> both agree that it has an open session.
>
> John
>
> When I try to attach a gdb session to either of the client processes, gdb
> just hangs.  However, right after the MDS restart gdb attaches to the
> process successfully, and shows that the getting stuck happened on
> closing
> of a file.  In fact, it looks like both processes were trying to write to
> the same file opened with fopen("filename", "a") and close it:
>
> (gdb) where
> #0  0x00002aaaadc53abd in write () from /lib64/libc.so.6
> #1  0x00002aaaadbe2383 in _IO_new_file_write () from /lib64/libc.so.6
> #2  0x00002aaaadbe37ec in __GI__IO_do_write () from /lib64/libc.so.6
> #3  0x00002aaaadbe30e0 in __GI__IO_file_close_it () from /lib64/libc.so.6
> #4  0x00002aaaadbd7020 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
> ...
>
> It seems like the fuse client wasn't handling this case well, when two
> processes write to the same file and close it perhaps?  This is just a
> speculation.  Any ideas on how to proceed?  Is there perhaps a known
> issue
> related to this?
>
> Thanks,
>
> Andras
> apataki@xxxxxxxxxxxxxxxxxxxxx
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com