On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > Hi John, > > It took a while but I believe now I have a reproducible test case for the > capabilities being lost issue in CephFS I wrote about a couple of weeks ago. > The quick summary of problem is that often processes hang using CephFS > either for a while or sometimes indefinitely. The fuse client believes it > is waiting for some caps which it does not get from the MDS. > > A quick summary of the setup: the whole system is on Jewel 10.2.6. We are > using the fuse client on all nodes for the file system. > > The full test program is uploaded together with the verbose logs of all > clients and the MDS (at log level 20) to > http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/ > It essentially runs in a loop opening a file for read/write, reading from it > and closing it. The read/write open is key, if open the file read-only, the > problem doesn't happen: > > while (time(NULL) - t0 < INTERVAL) { > > int fd = open(FILENAME, O_RDWR); > if (fd < 0) { > printf("Could not open file '%s' for read/write", FILENAME); > exit(1); > } > > read(fd, buffer, 100); > > close(fd); > count++; > } > > With INTERVAL = 10 seconds, on one machine with a single process I get > something like 30k opens per 10 seconds - an excellent result. When I run 3 > processes on 3 different nodes, I get a few (perhaps 4 or 5) opens per 10 > seconds per process. The particular case I collected logs for looks like > this: > > apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109 > ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110 > ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108 > ~/prg/ceph/test/timed_openrw_read < /dev/null & > Fri Mar 31 07:33:04 EDT 2017 > [1] 53833 > [2] 53834 > [3] 53835 > apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4 > Number of repeats: 3 > Number of repeats: 5 > > [1] Done ssh worker1109 > ~/prg/ceph/test/timed_openrw_read < /dev/null > [2]- Done ssh worker1110 > ~/prg/ceph/test/timed_openrw_read < /dev/null > [3]+ Done ssh worker1108 > ~/prg/ceph/test/timed_openrw_read < /dev/null > apataki@worker1108:~/prg/ceph/test$ > > Looking at the client, it looks like there are 5 second pauses waiting for > capabilities. It seems that it doesn't get a response, and perhaps at some > frequency it polls back to find out what happened to its capabilities. Here > is an example of such 5 second gaps from the client logs: > > 2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting for caps > need Fr want Fc > 2017-03-31 07:33:05.464058 2aaaafd47700 20 client.3849178 trim_cache size 4 > max 16384 > 2017-03-31 07:33:06.464180 2aaaafd47700 20 client.3849178 trim_cache size 4 > max 16384 > 2017-03-31 07:33:07.464307 2aaaafd47700 20 client.3849178 trim_cache size 4 > max 16384 > 2017-03-31 07:33:08.464421 2aaaafd47700 20 client.3849178 trim_cache size 4 > max 16384 > 2017-03-31 07:33:09.464554 2aaaafd47700 20 client.3849178 trim_cache size 4 > max 16384 > 2017-03-31 07:33:10.464680 2aaaafd47700 10 client.3849178 check_caps on > 100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 cap_refs={1024=0,2048=0} > open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478 > caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0 > dirty_or_tx 0] parents=0x55555f5d5340 0x55555f73ea00) wanted pAsxXsxFsxcrwb > used - issued pAsLsXsFr revoking Fsc is_delayed=1 > > The inode in question is 100024a7b1d which corresponds to the test file > /mnt/ceph/users/apataki/test/TEST-READ.txt > When I grep for this inode in the MDS logs, it also has a gap in the 7:33:05 > to 7:33:10 time. I've learned some about the caps by trying to reproduce > the problem, but I'm afraid I don't understand enough of the MDS logic to > see what the problem is there. > > The full logs are too large for the mailing list, so I've put them here: > http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/ > > Some help/advice with this would very much be appreciated. Thanks in > advance! Thanks for the reproducer -- I've tried this against master ceph with several clients on one node and I'm seeing hundreds of repeats per 10 second interval, although my clients are all on the same node. I got a similar result when setting the client to inject 10ms delays on messages to simulate being remote. I ran it in a loop with three clients several tens of times without seeing a hang. I'm compiling jewel now to see if it has a different behaviour to master (with the optimistic notion that we might have already fixed the underlying cause...) John > Andras > > > > > On 03/14/2017 12:55 PM, John Spray wrote: > > On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki > <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi John, > > I've checked the MDS session list, and the fuse client does appear on that > with 'state' as 'open'. So both the fuse client and the MDS agree on an > open connection. > > Attached is the log of the ceph fuse client at debug level 20. The MDS got > restarted at 9:44:20, and it went through its startup, and was in an > 'active' state in ceph -s by 9:45:20. As for the IP addresses in the logs, > 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs, > 10.128.129.63 is the IP of the client the log is from. > > So it looks like the client is getting stuck waiting for some > capabilities (the 7fff9c3f7700 thread in that log, which eventually > completes a ll_write on inode 100024ebea8 after the MDS restart). > Hard to say whether the MDS failed to send it the proper messages, or > if the client somehow missed it. > > It would be useful to have equally verbose logs from the MDS side from > earlier on, at the point that the client started trying to do the > write. I wonder if you could see if your MDS+client can handle both > being run at "debug mds = 20", "debug client = 20" respectively for a > while, then when a client gets stuck, do the MDS restart, and follow > back in the client log to work out which inode it was stuck on, then > find log areas on the MDS side relating to that inode number. > > BTW I see this "FsUsage" stuff in your log, which I don't recognise > from mainline Ceph, are you running something modified? > > John > > > As per another suggestion, I've also tried kick_stale_sessions on the fuse > client, which didn't help (I guess since it doesn't think the session is > stale). > Let me know if there is anything else I can do to help. > > Andras > > > > On 03/13/2017 06:08 PM, John Spray wrote: > > On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki > <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > > Dear Cephers, > > We're using the ceph file system with the fuse client, and lately some of > our processes are getting stuck seemingly waiting for fuse operations. > At > the same time, the cluster is healthy, no slow requests, all OSDs up and > running, and both the MDS and the fuse client think that there are no > pending operations. The situation is semi-reproducible. When I run a > various cluster jobs, some get stuck after a few hours of correct > operation. > The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, > but I > have tried 10.2.5 and 10.2.3, all of which have the same issue. This is > on > CentOS (7.2 for the clients, 7.3 for the MDS/OSDs). > > Here are some details: > > The node with the stuck processes: > > [root@worker1070 ~]# ps -auxwww | grep 30519 > apataki 30519 39.8 0.9 8728064 5257588 ? Dl 12:11 60:50 ./Arepo > param.txt 2 6 > [root@worker1070 ~]# cat /proc/30519/stack > [<ffffffffa0a1d7bb>] fuse_file_aio_write+0xbb/0x340 [fuse] > [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0 > [<ffffffff811de55d>] vfs_write+0xbd/0x1e0 > [<ffffffff811defff>] SyS_write+0x7f/0xe0 > [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff > > [root@worker1070 ~]# ps -auxwww | grep 30533 > apataki 30533 39.8 0.9 8795316 5261308 ? Sl 12:11 60:55 ./Arepo > param.txt 2 6 > [root@worker1070 ~]# cat /proc/30533/stack > [<ffffffffa0a12241>] wait_answer_interruptible+0x91/0xe0 [fuse] > [<ffffffffa0a12653>] __fuse_request_send+0x253/0x2c0 [fuse] > [<ffffffffa0a126d2>] fuse_request_send+0x12/0x20 [fuse] > [<ffffffffa0a1b966>] fuse_send_write+0xd6/0x110 [fuse] > [<ffffffffa0a1d45d>] fuse_perform_write+0x2ed/0x590 [fuse] > [<ffffffffa0a1d9a1>] fuse_file_aio_write+0x2a1/0x340 [fuse] > [<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0 > [<ffffffff811de55d>] vfs_write+0xbd/0x1e0 > [<ffffffff811defff>] SyS_write+0x7f/0xe0 > [<ffffffff816458c9>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff > > Presumably the second process is waiting on the first holding some lock > ... > > The fuse client on the node: > > [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok > status > { > "metadata": { > "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f", > "ceph_version": "ceph version 10.2.6 > (656b5b63ed7c43bd014bcafd81b001959d5f089f)", > "entity_id": "admin", > "hostname": "worker1070", > "mount_point": "\/mnt\/ceph", > "root": "\/" > }, > "dentry_count": 40, > "dentry_pinned_count": 23, > "inode_count": 123, > "mds_epoch": 19041, > "osd_epoch": 462327, > "osd_epoch_barrier": 462326 > } > > [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok > mds_sessions > { > "id": 3616543, > "sessions": [ > { > "mds": 0, > "addr": "10.128.128.110:6800\/909443124", > "seq": 338, > "cap_gen": 0, > "cap_ttl": "2017-03-13 14:47:37.575229", > "last_cap_renew_request": "2017-03-13 14:46:37.575229", > "cap_renew_seq": 12694, > "num_caps": 713, > "state": "open" > } > ], > "mdsmap_epoch": 19041 > } > > [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok > mds_requests > {} > > > The overall cluster health and the MDS: > > [root@cephosd000 ~]# ceph -s > cluster d7b33135-0940-4e48-8aa6-1d2026597c2f > health HEALTH_WARN > noscrub,nodeep-scrub,require_jewel_osds flag(s) set > monmap e17: 3 mons at > > {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0} > election epoch 29148, quorum 0,1,2 > hyperv029,hyperv030,hyperv031 > fsmap e19041: 1/1/1 up {0=cephosd000=up:active} > osdmap e462328: 624 osds: 624 up, 624 in > flags noscrub,nodeep-scrub,require_jewel_osds > pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects > 2154 TB used, 1791 TB / 3946 TB avail > 42496 active+clean > client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr > > [root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok > ops > { > "ops": [], > "num_ops": 0 > } > > > The odd thing is that if in this state I restart the MDS, the client > process > wakes up and proceeds with its work without any errors. As if a request > was > lost and somehow retransmitted/restarted when the MDS got restarted and > the > fuse layer reconnected to it. > > Interesting. A couple of ideas for more debugging: > > * Next time you go through this process of restarting the MDS while > there is a stuck client, first increase the client's logging (ceph > daemon <path to /var/run/ceph/ceph-<id>.asok> config set debug_client > 20"). Then we should get a clear sense of exactly what's happening on > the MDS restart that's enabling the client to proceed. > * When inspecting the client's "mds_sessions" output, also check the > "session ls" output on the MDS side to make sure the MDS and client > both agree that it has an open session. > > John > > When I try to attach a gdb session to either of the client processes, gdb > just hangs. However, right after the MDS restart gdb attaches to the > process successfully, and shows that the getting stuck happened on > closing > of a file. In fact, it looks like both processes were trying to write to > the same file opened with fopen("filename", "a") and close it: > > (gdb) where > #0 0x00002aaaadc53abd in write () from /lib64/libc.so.6 > #1 0x00002aaaadbe2383 in _IO_new_file_write () from /lib64/libc.so.6 > #2 0x00002aaaadbe37ec in __GI__IO_do_write () from /lib64/libc.so.6 > #3 0x00002aaaadbe30e0 in __GI__IO_file_close_it () from /lib64/libc.so.6 > #4 0x00002aaaadbd7020 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6 > ... > > It seems like the fuse client wasn't handling this case well, when two > processes write to the same file and close it perhaps? This is just a > speculation. Any ideas on how to proceed? Is there perhaps a known > issue > related to this? > > Thanks, > > Andras > apataki@xxxxxxxxxxxxxxxxxxxxx > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com