Re: CephFS fuse client users stuck

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Fri, 31 Mar 2017 08:27:07 -0400

    Hi John,

    It took a while but I believe now I have a reproducible test case
    for the capabilities being lost issue in CephFS I wrote about a
    couple of weeks ago.  The quick summary of problem is that often
    processes hang using CephFS either for a while or sometimes
    indefinitely.  The fuse client believes it is waiting for some caps
    which it does not get from the MDS.

    A quick summary of the setup: the whole system is on Jewel 10.2.6. 
    We are using the fuse client on all nodes for the file system.

    The full test program is uploaded together with the verbose logs of
    all clients and the MDS (at log level 20) to
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/

    It essentially runs in a loop opening a file for read/write, reading
    from it and closing it.  The read/write open is key, if open the
    file read-only, the problem doesn't happen:

        while (time(NULL) - t0 < INTERVAL) {

              int fd = open(FILENAME, O_RDWR);

              if (fd < 0) {

                  printf("Could not open file '%s' for read/write",
        FILENAME);

                  exit(1);

              }

              read(fd, buffer, 100);

              close(fd);

              count++;

          }

    With INTERVAL = 10 seconds, on one machine with a single process I
    get something like 30k opens per 10 seconds - an excellent result. 
    When I run 3 processes on 3 different nodes, I get a few (perhaps 4
    or 5) opens per 10 seconds per process.  The particular case I
    collected logs for looks like this:

    apataki@worker1108:~/prg/ceph/test$ date; ssh
        worker1109 ~/prg/ceph/test/timed_openrw_read < /dev/null
        & ssh worker1110 ~/prg/ceph/test/timed_openrw_read <
        /dev/null & ssh worker1108 ~/prg/ceph/test/timed_openrw_read
        < /dev/null &

      Fri Mar 31 07:33:04 EDT 2017

      [1] 53833

      [2] 53834

      [3] 53835

      apataki@worker1108:~/prg/ceph/test$ Number of
          repeats: 4

        Number of repeats: 3

        Number of repeats: 5

      [1]   Done                    ssh worker1109
        ~/prg/ceph/test/timed_openrw_read < /dev/null

      [2]-  Done                    ssh worker1110
        ~/prg/ceph/test/timed_openrw_read < /dev/null

      [3]+  Done                    ssh worker1108
        ~/prg/ceph/test/timed_openrw_read < /dev/null

      apataki@worker1108:~/prg/ceph/test$ 

    Looking at the client, it looks like there are 5 second pauses
    waiting for capabilities.  It seems that it doesn't get a response,
    and perhaps at some frequency it polls back to find out what
    happened to its capabilities.  Here is an example of such 5 second
    gaps from the client logs:

    2017-03-31 07:33:05.010111 2aaab1a57700 10
        client.3849178 waiting for caps need Fr want Fc

      2017-03-31 07:33:05.464058 2aaaafd47700 20 client.3849178
        trim_cache size 4 max 16384

      2017-03-31 07:33:06.464180 2aaaafd47700 20 client.3849178
        trim_cache size 4 max 16384

      2017-03-31 07:33:07.464307 2aaaafd47700 20 client.3849178
        trim_cache size 4 max 16384

      2017-03-31 07:33:08.464421 2aaaafd47700 20 client.3849178
        trim_cache size 4 max 16384

      2017-03-31 07:33:09.464554 2aaaafd47700 20 client.3849178
        trim_cache size 4 max 16384

      2017-03-31 07:33:10.464680 2aaaafd47700 10 client.3849178
        check_caps on 100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894
        cap_refs={1024=0,2048=0} open={3=1} mode=100664 size=23/4194304
        mtime=2017-03-30 11:56:38.588478 caps=pAsLsXsFr(0=pAsLsXsFr)
        objectset[100024a7b1d ts 0/0 objects 0 dirty_or_tx 0]
        parents=0x55555f5d5340 0x55555f73ea00) wanted pAsxXsxFsxcrwb
        used - issued pAsLsXsFr revoking Fsc is_delayed=1

    The inode in question is 100024a7b1d which corresponds to the test
    file /mnt/ceph/users/apataki/test/TEST-READ.txt

    When I grep for this inode in the MDS logs, it also has a gap in the
    7:33:05 to 7:33:10 time.  I've learned some about the caps by trying
    to reproduce the problem, but I'm afraid I don't understand enough
    of the MDS logic to see what the problem is there.

    The full logs are too large for the mailing list, so I've put them
    here:
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/

    Some help/advice with this would very much be appreciated.  Thanks
    in advance!

    Andras

    On 03/14/2017 12:55 PM, John Spray
      wrote:

      On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:

        Hi John,

I've checked the MDS session list, and the fuse client does appear on that
with 'state' as 'open'.  So both the fuse client and the MDS agree on an
open connection.

Attached is the log of the ceph fuse client at debug level 20.  The MDS got
restarted at 9:44:20, and it went through its startup, and was in an
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
10.128.129.63 is the IP of the client the log is from.

      So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3f7700 thread in that log, which eventually
completes a ll_write on inode 100024ebea8 after the MDS restart).
Hard to say whether the MDS failed to send it the proper messages, or
if the client somehow missed it.

It would be useful to have equally verbose logs from the MDS side from
earlier on, at the point that the client started trying to do the
write.  I wonder if you could see if your MDS+client can handle both
being run at "debug mds = 20", "debug client = 20" respectively for a
while, then when a client gets stuck, do the MDS restart, and follow
back in the client log to work out which inode it was stuck on, then
find log areas on the MDS side relating to that inode number.

BTW I see this "FsUsage" stuff in your log, which I don't recognise
from mainline Ceph, are you running something modified?

John

        As per another suggestion, I've also tried kick_stale_sessions on the fuse
client, which didn't help (I guess since it doesn't think the session is
stale).
Let me know if there is anything else I can do to help.

Andras

On 03/13/2017 06:08 PM, John Spray wrote:

          On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:

            Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.
At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct
operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6,
but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is
on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ?     Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[<ffffffffa0a1d7bb>] fuse_file_aio_write+0xbb/0x340 [fuse]
[<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
[<ffffffff811de55d>] vfs_write+0xbd/0x1e0
[<ffffffff811defff>] SyS_write+0x7f/0xe0
[<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ?     Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[<ffffffffa0a12241>] wait_answer_interruptible+0x91/0xe0 [fuse]
[<ffffffffa0a12653>] __fuse_request_send+0x253/0x2c0 [fuse]
[<ffffffffa0a126d2>] fuse_request_send+0x12/0x20 [fuse]
[<ffffffffa0a1b966>] fuse_send_write+0xd6/0x110 [fuse]
[<ffffffffa0a1d45d>] fuse_perform_write+0x2ed/0x590 [fuse]
[<ffffffffa0a1d9a1>] fuse_file_aio_write+0x2a1/0x340 [fuse]
[<ffffffff811ddd3d>] do_sync_write+0x8d/0xd0
[<ffffffff811de55d>] vfs_write+0xbd/0x1e0
[<ffffffff811defff>] SyS_write+0x7f/0xe0
[<ffffffff816458c9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Presumably the second process is waiting on the first holding some lock
...

The fuse client on the node:

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
status
{
     "metadata": {
         "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
         "ceph_version": "ceph version 10.2.6
(656b5b63ed7c43bd014bcafd81b001959d5f089f)",
         "entity_id": "admin",
         "hostname": "worker1070",
         "mount_point": "\/mnt\/ceph",
         "root": "\/"
     },
     "dentry_count": 40,
     "dentry_pinned_count": 23,
     "inode_count": 123,
     "mds_epoch": 19041,
     "osd_epoch": 462327,
     "osd_epoch_barrier": 462326
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_sessions
{
     "id": 3616543,
     "sessions": [
         {
             "mds": 0,
             "addr": "10.128.128.110:6800\/909443124",
             "seq": 338,
             "cap_gen": 0,
             "cap_ttl": "2017-03-13 14:47:37.575229",
             "last_cap_renew_request": "2017-03-13 14:46:37.575229",
             "cap_renew_seq": 12694,
             "num_caps": 713,
             "state": "open"
         }
     ],
     "mdsmap_epoch": 19041
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_requests
{}

The overall cluster health and the MDS:

[root@cephosd000 ~]# ceph -s
     cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
      health HEALTH_WARN
             noscrub,nodeep-scrub,require_jewel_osds flag(s) set
      monmap e17: 3 mons at

{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
             election epoch 29148, quorum 0,1,2
hyperv029,hyperv030,hyperv031
       fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
      osdmap e462328: 624 osds: 624 up, 624 in
             flags noscrub,nodeep-scrub,require_jewel_osds
       pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
             2154 TB used, 1791 TB / 3946 TB avail
                42496 active+clean
   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr

[root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok
ops
{
     "ops": [],
     "num_ops": 0
}

The odd thing is that if in this state I restart the MDS, the client
process
wakes up and proceeds with its work without any errors.  As if a request
was
lost and somehow retransmitted/restarted when the MDS got restarted and
the
fuse layer reconnected to it.

          Interesting.  A couple of ideas for more debugging:

* Next time you go through this process of restarting the MDS while
there is a stuck client, first increase the client's logging (ceph
daemon <path to /var/run/ceph/ceph-<id>.asok> config set debug_client
20").  Then we should get a clear sense of exactly what's happening on
the MDS restart that's enabling the client to proceed.
* When inspecting the client's "mds_sessions" output, also check the
"session ls" output on the MDS side to make sure the MDS and client
both agree that it has an open session.

John

            When I try to attach a gdb session to either of the client processes, gdb
just hangs.  However, right after the MDS restart gdb attaches to the
process successfully, and shows that the getting stuck happened on
closing
of a file.  In fact, it looks like both processes were trying to write to
the same file opened with fopen("filename", "a") and close it:

(gdb) where
#0  0x00002aaaadc53abd in write () from /lib64/libc.so.6
#1  0x00002aaaadbe2383 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00002aaaadbe37ec in __GI__IO_do_write () from /lib64/libc.so.6
#3  0x00002aaaadbe30e0 in __GI__IO_file_close_it () from /lib64/libc.so.6
#4  0x00002aaaadbd7020 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
...

It seems like the fuse client wasn't handling this case well, when two
processes write to the same file and close it perhaps?  This is just a
speculation.  Any ideas on how to proceed?  Is there perhaps a known
issue
related to this?

Thanks,

Andras
apataki@xxxxxxxxxxxxxxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com