Revisiting: Many clients (X) failing to respond to cache pressure

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Tue, 13 Dec 2016 17:03:53 +1100



    Hi Ceph(FS)ers...

    
    I am currently running in production the following environment:

    - ceph/cephfs in 10.2.2. 

      - All infrastructure is in the same version (rados cluster, mons,
      mds and cephfs clients). 

      - We mount cephfs using ceph-fuse.
    Since yesterday that we have our cluster in warning state with the
    message "mds0: Many clients (X) failing to respond to cache
      pressure". X has been changing with time, from ~130 to ~70. I
    am able to correlate the appearance of this message with burst of
    jobs in our cluster. 

    
    This subject has been discussed in the mailing list a lot of times,
    and normally, the recipe is to look for something wrong in the
    clients. So, I have tried to look to clients first:

    
    1) I've started to loop through all my clients, and run
      'ceph --admin-daemon /var/run/ceph/ceph-client.mount_user.asok
      status' to get the inodes_count reported in each client. 

      $ cat all.txt | grep inode_count | awk '{print $2}' |
        sed 's/,//g' | awk '{s+=$1} END {print s}'

        2407659

        
      2) I've then compared with the number of inodes the mds had in its
      cache (obtained by a perf dump)

               inode_max": 2000000 and "inodes": 2413826

      
      3) I've tried to understand how many clients had a number of
      inodes higher than 16384 (the default) and got

    
      $ for i in `cat all.txt | grep inode_count | awk
        '{print $2}' | sed 's/,//g' `; do if [ $i -ge 16384 ]; then echo
        $i; fi; done | wc -l

        27

      
      4) My conclusion is that the core of inodes is held by a couple of
      machines. However, while the majority is running user jobs, others
      are not doing anything at all. For example, an idle machine (which
      had no users logged in, no jobs running, updatedb does not search
      for cephfs filesystem) reported more than > 300000 inodes). To
      regain those inodes, I had to umount and remount cephfs in that
      machine. 

    
    5) Based on my previous observations I suspect that
      there are still some problems in the ceph-fuse client regarding
      recovering these inodes (or it happens at a very slow rate).

    
    However, I also do not completely understand what is happening on
    the server side:

    
      6) The current memory usage of my mds is the following:
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU
        %MEM     TIME+
COMMAND                                                                                      
        

        17831 ceph      20   0 13.667g 0.012t  10048 S  37.5 40.2  
        1068:47 ceph-mds 

      
      The mds cache size is set to 2000000. Running 'ceph daemon
      mds.<id> perf dump', I get  "inode_max": 2000000 and
      "inodes": 2413826. Assuming 4k per each inode one gets ~10G. So
      why it is taking much more than that?  

      
      7) I have been running cephfs for more than an year, and looking
      to ganglia, the mds memory never decreases but always increases
      (even in cases when we umount almost all the clients). Why does
      that happen?

      
      8) I am running 2 mds, in active / standby-replay mode. The memory
      of the standby-replay is much lower

      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM    
      TIME+
COMMAND                                                                                      
      

        716 ceph      20   0 6149424 5.115g   8524 S   1.2 43.6 
      53:19.74 ceph-mds

      If I trigger a restart on my active mds, the standby replay
        will start acting as active, but will continue with the same
        amount of memory. Why the second mds can become active, and do
        the same job but using much more memory?
    
    
      9) Finally, I am sending an extract of 'ceph daemon
        mds.<id> perf dump' from my active and standby mdses. What
        is exactly the meaning of inodes_pin_tail, inodes_expired and
        inodes_with_caps? Is the standby mds suppose to show the same
        numbers? They don't... 

      
    Thanks in advance for your answers /  suggestions
    Cheers
    Goncalo
    

      active:

      
          "mds": {

              "request": 93941296,

              "reply": 93940671,

              "reply_latency": {

                  "avgcount": 93940671,

                  "sum": 188398.004552299

              },

              "forward": 0,

              "dir_fetch": 309878,

              "dir_commit": 1736194,

              "dir_split": 0,

              "inode_max": 2000000,

              "inodes": 2413826,

              "inodes_top": 201,

              "inodes_bottom": 568,

              "inodes_pin_tail": 2413057,

              "inodes_pinned": 2413303,

              "inodes_expired": 19693168,

              "inodes_with_caps": 2409737,

              "caps": 2440565,

              "subtrees": 2,

              "traverse": 113291068,

              "traverse_hit": 57822611,

              "traverse_forward": 0,

              "traverse_discover": 0,

              "traverse_dir_fetch": 154708,

              "traverse_remote_ino": 1085,

              "traverse_lock": 66063,

              "load_cent": 9394314733,

              "q": 22,

              "exported": 0,

              "exported_inodes": 0,

              "imported": 0,

              "imported_inodes": 0

          },

      
      standby-replay:

      
          "mds": {

              "request": 0,

              "reply": 0,

              "reply_latency": {

                  "avgcount": 0,

                  "sum": 0.000000000

              },

              "forward": 0,

              "dir_fetch": 0,

              "dir_commit": 0,

              "dir_split": 0,

              "inode_max": 2000000,

              "inodes": 2000058,

              "inodes_top": 0,

              "inodes_bottom": 1993207,

              "inodes_pin_tail": 6851,

              "inodes_pinned": 124135,

              "inodes_expired": 10651484,

              "inodes_with_caps": 0,

              "caps": 0,

              "subtrees": 2,

              "traverse": 0,

              "traverse_hit": 0,

              "traverse_forward": 0,

              "traverse_discover": 0,

              "traverse_dir_fetch": 0,

              "traverse_remote_ino": 0,

              "traverse_lock": 0,

              "load_cent": 0,

              "q": 0,

              "exported": 0,

              "exported_inodes": 0,

              "imported": 0,

              "imported_inodes": 0

          },
    

    -- 
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937
  

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com