After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Tue, 16 Jan 2018 15:50:49 -0500



    Dear Cephers,

    
    We've upgraded the back end of our cluster from Jewel (10.2.10) to
    Luminous (12.2.2).  The upgrade went smoothly for the most part,
    except we seem to be hitting an issue with cephfs.  After about a
    day or two of use, the MDS start complaining about clients failing
    to respond to cache pressure:

    
    [root@cephmon00 ~]# ceph -s

        cluster:

          id:     d7b33135-0940-4e48-8aa6-1d2026597c2f

          health: HEALTH_WARN

                  1 MDSs have many clients failing to respond to
          cache pressure

                  noout flag(s) set

                  1 osds down

      
        services:

          mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02

          mgr: cephmon00(active), standbys: cephmon01, cephmon02

          mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2
        up:standby

          osd: 2208 osds: 2207 up, 2208 in

               flags noout

      
        data:

          pools:   6 pools, 42496 pgs

          objects: 919M objects, 3062 TB

          usage:   9203 TB used, 4618 TB / 13822 TB avail

          pgs:     42470 active+clean

                   22    active+clean+scrubbing+deep

                   4     active+clean+scrubbing

      
        io:

          client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101
        op/s wr

      
      [root@cephmon00 ~]# ceph health detail

      HEALTH_WARN 1 MDSs have many clients failing to respond to
        cache pressure; noout flag(s) set; 1 osds down

      MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to
          respond to cache pressure

          mdscephmon00(mds.0): Many clients (103) failing to
          respond to cache pressureclient_count: 103

      OSDMAP_FLAGS noout flag(s) set

      OSD_DOWN 1 osds down

          osd.1296
        (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down

    
    We are using exclusively the 12.2.2 fuse client on about 350 nodes
    or so (out of which it seems 100 are not responding to cache
    pressure in this log).  When this happens, clients appear pretty
    sluggish also (listing directories, etc.).  After bouncing the MDS,
    everything returns on normal after the failover for a while.  Ignore
    the message about 1 OSD down, that corresponds to a failed drive and
    all data has been re-replicated since.

    
    We were also using the 12.2.2 fuse client with the Jewel back end
    before the upgrade, and have not seen this issue.

    
    We are running with a larger MDS cache than usual, we have
    mds_cache_size set to 4 million.  All other MDS configs are the
    defaults.

    
    Is this a known issue?  If not, any hints on how to further diagnose
    the problem?

    
    Andras

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com