After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Cephers,

We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous (12.2.2).  The upgrade went smoothly for the most part, except we seem to be hitting an issue with cephfs.  After about a day or two of use, the MDS start complaining about clients failing to respond to cache pressure:

[root@cephmon00 ~]# ceph -s
  cluster:
    id:     d7b33135-0940-4e48-8aa6-1d2026597c2f
    health: HEALTH_WARN
            1 MDSs have many clients failing to respond to cache pressure
            noout flag(s) set
            1 osds down

  services:
    mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
    mgr: cephmon00(active), standbys: cephmon01, cephmon02
    mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
    osd: 2208 osds: 2207 up, 2208 in
         flags noout

  data:
    pools:   6 pools, 42496 pgs
    objects: 919M objects, 3062 TB
    usage:   9203 TB used, 4618 TB / 13822 TB avail
    pgs:     42470 active+clean
             22    active+clean+scrubbing+deep
             4     active+clean+scrubbing

  io:
    client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr

[root@cephmon00 ~]# ceph health detail
HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure; noout flag(s) set; 1 osds down
MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache pressure
    mdscephmon00(mds.0): Many clients (103) failing to respond to cache pressureclient_count: 103
OSDMAP_FLAGS noout flag(s) set
OSD_DOWN 1 osds down
    osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down

We are using exclusively the 12.2.2 fuse client on about 350 nodes or so (out of which it seems 100 are not responding to cache pressure in this log).  When this happens, clients appear pretty sluggish also (listing directories, etc.).  After bouncing the MDS, everything returns on normal after the failover for a while.  Ignore the message about 1 OSD down, that corresponds to a failed drive and all data has been re-replicated since.

We were also using the 12.2.2 fuse client with the Jewel back end before the upgrade, and have not seen this issue.

We are running with a larger MDS cache than usual, we have mds_cache_size set to 4 million.  All other MDS configs are the defaults.

Is this a known issue?  If not, any hints on how to further diagnose the problem?

Andras

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux