Re: After Luminous upgrade: ceph-fuse clients failing to respond to cache pressure

John Spray <jspray@xxxxxxxxxx> · Wed, 17 Jan 2018 11:09:45 +0000

On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:
> Dear Cephers,
>
> We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous
> (12.2.2).  The upgrade went smoothly for the most part, except we seem to be
> hitting an issue with cephfs.  After about a day or two of use, the MDS
> start complaining about clients failing to respond to cache pressure:

What's the OS, kernel version and fuse version on the hosts where the
clients are running?

There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.

Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.

John

>
> [root@cephmon00 ~]# ceph -s
>   cluster:
>     id:     d7b33135-0940-4e48-8aa6-1d2026597c2f
>     health: HEALTH_WARN
>             1 MDSs have many clients failing to respond to cache pressure
>             noout flag(s) set
>             1 osds down
>
>   services:
>     mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
>     mgr: cephmon00(active), standbys: cephmon01, cephmon02
>     mds: cephfs-1/1/1 up  {0=cephmon00=up:active}, 2 up:standby
>     osd: 2208 osds: 2207 up, 2208 in
>          flags noout
>
>   data:
>     pools:   6 pools, 42496 pgs
>     objects: 919M objects, 3062 TB
>     usage:   9203 TB used, 4618 TB / 13822 TB avail
>     pgs:     42470 active+clean
>              22    active+clean+scrubbing+deep
>              4     active+clean+scrubbing
>
>   io:
>     client:   56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr
>
> [root@cephmon00 ~]# ceph health detail
> HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure;
> noout flag(s) set; 1 osds down
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
> pressure
>     mdscephmon00(mds.0): Many clients (103) failing to respond to cache
> pressureclient_count: 103
> OSDMAP_FLAGS noout flag(s) set
> OSD_DOWN 1 osds down
>     osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down
>
>
> We are using exclusively the 12.2.2 fuse client on about 350 nodes or so
> (out of which it seems 100 are not responding to cache pressure in this
> log).  When this happens, clients appear pretty sluggish also (listing
> directories, etc.).  After bouncing the MDS, everything returns on normal
> after the failover for a while.  Ignore the message about 1 OSD down, that
> corresponds to a failed drive and all data has been re-replicated since.
>
> We were also using the 12.2.2 fuse client with the Jewel back end before the
> upgrade, and have not seen this issue.
>
> We are running with a larger MDS cache than usual, we have mds_cache_size
> set to 4 million.  All other MDS configs are the defaults.
>
> Is this a known issue?  If not, any hints on how to further diagnose the
> problem?
>
> Andras
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com