On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote: > Dear Cephers, > > We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous > (12.2.2). The upgrade went smoothly for the most part, except we seem to be > hitting an issue with cephfs. After about a day or two of use, the MDS > start complaining about clients failing to respond to cache pressure: What's the OS, kernel version and fuse version on the hosts where the clients are running? There have been some issues with ceph-fuse losing the ability to properly invalidate cached items when certain updated OS packages were installed. Specifically, ceph-fuse checks the kernel version against 3.18.0 to decide which invalidation method to use, and if your OS has backported new behaviour to a low-version-numbered kernel, that can confuse it. John > > [root@cephmon00 ~]# ceph -s > cluster: > id: d7b33135-0940-4e48-8aa6-1d2026597c2f > health: HEALTH_WARN > 1 MDSs have many clients failing to respond to cache pressure > noout flag(s) set > 1 osds down > > services: > mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02 > mgr: cephmon00(active), standbys: cephmon01, cephmon02 > mds: cephfs-1/1/1 up {0=cephmon00=up:active}, 2 up:standby > osd: 2208 osds: 2207 up, 2208 in > flags noout > > data: > pools: 6 pools, 42496 pgs > objects: 919M objects, 3062 TB > usage: 9203 TB used, 4618 TB / 13822 TB avail > pgs: 42470 active+clean > 22 active+clean+scrubbing+deep > 4 active+clean+scrubbing > > io: > client: 56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr > > [root@cephmon00 ~]# ceph health detail > HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure; > noout flag(s) set; 1 osds down > MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache > pressure > mdscephmon00(mds.0): Many clients (103) failing to respond to cache > pressureclient_count: 103 > OSDMAP_FLAGS noout flag(s) set > OSD_DOWN 1 osds down > osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down > > > We are using exclusively the 12.2.2 fuse client on about 350 nodes or so > (out of which it seems 100 are not responding to cache pressure in this > log). When this happens, clients appear pretty sluggish also (listing > directories, etc.). After bouncing the MDS, everything returns on normal > after the failover for a while. Ignore the message about 1 OSD down, that > corresponds to a failed drive and all data has been re-replicated since. > > We were also using the 12.2.2 fuse client with the Jewel back end before the > upgrade, and have not seen this issue. > > We are running with a larger MDS cache than usual, we have mds_cache_size > set to 4 million. All other MDS configs are the defaults. > > Is this a known issue? If not, any hints on how to further diagnose the > problem? > > Andras > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com