Hi John, Some other symptoms of the problem: when the MDS has been running for a few days, it starts looking really busy. At this time, listing directories becomes really slow. An "ls -l" on a directory with about 250 entries takes about 2.5 seconds. All the metadata is on OSDs with NVMe backing stores. Interestingly enough the memory usage seems pretty low (compared to the allowed cache limit). PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1604408 ceph 20 0 3710304 2.387g 18360 S 100.0 0.9 757:06.92 /usr/bin/ceph-mds -f --cluster ceph --id cephmon00 --setuser ceph --setgroup ceph Once I bounce it (fail it over), the CPU usage goes down to the 10-25% range. The same ls -l after the bounce takes about 0.5 seconds. I remounted the filesystem before each test to ensure there isn't anything cached. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 111100 ceph 20 0 6537052 5.864g 18500 S 17.6 2.3 9:23.55 /usr/bin/ceph-mds -f --cluster ceph --id cephmon02 --setuser ceph --setgroup ceph Also, I have a crawler that crawls the file system periodically. Normally the full crawl runs for about 24 hours, but with the slowing down MDS, now it has been running for more than 2 days and isn't close to finishing. The MDS related settings we are running with are: mds_cache_memory_limit = 17179869184 Andras On 01/17/2018 01:11 PM, John Spray
wrote:
On Wed, Jan 17, 2018 at 3:36 PM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:Hi John, All our hosts are CentOS 7 hosts, the majority are 7.4 with kernel 3.10.0-693.5.2.el7.x86_64, with fuse 2.9.2-8.el7. We have some hosts that have slight variations in kernel versions, the oldest one are a handful of CentOS 7.3 hosts with kernel 3.10.0-514.21.1.el7.x86_64 and fuse 2.9.2-7.el7. I know Redhat has been backporting lots of stuff so perhaps these kernels fall into the category you are describing?Quite possibly -- this issue was originally noticed on RHEL, so maybe the relevant bits made it back to CentOS recently. However, it looks like the fixes for that issue[1,2] are already in 12.2.2, so maybe this is something completely unrelated :-/ The ceph-fuse executable does create an admin command socket in /var/run/ceph (named something ceph-client...) that you can drive with "ceph daemon <socket> dump_cache", but the output is extremely verbose and low level and may not be informative. John 1. http://tracker.ceph.com/issues/21423 2. http://tracker.ceph.com/issues/22269When the cache pressure problem happens, is there a way to know exactly which hosts are involved, and what items are in their caches easily? Andras On 01/17/2018 06:09 AM, John Spray wrote:On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> wrote:Dear Cephers, We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous (12.2.2). The upgrade went smoothly for the most part, except we seem to be hitting an issue with cephfs. After about a day or two of use, the MDS start complaining about clients failing to respond to cache pressure:What's the OS, kernel version and fuse version on the hosts where the clients are running? There have been some issues with ceph-fuse losing the ability to properly invalidate cached items when certain updated OS packages were installed. Specifically, ceph-fuse checks the kernel version against 3.18.0 to decide which invalidation method to use, and if your OS has backported new behaviour to a low-version-numbered kernel, that can confuse it. John[root@cephmon00 ~]# ceph -s cluster: id: d7b33135-0940-4e48-8aa6-1d2026597c2f health: HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure noout flag(s) set 1 osds down services: mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02 mgr: cephmon00(active), standbys: cephmon01, cephmon02 mds: cephfs-1/1/1 up {0=cephmon00=up:active}, 2 up:standby osd: 2208 osds: 2207 up, 2208 in flags noout data: pools: 6 pools, 42496 pgs objects: 919M objects, 3062 TB usage: 9203 TB used, 4618 TB / 13822 TB avail pgs: 42470 active+clean 22 active+clean+scrubbing+deep 4 active+clean+scrubbing io: client: 56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr [root@cephmon00 ~]# ceph health detail HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure; noout flag(s) set; 1 osds down MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache pressure mdscephmon00(mds.0): Many clients (103) failing to respond to cache pressureclient_count: 103 OSDMAP_FLAGS noout flag(s) set OSD_DOWN 1 osds down osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down We are using exclusively the 12.2.2 fuse client on about 350 nodes or so (out of which it seems 100 are not responding to cache pressure in this log). When this happens, clients appear pretty sluggish also (listing directories, etc.). After bouncing the MDS, everything returns on normal after the failover for a while. Ignore the message about 1 OSD down, that corresponds to a failed drive and all data has been re-replicated since. We were also using the 12.2.2 fuse client with the Jewel back end before the upgrade, and have not seen this issue. We are running with a larger MDS cache than usual, we have mds_cache_size set to 4 million. All other MDS configs are the defaults. Is this a known issue? If not, any hints on how to further diagnose the problem? Andras _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com