Hello all, I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000 cephfs mounts (kernel client). We're currently only using 1 active mds. Performance is great about 80% of the time. MDS responses (per ceph daemonperf mds.$(hostname -s), indicates 2k-9k requests per second, with a latency under 100. It is the other 20ish percent I'm worried about. I'll check on it and it with be going 5-15 seconds with "0" requests, "0" latency, then give me 2 seconds of reasonable response times, and then back to nothing. Clients are actually seeing blocked requests for this period of time. The strange bit is that when I *reduce* the mds_cache_size, requests and latencies go back to normal for a while. When it happens again, I'll increase it back to where it was. It feels like the mds server decides that some of these inodes can't be dropped from the cache unless the cache size changes. Maybe something wrong with the LRU? I feel like I've got a reasonable cache size for my workload, 30GB on the small end, 55GB on the large. No real reason for a swing this large except to potentially delay it recurring after expansion for longer. I also feel like there is probably some magic tunable to change how inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what this tunable actually does? The documentation is a little sparse. I can grab logs from the mds if needed, just let me know the settings you'd like to see. -- Adam _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com