Hi Chris. > Interestingly, when duration gets long and performance gets bad ... This observation is likely due to MDS and client cache. My experience with ceph's cache implementations is that, well, they seem not that great. I think what we both observe is, that with an empty cache everything works fine. As soon as the cache starts saturating, memory needs to be freed. This seems to result in heavy fragmentation, making certain operations slower and slower (for example, alloc and free). There was a ceph-user thread discussing performance as a function of cache size and the finding was that performance increases with reducing cache size. Since then I use the following mds settings: client_cache_size = 8192 mds_cache_memory_limit = 17179869184 mds_cache_reservation = 0.500000 mds_max_caps_per_client = 65536 mds_min_caps_per_client = 4096 mds_recall_max_caps = 32768 I'm thinking about increasing the mid-point (mds_cache_reservation) even further to have a large amount of free cache mem to absorb load bursts and release when the burst is over. About your graphs: to have IO-peaks after MDS failover is expected. Clients actually continue to make IO requests that go to system's buffer. Once the MDS is back up, these buffered OPS get applied, leading to a temporary increase of load. The only exceptional raise is the last one, which might have coincided with someone starting to do a lot of IO. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx