Hello fellow cephers, I have been struggling with stability of my Jewel cluster and from what I can see I am not the only person. My setup is: 3 osd+mon servers, 30 osds, half a dozen of client host servers for rbd access, 40gbit/s infiniband link, all ceph servers are running on Ubuntu 16.04, clients are on Ubuntu 14.04. Problems that I've recently experienced after upgrading to Jewel: slow+blocked requests, ceph-osd crashes, memory leaks in ceph-mon and ceph-osd causing memory exhaustion and killing of ceph-osd/ceph-mon processes. (Slow/blocked requests were the problem for me for years) What I've tried initially was to reboot my osd/mon servers every night. This has solved ALL of my problems. At least I've not had any slow/blocked requests for over a week now. Before, there were somewhere between 50 - 2K slow requests per day. Obviously rebooting servers on a daily basis is not ideal to say the least. For the last 3 days I am running 4.9.8 kernel from the ubuntu builds and also running a cron script to clear Page Cache twice a day on each osd/mon servers. I am not rebooting the servers. This has solved my slow/blocked requests. Again, I've not seen a single slow/blocked request in 3 days. However, I do see one of my ceph-mon processes leaking memory and consuming about 3-4GB of RAM per day. I let it past over 8GB and restarted the ceph-mon, which seems to have stoped the leak for now as after 24 hours or so the process consumes <1gb ram on that server. I've not made any changes to the ceph clients. They are still running the same as before. I thought to share this so that perhaps other people might be experiencing similar troubles with 10.2.5 or other minor versions. Also, if anyone have an idea how to improve things, please share. Cheers Andrei -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170216/7b141338/attachment.htm>