On Tue, 2015-12-01 at 13:51 -0600, Ryan Tokarek wrote: > > On Nov 30, 2015, at 6:52 PM, Laurent GUERBY <laurent@xxxxxxxxxx> wrote: > > > > Hi, > > > > We lost a disk today in our ceph cluster so we added a new machine with > > 4 disks to replace the capacity and we activated straw1 tunable too > > (we also tried straw2 but we quickly backed up this change). > > > > During recovery OSD started crashing on all of our machines > > the issue being OSD RAM usage that goes very high, eg: > > > > 24078 root 20 0 27.784g 0.026t 10888 S 5.9 84.9 > > 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f > > /dev/sda1 2.7T 2.2T 514G 82% /var/lib/ceph/osd/ceph-41 > > > > That's about 8GB resident RAM per TB of disk, way above > > what we provisionned ~ 2-4 GB RAM/TB. > > We had something vaguely similar (not nearly that dramatic though!) happen to us. During a recovery (actually, I think this was rebalancing after upgrading from an earlier version of ceph), our OSDs took so much memory they would get killed by oom_killer and we couldn't keep the cluster up long enough to get back to healthy. > > A solution for us was to enable zswap; previously we had been running with no swap at all. > > If you are running a kernel newer than 3.11 (you might want more recent than that as I believe there were major fixes after 3.17), then enabling zswap allows the kernel to compress pages in memory before needing to touch disk. The default max pool size for this is 20% of memory. There is extra CPU time to compress/decompress, but it's much faster than going to disk, and the OSD data appears to be quite compressible. For us, nothing actually made it to the disk, but a swapfile must to be enabled for zswap to do its work. > > https://www.kernel.org/doc/Documentation/vm/zswap.txt > http://askubuntu.com/questions/471912/zram-vs-zswap-vs-zcache-ultimate-guide-when-to-use-which-one > > Add "zswap.enabled=1" to your kernel bool parameters and reboot. > > If you have no swap file/partition/disk/whatever, then you need one for zswap to actually do anything. Here is an example, but use whatever sizes, locations, process you prefer: > > dd if=/dev/zero of=/var/swap bs=1M count=8192 > chmod 600 /var/swap > mkswap /var/swap > swapon /var/swap > > Consider adding it to /etc/fstab: > /var/swap swap swap defaults 0 0 > > This got us through the rebalancing. The OSDs eventually returned to normal, but we've just left zswap enabled with no apparent problems. I don't know that it will be enough for your situation, but it might help. > > Ryan Hi Ryan, Thanks for your suggestion! We also managed to recover the cluster after about 15 hours of trying. We added 64G swapfile to hosts (taken on the OSD disks...), enabled noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent stopped all ceph clients, manually stopped for a few hours OSD that restarted too often (that action seemed to be the one that helped the most to stabilize things), sending periodically ceph tell osd.N heap release and restarting manually any suspicious OSD (slow request, or infinite "currently waiting for rw locks", or indecent RAM use). On the "waiting for rw locks" may be backporting to 0.94.6 http://tracker.ceph.com/issues/13821 would help. Loic, is there a test for a cluster where you get the OSD near max RAM of the host (eg: lots of small objects/pg, small amount of memory on the node) then you kill one third of the OSD and check that it recovers on the two other thirds alive without getting OOM? Next step would be to periodically stop, wait and restart a given number of the OSD and see if things stabilize RAM wise. Sincerely, Laurent _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com