Re: High 0.94.5 OSD memory use at 8GB RAM/TB raw disk during recovery

Laurent GUERBY <laurent@xxxxxxxxxx> · Tue, 01 Dec 2015 23:29:05 +0100

On Tue, 2015-12-01 at 13:51 -0600, Ryan Tokarek wrote:
> > On Nov 30, 2015, at 6:52 PM, Laurent GUERBY <laurent@xxxxxxxxxx> wrote:
> > 
> > Hi,
> > 
> > We lost a disk today in our ceph cluster so we added a new machine with
> > 4 disks to replace the capacity and we activated straw1 tunable too
> > (we also tried straw2 but we quickly backed up this change).
> > 
> > During recovery OSD started crashing on all of our machines
> > the issue being OSD RAM usage that goes very high, eg:
> > 
> > 24078 root      20   0 27.784g 0.026t  10888 S   5.9 84.9
> > 16:23.63 /usr/bin/ceph-osd --cluster=ceph -i 41 -f
> > /dev/sda1       2.7T  2.2T  514G  82% /var/lib/ceph/osd/ceph-41
> > 
> > That's about 8GB resident RAM per TB of disk, way above
> > what we provisionned ~ 2-4 GB RAM/TB.
> 
> We had something vaguely similar (not nearly that dramatic though!) happen to us. During a recovery (actually, I think this was rebalancing after upgrading from an earlier version of ceph), our OSDs took so much memory they would get killed by oom_killer and we couldn't keep the cluster up long enough to get back to healthy. 
> 
> A solution for us was to enable zswap; previously we had been running with no swap at all. 
> 
> If you are running a kernel newer than 3.11 (you might want more recent than that as I believe there were major fixes after 3.17), then enabling zswap allows the kernel to compress pages in memory before needing to touch disk. The default max pool size for this is 20% of memory. There is extra CPU time to compress/decompress, but it's much faster than going to disk, and the OSD data appears to be quite compressible. For us, nothing actually made it to the disk, but a swapfile must to be enabled for zswap to do its work. 
> 
> https://www.kernel.org/doc/Documentation/vm/zswap.txt
> http://askubuntu.com/questions/471912/zram-vs-zswap-vs-zcache-ultimate-guide-when-to-use-which-one
> 
> Add "zswap.enabled=1" to your kernel bool parameters and reboot. 
> 
> If you have no swap file/partition/disk/whatever, then you need one for zswap to actually do anything. Here is an example, but use whatever sizes, locations, process you prefer:
> 
> dd if=/dev/zero of=/var/swap bs=1M count=8192
> chmod 600 /var/swap
> mkswap /var/swap
> swapon /var/swap
> 
> Consider adding it to /etc/fstab:
> /var/swap	swap	swap	defaults 0 0 
> 
> This got us through the rebalancing. The OSDs eventually returned to normal, but we've just left zswap enabled with no apparent problems. I don't know that it will be enough for your situation, but it might help. 
> 
> Ryan

Hi Ryan,

Thanks for your suggestion!

We also managed to recover the cluster after about 15 hours of trying. 

We added 64G swapfile to hosts (taken on the OSD disks...), enabled
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub,notieragent
stopped all ceph clients, manually stopped for a few hours OSD that
restarted too often (that action seemed to be the one that helped the
most to stabilize things), sending periodically ceph tell osd.N heap
release and restarting manually any suspicious OSD (slow request, or
infinite "currently waiting for rw locks", or indecent RAM use).

On the "waiting for rw locks" may be backporting to 0.94.6
http://tracker.ceph.com/issues/13821
would help.

Loic, is there a test for a cluster where you get the OSD
near max RAM of the host (eg: lots of small objects/pg, small amount of
memory on the node) then you kill one third of the OSD and check that it
recovers on the two other thirds alive without getting OOM? Next step
would be to periodically stop, wait and restart a given number of the
OSD and see if things stabilize RAM wise.

Sincerely,

Laurent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com