RE: [ceph-users] Ceph killed by OS because of OOM under high load

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Tue, 4 Jun 2013 06:20:46 +0000



Hi Greg,
	Yes, Thanks for your advice ,we do turn down the osd_client_message_size_cap to 100MB/OSD ,both Journal queue and filestore queue are set to 100MB also.
	That's 300MB/OSD in total, but from TOP we see:
		16527     1  14:49.01     0  7.1  20   0 S 1147m    0 532m     0 ceph-osd
		17505     1  10:54.32     0  6.5  20   0 S 1111m    0 482m     0 ceph-osd
		18015     1  12:21.79     0  6.0  20   0 S 1041m    0 449m     0 ceph-osd
		17799     1  13:29.92     0  5.9  20   0 S 1085m    0 437m     0 ceph-osd
		17320     1  12:19.25     0  5.8  20   0 S 1079m    0 434m     0 ceph-osd
		18946     1  12:38.01     0  5.7  20   0 S 1082m    0 428m     0 ceph-osd
		16983     1  12:04.97     0  5.1  20   0 S 1045m    0 383m     0 ceph-osd
		19120     1  12:57.84     0  4.9  20   0 S 1003m    0 367m     0 ceph-osd
		16300     1  12:17.18     0  4.8  20   0 S  979m    0 361m     0 ceph-osd
		18494     1  13:13.04     0  4.5  20   0 S  950m    0 339m     0 ceph-osd
		16662     1  13:14.84     0  4.3  20   0 S  964m    0 318m     0 ceph-osd
		18285     1  11:07.79     0  3.7  20   0 S  895m    0 276m     0 ceph-os

	Most of the OSD daemons are slightly exceed 300MB, , so what part of code/modules take out these memory? 
	Or in another word, a more general question, is it able to draw out a pie chart, showing percentages of memory used by each modules in OSD?

																								Xiaoxi
-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: 2013年6月4日 0:37
To: Chen, Xiaoxi
Cc: ceph-devel@xxxxxxxxxxxxxxx; Mark Nelson (mark.nelson@xxxxxxxxxxx); ceph-users@xxxxxxxx
Subject: Re: [ceph-users] Ceph killed by OS because of OOM under high load

On Mon, Jun 3, 2013 at 8:47 AM, Chen, Xiaoxi <xiaoxi.chen@xxxxxxxxx> wrote:
> Hi,
>         As my previous mail reported some weeks ago ,we are suffering from OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue really stop us from digging further into ceph characterization.
>         Good news is that we seems find out the cause, I explain our experiments below:
>
>         Environment:
>                 We have 2 machines, one for client and one for ceph, connected via 10GbE.
>                 The client machine is very powerful, with 64 Cores and 256G RAM.
>                 The ceph machine with 32 Cores and 64G RAM, but we limited the available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 RPM 1T disk , 4* DCS 3700 SSDs as journals.
>                 Both client and ceph are v0.61.2.
>                 We run 12 rados bench instances in client node as a stress to ceph node, each instance with 256 concurrent.
>         Experiment and result:
>                 1.default ceph + default client ,   OK
>                 2.tuned ceph  + default client    FAIL,One osd killed by OS due to OOM, and all swap space is run out. (tuning: Large queue ops/Large queue bytes/.No flusher/sync_flush =true)
>                 3.tuned ceph WITHOUT large queue bytes  + default client   OK
>                 4.tuned ceph WITHOUT large queue bytes  + aggressive 
> client  FAILED, One osd killed by OOM and one suicide because 150s op 
> thread timeout.  (aggressive client: objecter_inflight_ops, 
> opjecter_inflight_bytes are both set to 10X of default)
>
>         Conclusion.
>                 We would like to say,
>                 a.      under heavy load, some tuning will make ceph unstable ,especially queue bytes related ( deduce from 1+2+3)
>                 b.      Ceph doesn't do any control on the lenth of OSD Queue, this is a critical issue, with aggressive client or a lot of concurrent clients, the osd queue will become too long to fit in memory ,thus result in osd daemon being killed.(deduce from 3+4)
>                 c.   An observation to osd daemon memory usage show that, if I use "killall rados" to kill all the rados bench instances, the ceph osd daemon cannot free the allocated memory, instead, it still remain very high memory usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , if killed rados, it still remain 5~6GB, restart ceph can solve this issue)

You don't have enough RAM for your OSDs. We really recommend 1-2GB per daemon; 600MB/daemon is dangerous. You might be able to make it work, but you'll definitely need to change the queue lengths and things.
Speaking of which...yes, the OSDs do control their queue lengths, but it's not dynamic tuning and by default it will let clients stack up 500MB of in-progress writes. With such wimpy systems you'll want to turn that down, probably alongside various journal and disk wait queues.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f