Re: OSD state flipping when cluster-network in high utilization

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Wed, 15 May 2013 12:53:49 +0000

Hi Sage,
	I can reproduce it with log ,but the log is quite big (30MB+) , where should I put the log?
	From the log and reading the code,seems that with uncertain reason, the OSD_OP_THREAD will be blocked there for more than 15s, which results a time-out. Then in " handle_osd_ping " function, it dropped the ping request because of "is healthy"= false.
	I have a full log with debug osd 20, but still have no idea why op thread will be blocked for that long, part of the log is pasted below:

2013-05-15 15:29:06.870834 7f0253340700 10 osd.319 4113 dequeue_op 0xf4821e0 prio 63 cost 4096 latency 0.000173 osd_op(client.4703.0
:2114483 rbd_data.11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 pg pg[3.d7( v 4103'5330 (3853'4329,4103'5330] lo
cal-les=4092 n=154 ec=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean]
2013-05-15 15:29:06.870862 7f0253340700 20 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] op_has_sufficient_caps pool=3 (xiaoxi_rbd)
owner=0 need_read_cap=0 need_write_cap=1 need_class_read_cap=0 need_class_write_cap=0 -> yes
2013-05-15 15:29:06.870871 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op osd_op(client.4703.0:2114483 rbd_data
.11ff6b8b4567.0000000000000339 [write 1605632~4096] 3.3e4c40d7) v4 may_write
2013-05-15 15:29:22.117110 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.127345 7f0269b6d700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.276280 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.356731 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.359808 7f025d354700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f0253340700' had timed out after 15
2013-05-15 15:29:22.513247 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] get_snapset_context rbd_data.11ff6b8b4567.0
000000000000339 0 -> 1
2013-05-15 15:29:22.513270 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] populate_obc_watchers 3e4c40d7/rbd_data.11f
f6b8b4567.0000000000000339/head//3
2013-05-15 15:29:22.513276 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] get_object_context 0xd896840 3e4c40d7/rbd_d
ata.11ff6b8b4567.0000000000000339/head//3 0 -> 1 read 3e4c40d7/rbd_data.11ff6b8b4567.0000000000000339/head//3(4103'5267 client.4703.
0:2109755 wrlock_by=unknown.0.0:0)
2013-05-15 15:29:22.513284 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] find_object_context 3e4c40d7/rbd_data.11ff6
b8b4567.0000000000000339/head//3 @head
2013-05-15 15:29:22.513290 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode is idle(wr=0)
2013-05-15 15:29:22.513295 7f0253340700 10 osd.319 pg_epoch: 4113 pg[3.d7( v 4103'5330 (3853'4329,4103'5330] local-les=4092 n=154 ec
=100 les/c 4092/4093 4091/4091/4034) [319,46] r=0 lpr=4091 mlcod 4103'5329 active+clean] do_op mode now rmw(wr=0)

-----Original Message-----
From: Sage Weil [mailto:sage@xxxxxxxxxxx] 
Sent: 2013年5月15日 11:40
To: Chen, Xiaoxi
Cc: Mark Nelson; ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxx
Subject: RE:  OSD state flipping when cluster-network in high utilization

On Wed, 15 May 2013, Chen, Xiaoxi wrote:

> >How responsive generally is the machine under load?  Is there available CPU?
> 	The machine works well, and the issued OSDs are likely the same, seems because they have relative slower disk( disk type are the same but the latency is a bit higher ,8ms -> 10ms).
> 	
> 	Top show no idle % but still have 30+% of io_wait,  my colleague educate me that io_wait can be treated as free.
> 
> Another information is offload the heartbeat to 1Gb nic doesn't solve 
> the problem, what's more, when we doing random write test, we can 
> still see this flipping happen. So I would like to say it may related 
> with CPU scheduler ? The heartbeat thread (in busy OSD ) failed to get 
> enough cpu cycle.

Can you try reproducing with logs?  If we can reproduce that will give us some clues.

sage

> 
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx 
> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: 2013?5?15? 7:23
> To: Chen, Xiaoxi
> Cc: Mark Nelson; ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxx
> Subject: Re:  OSD state flipping when cluster-network in 
> high utilization
> 
> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > I like the idea to leave ping in cluster network because it can help 
> > us detect switch?nic failure.
> > 
> > What confuse me is I keep pinging every ceph node's cluster ip?it is 
> > OK during the whole run with less than 1 ms latency?why the 
> > heartbeat still suffer? TOP show my cpu not 100% utilized?with ?30% 
> > io wait?.Enabling jumbo frame **seems** make things worth.(just 
> > feeling.no data supports)
> 
> I say "ping" in teh general sense.. it's not using ICMP, but sending small messages over a TCP session, doing some minimal processing on the other end, and sending them back.  If the machine is heavily loaded and that thread doesn't get scheduled or somehow blocks, it may be problematic.
> 
> How responsive generally is the machine under load?  Is there available CPU?
> 
> We can try to enable debugging to see what is going on.. 'debug ms = 1' 
> and 'debug osd = 20' is everything we would need, but will incur additoinal load itself and may spoil the experiment...
> 
> sage
> 
> > 
> > ???? iPhone
> > 
> > ? 2013-5-14?23:36?"Mark Nelson" <mark.nelson@xxxxxxxxxxx> ???
> > 
> > > On 05/14/2013 10:30 AM, Sage Weil wrote:
> > >> On Tue, 14 May 2013, Chen, Xiaoxi wrote:
> > >>> 
> > >>> Hi
> > >>> 
> > >>>   We are suffering our OSD flipping between up and down ( OSD X 
> > >>> be voted to down due to 3 missing ping, and after a while it 
> > >>> tells the monitor ?map xxx wrongly mark me down? ). Because we 
> > >>> are running sequential write performance test on top of RBDs, 
> > >>> and the cluster network nics is really in high utilization (8Gb/s+ for a 10Gb network).
> > >>> 
> > >>>          Is this a expected behavior ? or how can I prevent this happen?
> > >> 
> > >> You an increase the heartbeat grace period.  The pings are 
> > >> handled by a separate thread on the backside interface (if there is one).
> > >> If you are missing pings then the network or scheduler is 
> > >> preventing those (small) messages from being processed (there is 
> > >> almost no lock contention in that path).  Which means it really 
> > >> is taking ~20 seconds or wahtever to handle those messages.  It's 
> > >> really a questin of how unresponsive you want to permit the OSDs to be before you consider it a failure..
> > >> 
> > >> sage
> > >> 
> > >> 
> > > 
> > > It might be worth testing out how long pings or other network traffic are taking during these tests.  There may be some tcp tunning you can do here, or even consider using a separate network for the mons.
> > > 
> > > Mark
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> N????y????b?????v?????{.n??????z??ay????????j ???f????????????????:+v???????? ??zZ+??????"?!?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com