Pgs stuck peering

Jens Kristian Søgaard <jens@xxxxxxxxxxxxxxxxxxxx> · Fri, 08 Feb 2013 21:59:47 +0100

Hi,

I'm running Ceph 0.56.2 on Fedora 17.

The system was running with status HEALTH_OK serving a single qemu-kvm 
over rbd.

Suddenly the system went to this state:

health HEALTH_WARN 4 pgs peering; 4 pgs stuck inactive; 16 pgs stuck unclean
monmap e1: 3 mons at 
{a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.3:6789/0}, election epoch 
16642, quorum 0,1,2 a,b,c
osdmap e3326: 12 osds: 12 up, 12 in
pgmap v1484051: 1018 pgs: 12 active, 1002 active+clean, 4 peering; 3603 
GB data, 7407 GB used, 14090 GB / 22161 GB avail

All mons and osds have been running the whole time - so no crashes or 
similar.

I ran ceph pg dump_stuck inactive:

pg_stat	objects	mip	degr	unf	bytes	log	disklog	state	state_stamp	v 
reported	up	acting	last_scrub	scrub_stamp	last_deep_scrub	deep_scrub_stamp
2.10	0	0	0	0	0	0	0	peering	2013-02-08 15:51:34.157129	0'0	3292'1020 
[6,12]	[6,12]	0'0	2013-02-06 18:13:50.672259	0'0	2013-02-06 18:13:50.672259
0.12	3494	0	0	0	14618128384	147147	147147	peering	2013-02-08 
15:51:34.157225	3308'65211	3292'104335	[6,12]	[6,12]	3308'48798 
2013-02-06 19:04:27.344941	3308'32408	2013-02-02 07:33:58.549687
1.11	0	0	0	0	0	0	0	peering	2013-02-08 15:51:34.157352	0'0	3292'1020 
[6,12]	[6,12]	0'0	2013-02-06 17:07:37.263531	0'0	2013-02-06 17:07:37.263531
4.e	6	0	0	0	25165824	4094	4094	peering	2013-02-08 15:51:34.157456 
17'27	3292'1039	[6,12]	[6,12]	17'27	2013-02-06 18:31:07.711374	17'27 
2013-02-06 18:31:07.711374

I notice that they all have [6,12] in common.

The log for osd-12 has lines like this:

2013-02-08 21:07:42.812724 7fa1377b9700  0 -- 10.0.0.1:6804/29923 >> 
10.0.0.2:6819/11582 pipe(0x7fa120019c70 sd=38 :6804 s=0 pgs=0 cs=0 
l=0).accept connect_seq 20 vs existing 19 state standby
2013-02-08 21:11:03.945398 7fa0a2ded700  0 -- 10.0.0.1:6804/29923 >> 
10.0.0.3:6822/13426 pipe(0x7fa120031c60 sd=43 :6804 s=2 pgs=316 cs=13 
l=0).fault with nothing to send, going to standby
2013-02-08 21:21:54.689385 7fa0a3bfb700  0 -- 10.0.0.1:6804/29923 >> 
10.0.0.2:6807/10808 pipe(0x7fa10c7d07d0 sd=36 :41295 s=2 pgs=347 cs=43 
l=0).fault, initiating reconnect
2013-02-08 21:22:42.951173 7fa1377b9700  0 -- 10.0.0.1:6804/29923 >> 
10.0.0.2:6819/11582 pipe(0x7fa120019c70 sd=38 :6804 s=2 pgs=317 cs=21 
l=0).fault with nothing to send, going to standby

The log for osd-6 has lines like this:

2013-02-08 21:58:56.322592 7ffebe2ab700  0 log [WRN] : 1 slow requests, 
1 included below; oldest blocked for > 21968.585254 secs
2013-02-08 21:58:56.322599 7ffebe2ab700  0 log [WRN] : slow request 
21968.585254 seconds old, received at 2013-02-08 15:52:47.737306: 
osd_op(client.7867.0:38360567 rb.0.11b7.4a933baa.00000008279e [read 
2547712~4096] 0.b2bc4212) v4 currently reached pg
2013-02-08 21:58:57.322864 7ffebe2ab700  0 log [WRN] : 1 slow requests, 
1 included below; oldest blocked for > 21969.585525 secs
2013-02-08 21:58:57.322871 7ffebe2ab700  0 log [WRN] : slow request 
21969.585525 seconds old, received at 2013-02-08 15:52:47.737306: 
osd_op(client.7867.0:38360567 rb.0.11b7.4a933baa.00000008279e [read 
2547712~4096] 0.b2bc4212) v4 currently reached pg

What can I do to recover the system? (restart osds 6 and 12?)

How can I debug what caused this?

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@xxxxxxxxxxxxxxxxxxxx,
http://www.mermaidconsulting.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com