Hi,
I'm running Ceph 0.56.2 on Fedora 17.
The system was running with status HEALTH_OK serving a single qemu-kvm
over rbd.
Suddenly the system went to this state:
health HEALTH_WARN 4 pgs peering; 4 pgs stuck inactive; 16 pgs stuck unclean
monmap e1: 3 mons at
{a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.3:6789/0}, election epoch
16642, quorum 0,1,2 a,b,c
osdmap e3326: 12 osds: 12 up, 12 in
pgmap v1484051: 1018 pgs: 12 active, 1002 active+clean, 4 peering; 3603
GB data, 7407 GB used, 14090 GB / 22161 GB avail
All mons and osds have been running the whole time - so no crashes or
similar.
I ran ceph pg dump_stuck inactive:
pg_stat objects mip degr unf bytes log disklog state state_stamp v
reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.10 0 0 0 0 0 0 0 peering 2013-02-08 15:51:34.157129 0'0 3292'1020
[6,12] [6,12] 0'0 2013-02-06 18:13:50.672259 0'0 2013-02-06 18:13:50.672259
0.12 3494 0 0 0 14618128384 147147 147147 peering 2013-02-08
15:51:34.157225 3308'65211 3292'104335 [6,12] [6,12] 3308'48798
2013-02-06 19:04:27.344941 3308'32408 2013-02-02 07:33:58.549687
1.11 0 0 0 0 0 0 0 peering 2013-02-08 15:51:34.157352 0'0 3292'1020
[6,12] [6,12] 0'0 2013-02-06 17:07:37.263531 0'0 2013-02-06 17:07:37.263531
4.e 6 0 0 0 25165824 4094 4094 peering 2013-02-08 15:51:34.157456
17'27 3292'1039 [6,12] [6,12] 17'27 2013-02-06 18:31:07.711374 17'27
2013-02-06 18:31:07.711374
I notice that they all have [6,12] in common.
The log for osd-12 has lines like this:
2013-02-08 21:07:42.812724 7fa1377b9700 0 -- 10.0.0.1:6804/29923 >>
10.0.0.2:6819/11582 pipe(0x7fa120019c70 sd=38 :6804 s=0 pgs=0 cs=0
l=0).accept connect_seq 20 vs existing 19 state standby
2013-02-08 21:11:03.945398 7fa0a2ded700 0 -- 10.0.0.1:6804/29923 >>
10.0.0.3:6822/13426 pipe(0x7fa120031c60 sd=43 :6804 s=2 pgs=316 cs=13
l=0).fault with nothing to send, going to standby
2013-02-08 21:21:54.689385 7fa0a3bfb700 0 -- 10.0.0.1:6804/29923 >>
10.0.0.2:6807/10808 pipe(0x7fa10c7d07d0 sd=36 :41295 s=2 pgs=347 cs=43
l=0).fault, initiating reconnect
2013-02-08 21:22:42.951173 7fa1377b9700 0 -- 10.0.0.1:6804/29923 >>
10.0.0.2:6819/11582 pipe(0x7fa120019c70 sd=38 :6804 s=2 pgs=317 cs=21
l=0).fault with nothing to send, going to standby
The log for osd-6 has lines like this:
2013-02-08 21:58:56.322592 7ffebe2ab700 0 log [WRN] : 1 slow requests,
1 included below; oldest blocked for > 21968.585254 secs
2013-02-08 21:58:56.322599 7ffebe2ab700 0 log [WRN] : slow request
21968.585254 seconds old, received at 2013-02-08 15:52:47.737306:
osd_op(client.7867.0:38360567 rb.0.11b7.4a933baa.00000008279e [read
2547712~4096] 0.b2bc4212) v4 currently reached pg
2013-02-08 21:58:57.322864 7ffebe2ab700 0 log [WRN] : 1 slow requests,
1 included below; oldest blocked for > 21969.585525 secs
2013-02-08 21:58:57.322871 7ffebe2ab700 0 log [WRN] : slow request
21969.585525 seconds old, received at 2013-02-08 15:52:47.737306:
osd_op(client.7867.0:38360567 rb.0.11b7.4a933baa.00000008279e [read
2547712~4096] 0.b2bc4212) v4 currently reached pg
What can I do to recover the system? (restart osds 6 and 12?)
How can I debug what caused this?
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@xxxxxxxxxxxxxxxxxxxx,
http://www.mermaidconsulting.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com