Pgs stuck peering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I'm running Ceph 0.56.2 on Fedora 17.

The system was running with status HEALTH_OK serving a single qemu-kvm over rbd.

Suddenly the system went to this state:

health HEALTH_WARN 4 pgs peering; 4 pgs stuck inactive; 16 pgs stuck unclean
monmap e1: 3 mons at {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.3:6789/0}, election epoch 16642, quorum 0,1,2 a,b,c
osdmap e3326: 12 osds: 12 up, 12 in
pgmap v1484051: 1018 pgs: 12 active, 1002 active+clean, 4 peering; 3603 GB data, 7407 GB used, 14090 GB / 22161 GB avail

All mons and osds have been running the whole time - so no crashes or similar.

I ran ceph pg dump_stuck inactive:

pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.10 0 0 0 0 0 0 0 peering 2013-02-08 15:51:34.157129 0'0 3292'1020 [6,12] [6,12] 0'0 2013-02-06 18:13:50.672259 0'0 2013-02-06 18:13:50.672259 0.12 3494 0 0 0 14618128384 147147 147147 peering 2013-02-08 15:51:34.157225 3308'65211 3292'104335 [6,12] [6,12] 3308'48798 2013-02-06 19:04:27.344941 3308'32408 2013-02-02 07:33:58.549687 1.11 0 0 0 0 0 0 0 peering 2013-02-08 15:51:34.157352 0'0 3292'1020 [6,12] [6,12] 0'0 2013-02-06 17:07:37.263531 0'0 2013-02-06 17:07:37.263531 4.e 6 0 0 0 25165824 4094 4094 peering 2013-02-08 15:51:34.157456 17'27 3292'1039 [6,12] [6,12] 17'27 2013-02-06 18:31:07.711374 17'27 2013-02-06 18:31:07.711374

I notice that they all have [6,12] in common.

The log for osd-12 has lines like this:

2013-02-08 21:07:42.812724 7fa1377b9700 0 -- 10.0.0.1:6804/29923 >> 10.0.0.2:6819/11582 pipe(0x7fa120019c70 sd=38 :6804 s=0 pgs=0 cs=0 l=0).accept connect_seq 20 vs existing 19 state standby 2013-02-08 21:11:03.945398 7fa0a2ded700 0 -- 10.0.0.1:6804/29923 >> 10.0.0.3:6822/13426 pipe(0x7fa120031c60 sd=43 :6804 s=2 pgs=316 cs=13 l=0).fault with nothing to send, going to standby 2013-02-08 21:21:54.689385 7fa0a3bfb700 0 -- 10.0.0.1:6804/29923 >> 10.0.0.2:6807/10808 pipe(0x7fa10c7d07d0 sd=36 :41295 s=2 pgs=347 cs=43 l=0).fault, initiating reconnect 2013-02-08 21:22:42.951173 7fa1377b9700 0 -- 10.0.0.1:6804/29923 >> 10.0.0.2:6819/11582 pipe(0x7fa120019c70 sd=38 :6804 s=2 pgs=317 cs=21 l=0).fault with nothing to send, going to standby


The log for osd-6 has lines like this:

2013-02-08 21:58:56.322592 7ffebe2ab700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 21968.585254 secs 2013-02-08 21:58:56.322599 7ffebe2ab700 0 log [WRN] : slow request 21968.585254 seconds old, received at 2013-02-08 15:52:47.737306: osd_op(client.7867.0:38360567 rb.0.11b7.4a933baa.00000008279e [read 2547712~4096] 0.b2bc4212) v4 currently reached pg 2013-02-08 21:58:57.322864 7ffebe2ab700 0 log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 21969.585525 secs 2013-02-08 21:58:57.322871 7ffebe2ab700 0 log [WRN] : slow request 21969.585525 seconds old, received at 2013-02-08 15:52:47.737306: osd_op(client.7867.0:38360567 rb.0.11b7.4a933baa.00000008279e [read 2547712~4096] 0.b2bc4212) v4 currently reached pg


What can I do to recover the system? (restart osds 6 and 12?)

How can I debug what caused this?


--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@xxxxxxxxxxxxxxxxxxxx,
http://www.mermaidconsulting.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux