crashed+down+peering

Christian Brunner <chb@xxxxxx> · Thu, 2 Dec 2010 16:28:19 +0100

We have simulated the simultanious crash of multiple osds in our
environment. After starting all the cosd again, we have the following
situation:

2010-12-02 16:18:33.944436    pg v724432: 3712 pgs: 1 active, 3605
active+clean, 1 crashed+peering, 46 down+peering, 56
crashed+down+peering, 3 active+clean+inconsistent; 177 GB data, 365 GB
used, 83437 GB / 83834 GB avail; 1/93704 degraded (0.001%)

When I set of an "rbd rm" command for one of our rbd volumes, it seems
to hit the the "crashed+down+peering" pg. After that the command is
stuck.

When I turn on "debug ms = 1" on the client side I can see the
following messages:

2010-12-02 16:24:06.364081 7f7f67c887e0 -- 10.255.0.130:0/7505 -->
osd14 10.255.0.63:6804/2175 -- osd_op(client97429.0:3
rb.0.4b.000000000001 [delete] 3.2c79) v1 -- ?+0 0x191c720
2010-12-02 16:24:21.358572 7f7f66683710 -- 10.255.0.130:0/7505 -->
mon1 10.255.0.21:6789/0 -- mon_subscribe({monmap=2+,osdmap=1730}) v1
-- ?+0 0x7f7f38000a90
2010-12-02 16:24:21.358633 7f7f66683710 -- 10.255.0.130:0/7505 -->
osd14 10.255.0.63:6804/2175 -- ping v1 -- ?+0 0x7f7f38000d30
2010-12-02 16:24:21.359298 7f7f67c87710 -- 10.255.0.130:0/7505 <==
mon1 10.255.0.21:6789/0 10 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0
(844714770 0 0) 0x7f7f58000d70
2010-12-02 16:24:26.358836 7f7f66683710 -- 10.255.0.130:0/7505 -->
osd14 10.255.0.63:6804/2175 -- ping v1 -- ?+0 0x7f7f38000d30
[...]

The "ping v1" continues forever. When I turn on "debug ms = 1" and
"debug osd = 10" on the cosd side, I can see which pg causes the
problem:

2010-12-02 16:04:25.701914 7f024363c710 -- 10.255.0.63:6800/19612 <==
client98231 10.255.0.130:0/7161 3 ==== osd_op(client9
8231.0:41 rb.0.4a.000000000027 [delete] 3.2dab) v1 ==== 128+0+0
(4170532710 0 0) 0x7f0154000e10
2010-12-02 16:04:25.701954 7f024363c710 osd12 1718 request for pool=3
(rbd) owner=0 perm=7 may_read=0 may_write=1 may_exec=
0 require_exec_caps=0
2010-12-02 16:04:25.701971 7f024363c710 osd12 1718 pg[3.1ab( v
630'8247 (630'8241,630'8247] n=78 ec=2 les=814 1712/1712/1712) [12,10]
r=0 lcod 0'0 mlcod 0'0 !hml crashed+down+peering] not active (yet)

What would be the correct way to resolve this situation?

Thanks, Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html