offending shards are crashing osd's

Ronny Aasen <ronny+ceph-users@xxxxxxxx> · Thu, 6 Oct 2016 13:41:37 +0200

hello

I have a few osd's in my cluster that are regularly crashing.

in the log of them i can see

osd.7
    -1> 2016-10-06 08:09:18.869687 7ffaa037f700 -1 osd.7 pg_epoch: 
128840 pg[5.3as0( v 84797'30080 (67219'27080,84797'30080] 
local-les=128834 n=13146 ec=61149 les/c 128834/127358 
128829/128829/128829) [7,109,4,0,62,32]/[7,109,32,0,62,39] r=0 
lpr=128829 pi=127357-128828/12 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 
mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: 
inconsistent shard sizes 
5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head  the offending 
shard must be manually removed  after verifying there are enough shards 
to recover (0, 8388608, [32(2),0, 39(5),0])

osd.32
  -411> 2016-10-06 13:21:15.166968 7fe45b6cb700 -1 osd.32 pg_epoch: 
129181 pg[5.3as2( v 84797'30080 (67219'27080,84797'30080] 
local-les=129171 n=13146 ec=61149 les/c 129171/127358 
129170/129170/129170) 
[2147483647,2147483647,4,0,62,32]/[2147483647,2147483647,32,0,62,39] r=2 
lpr=129170 pi=121260-129169/43 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 
mlcod 0'0 active+undersized+degraded+remapped+backfilling] 
handle_recovery_read_complete: inconsistent shard sizes 
5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head  the offending 
shard must be manually removed  after verifying there are enough shards 
to recover (0, 8388608, [32(2),0, 39(5),0])

osd.109
 -1> 2016-10-06 13:17:36.748340 7fa53d36c700 -1 osd.109 pg_epoch: 
129167 pg[5.3as1( v 84797'30080 (66310'24592,84797'30080] 
local-les=129163 n=13146 ec=61149 les/c 129163/127358 
129162/129162/129162) 
[2147483647,109,4,0,62,32]/[2147483647,109,32,0,62,39] r=1 lpr=129162 
pi=112552-129161/59 rops=5 bft=4(2),32(5) crt=84797'30076 lcod 0'0 mlcod 
0'0 active+undersized+degraded+remapped+backfilling] 
handle_recovery_read_complete: inconsistent shard sizes 
5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head  the offending 
shard must be manually removed  after verifying there are enough shards 
to recover (0, 8388608, [32(2),0, 39(5),0])

ofcourse having 3 osd's dying regularly is not good for my health. so i 
have set noout, to avoid heavy recoveries.

googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the 
broken shard. (or perhaps all 3 of them?)

a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be 
nice.

hope someone have an idea how to progress.

kind regards
Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com