hello
I have a few osd's in my cluster that are regularly crashing.
in the log of them i can see
osd.7
-1> 2016-10-06 08:09:18.869687 7ffaa037f700 -1 osd.7 pg_epoch:
128840 pg[5.3as0( v 84797'30080 (67219'27080,84797'30080]
local-les=128834 n=13146 ec=61149 les/c 128834/127358
128829/128829/128829) [7,109,4,0,62,32]/[7,109,32,0,62,39] r=0
lpr=128829 pi=127357-128828/12 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0
mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete:
inconsistent shard sizes
5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head the offending
shard must be manually removed after verifying there are enough shards
to recover (0, 8388608, [32(2),0, 39(5),0])
osd.32
-411> 2016-10-06 13:21:15.166968 7fe45b6cb700 -1 osd.32 pg_epoch:
129181 pg[5.3as2( v 84797'30080 (67219'27080,84797'30080]
local-les=129171 n=13146 ec=61149 les/c 129171/127358
129170/129170/129170)
[2147483647,2147483647,4,0,62,32]/[2147483647,2147483647,32,0,62,39] r=2
lpr=129170 pi=121260-129169/43 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0
mlcod 0'0 active+undersized+degraded+remapped+backfilling]
handle_recovery_read_complete: inconsistent shard sizes
5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head the offending
shard must be manually removed after verifying there are enough shards
to recover (0, 8388608, [32(2),0, 39(5),0])
osd.109
-1> 2016-10-06 13:17:36.748340 7fa53d36c700 -1 osd.109 pg_epoch:
129167 pg[5.3as1( v 84797'30080 (66310'24592,84797'30080]
local-les=129163 n=13146 ec=61149 les/c 129163/127358
129162/129162/129162)
[2147483647,109,4,0,62,32]/[2147483647,109,32,0,62,39] r=1 lpr=129162
pi=112552-129161/59 rops=5 bft=4(2),32(5) crt=84797'30076 lcod 0'0 mlcod
0'0 active+undersized+degraded+remapped+backfilling]
handle_recovery_read_complete: inconsistent shard sizes
5/abc6d43a/rbd_data.33640a238e1f29.000000000003b165/head the offending
shard must be manually removed after verifying there are enough shards
to recover (0, 8388608, [32(2),0, 39(5),0])
ofcourse having 3 osd's dying regularly is not good for my health. so i
have set noout, to avoid heavy recoveries.
googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946
where it saies: "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the
broken shard. (or perhaps all 3 of them?)
a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be
nice.
hope someone have an idea how to progress.
kind regards
Ronny Aasen
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com