Still seing scrub errors in .80.5

mail@xxxxxxxxxx (Marc) · Tue, 16 Sep 2014 09:03:00 +0200

Hello fellow cephalopods,

every deep scrub seems to dig up inconsistencies (i.e. scrub errors)
that we could use some help with diagnosing.

I understand there used to be a data corruption issue before .80.3 so we
made sure that all the nodes were upgraded to .80.5 and all the daemons
were restarted (they all report .80.5 when contacted via socket).
*After* that we ran a deep scrub, which obviously found errors, which we
then repaired. But unfortunately, it's now a week later, and the next
deep scrub has dug up new errors, which shouldn't have happened I think...?

ceph.log shows these errors in between the deep scrub messages:

2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR]
3.335 shard 2: soid
6ba68735/rbd_data.59e3c2ae8944a.00000000000006b1/head//3 digest
3090820441 != known digest 3787996302
2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR]
3.335 shard 6: soid
6ba68735/rbd_data.59e3c2ae8944a.00000000000006b1/head//3 digest
3259686791 != known digest 3787996302
2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR]
3.335 deep-scrub 0 missing, 1 inconsistent objects
2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR]
3.335 deep-scrub 2 errors

2014-09-15 08:57:45.340968 osd.1 10.10.10.53:6800/3553 1100 : [ERR]
3.28a shard 1: soid
f0d8268a/rbd_data.590142ae8944a.0000000000000699/head//3 digest
1680449797 != known digest 624976551
2014-09-15 08:57:45.340973 osd.1 10.10.10.53:6800/3553 1101 : [ERR]
3.28a shard 7: soid
f0d8268a/rbd_data.590142ae8944a.0000000000000699/head//3 digest
2880845882 != known digest 624976551
2014-09-15 08:57:50.666323 osd.1 10.10.10.53:6800/3553 1102 : [ERR]
3.28a deep-scrub 0 missing, 1 inconsistent objects
2014-09-15 08:57:50.666329 osd.1 10.10.10.53:6800/3553 1103 : [ERR]
3.28a deep-scrub 2 errors

Side question: why do these errors show the public facing IPs of the
OSDs instead of the cluster network IPs? How much of the deep scrub
traffic is taking place on the public network side of the OSDs then?

Obviously we could have just repaired those as well, but getting fresh
scrub errors every week isn't all that appealing which is why we left
the cluster the way it is now, to be able to give out further
information if needed.

ceph health detail
HEALTH_ERR 8 pgs inconsistent; 14 scrub errors
pg 3.16c is active+clean+inconsistent, acting [6,4,1]
pg 3.125 is active+clean+inconsistent, acting [1,8,3]
pg 3.103 is active+clean+inconsistent, acting [8,15,4]
pg 3.33 is active+clean+inconsistent, acting [3,10,8]
pg 3.37e is active+clean+inconsistent, acting [10,4,15]
pg 3.335 is active+clean+inconsistent, acting [15,6,2]
pg 3.28a is active+clean+inconsistent, acting [1,8,7]
pg 3.185 is active+clean+inconsistent, acting [6,1,4]
14 scrub errors

Any input on this?

Thanks in advance,
Marc