Still seing scrub errors in .80.5

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 16 Sep 2014 11:09:22 -0700

On Tue, Sep 16, 2014 at 12:03 AM, Marc <mail at shoowin.de> wrote:
> Hello fellow cephalopods,
>
> every deep scrub seems to dig up inconsistencies (i.e. scrub errors)
> that we could use some help with diagnosing.
>
> I understand there used to be a data corruption issue before .80.3 so we
> made sure that all the nodes were upgraded to .80.5 and all the daemons
> were restarted (they all report .80.5 when contacted via socket).
> *After* that we ran a deep scrub, which obviously found errors, which we
> then repaired. But unfortunately, it's now a week later, and the next
> deep scrub has dug up new errors, which shouldn't have happened I think...?
>
> ceph.log shows these errors in between the deep scrub messages:
>
> 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR]
> 3.335 shard 2: soid
> 6ba68735/rbd_data.59e3c2ae8944a.00000000000006b1/head//3 digest
> 3090820441 != known digest 3787996302
> 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR]
> 3.335 shard 6: soid
> 6ba68735/rbd_data.59e3c2ae8944a.00000000000006b1/head//3 digest
> 3259686791 != known digest 3787996302
> 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR]
> 3.335 deep-scrub 0 missing, 1 inconsistent objects
> 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR]
> 3.335 deep-scrub 2 errors

Uh, I'm afraid those errors were never output as a result of bugs in
Firefly. These are indicating actual data differences between the
nodes, whereas the Firefly issue was a metadata flag that wasn't
handled properly in mixed-version OSD clusters.

I don't think Ceph has ever had a bug that would change the data
payload between OSDs. Searching the tracker logs, the only entries
with this error message are:
1) The local filesystem is not misbehaving under the workload we give
it (and there are no known filesystem issues that are exposed by
running firefly OSDs in default config that I can think of ? certainly
none with this error)
2) The disks themselves are bad.

:/

-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com