We ran a for-loop to tell all the OSDs to deep scrub (since * still doesn't work) after the upgrade. The deep scrub this week that produced these errors is the weekly scheduled one though. I shall go investigate the mentioned thread... On 16/09/2014 20:36, Gregory Farnum wrote: > Ah, you're right ? it wasn't popping up in the same searches and I'd > forgotten that was so recent. > > In that case, did you actually deep scrub *everything* in the cluster, > Marc? You'll need to run and fix every PG in the cluster, and the > background deep scrubbing doesn't move through the data very quickly. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, Sep 16, 2014 at 11:32 AM, Dan Van Der Ster > <daniel.vanderster at cern.ch> wrote: >> Hi Greg, >> I believe Marc is referring to the corruption triggered by set_extsize on >> xfs. That option was disabled by default in 0.80.4... See the thread >> "firefly scrub error". >> Cheers, >> Dan >> >> >> >> From: Gregory Farnum <greg at inktank.com> >> Sent: Sep 16, 2014 8:15 PM >> To: Marc >> Cc: ceph-users at lists.ceph.com >> Subject: Re: Still seing scrub errors in .80.5 >> >> On Tue, Sep 16, 2014 at 12:03 AM, Marc <mail at shoowin.de> wrote: >>> Hello fellow cephalopods, >>> >>> every deep scrub seems to dig up inconsistencies (i.e. scrub errors) >>> that we could use some help with diagnosing. >>> >>> I understand there used to be a data corruption issue before .80.3 so we >>> made sure that all the nodes were upgraded to .80.5 and all the daemons >>> were restarted (they all report .80.5 when contacted via socket). >>> *After* that we ran a deep scrub, which obviously found errors, which we >>> then repaired. But unfortunately, it's now a week later, and the next >>> deep scrub has dug up new errors, which shouldn't have happened I >>> think...? >>> >>> ceph.log shows these errors in between the deep scrub messages: >>> >>> 2014-09-15 07:56:23.164818 osd.15 10.10.10.55:6804/23853 364 : [ERR] >>> 3.335 shard 2: soid >>> 6ba68735/rbd_data.59e3c2ae8944a.00000000000006b1/head//3 digest >>> 3090820441 != known digest 3787996302 >>> 2014-09-15 07:56:23.164827 osd.15 10.10.10.55:6804/23853 365 : [ERR] >>> 3.335 shard 6: soid >>> 6ba68735/rbd_data.59e3c2ae8944a.00000000000006b1/head//3 digest >>> 3259686791 != known digest 3787996302 >>> 2014-09-15 07:56:28.485713 osd.15 10.10.10.55:6804/23853 366 : [ERR] >>> 3.335 deep-scrub 0 missing, 1 inconsistent objects >>> 2014-09-15 07:56:28.485734 osd.15 10.10.10.55:6804/23853 367 : [ERR] >>> 3.335 deep-scrub 2 errors >> Uh, I'm afraid those errors were never output as a result of bugs in >> Firefly. These are indicating actual data differences between the >> nodes, whereas the Firefly issue was a metadata flag that wasn't >> handled properly in mixed-version OSD clusters. >> >> I don't think Ceph has ever had a bug that would change the data >> payload between OSDs. Searching the tracker logs, the only entries >> with this error message are: >> 1) The local filesystem is not misbehaving under the workload we give >> it (and there are no known filesystem issues that are exposed by >> running firefly OSDs in default config that I can think of ? certainly >> none with this error) >> 2) The disks themselves are bad. >> >> :/ >> >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com