Re: 12.2.6 CRC errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In the meantime I upgraded the cluster to 12.2.7 and added the osd distrust data digest = true setting in ceph.conf because it's mixed cluster.

But I still see a constantly growing number of inconsistent PG's and Scrub errors. If I check the running ceph config with ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show I can't find this setting.

Did I something wrong? Is it safe to go ahead with the migration to bluestore?

Stefan



We are in the process of building the 12.2.7 release now that will fix
this.  (If you don't want to wait you can also install the autobuilt
packages from shaman.ceph.com... official packages are only a few hours
away from being ready though).

I would set data migration for the time being (norebalance). Once the new
version is install it will stop creating the crc mismatches and it
will prevent them from triggering an incorrect EIO on read.  However,
scrub doesn't repair them yet.  They will tend to go away on their own
as normal IO touches the affected objects.  In 12.2.8 scrub will repair
the CRCs.

In the meantime, while waiting for the fix, you can set
osd_skip_data_digest = false to avoid generating more errors.  But note
that once you upgrade you need to turn that back on (or
osd_distruct_data_digest) to apply the fix/workaround.

You'll want to read the 12.2.7 release notes carefully (PR at
https://github.com/ceph/ceph/pull/23057).

The bug doesn't corrupt data; only the whole-object checksums. However, some reads (when the entire object is read) will see the bad checksum and
return EIO.  This could break applications at a higher layer (although
hopefully they will just abort and exit cleanly; it is hard to tell given
the breadth of workloads).

I hope that helps, and I'm very sorry this regression crept in!
sage


On Mon, 16 Jul 2018, Stefan Schneebeli wrote:

hello guys,

unfortunately I missed the warning on friday and upgraded my cluster on
saturday to 12.2.6.
The cluster is in a migration state from filestore to bluestore (10/2) and I
get constantly inconsistent PG's only on the two bluestore OSD's.
If I run a rados list-inconsistent-obj 2.17 --format=json-pretty for example I
see at the end this mismatches:

            "shards": [
                {
                    "osd": 0,
                    "primary": true,
                    "errors": [],
                    "size": 4194304,
                    "omap_digest": "0xffffffff"
                },
                {
                    "osd": 1,
                    "primary": false,
                    "errors": [
                        "data_digest_mismatch_info"
                    ],
                    "size": 4194304,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0x21b21973"

Is this the issue you talking about ?
I can repair this PG's wth ceph pg repair and it reports the error is fixed.
But is it really fixed?
Do I have to be afraid to have now corrupted data?
Would it be an option to noout this bluestore OSD's and stop them?
When do you expect the new 12.2.7 Release? Will it fix all the errors?

Thank you in advance for your answers!

Stefan





------ Originalnachricht ------
Von: "Sage Weil" <sage@xxxxxxxxxxxx>
An: "Glen Baars" <glen@xxxxxxxxxxxxxxxxxxxxxx>
Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Gesendet: 14.07.2018 19:15:57
Betreff: Re:  12.2.6 CRC errors

> On Sat, 14 Jul 2018, Glen Baars wrote:
> > Hello Ceph users!
> >
> > Note to users, don't install new servers on Friday the 13th!
> >
> > We added a new ceph node on Friday and it has received the latest 12.2.6 > > update. I started to see CRC errors and investigated hardware issues. I > > have since found that it is caused by the 12.2.6 release. About 80TB
> > copied onto this server.
> >
> > I have set noout,noscrub,nodeepscrub and repaired the affected PGs (
> > ceph pg repair ) . This has cleared the errors.
> >
> > ***** no idea if this is a good way to fix the issue. From the bug
> > report this issue is in the deepscrub and therefore I suppose stopping
> > it will limit the issues. ***
> >
> > Can anyone tell me what to do? Downgrade seems that it won't fix the > > issue. Maybe remove this node and rebuild with 12.2.5 and resync data?
> > Wait a few days for 12.2.7?
>
> I would sit tight for now.  I'm working on the right fix and hope to
> having something to test shortly, and possibly a release by tomorrow.
>
> There is a remaining danger is that for the objects with bad full-object > digests, that a read of the entire object will throw an EIO. It's up > to you whether you want to try to quiesce workloads to avoid that (to
> prevent corruption at higher layers) or avoid a service
> degradation/outage. :( Unfortunately I don't have super precise guidance
> as far as how likely that is.
>
> Are you using bluestore only, or is it a mix of bluestore and filestore?
>
> sage
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>








_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux