Re: 12.2.6 CRC errors

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 14 Jul 2018 17:15:57 +0000 (UTC)

On Sat, 14 Jul 2018, Glen Baars wrote:
> Hello Ceph users!
> 
> Note to users, don't install new servers on Friday the 13th!
> 
> We added a new ceph node on Friday and it has received the latest 12.2.6 
> update. I started to see CRC errors and investigated hardware issues. I 
> have since found that it is caused by the 12.2.6 release. About 80TB 
> copied onto this server.
> 
> I have set noout,noscrub,nodeepscrub and repaired the affected PGs ( 
> ceph pg repair ) . This has cleared the errors.
> 
> ***** no idea if this is a good way to fix the issue. From the bug 
> report this issue is in the deepscrub and therefore I suppose stopping 
> it will limit the issues. ***
> 
> Can anyone tell me what to do? Downgrade seems that it won't fix the 
> issue. Maybe remove this node and rebuild with 12.2.5 and resync data? 
> Wait a few days for 12.2.7?

I would sit tight for now.  I'm working on the right fix and hope to 
having something to test shortly, and possibly a release by tomorrow.

There is a remaining danger is that for the objects with bad full-object 
digests, that a read of the entire object will throw an EIO.  It's up 
to you whether you want to try to quiesce workloads to avoid that (to 
prevent corruption at higher layers) or avoid a service 
degradation/outage.  :(  Unfortunately I don't have super precise guidance 
as far as how likely that is.

Are you using bluestore only, or is it a mix of bluestore and filestore?

sage

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com