Re: 12.2.7 + osd skip data digest + bluestore + I/O errors

SCHAER Frederic <frederic.schaer@xxxxxx> · Wed, 25 Jul 2018 08:28:09 +0000

Hi again,

Now with all OSDs restarted, I'm getting 
    health: HEALTH_ERR
            777 scrub errors
            Possible data damage: 36 pgs inconsistent
(...)
    pgs:     4764 active+clean
             36   active+clean+inconsistent

But from what I could read up to now, this is what's expected and should auto-heal when objects are overwritten  - fingers crossed as pg repair or scrub doesn't seem to help.
New errors in the ceph logs include lines like the following, which I also hope/presume are expected - I still have posts to read on this list about omap and those  errors :
2018-07-25 10:20:00.106227 osd.66 osd.66 192.54.207.75:6826/2430367 12 : cluster [ERR] 11.288 shard 207: soid 11:1155c332:::rbd_data.207dce238e1f29.0000000000000527:head data_digest 0xc8997a5b != data_digest 0x2ca15853 from auth oi 11:1155c332:::rbd_data.207dce238e1f29.0000000000000527:head(182554'240410 client.6084296.0:48463693 dirty|data_digest|omap_digest s 4194304 uv 49429318 dd 2ca15853 od ffffffff alloc_hint [0 0 0])
2018-07-25 10:20:00.106230 osd.66 osd.66 192.54.207.75:6826/2430367 13 : cluster [ERR] 11.288 soid 11:1155c332:::rbd_data.207dce238e1f29.0000000000000527:head: failed to pick suitable auth object

But never mind : with the SSD cache in writeback, I just saw the same error again on one VM (only) for now :
(lots of these)
2018-07-25 10:15:19.841746 osd.101 osd.101 192.54.207.206:6859/3392654 116 : cluster [ERR] 1.20 copy from 1:06dd6812:::rbd_data.194b8c238e1f29.00000000000007a3:head to 1:06dd6812:::rbd_data.194b8c238e1f29.00000000000007a3:head data digest 0x27451e3c != source 0x12c05014

(osd.101 is a SSD from the cache pool)

=> yum update => I/O error => Set the TIER pool to forward => yum update starts.

Weird, but if that happens only on this host, I can cope with it (I have 780+ scrub errors to handle now :/ )

And just to be sure ;)

[root@ceph10 ~]# ceph --admin-daemon /var/run/ceph/*osd*101* version
{"version":"12.2.7","release":"luminous","release_type":"stable"}

On the good side : this update is forcing us to dive into ceph internals : we'll be more ceph-aware tonight than this morning ;)

Cheers
Fred

-----Message d'origine-----
De : SCHAER Frederic 
Envoyé : mercredi 25 juillet 2018 09:57
À : 'Dan van der Ster' <dan@xxxxxxxxxxxxxx>
Cc : ceph-users <ceph-users@xxxxxxxx>
Objet : RE:  12.2.7 + osd skip data digest + bluestore + I/O errors

Hi Dan,

Just checked again : arggghhh...

# grep AUTO_RESTART /etc/sysconfig/ceph
CEPH_AUTO_RESTART_ON_UPGRADE=no

So no :'(
RPMs were upgraded, but OSD were not restarted as I thought. Or at least not restarted with new 12.2.7 binaries (but since the skip digest option was present in the running 12.2.6 OSDs, I guess the 12.2.6 osds did not understand that option)

I just restarted all of the OSDs : I will check again the behavior and report here - thanks for pointing me in the good direction !

Fred

-----Message d'origine-----
De : Dan van der Ster [mailto:dan@xxxxxxxxxxxxxx] 
Envoyé : mardi 24 juillet 2018 16:50
À : SCHAER Frederic <frederic.schaer@xxxxxx>
Cc : ceph-users <ceph-users@xxxxxxxx>
Objet : Re:  12.2.7 + osd skip data digest + bluestore + I/O errors

`ceph versions` -- you're sure all the osds are running 12.2.7 ?

osd_skip_data_digest = true is supposed to skip any crc checks during reads.
But maybe the cache tiering IO path is different and checks the crc anyway?

-- dan

On Tue, Jul 24, 2018 at 3:01 PM SCHAER Frederic <frederic.schaer@xxxxxx> wrote:
>
> Hi,
>
>
>
> I read the 12.2.7 upgrade notes, and set “osd skip data digest = true” before I started upgrading from 12.2.6 on my Bluestore-only cluster.
>
> As far as I can tell, my OSDs all got restarted during the upgrade and all got the option enabled :
>
>
>
> This is what I see for a specific OSD taken at random:
>
> # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep data_digest
>
>     "osd_skip_data_digest": "true",
>
>
>
> This is what I see when I try to injectarg the option data digest ignore option :
>
>
>
> # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head
>
> osd.0: osd_skip_data_digest = 'true' (not observed, change may require restart)
>
> osd.1: osd_skip_data_digest = 'true' (not observed, change may require restart)
>
> osd.2: osd_skip_data_digest = 'true' (not observed, change may require restart)
>
> osd.3: osd_skip_data_digest = 'true' (not observed, change may require restart)
>
> (…)
>
>
>
> This has been like that since I upgraded to 12.2.7.
>
> I read in the releanotes that the skip_data_digest  option should be sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal on rewrite…
>
>
>
> However…
>
>
>
> My config :
>
> -          Using tiering with an SSD hot storage tier
>
> -          HDDs for cold storage
>
>
>
> And… I get I/O errors on some VMs when running some commands as simple as “yum check-update”.
>
>
>
> The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) :
>
>
>
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
>
>
>
> In the ceph logs, I can see these errors :
>
>
>
> 2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54
>
>
>
> (yes, my cluster is seen as healthy)
>
>
>
> On the affected OSDs, I can see these errors :
>
>
>
> 2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected promote error (5) Input/output error
>
> 2018-07-24 11:17:56.429900 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.429934 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.429939 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected promote error (5) Input/output error
>
>
>
> And…. I don’t know how to recover from that.
>
> Pool #1 is my SSD cache tier, hence pg 1.23 is on the SSD side.
>
>
>
> I’ve tried setting the cache pool to “readforward” despite the “not well supported” warning and could immediately get back working VMs (no more I/O errors).
>
> But with no SSD tiering : not really useful.
>
>
>
> As soon as I’ve tried setting the cache tier to writeback again, I got those I/O errors again… (not on the yum command, but in the mean time I’ve stopped and set out, then unset out osd.71 to check it with badblocks just in case…)
>
> I still have to find how to reproduce the io error on an affected host to further try to debug/fix that issue…
>
>
>
> Any ideas ?
>
>
>
> Thanks && regards
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com