Re: 12.2.7 + osd skip data digest + bluestore + I/O errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Oh my…

 

Tried to yum upgrade in writeback mode and noticed the syslogs on the VM :

 

Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1896024

Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1896064

Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895552

Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895536

Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895520

(…)

 

Ceph is also lgging many errors :

 

2018-07-24 15:20:24.893872 osd.74 [ERR] 1.33 copy from 1:cd70e921:::rbd_data.21e0fe2ae8944a.0000000000001111:head to 1:cd70e921:::rbd_data.21e0fe2ae8944a.0000000000001111:head data digest 0x1480c7a1 != source 0xe1e7591b

[root@ceph0 ~]# egrep 'copy from.*to.*data digest' /var/log/ceph/ceph.log |wc -l

928

 

Setting the cache tier again to forward mode prevents the IO errors again :

 

In writeback mode :

 

# yum update 2>&1|tail

---> Package glibc-headers.x86_64 0:2.12-1.209.el6_9.2 will be updated

---> Package glibc-headers.x86_64 0:2.12-1.212.el6 will be an update

---> Package gmp.x86_64 0:4.3.1-12.el6 will be updated

---> Package gmp.x86_64 0:4.3.1-13.el6 will be an update

---> Package gnupg2.x86_64 0:2.0.14-8.el6 will be updated

---> Package gnupg2.x86_64 0:2.0.14-9.el6_10 will be an update

---> Package gnutls.x86_64 0:2.12.23-21.el6 will be updated

---> Package gnutls.x86_64 0:2.12.23-22.el6 will be an update

---> Package httpd.x86_64 0:2.2.15-60.sl6.6 will be updated

Error: disk I/O error

 

ð  Each time I run a yum update, I get a bit farther in the yum update process.

 

In forward mode : works as expected

I haven’t tried to flush the cache pool while in forward mode… yet…

 

Ugh :/

 

Regards

 

 

De : ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] De la part de SCHAER Frederic
Envoyé : mardi 24 juillet 2018 15:01
À : ceph-users <ceph-users@xxxxxxxx>
Objet : [PROVENANCE INTERNET] 12.2.7 + osd skip data digest + bluestore + I/O errors

 

Hi,

 

I read the 12.2.7 upgrade notes, and set “osd skip data digest = true” before I started upgrading from 12.2.6 on my Bluestore-only cluster.

As far as I can tell, my OSDs all got restarted during the upgrade and all got the option enabled :

 

This is what I see for a specific OSD taken at random:

# ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep data_digest

    "osd_skip_data_digest": "true",

 

This is what I see when I try to injectarg the option data digest ignore option :

 

# ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head

osd.0: osd_skip_data_digest = 'true' (not observed, change may require restart)

osd.1: osd_skip_data_digest = 'true' (not observed, change may require restart)

osd.2: osd_skip_data_digest = 'true' (not observed, change may require restart)

osd.3: osd_skip_data_digest = 'true' (not observed, change may require restart)

(…)

 

This has been like that since I upgraded to 12.2.7.

I read in the releanotes that the skip_data_digest  option should be sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal on rewrite…

 

However…

 

My config :

-          Using tiering with an SSD hot storage tier

-          HDDs for cold storage

 

And… I get I/O errors on some VMs when running some commands as simple as “yum check-update”.

 

The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) :

 

block I/O error in device 'drive-virtio-disk0': Input/output error (5)

 

In the ceph logs, I can see these errors :

 

2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54

 

(yes, my cluster is seen as healthy)

 

On the affected OSDs, I can see these errors :

 

2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected promote error (5) Input/output error

2018-07-24 11:17:56.429900 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.429934 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00000000000000e7:head data digest 0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.429939 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected promote error (5) Input/output error

 

And…. I don’t know how to recover from that.

Pool #1 is my SSD cache tier, hence pg 1.23 is on the SSD side.

 

I’ve tried setting the cache pool to “readforward” despite the “not well supported” warning and could immediately get back working VMs (no more I/O errors).

But with no SSD tiering : not really useful.

 

As soon as I’ve tried setting the cache tier to writeback again, I got those I/O errors again… (not on the yum command, but in the mean time I’ve stopped and set out, then unset out osd.71 to check it with badblocks just in case…)

I still have to find how to reproduce the io error on an affected host to further try to debug/fix that issue…

 

Any ideas ?

 

Thanks && regards

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux