Testing CEPH scrubbing / self-healing capabilities

Petr Bena <petr@bena.rocks> · Tue, 4 Jun 2024 12:13:27 +0200

Hello,

I wanted to try out (lab ceph setup) what exactly is going to happen 
when parts of data on OSD disk gets corrupted. I created a simple test 
where I was going through the block device data until I found something 
that resembled user data (using dd and hexdump) (/dev/sdd is a block 
device that is used by OSD)

INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 | 
hexdump -C
00000000  6e 20 69 64 3d 30 20 65  78 65 3d 22 2f 75 73 72  |n id=0 
exe="/usr|
00000010  2f 73 62 69 6e 2f 73 73  68 64 22 20 68 6f 73 74 |/sbin/sshd" 
host|

Then I deliberately overwrote 32 bytes using random data:

INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/urandom of=/dev/sdd bs=32 
count=1 seek=33920

INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 | 
hexdump -C
00000000  25 75 af 3e 87 b0 3b 04  78 ba 79 e3 64 fc 76 d2 
|%u.>..;.x.y.d.v.|
00000010  9e 94 00 c2 45 a5 e1 d2  a8 86 f1 25 fc 18 07 5a 
|....E......%...Z|

At this point I would expect some sort of data corruption. I restarted 
the OSD daemon on this host to make sure it flushes any potentially 
buffered data. It restarted OK without noticing anything, which was 
expected.

Then I ran

ceph osd scrub 5

ceph osd deep-scrub 5

And waiting for all scheduled scrub operations for all PGs to finish.

No inconsistency was found. No errors reported, scrubs just finished OK, 
data are still visibly corrupt via hexdump.

Did I just hit some block of data that WAS used by OSD, but was marked 
deleted and therefore no longer used or am I missing something? I would 
expect CEPH to detect disk corruption and automatically replace the 
invalid data with a valid copy?

I use only replica pools in this lab setup, for RBD and CephFS.

Thanks

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx