Scrubbing for RocksDB

Eugen Block <eblock@xxxxxx> · Mon, 09 Apr 2018 13:53:31 +0000

Hi list,

we were wondering if and how the consistency of OSD journals  
(BlueStore) is checked.

Our cluster runs on Luminous (12.2.2) and we had migrated all our  
filestore OSDs to bluestore a couple of months ago. During that  
process we placed each rocksDB on a separate partition on a RAID1  
consisting of two SSDs. Our cluster was healthy, we deep-scrub the  
whole cluster once a week without any errors etc.

Then we decided to restructure the disk layout on one of the hosts, we  
didn't want that RAID of SSDs anymore. So we failed one disk (diskB),  
wiped it and assigned a new volume group to it, now containing one  
logical volume per OSD. We started the journal migration as mentioned  
in [1] by copying the data from diskA (degraded RAID1) to diskB (LVM)  
with dd. The first journal migration worked like a charm, but for the  
next four partitions the dd command reported errors like these:

---cut here---
FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sense Key : Medium Error [current]
Add. Sense: Read retries exhausted
CDB: Read(10) 28 00 0a 08 8b a0 00 04 00 00
blk_update_request: critical medium error, dev sdk, sector 168332406
Buffer I/O error on dev md126p6, logical block 1363854, async page read
---cut here---

Four of six partitions reported these errors, a look into smartctl  
confirmed that this SSD is corrupt and has non-recoverable errors.  
That's why we had to rebuild the respective OSDs from scratch, but at  
least without rearranging the whole cluster (also mentioned in [1]).

So my question is, why can't I find anything in the ceph logs about  
this? The scrubbing and deep-scrubbing only check the PGs on the data  
device for consistency, but what about the journal? Is there any tool  
we haven't found yet or any mechanism that would detect an I/O error?  
Of course there is a possibility that the respective blocks on the  
corrupt partitions haven't been updated for some time, but IMHO there  
should be something to check the journal's consistency and report it  
in the ceph logs, something like a journal-scrub, maybe.

Has someone experienced similar issues and can shed some light on  
this? Any insights would be very helpful.

Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024913.html

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com