Corrupted and inconsistent reads from CephFS on EC pool

aschmitz <ceph-users@xxxxxxxxxxxx> · Fri, 15 Dec 2023 02:14:02 -0600

Hi everyone,

I'm seeing different results from reading files, depending on which OSDs 
are running, including some incorrect reads with all OSDs running, in 
CephFS from a pool with erasure coding. I'm running Ceph 17.2.6.

# More detail

In particular, I have a relatively large backup of some files, combined 
with SHA-256 hashes of the files (which were verified when the backup 
was created, approximately 7 months ago). Verifying these hashes 
currently gives several errors, both in large and small files, but 
somewhat tilted towards larger files.

Investigating which PGs stored the relevant files (using 
cephfs-data-scan pg_files) didn't show the problem to be isolated to one 
PG, but did show several PGs that contained OSD 15 as an active member.

Taking OSD 15 offline leads to *better* reads (more files with correct 
SHA-256 hashes), but not completely correct reads. Further investigation 
implicated OSD 34 as another potential issue, but taking it offline also 
results in more correct files but not completely.

Bringing the stopped OSDs (15 and 34) back online results in the earlier 
(incorrect) hashes when reading files, as might be expected, but this 
seems to demonstrate that the correct information (or at least more 
correct information) is still on the drives.

The hashes I receive for a given corrupted file are consistent from read 
to read (including on different hosts, to avoid caching as an issue), 
but obviously sometimes change if I take an affected OSD offline.

# Recent history

I have Ceph configured with a deep scrub interval of approximately 30 
days, and they have completed regularly with no issues identified. 
However, within the past two weeks I added two additional drives to the 
cluster, and rebalancing took about two weeks to complete: the placement 
groups I took notice of having issues were not deep scrubbed since the 
replacement completed, so it is possible something got corrupted during 
the rebalance.

Neither OSD 15 nor 34 is a new drive, and as far as I have experienced 
(and Ceph's health indications have shown), all of the existing OSDs 
have behaved correctly up to this point.

# Configuration

I created an erasure coding profile for the pool in question using the 
following command:

    ceph osd erasure-code-profile set erasure_k4_m2 \
      plugin=jerasure \
      k=4 m=2 \
      technique=blaum_roth \
      crush-device-class=hdd

And the following CRUSH rule is used for the pool:

      rule erasure_k4_m2_hdd_rule {
        id 3
        type erasure
        min_size 4
        max_size 6
        step take default class hdd
        step choose indep 3 type host
        step chooseleaf indep 2 type osd
        step emit
      }

# Questions

1. Does this behavior ring a bell to anyone? Is there something obvious 
I'm missing or should do?

2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully 
not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34, 
and will likely follow up with the rest of the pool.)

3. Is there a way to force "full reads" or otherwise to use all of the 
EC chunks (potentially in tandem with on-disk checksums) to identify the 
correct data, rather than a combination of the data from the primary OSDs?

Thanks for any insights you might have,
aschmitz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx