Corrupted and inconsistent reads from CephFS on EC pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi everyone,

I'm seeing different results from reading files, depending on which OSDs are running, including some incorrect reads with all OSDs running, in CephFS from a pool with erasure coding. I'm running Ceph 17.2.6.

# More detail

In particular, I have a relatively large backup of some files, combined with SHA-256 hashes of the files (which were verified when the backup was created, approximately 7 months ago). Verifying these hashes currently gives several errors, both in large and small files, but somewhat tilted towards larger files.

Investigating which PGs stored the relevant files (using cephfs-data-scan pg_files) didn't show the problem to be isolated to one PG, but did show several PGs that contained OSD 15 as an active member.

Taking OSD 15 offline leads to *better* reads (more files with correct SHA-256 hashes), but not completely correct reads. Further investigation implicated OSD 34 as another potential issue, but taking it offline also results in more correct files but not completely.

Bringing the stopped OSDs (15 and 34) back online results in the earlier (incorrect) hashes when reading files, as might be expected, but this seems to demonstrate that the correct information (or at least more correct information) is still on the drives.

The hashes I receive for a given corrupted file are consistent from read to read (including on different hosts, to avoid caching as an issue), but obviously sometimes change if I take an affected OSD offline.

# Recent history

I have Ceph configured with a deep scrub interval of approximately 30 days, and they have completed regularly with no issues identified. However, within the past two weeks I added two additional drives to the cluster, and rebalancing took about two weeks to complete: the placement groups I took notice of having issues were not deep scrubbed since the replacement completed, so it is possible something got corrupted during the rebalance.

Neither OSD 15 nor 34 is a new drive, and as far as I have experienced (and Ceph's health indications have shown), all of the existing OSDs have behaved correctly up to this point.

# Configuration

I created an erasure coding profile for the pool in question using the following command:

    ceph osd erasure-code-profile set erasure_k4_m2 \
      plugin=jerasure \
      k=4 m=2 \
      technique=blaum_roth \
      crush-device-class=hdd

And the following CRUSH rule is used for the pool:

      rule erasure_k4_m2_hdd_rule {
        id 3
        type erasure
        min_size 4
        max_size 6
        step take default class hdd
        step choose indep 3 type host
        step chooseleaf indep 2 type osd
        step emit
      }

# Questions

1. Does this behavior ring a bell to anyone? Is there something obvious I'm missing or should do?

2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34, and will likely follow up with the rest of the pool.)

3. Is there a way to force "full reads" or otherwise to use all of the EC chunks (potentially in tandem with on-disk checksums) to identify the correct data, rather than a combination of the data from the primary OSDs?

Thanks for any insights you might have,
aschmitz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux