Oh my bad, 12.2.6 was the one that was broken that way. 12.2.5 just had a pg log+EC issue. On Tue, Sep 4, 2018 at 11:28 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Fri, Aug 31, 2018 at 2:20 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> On Thu, 30 Aug 2018, Gregory Farnum wrote: >>> On Thu, Aug 23, 2018 at 8:38 AM, poi <poiiiicen@xxxxxxxxx> wrote: >>> > Hello! >>> > >>> > Recently, we did data migration from one crush root to another, but >>> > after that, we found some objects were wrong and their copies on other >>> > OSDs were also wrong. >>> > >>> > Finally, we found that for one pg, the data migration uses only one >>> > OSD's data to generate three new copies, and do not check the crc >>> > before migration like assuming the data is always correct (but >>> > actually nobody can promise it). We tried both filestore and >>> > bluestore, and the results were the same. Copying from one pg without >>> > crc check may lack reliability. >>> >>> Exactly what version are you running, and what backends? Are you >>> actually using BlueStore? >>> >>> This is certainly the general case with replicated pools on FileStore, >>> but it shouldn't happen with BlueStore or EC pools at all. We aren't >>> going to implement "voting" on FileStore-backed OSDs though as that >>> would vastly multiply the cost of backfilling. :( >> >> If I'm understanding, assuming you're running bluestore, the checksums on >> each OSD are (internally) correct (so reads succeed), but the actual data >> stored on the different OSDs is different. And then you trigger a >> rebalance or recovery. Is that right? >> >> This is something a deep scrub will normally pick up in due time, but in a >> recovery case, we assume the existing replicas are consistent and base any >> new replica off of the primary's copy. > > But if we're returning EIO on regular read of the object, that's > obviously doing some checking we don't do in the backfill or recovery > case. Sounds like it's comparing the hobject's global checksum to what > it reads but we don't check that on recovery. > > Which, well, we should! Might be some issues recovering from it if > there's a failure in that path, but such is life. > > ...but oh, I just realized this is 12.2.5 (aka: the broken one). So I > think this is already resolved in the 12.2.7 fixes? David? I created > http://tracker.ceph.com/issues/35542 before I realized, so we can > either prioritize that or close it as fixed. > -Greg > >> >> The way to "fix" that might be to do a deep-scrub on an object before it >> is recovered. I'm pretty sure we don't want to incur that kind of >> overhead (it would slow recovery way down). And in the recovery case >> where you are down one or more replicas, you're always going to have to >> base the new replica off an existing replica. In theory if you're >> making multiple replicas you could source them from different >> existing copies, but in practice that's inefficient and doesn't align >> with how the OSD implements (everything is push/pull from the >> primary). >> >> sage >> >> >> >>> -Greg >>> >>> > >>> > Is there any way to ensure the correctness of data when data >>> > migration? Although we can do deep scrub before migration, but the >>> > cost is too high. I think when peering, adding crc check for objects >>> > before copying may work. >>> > >>> > Regards >>> > >>> > Poi >>> >>>