Multisite sync corruption for large multipart obj

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



we have a two-zone multi-site setup, zone lvs and zone slc
respectively. It works fine in general however we got reports from
customer about data corruption/mismatch between two zone

root@host:~# s3cmd -c .s3cfg_lvs ls
s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
2019-05-14 04:30 410444223 s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
root@host-ump:~# s3cmd -c .s3cfg_slc ls
s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
2019-05-14 04:30 62158776 s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index

Object metadata in SLC/LVS can be found in
https://pastebin.com/a5JNb9vb LVS
https://pastebin.com/1MuPJ0k1 SLC

SLC is a single flat object while LVS is a multi-part object, which
indicate the object was uploaded by user in LVS and mirrored to
SLC.The SLC object get truncated after 62158776, the first 62158776
bytes are right.

root@host:~# cmp -l slc_obj lvs_obj
cmp: EOF on slc_obj after byte 62158776

Both bucket sync status and overall sync status shows positive, and
the obj was created 5 days ago. It sounds more like when pulling the
object content from source zone(LVS), the transaction was terminated
somewhere in between and cause an incomplete obj, and seems we dont
have checksum verification in sync_agent so that the corrupted obj was
there and be treated as a success sync.

root@host:~# radosgw-admin --cluster slc_ceph_ump bucket sync status
--bucket=ms-nsn-prod-48
realm 2305f95c-9ec9-429b-a455-77265585ef68 (metrics)
zonegroup 9dad103a-3c3c-4f3b-87a0-a15e17b40dae (ebay)
zone 6205e53d-6ce4-4e25-a175-9420d6257345 (slc)
bucket ms-nsn-prod-48[017a0848-cf64-4879-b37d-251f72ff9750.432063.48]

source zone 017a0848-cf64-4879-b37d-251f72ff9750 (lvs)
                full sync: 0/16 shards
                incremental sync: 16/16 shards
                bucket is caught up with source


Re-sync on the bucket will not solve the inconsistency

radosgw-admin bucket sync init --source-zone lvs --bucket=ms-nsn-prod-48

root@host:~# radosgw-admin bucket sync status --bucket=ms-nsn-prod-48
realm 2305f95c-9ec9-429b-a455-77265585ef68 (metrics)
zonegroup 9dad103a-3c3c-4f3b-87a0-a15e17b40dae (ebay)
zone 6205e53d-6ce4-4e25-a175-9420d6257345 (slc)
bucket ms-nsn-prod-48[017a0848-cf64-4879-b37d-251f72ff9750.432063.48]

source zone 017a0848-cf64-4879-b37d-251f72ff9750 (lvs)
                full sync: 0/16 shards
                incremental sync: 16/16 shards
                bucket is caught up with source

root@lvscephmon01-ump:~# s3cmd -c .s3cfg_slc ls
s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index
2019-05-14 04:30 62158776 s3://ms-nsn-prod-48/01DAT9KVPEDE4QTA6EWFBZJ5KS/index


A tracker was submitted to
https://tracker.ceph.com/issues/39992



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux