Re: Silent data corruption may destroy all the object copies after data migration

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxx> · Fri, 24 Aug 2018 09:39:55 +0800

hi,

On 08/24/2018 09:25 AM, poi wrote:
Hello!

Recently, we did data migration from one crush root to another because
the resouces of the origin root is going to run out. But after the
migration, we found that some objects were destroyed and their copies
on other OSDs were also destroyed.

Finally, we found that for one pg, the data migration uses only one
OSD's data to generate three new copies, and do not check the crc
before migration like assuming the data is always correct (but
actually nobody can promise it). We tried both filestore and
bluestore, and the results are the same. Copying from one pg without
crc check may lack reliability.

Is there any way to ensure the correctness of data when data
migration? Although we can do deep scrub before migration, but the
cost is too high. I think when peering, adding crc check for objects
before copying may work.
In current implementation, deep-scrub may also not find silent data
error timely when object's data is cached for bluestore.
Please see this pull request's commit message:
https://github.com/ceph/ceph/pull/23629
When donging crc checks, we should bypass cache to read disk directly.

Regards,
Xiaoguang Wang

The experiment we did is something like below:

First create some OSDs and devide them into two different root.

-> # ceph osd tree

ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF

-7 0.05699 root root1

-5 0.05699 host ceph01

  3 hdd 0.01900 osd.3 up 1.00000 1.00000

  4 hdd 0.01900 osd.4 up 1.00000 1.00000

  5 hdd 0.01900 osd.5 up 1.00000 1.00000

-1 0.05846 root default

-3 0.05846 host ceph02

  0 hdd 0.01949 osd.0 up 1.00000 1.00000

  1 hdd 0.01949 osd.1 up 1.00000 1.00000

  2 hdd 0.01949 osd.2 up 1.00000 1.00000

Then create a replicated pool on root default. Note we set the failure
domain to OSD.

ceph osd pool create test 128 128

Next we put an object into the pool.

-> # cat txt

123

-> # rados -p test put test_copy txt

-> # rados -p test get test_copy -

123

Then we make OSD.0 down, and change its data of object test_copy.

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0
test_copy get-bytes

123

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0
test_copy set-bytes 120txt

Next we start OSD.0 and do data migration.

ceph osd pool set test crush_rule root1_rule

Finally we try to get the object by rados and ceph-objectstore-tool

-> # rados -p test get test_copy -

error getting test/test_copy: (5) Input/output error

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3
test_copy get-bytes

120

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4
test_copy get-bytes

120

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-5
test_copy get-bytes

120

The data of test_copy on OSD.3 OSD.4 OSD.5 is from OSD.0 which has the
silent data corruption.

Our config is below:

-> # cat /etc/ceph/ceph.conf

[global]

fsid = 96c5f802-ca66-4d12-974f-5b5658a18353

mon_initial_members = ceph00

mon_host = 10.18.192.27

auth_cluster_required = none

auth_service_required = none

auth_client_required = none

public_network = 10.18.192.0/24

[mon]

mon_allow_pool_delete = true

The ceph version is below

-> # ceph -v

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable)

Thanks

Jiahui Cen