Re: Copying without crc check when peering may lack reliability

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxx> · Wed, 5 Sep 2018 20:27:08 +0800

hi,

On 08/31/2018 09:47 AM, poi wrote:
Thanks for your reply.
The version we are running is Luminous 12.2.5, and we are actually
using BlueStore with replicated pools.

Our config is below:

-> # cat /etc/ceph/ceph.conf

[global]
fsid = 96c5f802-ca66-4d12-974f-5b5658a18353
mon_initial_members = ceph00
mon_host = 10.18.192.27
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
public_network = 10.18.192.0/24

[mon]
mon_allow_pool_delete = true

And the experiment we did is like this:

First create some OSDs and devide them into two different root.

-> # ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-7 0.05699 root root1
-5 0.05699 host ceph01
  3 hdd 0.01900 osd.3 up 1.00000 1.00000
  4 hdd 0.01900 osd.4 up 1.00000 1.00000
  5 hdd 0.01900 osd.5 up 1.00000 1.00000
-1 0.05846 root default
-3 0.05846 host ceph02
  0 hdd 0.01949 osd.0 up 1.00000 1.00000
  1 hdd 0.01949 osd.1 up 1.00000 1.00000
  2 hdd 0.01949 osd.2 up 1.00000 1.00000

Then create a replicated pool on root default. Note we set the failure
domain to OSD.

-> # ceph osd pool create test 128 128

Next we put an object into the pool.

-> # cat txt
123
-> # rados -p test put test_copy txt
-> # rados -p test get test_copy -
123

Then we make OSD.0 down, and change its data of object test_copy.

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0
test_copy get-bytes
123
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0
test_copy set-bytes 120txt

Next we start OSD.0 and do data migration.

-> # ceph osd pool set test crush_rule root1_rule

Finally we try to get the object by rados and ceph-objectstore-tool

-> # rados -p test get test_copy -
error getting test/test_copy: (5) Input/output error

I had tried to reproduce this issue in current ceph maser branch.
With "osd skip data digest = true", in my test environment, I modified
a object's primary replicate content, from "123" to "120". I get:
    [root@localhost build]# rados -p test get test_copy -
    120
Though data is corrupted, but I don't get any EIO error. Indeed, I think 
this is reasonable,ceph-objectstore-tool change object's data, but also 
will re-compute crc, so from object's primary replicate' view, it won't 
perceive this data corruption, ceph will use this replicate to fill 
other new osds.

With "osd skip data digest = false", object's primary replicate will be 
repaired automatically, so I get:
    [root@localhost build]# rados -p test get test_copy -
    123

I used vstart.sh to test, can you please provide one reproduce script 
and share your ceph crush map setting, thanks.

Regards,
Xiaoguang Wang

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3
test_copy get-bytes
120
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4
test_copy get-bytes
120
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-5
test_copy get-bytes
120

The data of test_copy on OSD.3 OSD.4 OSD.5 is from OSD.0 which has the
silent data corruption.

Regards,
Poi
Gregory Farnum <gfarnum@xxxxxxxxxx> 于2018年8月31日周五 上午12:51写道：

On Thu, Aug 23, 2018 at 8:38 AM, poi <poiiiicen@xxxxxxxxx> wrote:
Hello!

Recently, we did data migration from one crush root to another, but
after that, we found some objects were wrong and their copies on other
OSDs were also wrong.

Finally, we found that for one pg, the data migration uses only one
OSD's data to generate three new copies, and do not check the crc
before migration like assuming the data is always correct (but
actually nobody can promise it). We tried both filestore and
bluestore, and the results were the same. Copying from one pg without
crc check may lack reliability.

Exactly what version are you running, and what backends? Are you
actually using BlueStore?

This is certainly the general case with replicated pools on FileStore,
but it shouldn't happen with BlueStore or EC pools at all. We aren't
going to implement "voting" on FileStore-backed OSDs though as that
would vastly multiply the cost of backfilling. :(
-Greg

Is there any way to ensure the correctness of data when data
migration? Although we can do deep scrub before migration, but the
cost is too high. I think when peering, adding crc check for objects
before copying may work.

Regards

Poi