Re: Copying without crc check when peering may lack reliability

poi <poiiiicen@xxxxxxxxx> · Fri, 31 Aug 2018 09:47:27 +0800

Thanks for your reply.
The version we are running is Luminous 12.2.5, and we are actually
using BlueStore with replicated pools.

Our config is below:

-> # cat /etc/ceph/ceph.conf

[global]
fsid = 96c5f802-ca66-4d12-974f-5b5658a18353
mon_initial_members = ceph00
mon_host = 10.18.192.27
auth_cluster_required = none
auth_service_required = none
auth_client_required = none
public_network = 10.18.192.0/24

[mon]
mon_allow_pool_delete = true

And the experiment we did is like this:

First create some OSDs and devide them into two different root.

-> # ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-7 0.05699 root root1
-5 0.05699 host ceph01
 3 hdd 0.01900 osd.3 up 1.00000 1.00000
 4 hdd 0.01900 osd.4 up 1.00000 1.00000
 5 hdd 0.01900 osd.5 up 1.00000 1.00000
-1 0.05846 root default
-3 0.05846 host ceph02
 0 hdd 0.01949 osd.0 up 1.00000 1.00000
 1 hdd 0.01949 osd.1 up 1.00000 1.00000
 2 hdd 0.01949 osd.2 up 1.00000 1.00000

Then create a replicated pool on root default. Note we set the failure
domain to OSD.

-> # ceph osd pool create test 128 128

Next we put an object into the pool.

-> # cat txt
123
-> # rados -p test put test_copy txt
-> # rados -p test get test_copy -
123

Then we make OSD.0 down, and change its data of object test_copy.

-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0
test_copy get-bytes
123
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0
test_copy set-bytes 120txt

Next we start OSD.0 and do data migration.

-> # ceph osd pool set test crush_rule root1_rule

Finally we try to get the object by rados and ceph-objectstore-tool

-> # rados -p test get test_copy -
error getting test/test_copy: (5) Input/output error
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3
test_copy get-bytes
120
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-4
test_copy get-bytes
120
-> # ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-5
test_copy get-bytes
120

The data of test_copy on OSD.3 OSD.4 OSD.5 is from OSD.0 which has the
silent data corruption.

Regards,
Poi
Gregory Farnum <gfarnum@xxxxxxxxxx> 于2018年8月31日周五 上午12:51写道：
>
> On Thu, Aug 23, 2018 at 8:38 AM, poi <poiiiicen@xxxxxxxxx> wrote:
> > Hello!
> >
> > Recently, we did data migration from one crush root to another, but
> > after that, we found some objects were wrong and their copies on other
> > OSDs were also wrong.
> >
> > Finally, we found that for one pg, the data migration uses only one
> > OSD's data to generate three new copies, and do not check the crc
> > before migration like assuming the data is always correct (but
> > actually nobody can promise it). We tried both filestore and
> > bluestore, and the results were the same. Copying from one pg without
> > crc check may lack reliability.
>
> Exactly what version are you running, and what backends? Are you
> actually using BlueStore?
>
> This is certainly the general case with replicated pools on FileStore,
> but it shouldn't happen with BlueStore or EC pools at all. We aren't
> going to implement "voting" on FileStore-backed OSDs though as that
> would vastly multiply the cost of backfilling. :(
> -Greg
>
> >
> > Is there any way to ensure the correctness of data when data
> > migration? Although we can do deep scrub before migration, but the
> > cost is too high. I think when peering, adding crc check for objects
> > before copying may work.
> >
> > Regards
> >
> > Poi