Re: rgw replication sync issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

according to [1] the error 125 means there was a race condition:

failed to sync bucket instance: (125) Operation canceled
A racing condition exists between writes to the same RADOS object.

Can you rewrite just the affected object? Not sure about the other error, maybe try rewriting that object as well? But I'm not sure how that would lead to a 25 TB difference. Or could this condition impact the entire sync? Hopefully someone with more multisite knowledge can comment. Is ceph healthy? No inactive PGs or anything?

[1] https://www.ibm.com/docs/en/storage-ceph/6?topic=gateway-error-code-definitions-ceph-object

Zitat von ankit raikwar <ankit199999raikwar@xxxxxxxxx>:

Hello Users,

We have the environment as below. Both environments are the zones of one RGW multisite zonegroup, whereas the DC zone is the primary and the DR zone is the secondary at this point.

DC

Ceph Version: 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
 Number of rgw daemons : 25

DR
Ceph Version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
 Number of rgw daemons : 25

Environment description:
Both the mentioned zones are in production and the RGW multisite bandwidth is over MPLS of around 3 Gbps.

Issue description :
We have enabled the multisite between DC-DR almost around a month ago. The total data at the DC zone is around 159 TiB and the sync has been going as expected . But when the sync had gone around 120 TiB we saw the speed drastically fell low, the ideal was around 2 Gbps, and it fell way below 10 Mbps though the link is not saturated. After checking "# radosgw-admin sync status " the output says "metadata is caught up with master" and "data is caught up with source" but with almost 25 TB data behind as compared to DC. It also looks like the sync status of the bucket " radosgw-admin bucket sync status --bucket=<bucket-name>" still bucket is behind shards. Attaching the log and the output below.

The possibility of issuing a resync of the data from the beginning is quite low and not feasible in our case. The "# radosgw-admin sync error list" output is also attached with some information redacted and we see errors.

radosgw-sync status


 radosgw-admin sync status
          realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0)
      zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0)
           zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: d09d3d16-8601-448b-bf3d-609b8a29647d (ahd)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

 radosgw-admin bucket sync status --bucket=<bucket-name>

 realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0)
      zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0)
           zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd)
         bucket :tc******rc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1])

    source zone d09d3d16-8601-448b-bf3d-609b8a29647d (ahd)
source bucket :tc*******arc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1])
                full sync: 14/9221 shards
                full sync: 49448693 objects completed
                incremental sync: 9207/9221 shards
                bucket is behind on 25 shards
behind shards: [9,111,590,826,1774,2968,3132,3382,3386,3409,3685,3820,4174,4544,4708,4811,5733,6285,6558,7288,7417,7443,7876,8151,8878]

Error:  radosgw-admin sync error list

 "id": "1_1690799008.725414_3926410.1",
                "section": "data",
"name": "bucket0:d09d3d16-8601-448b-bf3d-609b8a29647d.89871.1:1949",
                "timestamp": "2023-07-31T10:23:28.725414Z",
                "info": {
                    "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d",
                    "error_code": 125,
"message": "failed to sync bucket instance: (125) Operation canceled"

 "id": "1_1690804503.144829_3759212.1",
                "section": "data",
"name": "bucket1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:1232/S01/1/120/2b7ea802-efad-41d3-9d90-9**************523.txt",
                "timestamp": "2023-07-31T11:54:53.233451Z",
                "info": {
                    "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d",
                    "error_code": 5,
                    "message": "failed to sync object(5) Input/output error"



Thanks
Ankit
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux