Re: rgw replication sync issue

Eugen Block <eblock@xxxxxx> · Fri, 25 Aug 2023 08:09:36 +0000

Hi,

according to [1] the error 125 means there was a race condition:

failed to sync bucket instance: (125) Operation canceled
A racing condition exists between writes to the same RADOS object.

Can you rewrite just the affected object? Not sure about the other  
error, maybe try rewriting that object as well? But I'm not sure how  
that would lead to a 25 TB difference. Or could this condition impact  
the entire sync? Hopefully someone with more multisite knowledge can  
comment. Is ceph healthy? No inactive PGs or anything?

[1]  
https://www.ibm.com/docs/en/storage-ceph/6?topic=gateway-error-code-definitions-ceph-object

Zitat von ankit raikwar <ankit199999raikwar@xxxxxxxxx>:

Hello Users,

          We have the environment as below. Both environments are  
the zones of one RGW multisite zonegroup, whereas the DC zone is the  
primary and the DR zone is the secondary at this point.

DC

 Ceph Version: 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)  
quincy (stable)
 Number of rgw daemons : 25

DR
 Ceph Version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)  
quincy (stable)
 Number of rgw daemons : 25

Environment description:
               Both the mentioned zones are in production and the  
RGW multisite bandwidth is over MPLS of around 3 Gbps.

Issue description :
               We have enabled the multisite between DC-DR almost  
around a month ago. The total data at the DC zone is around 159 TiB  
and the sync has been going as expected . But when the sync had gone  
around 120 TiB we saw the speed drastically fell low, the ideal was  
around 2 Gbps, and it fell way below 10 Mbps though the link is not  
saturated. After checking "# radosgw-admin sync status " the output  
says "metadata is caught up with master" and "data is caught up with  
source" but with almost 25 TB data behind as compared to DC. It also  
looks like the sync status of the bucket " radosgw-admin bucket sync  
status --bucket=<bucket-name>" still bucket is behind shards.  
Attaching the log and the output below.

                   The possibility of issuing a resync of the data  
from the beginning is quite low and not feasible in our case. The "#  
radosgw-admin sync error list" output is also attached with some  
information redacted and we see errors.

radosgw-sync status

 radosgw-admin sync status
          realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0)
      zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0)
           zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd)
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: d09d3d16-8601-448b-bf3d-609b8a29647d (ahd)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is caught up with source

 radosgw-admin bucket sync status --bucket=<bucket-name>

 realm 6a7fab77-64e3-453e-b54b-066bc8af2f00 (realm0)
      zonegroup be660604-d853-4f8e-a576-579cae2e07c2 (zg0)
           zone d06a8dd3-5bcb-486c-945b-2a98969ccd5f (fbd)
         bucket :tc******rc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1])

    source zone d09d3d16-8601-448b-bf3d-609b8a29647d (ahd)
  source bucket  
:tc*******arc-b1[d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1])
                full sync: 14/9221 shards
                full sync: 49448693 objects completed
                incremental sync: 9207/9221 shards
                bucket is behind on 25 shards
                behind shards:  
[9,111,590,826,1774,2968,3132,3382,3386,3409,3685,3820,4174,4544,4708,4811,5733,6285,6558,7288,7417,7443,7876,8151,8878]

Error:  radosgw-admin sync error list

 "id": "1_1690799008.725414_3926410.1",
                "section": "data",
                "name":  
"bucket0:d09d3d16-8601-448b-bf3d-609b8a29647d.89871.1:1949",
                "timestamp": "2023-07-31T10:23:28.725414Z",
                "info": {
                    "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d",
                    "error_code": 125,
                    "message": "failed to sync bucket instance:  
(125) Operation canceled"

 "id": "1_1690804503.144829_3759212.1",
                "section": "data",
                "name":  
"bucket1:d09d3d16-8601-448b-bf3d-609b8a29647d.38987.1:1232/S01/1/120/2b7ea802-efad-41d3-9d90-9**************523.txt",
                "timestamp": "2023-07-31T11:54:53.233451Z",
                "info": {
                    "source_zone": "d09d3d16-8601-448b-bf3d-609b8a29647d",
                    "error_code": 5,
                    "message": "failed to sync object(5) Input/output error"

Thanks
Ankit
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx