Re: RGW sync gets stuck every day

Olaf Seibert <o.seibert@xxxxxxxxxxxx> · Thu, 8 Aug 2024 16:24:32 +0200

On 08.08.24 08:31, Eugen Block wrote:
>> Redeploying stuff seems like a much too big hammer to get things 
going again. Surely there must be something more reasonable?
>
> wouldn't a restart suffice?

Probably, but when we were handling this the first time around, a 
redeploy was the first thing that we tried and actually made a 
difference, so for the time being I stuck with it.

> Do you see anything in the 'radosgw-admin sync error list'? Maybe an 
error prevents the sync from continuing?

We did some experiments today. There were entries in 'radosgw-admin sync 
error list' but only for other shard numbers. Those looked like

    {
        "shard_id": 1,
        "entries": [
            {
                "id": "1_1722601657.174881_2599019.1",
                "section": "data",
                "name": 
"tbackup-zone1:blah:4/workload_blah/snapshot_blah/vm_id_blah/vm_res_id_fblah_vda/blah",
                "timestamp": "2024-08-02T12:27:37.174881Z",
                "info": {
                    "source_zone": "blah-blah-blah-blah-e8db1c51b705",
                    "error_code": 11,
                    "message": "failed to sync object(11) Resource 
temporarily unavailable"
                }
            },

It seems that the first part, "tbackup-zone1" is the actual bucket name. 
The second part between colons seems to be some long ID that seemed the 
same in other mentions of the bucket. And the final part, "4" here, varied.

The object that is mentioned seemed to be deleted when I checked them later.

Today we had 2 shards behind: [30,39].

# radosgw-admin data sync status --rgw-realm backup --source-zone 
zone1-backup --shard-id 30
{
    "shard_id": 30,
    "marker": {
        "status": "incremental-sync",
        "marker": "00000000000000000000:00000000000000447240",
        "next_step_marker": "",
        "total_entries": 0,
        "pos": 0,
        "timestamp": "2024-08-08T10:28:40.756653Z"
    },
    "pending_buckets": [
        "tbackup-zone1:blah:1"
    ],
    "recovering_buckets": [],
    "current_time": "2024-08-08T11:30:47Z"
}

The other shard number had the same pending bucket.
(Timestamps might not be exactly right because I ran commands many times 
and I copied this from terminal history later)

What I tried today was

radosgw-admin bucket  sync init --bucket="tbackup-zone1" --rgw-realm 
backup --source-zone=zone1-backup

followed by a "sync run", started in the background.

That didn't seem to have an immediate effect, but after some minutes we 
got from 'radosgw-admin sync status --rgw-realm backup':

   current time 2024-08-08T10:49:18Z
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 2 shards
                        behind shards: [30,39]
                        oldest incremental change not applied: 
2024-08-08T10:28:41.601323+0000 [39]
                        1 shards are recovering
                        recovering shards: [36]

Unfortunately you can't see from this if the recovering makes any progress.

Interestingly, the output from 'radosgw-admin sync error list 
--rgw-realm backup' only goes up to shard 31... and for higher numbers 
the output is simply "[]" instead of "
[
    {
        "shard_id": 30,
        "entries": []
    }
]".

I think it was at this point that we decided to restart a single rgw 
daemon on the other side. I don't know if we were "lucky" but this 
resulted in a change:

# radosgw-admin sync status --rgw-realm backup
          realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
      zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
           zone xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 (zone2-backup)
   current time 2024-08-08T11:52:47Z
zonegroup features enabled: resharding
                   disabled: compress-encrypted
  metadata sync no sync (zone is master)
      data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 (zone1-backup)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        3 shards are recovering
                        recovering shards: [30,36,39]

After that it took another hour or so until the recovery was finished. 
Probably the "radosgw-admin bucket  sync run ... &" command finished 
then too but after so long I wasn't looking all the time.

Now I am very curious if this fixed something permanently, or if we get 
the same situation again tomorrow. I guess we will find out. We are 
alerting on the sync getting behind...

>
> Zitat von Olaf Seibert <o.seibert@xxxxxxxxxxxx>:
>
>> Hi all,
>>
>> we have some Ceph clusters with RGW replication between them. It 
seems that in the last month at least, it gets stuck at around the same 
time ~every day. Not 100% the same time, and also not 100% of the days, 
but in the more recent days seem to happen more, and for longer.
>>
>> With "stuck" I mean that the "oldest incremental change not applied" 
is getting 5 or more minutes old, and not changing. In the past this 
seemed to resolve itself in a short time, but recently it didn't. It 
remained stuck at the same place for several hours. Also, on several 
different occasions I noticed that the shard number in question was the 
same.
>>
>> We are using Ceph 18.2.2, image id 719d4c40e096.
>>
>> The output on one end looks like this (I redacted out some of the 
data because I don't know how much of the naming would be sensitive 
information):
>>
>> root@zone2:/# radosgw-admin sync status --rgw-realm backup
>>           realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
>>       zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
>>            zone xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 (zone2-backup)
>>    current time 2024-08-04T10:22:00Z
>> zonegroup features enabled: resharding
>>                    disabled: compress-encrypted
>>   metadata sync no sync (zone is master)
>>       data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 
(zone1-backup)
>>                         syncing
>>                         full sync: 0/128 shards
>>                         incremental sync: 128/128 shards
>>                         data is behind on 3 shards
>>                         behind shards: [30,90,95]
>>                         oldest incremental change not applied: 
2024-08-04T10:05:54.015403+0000 [30]
>>
>> while on the other side it looks ok (not more than half a minute 
behind):
>>
>> root@zone1:/# radosgw-admin sync status --rgw-realm backup
>>           realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
>>       zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
>>            zone xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 (zone1-backup)
>>    current time 2024-08-04T10:23:05Z
>> zonegroup features enabled: resharding
>>                    disabled: compress-encrypted
>>   metadata sync syncing
>>                 full sync: 0/64 shards
>>                 incremental sync: 64/64 shards
>>                 metadata is caught up with master
>>       data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 
(zone2-backup)
>>                         syncing
>>                         full sync: 0/128 shards
>>                         incremental sync: 128/128 shards
>>                         data is behind on 4 shards
>>                         behind shards: [89,92,95,98]
>>                         oldest incremental change not applied: 
2024-08-04T10:22:53.175975+0000 [95]
>>
>>
>> With some experimenting, we found that redeploying the RGWs on this 
side resolves the situation: "ceph orch redeploy rgw.zone1-backup". The 
shards go into "Recovering" state and after a short time it is "caught 
up with source" as well.
>>
>> Redeploying stuff seems like a much too big hammer to get things 
going again. Surely there must be something more reasonable?
>>
>> Also, any ideas about how we can find out what is causing this? It 
may be that some customer has some job running every 24 hours, but that 
shouldn't cause the replication to get stuck.
>>
>> Thanks in advance,
>>
>> --
>> Olaf Seibert
>> Site Reliability Engineer--
Olaf Seibert
Site Reliability Engineer

SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/

Current system status always at:
https://www.syseleven-status.net/

Company headquarters: Berlin
Registered court: AG Berlin Charlottenburg, HRB 108571 Berlin
Managing directors: Andreas Hermann, Jens Ihlenfeld, Norbert Müller, 
Jens Plogsties
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx