Re: RGW sync gets stuck every day

Eugen Block <eblock@xxxxxx> · Thu, 08 Aug 2024 06:31:54 +0000

Hi,

Redeploying stuff seems like a much too big hammer to get things  
going again. Surely there must be something more reasonable?

wouldn't a restart suffice?
Do you see anything in the 'radosgw-admin sync error list'? Maybe an  
error prevents the sync from continuing?

Zitat von Olaf Seibert <o.seibert@xxxxxxxxxxxx>:

Hi all,

we have some Ceph clusters with RGW replication between them. It  
seems that in the last month at least, it gets stuck at around the  
same time ~every day. Not 100% the same time, and also not 100% of  
the days, but in the more recent days seem to happen more, and for  
longer.

With "stuck" I mean that the "oldest incremental change not applied"  
is getting 5 or more minutes old, and not changing. In the past this  
seemed to resolve itself in a short time, but recently it didn't. It  
remained stuck at the same place for several hours. Also, on several  
different occasions I noticed that the shard number in question was  
the same.

We are using Ceph 18.2.2, image id 719d4c40e096.

The output on one end looks like this (I redacted out some of the  
data because I don't know how much of the naming would be sensitive  
information):

root@zone2:/# radosgw-admin sync status --rgw-realm backup
          realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
      zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
           zone xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 (zone2-backup)
   current time 2024-08-04T10:22:00Z
zonegroup features enabled: resharding
                   disabled: compress-encrypted
  metadata sync no sync (zone is master)
      data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 (zone1-backup)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 3 shards
                        behind shards: [30,90,95]
                        oldest incremental change not applied:  
2024-08-04T10:05:54.015403+0000 [30]

while on the other side it looks ok (not more than half a minute behind):

root@zone1:/# radosgw-admin sync status --rgw-realm backup
          realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
      zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
           zone xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 (zone1-backup)
   current time 2024-08-04T10:23:05Z
zonegroup features enabled: resharding
                   disabled: compress-encrypted
  metadata sync syncing
                full sync: 0/64 shards
                incremental sync: 64/64 shards
                metadata is caught up with master
      data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 (zone2-backup)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 4 shards
                        behind shards: [89,92,95,98]
                        oldest incremental change not applied:  
2024-08-04T10:22:53.175975+0000 [95]

With some experimenting, we found that redeploying the RGWs on this  
side resolves the situation: "ceph orch redeploy rgw.zone1-backup".  
The shards go into "Recovering" state and after a short time it is  
"caught up with source" as well.

Redeploying stuff seems like a much too big hammer to get things  
going again. Surely there must be something more reasonable?

Also, any ideas about how we can find out what is causing this? It  
may be that some customer has some job running every 24 hours, but  
that shouldn't cause the replication to get stuck.

Thanks in advance,

--
Olaf Seibert
Site Reliability Engineer

SysEleven GmbH
Boxhagener Straße 80
10245 Berlin

T +49 30 233 2012 0
F +49 30 616 7555 0

https://www.syseleven.de
https://www.linkedin.com/company/syseleven-gmbh/

Current system status always at:
https://www.syseleven-status.net/

Company headquarters: Berlin
Registered court: AG Berlin Charlottenburg, HRB 108571 Berlin
Managing directors: Andreas Hermann, Jens Ihlenfeld, Norbert Müller,  
Jens Plogsties
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx