Re: RGW sync gets stuck every day

Matthew Darwin <bugs@xxxxxxxxxx> · Thu, 12 Sep 2024 00:20:32 -0400

I'm on quincy.

I had lots of problems with RGW getting stuck.  Once I dedicated 1 
single RGW on each side to do replication, my problems went away.  
Having a cluster of RGW behind a load balancer seemed to be confusing 
things.

I still have multiple RGW for user-facing load, but a single RGW for 
datacentre replication (for each cluster).

On 2024-09-11 08:56, Olaf Seibert wrote:
So we still have this rgw synchronization that gets stuck every day 
and about the same time. We have alerting on it, so our on-call 
people are getting annoyed.

Summarizing: we see on the receiving end of an rgw sync something 
like this:

# radosgw-admin sync status --rgw-realm backup

          realm 0336648b-0fc3-46d6-99b9-8ddf4576ebab (backup)
      zonegroup 8a84e62c-0958-4e2e-8d61-58af9051e063 (backup)
           zone 7521c29d-e7f6-4812-9d0f-e1223ae425a4 (dus2-backup)
   current time 2024-09-11T10:05:58Z
zonegroup features enabled: resharding
                   disabled: compress-encrypted
  metadata sync no sync (zone is master)
      data sync source: 54cbf963-75ca-431d-8dc5-e8db1c51b705 
(ham1-backup)
                        syncing
                        full sync: 0/128 shards
                        incremental sync: 128/128 shards
                        data is behind on 1 shards
                        behind shards: [36]
                        oldest incremental change not applied: 
2024-09-11T09:59:02.900939+0000 [36]

and that time at the end makes no progress. Last time we had the 
patience for it, this lasted for several hours. Strangely enough, if 
we look at this at other times, then sometimes it also gets behind, 
but then it usually fixes itself.

On the sending side, we have about 12 rgw daemons. One of them seems 
to be responsible for being stuck. If we restart them slowly one by 
one, after one particular daemon restarts, the shard goes to 
recovering state and after some minutes it has caught up.

So I got the journal from that daemon, a few days ago, and went 
looking for seemingly interesting lines. Here is what I found, with 
some annotations between (parentheses). I sorted them by group and 
the groups are not necessarily in chronological order.

Is there anything in these logs that points to a root cause, and 
more interesting, how to fix it? Is there anything else in there 
that I could look up and report?

Seemingly interesting entries in the rgw daemon that is responsible 
for shard 30.

behind shards: [30,35]
oldest incremental change not applied: 
2024-09-09T10:03:53.033852+0000 [30]

(lots of these are happening)

(it is very annoying that all requests are indicated with an 
address that seems to get reused all the time)
Sep 09 10:24:28 ham1-000215 radosgw[2104175]: ====== starting new 
request req=0x7f6a325e5660 =====
Sep 09 10:24:28 ham1-000215 radosgw[2104175]: ====== req done 
req=0x7f6a325e5660 op status=0 http_status=200 latency=0.000000000s 
======
Sep 09 10:24:28 ham1-000215 radosgw[2104175]: beast: 
0x7f6a325e5660: 10.34.32.5 - anonymous [09/Sep/2024:10:24:28.426 
+0000] "GET / HTTP/1.1" 200 231 - "nginx/1.25.5 (health check 
radosgw_ok)" - latency=0.000000000s

(lots of these are happening)

Sep 09 10:24:39 ham1-000215 radosgw[2104175]: ====== starting new 
request req=0x7f6a325e5660 =====
Sep 09 10:24:39 ham1-000215 radosgw[2104175]: ====== req done 
req=0x7f6a325e5660 op status=0 http_status=200 latency=0.000000000s 
======
Sep 09 10:24:39 ham1-000215 radosgw[2104175]: beast: 
0x7f6a325e5660: 10.34.32.21 - ceph-rgw-sync-backup 
[09/Sep/2024:10:24:39.599 +0000] "GET 
/admin/log/?type=data&id=84&info&rgwx-zonegroup=8a84e62c-0958-4e2e-8d61-58af9051e063 
HTTP/1.1" 200 38 - - - latency=0.000000000s

(there are a few of these, spread through the log)

Sep 09 07:39:55 ham1-000215 radosgw[2104175]: 
RGW-SYNC:data:sync:shard[30]: incremental sync on 
triliobackup-ham1:fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1:1shard: 
30on gen 0
Sep 09 07:40:35 ham1-000215 radosgw[2104175]: 
RGW-SYNC:data:sync:shard[30]: incremental sync on 
triliobackup-ham1:fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1:1shard: 
30on gen 0
Sep 09 07:41:35 ham1-000215 radosgw[2104175]: 
RGW-SYNC:data:sync:shard[30]: incremental sync on 
triliobackup-ham1:fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1:1shard: 
30on gen 0
Sep 09 09:57:36 ham1-000215 radosgw[2104175]: 
RGW-SYNC:data:sync:shard[30]: incremental sync on 
triliobackup-ham1:fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1:1shard: 
30on gen 0
Sep 09 09:59:16 ham1-000215 radosgw[2104175]: 
RGW-SYNC:data:sync:shard[30]: incremental sync on 
triliobackup-ham1:fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1:1shard: 
30on gen 0
Sep 09 10:02:56 ham1-000215 radosgw[2104175]: 
RGW-SYNC:data:sync:shard[30]: incremental sync on 
triliobackup-ham1:fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1:1shard: 
30on gen 0

(no more reports about that; the time 10:03:53.033852+0000 "time of 
the oldest incremental change not applied" is ~1 minute after this)

(here is some sort of WARNING - it occurs every day, is this the 
trigger of the lack of progress? The last one occurs before the 
last of the messages above. Unfortunately the req number doesn't 
occur anywhere else in the log so it's not very helpful.)

Sep 09 09:58:07 ham1-000215 radosgw[2104175]: ====== starting new 
request req=0x7f6a32666660 =====
Sep 09 09:58:07 ham1-000215 radosgw[2104175]: req 
10197104386188992433 0.008000183s s3:complete_multipart WARNING: 
failed to unlock 
fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1__multipart_workload_c50c59c2-66d0-4c22-87a9-7ac557b1df50/snapshot_b855da51-1f07-4ca5-9cbe-6cfd4e0faf9b/vm_id_843d7585-8a86-467e-b903-27b1b0c38473/vm_res_id_1fd68158-768f-4ea4-90b0-04d857f7fc49_vda/c36ca5b3-008b-4ede-af0d-1e1c59679f23.2~kC-ezzxlX43AoVgGmbT1c5ccxoTyeZv.meta
Sep 09 09:58:07 ham1-000215 radosgw[2104175]: ====== req done 
req=0x7f6a32666660 op status=0 http_status=200 latency=0.008000183s 
======
Sep 09 09:58:07 ham1-000215 radosgw[2104175]: beast: 
0x7f6a32666660: 10.34.32.21 - ceph-radosgw-trilio-backup 
[09/Sep/2024:09:58:07.718 +0000] "POST 
/workload_c50c59c2-66d0-4c22-87a9-7ac557b1df50/snapshot_b855da51-1f07-4ca5-9cbe-6cfd4e0faf9b/vm_id_843d7585-8a86-467e-b903-27b1b0c38473/vm_res_id_1fd68158-768f-4ea4-90b0-04d857f7fc49_vda/c36ca5b3-008b-4ede-af0d-1e1c59679f23?uploadId=2~kC-ezzxlX43AoVgGmbT1c5ccxoTyeZv 
HTTP/1.1" 200 754 - "s3fs/1.90 (commit hash unknown; 
GnuTLS(gcrypt))" - latency=0.008000183s

Sep 09 10:01:25 ham1-000215 radosgw[2104175]: ====== starting new 
request req=0x7f6a32666660 =====
Sep 09 10:01:25 ham1-000215 radosgw[2104175]: req 
13832955583618729439 0.008000182s s3:complete_multipart WARNING: 
failed to unlock 
fab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1__multipart_workload_c50c59c2-66d0-4c22-87a9-7ac557b1df50/snapshot_8d03fbde-d351-4353-8798-8f84eff675cc/vm_id_843d7585-8a86-467e-b903-27b1b0c38473/vm_res_id_4ff90160-f943-427c-8527-07beced85f59_vda/09716dab-c5d6-4a33-9057-b18429bed3d9.2~tUXO0LVUMkzHEvcdAyYVXSMqOvydOux.meta
Sep 09 10:01:25 ham1-000215 radosgw[2104175]: ====== req done 
req=0x7f6a32666660 op status=0 http_status=200 latency=0.012000274s 
======
Sep 09 10:01:25 ham1-000215 radosgw[2104175]: beast: 
0x7f6a32666660: 10.34.32.5 - ceph-radosgw-trilio-backup 
[09/Sep/2024:10:01:25.003 +0000] "POST 
/workload_c50c59c2-66d0-4c22-87a9-7ac557b1df50/snapshot_8d03fbde-d351-4353-8798-8f84eff675cc/vm_id_843d7585-8a86-467e-b903-27b1b0c38473/vm_res_id_4ff90160-f943-427c-8527-07beced85f59_vda/09716dab-c5d6-4a33-9057-b18429bed3d9?uploadId=2~tUXO0LVUMkzHEvcdAyYVXSMqOvydOux 
HTTP/1.1" 200 754 - "s3fs/1.90 (commit hash unknown; 
GnuTLS(gcrypt))" - latency=0.012000274s

(here is some error - what is error -2? it could be a negative 
"ENOENT 2 No such file or directory". This fits with the 404.)

Sep 09 10:03:59 ham1-000215 radosgw[2104175]: ====== starting new 
request req=0x7f6a325e5660 =====
Sep 09 10:03:59 ham1-000215 radosgw[2104175]: req 
8299556935383576414 0.004000092s op->ERRORHANDLER: err_no=-2 
new_err_no=-2
Sep 09 10:03:59 ham1-000215 radosgw[2104175]: ====== req done 
req=0x7f6a325e5660 op status=0 http_status=404 latency=0.004000092s 
======
Sep 09 10:03:59 ham1-000215 radosgw[2104175]: beast: 
0x7f6a325e5660: 10.34.32.21 - ceph-radosgw-trilio-backup 
[09/Sep/2024:10:03:59.514 +0000] "HEAD 
/contego_tasks/snapshot_b855da51-1f07-4ca5-9cbe-6cfd4e0faf9b 
HTTP/1.1" 404 0 - "s3fs/1.90 (commit hash unknown; GnuTLS(gcrypt))" 
- latency=0.004000092s

(just before the restart; some request seems to be stuck for 1309 s)

Sep 09 10:24:42 ham1-000215 systemd[1]: Stopping Ceph 
rgw.ham1-backup.ham1-000215.eptemh for 
fc678131-d3f5-48f0-91f4-d81e2029f637...
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: received  signal: 
Terminated from /dev/init -- /usr/bin/radosgw -n 
client.rgw.ham1-backup.ham1-000215.eptemh -f --setuser ceph 
--setgroup ceph --default-log-to-file=false 
--default-log-to-journald=true --default-log-to-stderr=false (PID: 
1) UID: 0
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: handle_sigterm
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: handle_sigterm set 
alarm for 120
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: shutting down
Sep 09 10:24:43 ham1-000215 
ceph-fc678131-d3f5-48f0-91f4-d81e2029f637-rgw-ham1-backup-ham1-000215-eptemh[2104169]: 
2024-09-09T10:24:43.391+0000 7f6b63134640 -1 received  signal: 
Terminated from /dev/init -- /usr/bin/radosgw -n 
client.rgw.ham1-backup.ham1-000215.eptemh -f --setuser ceph 
--setgroup ceph --default-log-to-file=false 
--default-log-to-journald=true --default-log-to-stderr=false (PID: 
1) UID: 0
Sep 09 10:24:43 ham1-000215 
ceph-fc678131-d3f5-48f0-91f4-d81e2029f637-rgw-ham1-backup-ham1-000215-eptemh[2104169]: 
2024-09-09T10:24:43.391+0000 7f6b6699f880 -1 shutting down
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: req 
823990580885441354 1309.589843750s s3:get_obj iterate_obj() failed 
with -125
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: ====== req done 
req=0x7f6a32666660 op status=-125 http_status=200 
latency=1309.589843750s ======
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: beast: 
0x7f6a32666660: 10.34.32.21 - ceph-rgw-sync-backup 
[09/Sep/2024:10:02:53.801 +0000] "GET 
/triliobackup-ham1/workload_c50c59c2-66d0-4c22-87a9-7ac557b1df50%2Fsnapshot_8d03fbde-d351-4353-8798-8f84eff675cc%2Fvm_id_843d7585-8a86-467e-b903-27b1b0c38473%2Fvm_res_id_4ff90160-f943-427c-8527-07beced85f59_vda%2F09716dab-c5d6-4a33-9057-b18429bed3d9?rgwx-zonegroup=8a84e62c-0958-4e2e-8d61-58af9051e063&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=7521c29d-e7f6-4812-9d0f-e1223ae425a4%3Atriliobackup-ham1%3Afab149a7-8bf6-4231-aaf7-1fb6e2e2d4ca.68875820.1 
HTTP/1.1" 200 3347057253 - - - latency=1309.589843750s
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: RGW-SYNC:meta: ERROR: 
failed to fetch all metadata keys
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: rgw rados thread: 
ERROR: failed to run sync
Sep 09 10:24:43 ham1-000215 radosgw[2104175]: final shutdown

On 08.08.24 16:24, Olaf Seibert wrote:
On 08.08.24 08:31, Eugen Block wrote:
 >> Redeploying stuff seems like a much too big hammer to get 
things going again. Surely there must be something more reasonable?
 >
 > wouldn't a restart suffice?

Probably, but when we were handling this the first time around, a 
redeploy was the first thing that we tried and actually made a 
difference, so for the time being I stuck with it.

 > Do you see anything in the 'radosgw-admin sync error list'? 
Maybe an error prevents the sync from continuing?

We did some experiments today. There were entries in 'radosgw-admin 
sync error list' but only for other shard numbers. Those looked like

     {
         "shard_id": 1,
         "entries": [
             {
                 "id": "1_1722601657.174881_2599019.1",
                 "section": "data",
                 "name": "tbackup-zone1:blah:4/workload_blah/ 
snapshot_blah/vm_id_blah/vm_res_id_fblah_vda/blah",
                 "timestamp": "2024-08-02T12:27:37.174881Z",
                 "info": {
                     "source_zone": 
"blah-blah-blah-blah-e8db1c51b705",
                     "error_code": 11,
                     "message": "failed to sync object(11) Resource 
temporarily unavailable"
                 }
             },

It seems that the first part, "tbackup-zone1" is the actual bucket 
name. The second part between colons seems to be some long ID that 
seemed the same in other mentions of the bucket. And the final 
part, "4" here, varied.

The object that is mentioned seemed to be deleted when I checked 
them later.

Today we had 2 shards behind: [30,39].

# radosgw-admin data sync status --rgw-realm backup --source-zone 
zone1- backup --shard-id 30
{
     "shard_id": 30,
     "marker": {
         "status": "incremental-sync",
         "marker": "00000000000000000000:00000000000000447240",
         "next_step_marker": "",
         "total_entries": 0,
         "pos": 0,
         "timestamp": "2024-08-08T10:28:40.756653Z"
     },
     "pending_buckets": [
         "tbackup-zone1:blah:1"
     ],
     "recovering_buckets": [],
     "current_time": "2024-08-08T11:30:47Z"
}

The other shard number had the same pending bucket.
(Timestamps might not be exactly right because I ran commands many 
times and I copied this from terminal history later)

What I tried today was

radosgw-admin bucket  sync init --bucket="tbackup-zone1" 
--rgw-realm backup --source-zone=zone1-backup

followed by a "sync run", started in the background.

That didn't seem to have an immediate effect, but after some 
minutes we got from 'radosgw-admin sync status --rgw-realm backup':

    current time 2024-08-08T10:49:18Z
                         syncing
                         full sync: 0/128 shards
                         incremental sync: 128/128 shards
                         data is behind on 2 shards
                         behind shards: [30,39]
                         oldest incremental change not applied: 
2024-08-08T10:28:41.601323+0000 [39]
                         1 shards are recovering
                         recovering shards: [36]

Unfortunately you can't see from this if the recovering makes any 
progress.

Interestingly, the output from 'radosgw-admin sync error list 
--rgw- realm backup' only goes up to shard 31... and for higher 
numbers the output is simply "[]" instead of "
[
     {
         "shard_id": 30,
         "entries": []
     }
]".

I think it was at this point that we decided to restart a single 
rgw daemon on the other side. I don't know if we were "lucky" but 
this resulted in a change:

# radosgw-admin sync status --rgw-realm backup
           realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
       zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
            zone xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 (zone2-backup)
    current time 2024-08-08T11:52:47Z
zonegroup features enabled: resharding
                    disabled: compress-encrypted
   metadata sync no sync (zone is master)
       data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 
(zone1- backup)
                         syncing
                         full sync: 0/128 shards
                         incremental sync: 128/128 shards
                         3 shards are recovering
                         recovering shards: [30,36,39]

After that it took another hour or so until the recovery was 
finished. Probably the "radosgw-admin bucket  sync run ... &" 
command finished then too but after so long I wasn't looking all 
the time.

Now I am very curious if this fixed something permanently, or if we 
get the same situation again tomorrow. I guess we will find out. We 
are alerting on the sync getting behind...

 >
 > Zitat von Olaf Seibert <o.seibert@xxxxxxxxxxxx>:
 >
 >> Hi all,
 >>
 >> we have some Ceph clusters with RGW replication between them. 
It seems that in the last month at least, it gets stuck at around 
the same time ~every day. Not 100% the same time, and also not 100% 
of the days, but in the more recent days seem to happen more, and 
for longer.
 >>
 >> With "stuck" I mean that the "oldest incremental change not 
applied" is getting 5 or more minutes old, and not changing. In the 
past this seemed to resolve itself in a short time, but recently it 
didn't. It remained stuck at the same place for several hours. 
Also, on several different occasions I noticed that the shard 
number in question was the same.
 >>
 >> We are using Ceph 18.2.2, image id 719d4c40e096.
 >>
 >> The output on one end looks like this (I redacted out some of 
the data because I don't know how much of the naming would be 
sensitive information):
 >>
 >> root@zone2:/# radosgw-admin sync status --rgw-realm backup
 >>           realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
 >>       zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
 >>            zone xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 
(zone2-backup)
 >>    current time 2024-08-04T10:22:00Z
 >> zonegroup features enabled: resharding
 >>                    disabled: compress-encrypted
 >>   metadata sync no sync (zone is master)
 >>       data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 
(zone1- backup)
 >>                         syncing
 >>                         full sync: 0/128 shards
 >>                         incremental sync: 128/128 shards
 >>                         data is behind on 3 shards
 >>                         behind shards: [30,90,95]
 >>                         oldest incremental change not applied: 
2024-08-04T10:05:54.015403+0000 [30]
 >>
 >> while on the other side it looks ok (not more than half a 
minute behind):
 >>
 >> root@zone1:/# radosgw-admin sync status --rgw-realm backup
 >>           realm xxxxxxxx-xxxx-xxxx-xxxx-8ddf4576ebab (backup)
 >>       zonegroup xxxxxxxx-xxxx-xxxx-xxxx-58af9051e063 (backup)
 >>            zone xxxxxxxx-xxxx-xxxx-xxxx-e8db1c51b705 
(zone1-backup)
 >>    current time 2024-08-04T10:23:05Z
 >> zonegroup features enabled: resharding
 >>                    disabled: compress-encrypted
 >>   metadata sync syncing
 >>                 full sync: 0/64 shards
 >>                 incremental sync: 64/64 shards
 >>                 metadata is caught up with master
 >>       data sync source: xxxxxxxx-xxxx-xxxx-xxxx-e1223ae425a4 
(zone2- backup)
 >>                         syncing
 >>                         full sync: 0/128 shards
 >>                         incremental sync: 128/128 shards
 >>                         data is behind on 4 shards
 >>                         behind shards: [89,92,95,98]
 >>                         oldest incremental change not applied: 
2024-08-04T10:22:53.175975+0000 [95]
 >>
 >>
 >> With some experimenting, we found that redeploying the RGWs on 
this side resolves the situation: "ceph orch redeploy 
rgw.zone1-backup". The shards go into "Recovering" state and after 
a short time it is "caught up with source" as well.
 >>
 >> Redeploying stuff seems like a much too big hammer to get 
things going again. Surely there must be something more reasonable?
 >>
 >> Also, any ideas about how we can find out what is causing this? 
It may be that some customer has some job running every 24 hours, 
but that shouldn't cause the replication to get stuck.
 >>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx