RGW multisite slowness issue due to the "304 Not Modified" responses on primary zone

praveenkumargpk17@xxxxxxxxx · Tue, 05 Mar 2024 09:48:48 -0000

Hi,
We have 2 clusters (v18.2.1) primarily used for RGW which has over 2+ billion RGW objects. They are also in multisite configuration totaling to 2 zones and we've got around 2 Gbps of bandwidth dedicated (P2P) for the multisite traffic. We see that using "radosgw-admin sync status" on the zone 2, all the 128 shards are recovering and unfortunately there is very less data transfer from primary zone ie., the link utilization is barely 100 Mbps / 2 Gbps. Our objects are quite small as well like avg. of 1 MB in size. 
On further inspection, we noticed the rgw access the logs at primary site are mostly yielding "304 Not Modified" for RGWs at site-2. Is this expected? Here are some of the logs (information is redacted)

root@host-04:~# tail -f /var/log/haproxy-msync.log
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:33730 [12/Feb/2024:05:06:51.047] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/1/0/0 0/0 "GET /bucket1/object1.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7 HTTP/1.1"
Feb 12 05:06:51 host-04 haproxy[971171]: 10.1.85.14:59730 [12/Feb/2024:05:06:51.048] https~ backend/host-04-msync 0/0/0/2/2 304 143 - - ---- 56/55/3/1/0 0/0 "GET /bucket1/object91.jpg?rgwx-zonegroup=71dceb3d-3092-4dc6-897f-a9abf60c9972&rgwx-prepend-metadata=true&rgwx-sync-manifest&rgwx-sync-cloudtiered&rgwx-skip-decrypt&rgwx-if-not-replicated-to=a8204ce2-b69e-4d90-bca1-93edd05a1a29%3Abucket1%3A8b96aea5-c763-40a3-8430-efd67cff0c62.20010.7 HTTP/1.1"

We also took a look at our Grafana instance and out of 1000 requests / second, 200 requests are getting  "200 OK" and 800 requests are getting "304 Not Modified". Sync threads are run on only 2 rgw daemons per zone and are behind a Load Balancer. "# radosgw-admin sync error list" also contains around 20 errors which are mostly automatically recoverable.
As we understand, does it mean that RGW multisite sync logs in the log pool are yet to be generated or some sort? Please provide us some insights and let us know how to resolve this.

Thanks,
Praveen
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx