Metadata sync fails after promoting new zone to master - mdlog buffer read issue

Jesse Roberts <jesse@xxxxxxxxxxxx> · Thu, 17 May 2018 12:25:24 -0600

I have a 3 zone multi-site setup using Ceph luminous (12.2.4) on Ubuntu 18.04. I used ceph-deploy to build each cluster and followed https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html-single/object_gateway_guide_for_red_hat_enterprise_linux/index#multi_site for multi-site setup and have purged the multi-zone config (ie purged all pools and started from scratch) twice now. This is a testing setup so no data has been lost.

On initial setup, I have no issues. Metadata & normal data sync as expected and works great. As soon as I promote a different zone to master though, metadata sync is broken on both secondaries. I'll note that data sync still works, as object counts in each zones $ZONE.rgw.buckets.data grow even when I create new buckets and push data to them.  I've narrowed the lack of metadata sync down to a 500 being returned by the rados gateway in the master zone when secondaries are requesting /admin/log?type=metadata&rgwx-zonegroup=$ZONEGROUPID (by calling radosgw-admin metadata sync run)

Associated logs on the rgw instance in the master zone:

2018-05-16 06:25:17.518662 7f8957499700  1 ====== starting new request req=0x7f89574931e0 =====
2018-05-16 06:25:17.520186 7f8957499700  1 failed to decode the mdlog history: buffer::end_of_buffer
2018-05-16 06:25:17.520195 7f8957499700  1 failed to read mdlog history: (5) Input/output error
2018-05-16 06:25:17.520207 7f8957499700  0 WARNING: set_req_state_err err_no=5 resorting to 500
2018-05-16 06:25:17.520263 7f8957499700  1 ====== req done req=0x7f89574931e0 op status=0 http_status=500 ======
2018-05-16 06:25:17.520314 7f8957499700  1 civetweb: 0x55a584cb9000: $INTERNAL_IP

However, an 'radosgw-admin mdlog list' works just fine, and returns what appears to be a perfectly valid log.

I have ensured that each of the secondaries has pulled the latest period and restarted all the gateways. All rgw instances agree on the current period as well as the master zone.

Any ideas on what may be going on?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com