Radosgw agent and federated config problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We run a ceph cluster with radosgw on top of it. During the installation we have never specified any regions or zones, which means that every bucket currently resides in the default region. To support a federated config we have built a test cluster that replicates the current production setup with the same default region and zone. Once that setup was running we went through the following steps to make the switch to a federated config. Our second zone is completely empty to begin with and has no data in it at this point.

1) We created a new region that includes the api_name, master_zone and endpoints for our two zones.
2) We created two users in zone1 and zone2 with the same access and secret key across the two zones.
3) We created two zones with default pools and specified the access and secret key.
4) We have changed ceph.conf to include the new region and zone and pushed it to our nodes.
5) The default region was set to our new region through radosgw-admin and the default was removed.
6) The regionmap was updated to reflect the changes we made to our regions.

This last step proved to be a little difficult, as radosgw-admin regionmap update returns:
7f7b36b7b840 -1 cannot update region map, master_region conflict

The master_region is set to 'ams' in both clusters.

It may be that we be running into issues later on because we have solved this the 'hard way' by changing the regionmap manually.

6) As we have changed our region and zones we have restarted radosgw. As expected this takes our objects offline.
7) We have updated all buckets to sit in the new region.

After our buckets have changed all of our objects are back online again. 

We have not made any changes to our pools. The new region points to the existing pool so this has never resulted in any physical movement of data. Once this was all done the cluster was up and running, as expected, but serving its content from the new zone.

At this point we set up radosgw-admin with the users from step 2 and 3 matching our zones. The first time we have done this we ran into a couple of problems. The first was that radosgw-admin that's available in the repository is a little older than the one on github. This version lacks a lot of exception handling and proper error output, making it difficult to diagnose issues as they come up. We've switched to the latest available version from github which has helped us a lot to get where we are now. We had to switch radosgw from sockets to tcp first, but the manual didn't include a specific parameter which lead to radosgw not being able to handle /-characters properly. Once we added AllowEncodedSlashes it all magically worked. 

As it took us quite some time and fiddling around to get to this point we wanted to replicate the exact same situation on another test environment again to make sure we know what to do when we would change this in a live environment. And this is where it all fails. We are unable to get this set up back up again. We've compared configurations, checked every single setting we've played with but we're unable to find what's going wrong. The error message is pretty obvious though:

2015-04-24 15:37:55,073 9406 [radosgw_agent.worker][DEBUG ] syncing object object/test.txt
2015-04-24 15:37:55,089 9406 [radosgw_agent.worker][DEBUG ] object "object/test.txt" not found on master, deleting from secondary

I was expecting to find this entry in our Apache log files. Surely it would trigger a 404. It turns out though that we're not seeing any log files at all. It's not being found at all. Though when I look at the logs in zone2 I see the following:

[24/Apr/2015:15:45:01 +0000] "PUT /object/test.txt?rgwx-op-id=radosgw1%3A9727%3A1&rgwx-source-zone=zone1&rgwx-client-id=radosgw-agent HTTP/1.1" 404 242 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic"
[24/Apr/2015:15:45:01 +0000] "GET /object/?max-keys=0 HTTP/1.1" 200 408 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic"
[24/Apr/2015:15:45:01 +0000] "DELETE /object/test.txt HTTP/1.1" 204 126 "-" "Boto/2.20.1 Python/2.7.6 Linux/3.13.0-49-generic”

We’re running ceph and radosgw 0.94.1, the agent comes from github as the one that’s in the repository doesn’t seem entirely stable nor very clear on error messages.

I’m sure we may be missing something, but it feels like radosgw-agent isn’t production ready yet. Any thoughts?

Thanks,
Thomas

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux