rgw: new multisite update

Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> · Fri, 4 Dec 2015 22:34:49 -0800

Orit, Casey and I have been working for quite a while now on a new
multi-site framework that will replace the current sync-agent system.
There are many improvements that we are working on, and we think it
will be worth the wait (and effort!).
First and foremost, the main complain that we hear about the rgw, and
specifically about its multisite feature is the configuration
complexity. The new scheme will make things much easier to configure,
and will handle changes dynamically. There will be many new commands
that will remove the need to manually edit and inject json
configurations as was previously needed. Changes to the zones
configuration will be applied to running gateways, and these will be
able to handle those changes without the need to restarts the
processes.

We're getting rid of the sync agent, and the gateways themselves will
handle the sync process. Removing the sync agent will help with making
the system much easier to set and configure. It also helps with the
sync process itself in many aspects.
The data sync will now be active-active, so all zones within the same
zone group will be writable. Note that 'region' is now called 'zone
group'; there was too much confusion with the old term. The metadata
changes will keep the master-slave scheme to keep things simpler in
that arena.
We added a new entity called 'realm', which is a container for zone
groups. Multiple realms can be created, which provides the ability to
run completely different configurations on the same clusters with
minimum effort.
A new entity called 'period' (might be renamed to 'config') holds the
realm configuration structure. A period changes when the master zone
changes. A period epoch is being incremented whenever there's a change
in the configuration that does not modify the master.

New radosgw-admin commands were added to provide better view into the
sync process status itself. The scheme still requires handling 3
different logs (metadata, data, bucket indexes), and the sync statuses
reflect the position in those logs (for incremental sync), or which
entry is being synced (for the full sync).

There is also a new admin socket command ('cr dump') that dumps the
current state of the coroutines framework (that was created for this
feature), which helps quite a bit with debugging problems.

Migrating from the old sync agent to the new sync will require the new
sync to start from scratch. Note that this process should not copy any
actual data, but the sync will need to build the new sync status (and
verify that all the data is in place in the zones).

So, when is this going to be ready?

We're aiming at having it in Jewel. At the moment nothing is merged
yet (still at the wip-rgw-new-multisite branch); we're trying to make
sure that things still work against it (e.g., the sync agent can still
work), and we'll get it merged once we feel comfortable with the
backward compatibility. The metadata sync is still missing some
functionality that is related to fail over recovery, and the error
reporting and retry still needs some more work. The data sync itself
has a few cases that we don't handle correctly. The realm/period
bootstrapping still needs some more work. Documentation is almost non
existent. But the most important piece that we actually need to work
on is the testing. We need to make sure that we have tests coverage
for all the new functionality. Which brings me to this:

It would be great if we had people outside of the team that could take
an early look on it and help with mapping the pain points. It would be
even greater if someone could help with the actual development of the
automated tests (via teuthology), but even just manually testing and
reporting of any issues will help a lot. Note: Danger! Danger! this
could and will eat your data! It shouldn't be tested on a production
environment (yet!).

The following is a sample config of a single zone group, with two
separate zones. There are two machines that we set up the zones on:
rgw1, rgw2, where rgw1 serves as the master for metadata.

Note that there are a bunch of commands that we'll be able to redact
later (e.g., all the default setting ones, separate commands to create
a zone and attach it to a zonegroup). In some of the cases when you
create the first realm/zonegroup/zone, the entity will automatically
become the default. However, I've ran into some issues when trying to
set multiple realms on a single cluster, and not having the default
caused some issues. We'll need to clean that up.

access_key=<access key>
secret=<secret>

# run on rgw1
$ radosgw-admin realm create --rgw-realm=earth
$ radosgw-admin zonegroup create --rgw-zonegroup=us
--endpoints=http://rgw1:80 --master
$ radosgw-admin zonegroup default --rgw-zonegroup=us
$ radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-1
--access-key=${access_key} --secret=${secret}
--endponts=http://rgw1:80
$ radosgw-admin zone default --rgw-zone=us-1
$ radosgw-admin zonegroup add --rgw-zonegroup=us --rgw-zone=us-1
$ radosgw-admin user create --uid=zone.jup --display-name="Zone User"
--access-key=${access_key} --secret=${secret} --system
$ radosgw-admin period update --commit

$ radosgw --rgw-zone=us-1 --rgw-frontends="civetweb port=80"

# run on rgw2
$ radosgw-admin realm pull --url=http://rgw1
--access-key=${access_key} --secret=${secret}
$ radosgw-admin realm default --rgw-realm=earth
$ radosgw-admin zonegroup default --rgw-zonegroup=us
$ radosgw-admin zone create --rgw-zonegroup=us --rgw-zone=us-2
--access-key=${access_key} --secret=${secret}
--endpoints=http://rgw2:80
$ radosgw-admin period update --commit

$ radosgw --rgw-zone=us-2 --rgw-frontends="civetweb port=80"

At this point both zones should be running and syncing from each
other. There are still a lot of rough edges, things that we're working
to fix and clean up. As I said, it would be great to have some other
people trying this, so that we understand better the pain points, and
we can map the issues. As was mentioned before, this whole feature can
be found at the wip-rgw-new-multisite development branch. It should
used only in test environments, and it *will* eat your data (and if by
any chance it doesn't eat your data, let us know and we'll try to
figure out what went wrong). Beware!

Thanks!
Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html