Re: Ceph RBD Mirroring

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Wed, 11 Sep 2019 18:57:05 +0200

Dear Jason,

I played a bit more with rbd mirroring and learned that deleting an image at the source (or disabling journaling on it) immediately moves the image to trash at the target -
but setting rbd_mirroring_delete_delay helps to have some more grace time to catch human mistakes.

However, I have issues restoring such an image which has been moved to trash by the RBD-mirror daemon as user:
-----------------------------------
[root@mon001 ~]# rbd trash ls -la
ID           NAME                             SOURCE    DELETED_AT               STATUS                                   PARENT
d4fbe8f63905 test-vm-XXXXXXXXXXXXXXXXXX-disk2 MIRRORING Wed Sep 11 18:43:14 2019 protected until Thu Sep 12 18:43:14 2019
[root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905
rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 librbd::api::Trash: restore: Current trash source: mirroring does not match expected: user
(22) Invalid argument
-----------------------------------
This is issued on the mon, which has the client.admin key, so it should not be a permission issue.
It also fails when I try that in the Dashboard.

Sadly, the error message is not clear enough for me to figure out what could be the problem - do you see what I did wrong?

Cheers and thanks again,
	Oliver

On 2019-09-10 23:17, Oliver Freyermuth wrote:
Dear Jason,

On 2019-09-10 23:04, Jason Dillaman wrote:
On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

Dear Jason,

On 2019-09-10 18:50, Jason Dillaman wrote:
On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

Dear Cephalopodians,

I have two questions about RBD mirroring.

1) I can not get it to work - my setup is:

     - One cluster holding the live RBD volumes and snapshots, in pool "rbd", cluster name "ceph",
       running latest Mimic.
       I ran "rbd mirror pool enable rbd pool" on that cluster and created a cephx user "rbd_mirror" with (is there a better way?):
       ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd r' -o ceph.client.rbd_mirror.keyring --cluster ceph
       In that pool, two images have the journaling feature activated, all others have it disabled still (so I would expect these two to be mirrored).

You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps --
but you definitely need more than read-only permissions to the remote
cluster since it needs to be able to create snapshots of remote images
and update/trim the image journals.

these profiles really make life a lot easier. I should have thought of them rather than "guessing" a potentially good configuration...

     - Another (empty) cluster running latest Nautilus, cluster name "ceph", pool "rbd".
       I've used the dashboard to activate mirroring for the RBD pool, and then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", filled in the mons and key created above.
       I've then run:
       ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o client.rbd_mirror_backup.keyring --cluster ceph
       and deployed that key on the rbd-mirror machine, and started the service with:

Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1].

That did the trick (in combination with the above)!
Again a case of PEBKAC: I should have read the documentation until the end, clearly my fault.

It works well now, even though it seems to run a bit slow (~35 MB/s for the initial sync when everything is 1 GBit/s),
but that may also be caused by combination of some very limited hardware on the receiving end (which will be scaled up in the future).
A single host with 6 disks, replica 3 and a RAID controller which can only do RAID0 and not JBOD is certainly not ideal, so commit latency may cause this slow bandwidth.

You could try increasing "rbd_concurrent_management_ops" from the
default of 10 ops to something higher to attempt to account for the
latency. However, I wouldn't expect near-line speed w/ RBD mirroring.

Thanks - I will play with this option once we have more storage available in the target pool ;-).

       systemctl start ceph-rbd-mirror@rbd_mirror_backup.service

    After this, everything looks fine:
     # rbd mirror pool info
       Mode: pool
       Peers:
        UUID                                 NAME      CLIENT
        XXXXXXXXXXX                          ceph-virt client.rbd_mirror

    The service also seems to start fine, but logs show (debug rbd_mirror=20):

    rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXXXXXXXXXX
    rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter
    rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting failed pool replayer for uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror
    rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror
    rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error connecting to remote peer uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror: (95) Operation not supported
    rbd::mirror::ServiceDaemon: 0x5575e29c8d70 add_or_update_callout: pool_id=2, callout_id=2, callout_level=error, text=unable to connect to remote cluster

If it's still broken after fixing your caps above, perhaps increase
debugging for "rados", "monc", "auth", and "ms" to see if you can
determine the source of the op not supported error.

I already tried storing the ceph.client.rbd_mirror.keyring (i.e. from the cluster with the live images) on the rbd-mirror machine explicitly (i.e. not only in mon config storage),
and after doing that:
   rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls
works fine. So it's not a connectivity issue. Maybe a permission issue? Or did I miss something?

Any idea what "operation not supported" means?
It's unclear to me whether things should work well using Mimic with Nautilus, and enabling pool mirroring but only having journaling on for two images is a supported case.

Yes and yes.

2) Since there is a performance drawback (about 2x) for journaling, is it also possible to only mirror snapshots, and leave the live volumes alone?
     This would cover the common backup usecase before deferred mirroring is implemented (or is it there already?).

This is in-development right now and will hopefully land for the
Octopus release.

That would be very cool. Just to clarify: You mean the "real" deferred mirroring, not a "snapshot only" mirroring?
Is it already clear if this will require Octopous (or a later release) on both ends, or only on the receiving side?

I might not be sure what you mean by deferred mirroring. You can delay
the replay of the journal via the "rbd_mirroring_replay_delay"
configuration option so that your DR site can be X seconds behind the
primary at a minimum.

This is indeed what I was thinking of...

For Octopus we are working on on-demand and
scheduled snapshot mirroring between sites -- no journal is involved.

... and this is what I was dreaming of. We keep snapshots of VMs to be able to roll them back.
We'd like to also keep those snapshots in a separate Ceph instance as an additional safety-net (in addition to an offline backup of those snapshots with Benji backup).
It is not (yet) clear to me whether we can pay the "2 x" price for journaling in the long run, so this would be the way to go in case we can't.

Since I got you personally, I have two bonus questions.

1) Your talk:
    https://events.static.linuxfound.org/sites/events/files/slides/Disaster%20Recovery%20and%20Ceph%20Block%20Storage-%20Introducing%20Multi-Site%20Mirroring.pdf
    mentions "rbd journal object flush age", which I'd translate with something like the "commit" mount option on a classical file system - correct?
    I don't find this switch documented anywhere, though - is there experience with it / what's the default?

It's a low-level knob that by default causes the journal to flush its
pending IO events before it allows the corresponding IO to be issued
against the backing image. Setting it to a value greater that zero
will allow that many seconds of IO events to be batched together in a
journal append operation and its helpful for high-throughout, small IO
operations. Of course it turned out that a bug had broken that option
a while where events would never batch, so a fix is currently
scheduled for backport of all active releases [1] w/ the goal that no
one should need to tweak it.

That's even better - since our setup is growing and we will keep upgrading, I'll then just keep things as they are now (no manual tweaking)
and tag along the development. Thanks!

2) I read I can run more than one rbd-mirror with Mimic/Nautilus. Do they load-balance the images, or "only" failover in case one of them dies?

Starting with Nautilus, the default configuration for rbd-mirror is to
evenly divide the number of mirrored images between all running
daemons. This does not split the total load since some images might be
hotter than others, but it at least spreads the load.

That's fine enough for our use case. Spreading by "hotness" is a task without a clear answer
and "temperature" may change quickly, so that's all I hoped for.

Many thanks again for the very helpful explanations!

Cheers,
	Oliver

Cheers and many thanks for the quick and perfect help!
         Oliver

Cheers and thanks in advance,
         Oliver

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] https://docs.ceph.com/docs/master/rbd/rbd-mirroring/#rbd-mirror-daemon

--
Jason

[1] https://github.com/ceph/ceph/pull/28539

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com