Re: Ceph RBD Mirroring

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Sun, 15 Sep 2019 00:12:53 +0200

Dear Jason,

Am 15.09.19 um 00:03 schrieb Jason Dillaman:
> I was able to repeat this issue locally by restarting the primary OSD
> for the "rbd_mirroring" object. It seems that a regression was
> introduced w/ the introduction of Ceph msgr2 in that upon reconnect,
> the connection type for the client switches from ANY to V2 -- but only
> for the watcher session and not the status updates. I've opened a
> tracker ticker for this issue [1].
> 
> Thanks.

many thanks to you for the detailed investigation and reproduction!
While I did not restart the first 5 OSDs of the test cluster, I added an OSD and rebalanced - so I guess this can also be triggered if the primary OSD for the object changes,
which should of course also lead to a reconnection. 
I can also add to my observations that now while not touching the cluster anymore things stay in "up+replaying". 

Thanks and all the best,
	Oliver

> 
> On Fri, Sep 13, 2019 at 12:44 PM Oliver Freyermuth
> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> Am 13.09.19 um 18:38 schrieb Jason Dillaman:
>>> On Fri, Sep 13, 2019 at 11:30 AM Oliver Freyermuth
>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>> Am 13.09.19 um 17:18 schrieb Jason Dillaman:
>>>>> On Fri, Sep 13, 2019 at 10:41 AM Oliver Freyermuth
>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>
>>>>>> Am 13.09.19 um 16:30 schrieb Jason Dillaman:
>>>>>>> On Fri, Sep 13, 2019 at 10:17 AM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> On Fri, Sep 13, 2019 at 10:02 AM Oliver Freyermuth
>>>>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Dear Jason,
>>>>>>>>>
>>>>>>>>> thanks for the very detailed explanation! This was very instructive.
>>>>>>>>> Sadly, the watchers look correct - see details inline.
>>>>>>>>>
>>>>>>>>> Am 13.09.19 um 15:02 schrieb Jason Dillaman:
>>>>>>>>>> On Thu, Sep 12, 2019 at 9:55 PM Oliver Freyermuth
>>>>>>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Dear Jason,
>>>>>>>>>>>
>>>>>>>>>>> thanks for taking care and developing a patch so quickly!
>>>>>>>>>>>
>>>>>>>>>>> I have another strange observation to share. In our test setup, only a single RBD mirroring daemon is running for 51 images.
>>>>>>>>>>> It works fine with a constant stream of 1-2 MB/s, but at some point after roughly 20 hours, _all_ images go to this interesting state:
>>>>>>>>>>> -----------------------------------------
>>>>>>>>>>> # rbd mirror image status test-vm.XXXXX-disk2
>>>>>>>>>>> test-vm.XXXXX-disk2:
>>>>>>>>>>>       global_id:   XXXXXXXXXXXXXXX
>>>>>>>>>>>       state:       down+replaying
>>>>>>>>>>>       description: replaying, master_position=[object_number=14, tag_tid=6, entry_tid=6338], mirror_position=[object_number=14, tag_tid=6, entry_tid=6338], entries_behind_master=0
>>>>>>>>>>>       last_update: 2019-09-13 03:45:43
>>>>>>>>>>> -----------------------------------------
>>>>>>>>>>> Running this command several times, I see entry_tid increasing at both ends, so mirroring seems to be working just fine.
>>>>>>>>>>>
>>>>>>>>>>> However:
>>>>>>>>>>> -----------------------------------------
>>>>>>>>>>> # rbd mirror pool status
>>>>>>>>>>> health: WARNING
>>>>>>>>>>> images: 51 total
>>>>>>>>>>>         51 unknown
>>>>>>>>>>> -----------------------------------------
>>>>>>>>>>> The health warning is not visible in the dashboard (also not in the mirroring menu), the daemon still seems to be running, dropped nothing in the logs,
>>>>>>>>>>> and claims to be "ok" in the dashboard - it's only that all images show up in unknown state even though all seems to be working fine.
>>>>>>>>>>>
>>>>>>>>>>> Any idea on how to debug this?
>>>>>>>>>>> When I restart the rbd-mirror service, all images come back as green. I already encountered this twice in 3 days.
>>>>>>>>>>
>>>>>>>>>> The dashboard relies on the rbd-mirror daemon to provide it errors and
>>>>>>>>>> warnings. You can see the status reported by rbd-mirror by running
>>>>>>>>>> "ceph service status":
>>>>>>>>>>
>>>>>>>>>> $ ceph service status
>>>>>>>>>> {
>>>>>>>>>>         "rbd-mirror": {
>>>>>>>>>>             "4152": {
>>>>>>>>>>                 "status_stamp": "2019-09-13T08:58:41.937491-0400",
>>>>>>>>>>                 "last_beacon": "2019-09-13T08:58:41.937491-0400",
>>>>>>>>>>                 "status": {
>>>>>>>>>>                     "json":
>>>>>>>>>> "{\"1\":{\"name\":\"mirror\",\"callouts\":{},\"image_assigned_count\":1,\"image_error_count\":0,\"image_local_count\":1,\"image_remote_count\":1,\"image_warning_count\":0,\"instance_id\":\"4154\",\"leader\":true},\"2\":{\"name\":\"mirror_parent\",\"callouts\":{},\"image_assigned_count\":0,\"image_error_count\":0,\"image_local_count\":0,\"image_remote_count\":0,\"image_warning_count\":0,\"instance_id\":\"4156\",\"leader\":true}}"
>>>>>>>>>>                 }
>>>>>>>>>>             }
>>>>>>>>>>         }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> In your case, most likely it seems like rbd-mirror thinks all is good
>>>>>>>>>> with the world so it's not reporting any errors.
>>>>>>>>>
>>>>>>>>> This is indeed the case:
>>>>>>>>>
>>>>>>>>> # ceph service status
>>>>>>>>> {
>>>>>>>>>         "rbd-mirror": {
>>>>>>>>>             "84243": {
>>>>>>>>>                 "status_stamp": "2019-09-13 15:40:01.149815",
>>>>>>>>>                 "last_beacon": "2019-09-13 15:40:26.151381",
>>>>>>>>>                 "status": {
>>>>>>>>>                     "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"84247\",\"leader\":true}}"
>>>>>>>>>                 }
>>>>>>>>>             }
>>>>>>>>>         },
>>>>>>>>>         "rgw": {
>>>>>>>>> ...
>>>>>>>>>         }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>> The "down" state indicates that the rbd-mirror daemon isn't correctly
>>>>>>>>>> watching the "rbd_mirroring" object in the pool. You can see who it
>>>>>>>>>> watching that object by running the "rados" "listwatchers" command:
>>>>>>>>>>
>>>>>>>>>> $ rados -p <pool name> listwatchers rbd_mirroring
>>>>>>>>>> watcher=1.2.3.4:0/199388543 client.4154 cookie=94769010788992
>>>>>>>>>> watcher=1.2.3.4:0/199388543 client.4154 cookie=94769061031424
>>>>>>>>>>
>>>>>>>>>> In my case, the "4154" from "client.4154" is the unique global id for
>>>>>>>>>> my connection to the cluster, which relates back to the "ceph service
>>>>>>>>>> status" dump which also shows status by daemon using the unique global
>>>>>>>>>> id.
>>>>>>>>>
>>>>>>>>> Sadly(?), this looks as expected:
>>>>>>>>>
>>>>>>>>> # rados -p rbd listwatchers rbd_mirroring
>>>>>>>>> watcher=10.160.19.240:0/2922488671 client.84247 cookie=139770046978672
>>>>>>>>> watcher=10.160.19.240:0/2922488671 client.84247 cookie=139771389162560
>>>>>>>>
>>>>>>>> Hmm, the unique id is different (84243 vs 84247). I wouldn't have
>>>>>>>> expected the global id to have changed. Did you restart the Ceph
>>>>>>>> cluster or MONs? Do you see any "peer assigned me a different
>>>>>>>> global_id" errors in your rbd-mirror logs?
>>>>>>>>
>>>>>>>> I'll open a tracker ticket to fix the "ceph service status", though,
>>>>>>>> since clearly your global id changed but it wasn't noticed by the
>>>>>>>> service daemon status updater.
>>>>>>>
>>>>>>> ... also, can you please provide the output from the following via a
>>>>>>> pastebin link?
>>>>>>>
>>>>>>> # rados -p rbd listomapvals rbd_mirroring
>>>>>>
>>>>>> Of course, here you go:
>>>>>> https://0x0.st/zy8J.txt
>>>>>
>>>>> Thanks. For the case above of global image id
>>>>> 1a53fafa-37ef-4edf-9633-c2ba3323ed93, the on-disk status shows that it
>>>>> was last updated by client.84247 / nonce 2922488671, which correctly
>>>>> matches your watcher so the status should be "up":
>>>>>
>>>>> status_global_1a53fafa-37ef-4edf-9633-c2ba3323ed93
>>>>> value (232 bytes) :
>>>>> 00000000  01 01 2c 00 00 00 08 17  49 01 00 00 00 00 00 01
>>>>> |..,.....I.......|     <--- "17  49 01 00 00 00 00 00" (84247) is the
>>>>> instance id
>>>>> 00000010  01 01 1c 00 00 00 03 00  00 00 5f a3 31 ae 10 00
>>>>> |.........._.1...|    <--- "5f a3 31 ae" is the nonce (2922488671)
>>>>> 00000020  00 00 02 00 00 00 0a a0  13 f0 00 00 00 00 00 00
>>>>> |................|     <--- "0a a0  13 f0" is the IP address
>>>>> (10.160.9.240)
>>>>> 00000030  00 00 01 01 b0 00 00 00  04 a2 00 00 00 72 65 70  |.............rep|
>>>>> 00000040  6c 61 79 69 6e 67 2c 20  6d 61 73 74 65 72 5f 70  |laying, master_p|
>>>>> 00000050  6f 73 69 74 69 6f 6e 3d  5b 6f 62 6a 65 63 74 5f  |osition=[object_|
>>>>> 00000060  6e 75 6d 62 65 72 3d 31  39 2c 20 74 61 67 5f 74  |number=19, tag_t|
>>>>> 00000070  69 64 3d 36 2c 20 65 6e  74 72 79 5f 74 69 64 3d  |id=6, entry_tid=|
>>>>> 00000080  32 36 34 34 33 5d 2c 20  6d 69 72 72 6f 72 5f 70  |26443], mirror_p|
>>>>> 00000090  6f 73 69 74 69 6f 6e 3d  5b 6f 62 6a 65 63 74 5f  |osition=[object_|
>>>>> 000000a0  6e 75 6d 62 65 72 3d 31  39 2c 20 74 61 67 5f 74  |number=19, tag_t|
>>>>> 000000b0  69 64 3d 36 2c 20 65 6e  74 72 79 5f 74 69 64 3d  |id=6, entry_tid=|
>>>>> 000000c0  32 36 34 34 33 5d 2c 20  65 6e 74 72 69 65 73 5f  |26443], entries_|
>>>>> 000000d0  62 65 68 69 6e 64 5f 6d  61 73 74 65 72 3d 30 51  |behind_master=0Q|
>>>>> 000000e0  aa 7b 5d 1b 5f 4f 33 00                           |.{]._O3.|
>>>>> 000000e8
>>>>>
>>>>> The only thing I can think of is that somehow the watcher entity
>>>>> instance has a different encoding and its failing a comparison. Can
>>>>> you restart rbd-mirror such that the statuses list "up+replaying" and
>>>>> then run the following?
>>>>>
>>>>> # rados -p rbd getomapval rbd_mirroring
>>>>> status_global_1a53fafa-37ef-4edf-9633-c2ba3323ed93
>>>>
>>>> Interesting! Again, thanks for the detailed context - learning a bit more about the internals is one of the many reasons why we love Ceph so much,
>>>> and something which fully proprietary code will usually never deliver :-).
>>>>
>>>> Here's the output after the restart, image is in up+replaying state:
>>>>
>>>> # rados -p rbd getomapval rbd_mirroring status_global_1a53fafa-37ef-4edf-9633-c2ba3323ed93
>>>> value (232 bytes) :
>>>> 00000000  01 01 2c 00 00 00 08 ec  50 01 00 00 00 00 00 01  |..,.....P.......|
>>>> 00000010  01 01 1c 00 00 00 03 00  00 00 0b 24 cd a5 10 00  |...........$....|
>>>> 00000020  00 00 02 00 00 00 0a a0  13 f0 00 00 00 00 00 00  |................|
>>>> 00000030  00 00 01 01 b0 00 00 00  04 a2 00 00 00 72 65 70  |.............rep|
>>>> 00000040  6c 61 79 69 6e 67 2c 20  6d 61 73 74 65 72 5f 70  |laying, master_p|
>>>> 00000050  6f 73 69 74 69 6f 6e 3d  5b 6f 62 6a 65 63 74 5f  |osition=[object_|
>>>> 00000060  6e 75 6d 62 65 72 3d 31  38 2c 20 74 61 67 5f 74  |number=18, tag_t|
>>>> 00000070  69 64 3d 36 2c 20 65 6e  74 72 79 5f 74 69 64 3d  |id=6, entry_tid=|
>>>> 00000080  32 37 36 32 36 5d 2c 20  6d 69 72 72 6f 72 5f 70  |27626], mirror_p|
>>>> 00000090  6f 73 69 74 69 6f 6e 3d  5b 6f 62 6a 65 63 74 5f  |osition=[object_|
>>>> 000000a0  6e 75 6d 62 65 72 3d 31  38 2c 20 74 61 67 5f 74  |number=18, tag_t|
>>>> 000000b0  69 64 3d 36 2c 20 65 6e  74 72 79 5f 74 69 64 3d  |id=6, entry_tid=|
>>>> 000000c0  32 37 36 32 36 5d 2c 20  65 6e 74 72 69 65 73 5f  |27626], entries_|
>>>> 000000d0  62 65 68 69 6e 64 5f 6d  61 73 74 65 72 3d 30 eb  |behind_master=0.|
>>>> 000000e0  b3 7b 5d 27 9c d8 31 00                           |.{]'..1.|
>>>> 000000e8
>>>>
>>>> IIUC, this decodes to instance ID 86252, IP address of course stayed the same.
>>>>
>>>> Checking the other output:
>>>>
>>>> # ceph service status
>>>> {
>>>>       "rbd-mirror": {
>>>>           "86248": {
>>>>               "status_stamp": "2019-09-13 17:26:15.391048",
>>>>               "last_beacon": "2019-09-13 17:26:25.391759",
>>>>               "status": {
>>>>                   "json": "{\"2\":{\"name\":\"rbd\",\"callouts\":{},\"image_assigned_count\":51,\"image_error_count\":0,\"image_local_count\":51,\"image_remote_count\":51,\"image_warning_count\":0,\"instance_id\":\"86252\",\"leader\":true}}"
>>>>               }
>>>>           }
>>>>       },
>>>> ...
>>>> }
>>>>
>>>> # rados -p rbd listwatchers rbd_mirroring
>>>> watcher=10.160.19.240:0/2781684747 client.86252 cookie=140089552292144
>>>> watcher=10.160.19.240:0/2781684747 client.86252 cookie=140090961572928
>>>>
>>>> This looks as strange as before again: Global instance ID is 86248, but instance ID (and what I find in the omap dump) is 86252.
>>>>
>>>> However, things look okay in the dashboard again and also:
>>>> # rbd mirror pool status
>>>> health: OK
>>>> images: 51 total
>>>>       51 replaying
>>>>
>>>> Cheers,
>>>>          Oliver
>>>
>>> Can you also provide the output from "ceph features"?
>>
>> Here you go:
>> ------------------------------------------------------
>> # ceph features
>> {
>>      "mon": [
>>          {
>>              "features": "0x3ffddff8ffacffff",
>>              "release": "luminous",
>>              "num": 1
>>          }
>>      ],
>>      "osd": [
>>          {
>>              "features": "0x3ffddff8ffacffff",
>>              "release": "luminous",
>>              "num": 6
>>          }
>>      ],
>>      "client": [
>>          {
>>              "features": "0x3ffddff8ffacffff",
>>              "release": "luminous",
>>              "num": 6
>>          }
>>      ],
>>      "mgr": [
>>          {
>>              "features": "0x3ffddff8ffacffff",
>>              "release": "luminous",
>>              "num": 1
>>          }
>>      ]
>> }
>> ------------------------------------------------------
>> This is a rather fresh Nautilus cluster, which has not yet seen any version upgrade in its lifetime.
>>
>> Cheers,
>>         Oliver
>>
>>
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Cheers,
>>>>>> Oliver
>>>>>>
>>>>>>>
>>>>>>>>> However, the dashboard still shows those images in "unknown", and this also shows up via command line:
>>>>>>>>>
>>>>>>>>> # rbd mirror pool status
>>>>>>>>> health: WARNING
>>>>>>>>> images: 51 total
>>>>>>>>>         51 unknown
>>>>>>>>> # rbd mirror image status test-vm.physik.uni-bonn.de-disk1
>>>>>>>>> test-vm.physik.uni-bonn.de-disk2:
>>>>>>>>>       global_id:   1a53fafa-37ef-4edf-9633-c2ba3323ed93
>>>>>>>>>       state:       down+replaying
>>>>>>>>>       description: replaying, master_position=[object_number=18, tag_tid=6, entry_tid=25202], mirror_position=[object_number=18, tag_tid=6, entry_tid=25202], entries_behind_master=0
>>>>>>>>>       last_update: 2019-09-13 15:55:15
>>>>>>>>>
>>>>>>>>> Any ideas on what else could cause this?
>>>>>>>>>
>>>>>>>>> Cheers and thanks,
>>>>>>>>>            Oliver
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Any idea on this (or how I can extract more information)?
>>>>>>>>>>> I fear keeping high-level debug logs active for ~24h is not feasible.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>             Oliver
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 2019-09-11 19:14, Jason Dillaman wrote:
>>>>>>>>>>>> On Wed, Sep 11, 2019 at 12:57 PM Oliver Freyermuth
>>>>>>>>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dear Jason,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I played a bit more with rbd mirroring and learned that deleting an image at the source (or disabling journaling on it) immediately moves the image to trash at the target -
>>>>>>>>>>>>> but setting rbd_mirroring_delete_delay helps to have some more grace time to catch human mistakes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, I have issues restoring such an image which has been moved to trash by the RBD-mirror daemon as user:
>>>>>>>>>>>>> -----------------------------------
>>>>>>>>>>>>> [root@mon001 ~]# rbd trash ls -la
>>>>>>>>>>>>> ID           NAME                             SOURCE    DELETED_AT               STATUS                                   PARENT
>>>>>>>>>>>>> d4fbe8f63905 test-vm-XXXXXXXXXXXXXXXXXX-disk2 MIRRORING Wed Sep 11 18:43:14 2019 protected until Thu Sep 12 18:43:14 2019
>>>>>>>>>>>>> [root@mon001 ~]# rbd trash restore --image foo-image d4fbe8f63905
>>>>>>>>>>>>> rbd: restore error: 2019-09-11 18:50:15.387 7f5fa9590b00 -1 librbd::api::Trash: restore: Current trash source: mirroring does not match expected: user
>>>>>>>>>>>>> (22) Invalid argument
>>>>>>>>>>>>> -----------------------------------
>>>>>>>>>>>>> This is issued on the mon, which has the client.admin key, so it should not be a permission issue.
>>>>>>>>>>>>> It also fails when I try that in the Dashboard.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sadly, the error message is not clear enough for me to figure out what could be the problem - do you see what I did wrong?
>>>>>>>>>>>>
>>>>>>>>>>>> Good catch, it looks like we accidentally broke this in Nautilus when
>>>>>>>>>>>> image live-migration support was added. I've opened a new tracker
>>>>>>>>>>>> ticket to fix this [1].
>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers and thanks again,
>>>>>>>>>>>>>             Oliver
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 2019-09-10 23:17, Oliver Freyermuth wrote:
>>>>>>>>>>>>>> Dear Jason,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2019-09-10 23:04, Jason Dillaman wrote:
>>>>>>>>>>>>>>> On Tue, Sep 10, 2019 at 2:08 PM Oliver Freyermuth
>>>>>>>>>>>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Dear Jason,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2019-09-10 18:50, Jason Dillaman wrote:
>>>>>>>>>>>>>>>>> On Tue, Sep 10, 2019 at 12:25 PM Oliver Freyermuth
>>>>>>>>>>>>>>>>> <freyermuth@xxxxxxxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Dear Cephalopodians,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I have two questions about RBD mirroring.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1) I can not get it to work - my setup is:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>          - One cluster holding the live RBD volumes and snapshots, in pool "rbd", cluster name "ceph",
>>>>>>>>>>>>>>>>>>            running latest Mimic.
>>>>>>>>>>>>>>>>>>            I ran "rbd mirror pool enable rbd pool" on that cluster and created a cephx user "rbd_mirror" with (is there a better way?):
>>>>>>>>>>>>>>>>>>            ceph auth get-or-create client.rbd_mirror mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd r' -o ceph.client.rbd_mirror.keyring --cluster ceph
>>>>>>>>>>>>>>>>>>            In that pool, two images have the journaling feature activated, all others have it disabled still (so I would expect these two to be mirrored).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can just use "mon 'profile rbd' osd 'profile rbd'" for the caps --
>>>>>>>>>>>>>>>>> but you definitely need more than read-only permissions to the remote
>>>>>>>>>>>>>>>>> cluster since it needs to be able to create snapshots of remote images
>>>>>>>>>>>>>>>>> and update/trim the image journals.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> these profiles really make life a lot easier. I should have thought of them rather than "guessing" a potentially good configuration...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>          - Another (empty) cluster running latest Nautilus, cluster name "ceph", pool "rbd".
>>>>>>>>>>>>>>>>>>            I've used the dashboard to activate mirroring for the RBD pool, and then added a peer with cluster name "ceph-virt", cephx-ID "rbd_mirror", filled in the mons and key created above.
>>>>>>>>>>>>>>>>>>            I've then run:
>>>>>>>>>>>>>>>>>>            ceph auth get-or-create client.rbd_mirror_backup mon 'allow r' osd 'allow class-read object_prefix rbd_children, allow pool rbd rwx' -o client.rbd_mirror_backup.keyring --cluster ceph
>>>>>>>>>>>>>>>>>>            and deployed that key on the rbd-mirror machine, and started the service with:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please use "mon 'profile rbd-mirror' osd 'profile rbd'" for your caps [1].
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That did the trick (in combination with the above)!
>>>>>>>>>>>>>>>> Again a case of PEBKAC: I should have read the documentation until the end, clearly my fault.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It works well now, even though it seems to run a bit slow (~35 MB/s for the initial sync when everything is 1 GBit/s),
>>>>>>>>>>>>>>>> but that may also be caused by combination of some very limited hardware on the receiving end (which will be scaled up in the future).
>>>>>>>>>>>>>>>> A single host with 6 disks, replica 3 and a RAID controller which can only do RAID0 and not JBOD is certainly not ideal, so commit latency may cause this slow bandwidth.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You could try increasing "rbd_concurrent_management_ops" from the
>>>>>>>>>>>>>>> default of 10 ops to something higher to attempt to account for the
>>>>>>>>>>>>>>> latency. However, I wouldn't expect near-line speed w/ RBD mirroring.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks - I will play with this option once we have more storage available in the target pool ;-).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>            systemctl start ceph-rbd-mirror@rbd_mirror_backup.service
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         After this, everything looks fine:
>>>>>>>>>>>>>>>>>>          # rbd mirror pool info
>>>>>>>>>>>>>>>>>>            Mode: pool
>>>>>>>>>>>>>>>>>>            Peers:
>>>>>>>>>>>>>>>>>>             UUID                                 NAME      CLIENT
>>>>>>>>>>>>>>>>>>             XXXXXXXXXXX                          ceph-virt client.rbd_mirror
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         The service also seems to start fine, but logs show (debug rbd_mirror=20):
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>         rbd::mirror::ClusterWatcher:0x5575e2a7d390 resolve_peer_config_keys: retrieving config-key: pool_id=2, pool_name=rbd, peer_uuid=XXXXXXXXXXX
>>>>>>>>>>>>>>>>>>         rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: enter
>>>>>>>>>>>>>>>>>>         rbd::mirror::Mirror: 0x5575e29c7240 update_pool_replayers: restarting failed pool replayer for uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror
>>>>>>>>>>>>>>>>>>         rbd::mirror::PoolReplayer: 0x5575e2a7da20 init: replaying for uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror
>>>>>>>>>>>>>>>>>>         rbd::mirror::PoolReplayer: 0x5575e2a7da20 init_rados: error connecting to remote peer uuid: XXXXXXXXXXX cluster: ceph-virt client: client.rbd_mirror: (95) Operation not supported
>>>>>>>>>>>>>>>>>>         rbd::mirror::ServiceDaemon: 0x5575e29c8d70 add_or_update_callout: pool_id=2, callout_id=2, callout_level=error, text=unable to connect to remote cluster
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If it's still broken after fixing your caps above, perhaps increase
>>>>>>>>>>>>>>>>> debugging for "rados", "monc", "auth", and "ms" to see if you can
>>>>>>>>>>>>>>>>> determine the source of the op not supported error.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I already tried storing the ceph.client.rbd_mirror.keyring (i.e. from the cluster with the live images) on the rbd-mirror machine explicitly (i.e. not only in mon config storage),
>>>>>>>>>>>>>>>>>> and after doing that:
>>>>>>>>>>>>>>>>>>        rbd -m mon_ip_of_ceph_virt_cluster --id=rbd_mirror ls
>>>>>>>>>>>>>>>>>> works fine. So it's not a connectivity issue. Maybe a permission issue? Or did I miss something?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Any idea what "operation not supported" means?
>>>>>>>>>>>>>>>>>> It's unclear to me whether things should work well using Mimic with Nautilus, and enabling pool mirroring but only having journaling on for two images is a supported case.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes and yes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2) Since there is a performance drawback (about 2x) for journaling, is it also possible to only mirror snapshots, and leave the live volumes alone?
>>>>>>>>>>>>>>>>>>          This would cover the common backup usecase before deferred mirroring is implemented (or is it there already?).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This is in-development right now and will hopefully land for the
>>>>>>>>>>>>>>>>> Octopus release.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> That would be very cool. Just to clarify: You mean the "real" deferred mirroring, not a "snapshot only" mirroring?
>>>>>>>>>>>>>>>> Is it already clear if this will require Octopous (or a later release) on both ends, or only on the receiving side?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I might not be sure what you mean by deferred mirroring. You can delay
>>>>>>>>>>>>>>> the replay of the journal via the "rbd_mirroring_replay_delay"
>>>>>>>>>>>>>>> configuration option so that your DR site can be X seconds behind the
>>>>>>>>>>>>>>> primary at a minimum.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is indeed what I was thinking of...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For Octopus we are working on on-demand and
>>>>>>>>>>>>>>> scheduled snapshot mirroring between sites -- no journal is involved.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ... and this is what I was dreaming of. We keep snapshots of VMs to be able to roll them back.
>>>>>>>>>>>>>> We'd like to also keep those snapshots in a separate Ceph instance as an additional safety-net (in addition to an offline backup of those snapshots with Benji backup).
>>>>>>>>>>>>>> It is not (yet) clear to me whether we can pay the "2 x" price for journaling in the long run, so this would be the way to go in case we can't.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since I got you personally, I have two bonus questions.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1) Your talk:
>>>>>>>>>>>>>>>>         https://events.static.linuxfound.org/sites/events/files/slides/Disaster%20Recovery%20and%20Ceph%20Block%20Storage-%20Introducing%20Multi-Site%20Mirroring.pdf
>>>>>>>>>>>>>>>>         mentions "rbd journal object flush age", which I'd translate with something like the "commit" mount option on a classical file system - correct?
>>>>>>>>>>>>>>>>         I don't find this switch documented anywhere, though - is there experience with it / what's the default?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It's a low-level knob that by default causes the journal to flush its
>>>>>>>>>>>>>>> pending IO events before it allows the corresponding IO to be issued
>>>>>>>>>>>>>>> against the backing image. Setting it to a value greater that zero
>>>>>>>>>>>>>>> will allow that many seconds of IO events to be batched together in a
>>>>>>>>>>>>>>> journal append operation and its helpful for high-throughout, small IO
>>>>>>>>>>>>>>> operations. Of course it turned out that a bug had broken that option
>>>>>>>>>>>>>>> a while where events would never batch, so a fix is currently
>>>>>>>>>>>>>>> scheduled for backport of all active releases [1] w/ the goal that no
>>>>>>>>>>>>>>> one should need to tweak it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's even better - since our setup is growing and we will keep upgrading, I'll then just keep things as they are now (no manual tweaking)
>>>>>>>>>>>>>> and tag along the development. Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2) I read I can run more than one rbd-mirror with Mimic/Nautilus. Do they load-balance the images, or "only" failover in case one of them dies?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Starting with Nautilus, the default configuration for rbd-mirror is to
>>>>>>>>>>>>>>> evenly divide the number of mirrored images between all running
>>>>>>>>>>>>>>> daemons. This does not split the total load since some images might be
>>>>>>>>>>>>>>> hotter than others, but it at least spreads the load.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's fine enough for our use case. Spreading by "hotness" is a task without a clear answer
>>>>>>>>>>>>>> and "temperature" may change quickly, so that's all I hoped for.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Many thanks again for the very helpful explanations!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>           Oliver
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers and many thanks for the quick and perfect help!
>>>>>>>>>>>>>>>>              Oliver
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cheers and thanks in advance,
>>>>>>>>>>>>>>>>>>              Oliver
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> ceph-users mailing list
>>>>>>>>>>>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1] https://docs.ceph.com/docs/master/rbd/rbd-mirroring/#rbd-mirror-daemon
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Jason
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1] https://github.com/ceph/ceph/pull/28539
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://tracker.ceph.com/issues/41780
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Jason
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jason
>>>>>
>>>>
>>>>
>>>
>>>
>>
> 
> [1] https://tracker.ceph.com/issues/41833
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com