Re: RBD Journaling seemingly getting stuck for some VMs after upgrade to Octopus

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Tue, 13 Aug 2024 18:24:31 +0200

Am 13.08.24 um 15:02 schrieb Ilya Dryomov:
On Mon, Aug 12, 2024 at 1:17 PM Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

Am 12.08.24 um 12:16 schrieb Ilya Dryomov:
On Mon, Aug 12, 2024 at 11:28 AM Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

Am 12.08.24 um 11:09 schrieb Ilya Dryomov:
On Mon, Aug 12, 2024 at 10:20 AM Oliver Freyermuth
<freyermuth@xxxxxxxxxxxxxxxxxx> wrote:

Dear Cephalopodians,

we've successfully operated a "good old" Mimic cluster with primary RBD images, replicated via journaling to a "backup cluster" with Octopus, for the past years (i.e. one-way replication).
We've now finally gotten around upgrading the cluster with the primary images to Octopus (and plan to upgrade further in the near future).

After the upgrade, all MON+MGR-OSD+rbd_mirror daemons are running 15.2.17.

We run three rbd-mirror daemons which all share the following client with auth in the "backup" cluster, to which they write:

      client.rbd_mirror_backup
            caps: [mon] profile rbd-mirror
            caps: [osd] profile rbd

and the following shared client with auth in the "primary cluster" from which they are reading:

      client.rbd_mirror
            caps: [mon] profile rbd
            caps: [osd] profile rbd

i.e. the same auth as described in the docs[0].

Checking on the primary cluster, we get:

# rbd mirror pool status
      health: UNKNOWN
      daemon health: UNKNOWN
      image health: OK
      images: 288 total
          288 replaying

For some reason, some values are "unknown" here. But mirroring seems to work, as checking on the backup cluster reveals, see for example:

      # rbd mirror image status zabbix-test.example.com-disk2
        zabbix-test.example.com-disk2:
        global_id:   1bdcb981-c1c5-4172-9583-be6a6cd996ec
        state:       up+replaying
        description: replaying, {"bytes_per_second":8540.27,"entries_behind_primary":0,"entries_per_second":1.8,"non_primary_position":{"entry_tid":869176,"object_number":504,"tag_tid":1},"primary_position":{"entry_tid":11143,"object_number":7,"tag_tid":1}}
        service:     rbd_mirror_backup on rbd-mirror002.example.com
        last_update: 2024-08-12 09:53:17

However, we do in some seemingly random cases see that journals are never advanced on the primary cluster — staying with the example above, on the primary cluster I find the following:

      # rbd journal status --image zabbix-test.physik.uni-bonn.de-disk2
      minimum_set: 1
      active_set: 126
        registered clients:
              [id=, commit_position=[positions=[[object_number=7, tag_tid=1, entry_tid=11143], [object_number=6, tag_tid=1, entry_tid=11142], [object_number=5, tag_tid=1, entry_tid=11141], [object_number=4, tag_tid=1, entry_tid=11140]]], state=connected]
              [id=52b80bb0-a090-4f7d-9950-c8691ed8fee9, commit_position=[positions=[[object_number=505, tag_tid=1, entry_tid=869181], [object_number=504, tag_tid=1, entry_tid=869180], [object_number=507, tag_tid=1, entry_tid=869179], [object_number=506, tag_tid=1, entry_tid=869178]]], state=connected]

As you can see, the minimum_set was not advanced. As can be seen in "mirror image status", it shows the strange output that non_primary_position seems much more advanced than primary_position. This seems to happen "at random" for only a few volumes...
There are no other active clients apart from the actual VM (libvirt+qemu).

Hi Oliver,

Were the VM clients (i.e. librbd on the hypervisor nodes) upgraded as well?

Hi Ilya,

"some of them" — as a matter of fact, we wanted to stress-test VM restarting and live migration first, and in some cases saw VMs stuck for a long time, which is now understandable...

As a quick fix, to purge journals piling up over and over, we've only found the "solution" to temporarily disable and then re-enable journaling for affected VM disks, which can be identified by:
     for A in $(rbd ls); do echo -n "$A: "; rbd --format=json journal status --image $A | jq '.active_set - .minimum_set'; done

Any idea what is going wrong here?
This did not happen with the primary cluster running Mimic and the backup cluster running Octopus before, and also did not happen when both were running Mimic.

You might be hitting https://tracker.ceph.com/issues/57396.

Indeed, it looks exactly like that, as we do fsfreeze+fstrim every night (before snapshotting) inside all VMs (via qemu-guest-agent).
Correlating affected VMs with upgraded hypervisors reveals that only those VMs running on hypervisors with Octopus clients seem affected,
and the issue easily explains why we saw problems with VM shutdown / restart or live migration (extremely slowness / VMs almost getting stuck). I can also confirm these problems seem to vanish when disabling journaling.

So many thanks, this does indeed explain a lot :-). It also means the bug is still present in Octopus, but fixed in Pacific and later.

We'll likely switch to snapshot-based mirroring in the next weeks (now that we know that this will avoid the problem), then finish the upgrade of all hypervisors to Octopus, and only then attack Pacific and later.

Are any of your VM images clones (in "rbd snap create" + "rbd clone"
sense)?  If so, I'd advise against switching to snapshot-based based
mirroring as there are known issues with sync/replication correctness
there.

Dear Ilya,

many thanks for the warning! Luckily, none of them are clones (but we did consider doing clones, so it's good to know we should avoid that for now),
we only have classic rbd volumes with several manual snapshots (and thin them out over time).
So the workload would be "manual snapshots" plus "mirror snapshots".

FWIW, due to the nightly fstrim, we also run object map checks and rebuilds nightly due to [0] but I hope this tracker will not have an effect on mirroring.

Cheers and thanks,
         Oliver

[0] https://tracker.ceph.com/issues/37876

It shouldn't have any noticeable effect.

FWIW I don't think rebuilding object maps like that is needed.  This
inconsistency is benign -- occasionally marking a clean object as dirty
isn't a problem.  It's the reverse that would be an issue...  But I get
why you are running "rbd object-map check" nightly and certainly don't
have anything against that.

Thanks for confirming, that matches what I thought — which means the "check and rebuild" which we do surely is overkill,
the main reason we do it is to feel more safe, as we explicitly use rbd diffs in our additional backup solution (Benji backup using "rbd diff" output for differential backups).
It would also deduplicate (and compress) marked-as-dirty objects, so that would not be an actual issue even in terms of space consumption (a "real" issue would be missing blocks in the diff, but we have luckily never seen that).

So yeah, it's really mostly for "peace of mind" ;-).

Perhaps we should consider making "rbd object-map check" a bit more
lenient and not invalidate the object map in this case.  It could still
report the inconsistency, leaving it up to the operator whether they
want to follow up with a rebuild.

That sounds very reasonable. We also noticed a speedup of about 20 % for the checks and rebuilds going from Mimic to Octopus,
so for us this is still quite a "cheap" operation (297 volumes with on average 20 GB size, snapshotted daily, about 7800 total snapshots kept in the live system,
around 80 object maps to rebuild every day, taking less than 50 minutes serially checking all object maps).

Cheers and thanks again,
	Oliver

--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment:
smime.p7s

Description: Kryptografische S/MIME-Signatur
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx