Re: rbd-mirror stops replaying journal on primary cluster

Josef Johansson <josef86@xxxxxxxxx> · Tue, 30 Aug 2022 08:43:52 +0200

Hi,

There's nothing special in the cluster when it stops replaying. It
seems that a journal entry that the local replayer doesn't handle and
just stops. Since it's the local replayer that stops there's no logs
in rbd-mirror. The odd part is that rbd-mirror handles this totally
fine and is the one syncing correctly.

What's worse is that this is reported as HEALTHY in status
information, even though when restarting that VM it will stall until
replaying is complete. The replay function inside rbd client seems to
be fine handling the journal, but only on start of the vm. I will try
to get a ticket open on tracker.ceph.com as soon as my account is
approved.

I have tried to see what component is responsible for local replay but
I have not been successful yet.

Thanks for answering :)

On Mon, Aug 22, 2022 at 11:05 AM Eugen Block <eblock@xxxxxx> wrote:
>
> Hi,
>
> IIRC the rbd mirror journals will grow if the sync stops to work,
> which seems to be the case here. Does the primary cluster experience
> any high load when the replay stops? How is the connection between the
> two sites and is the link saturated? Does the rbd-mirror log reveal
> anything useful (maybe also in debug mode)?
>
> Regards,
> Eugen
>
> Zitat von Josef Johansson <josef@xxxxxxxxxxx>:
>
> > Hi,
> >
> > I'm running ceph octopus 15.2.16 and I'm trying out two way mirroring.
> >
> > Everything seems to running fine except sometimes when the replay
> > stops at the primary clusters.
> >
> > This means that VMs will not start properly until all journal
> > entries are replayed, but also that the journal grows by time.
> >
> > I am trying to find out why this occurs, and where to look for more
> > information.
> >
> > I am currently using rbd --pool <pool> --image <image> journal
> > status to see if the clients are in sync or not.
> >
> > Example output when things went sideways
> >
> > minimum_set: 0
> > active_set: 2
> > registered clients:
> > [id=, commit_position=[positions=[[object_number=0, tag_tid=1,
> > entry_tid=4592], [object_number=3, tag_tid=1, entry_tid=4591],
> > [object_number=2, tag_tid=1, entry_tid=4590], [object_number=1,
> > tag_tid=1, entry_tid=4589]]], state=connected]
> > [id=bdde9b90-df26-4e3d-84b3-66605dc45608,
> > commit_position=[positions=[[object_number=5, tag_tid=1,
> > entry_tid=19913], [object_number=4, tag_tid=1, entry_tid=19912],
> > [object_number=7, tag_tid=1, entry_tid=19911], [object_number=6,
> > tag_tid=1, entry_tid=19910]]], state=disconnected]
> >
> > Right now I'm trying to catch it red handed in the primary osd logs.
> > But I'm not even sure if that's the process that is replaying the
> > journal..
> >
> > Regards
> > Josef
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx