Re: cephfs snap-mirror stalled

Holger Naundorf <naundorf@xxxxxxxxxxxxxx> · Wed, 7 Dec 2022 15:53:11 +0100

On 06.12.22 14:17, Venky Shankar wrote:
On Tue, Dec 6, 2022 at 6:34 PM Holger Naundorf <naundorf@xxxxxxxxxxxxxx> wrote:

On 06.12.22 09:54, Venky Shankar wrote:
Hi Holger,

On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf <naundorf@xxxxxxxxxxxxxx> wrote:

Hello,
we have set up a snap-mirror for a directory on one of our clusters -
running ceph version

ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
(stable)

to get mirrorred our other cluster - running ceph version

ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
(stable)

The initial setup went ok, when the first snapshot was created data
started to flow at a decent (for our HW) rate of 100-200MB/s. As the
directory contains  ~200TB this was expected to take some time - but now
the process has stalled completely after ~100TB were mirrored and ~7d
running.

Up to now I do not have any hints why it has stopped - I do not see any
error messages from the cephfs-mirror daemon. Can the small version
mismatch be a problem?

Any hints where to look to find out what has got stuck are welcome.

I'd look at the mirror daemon logs for any errors to start with. You
might want to crank up the log level for debugging (debug
cephfs_mirror=20).

Even on max debug I do not see anything which looks like an error - but
as this is the first time I try to dig into any cephfs-mirror logs I
might not notice (as long as it is not red and flashing).

The Log basically this type of sequence, repeating forever:

(...)
cephfs::mirror::MirrorWatcher handle_notify
cephfs::mirror::Mirror update_fs_mirrors
cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror
update (0x556fe3a7f130) after 2 seconds
cephfs::mirror::Watcher handle_notify: notify_id=751516198184655,
handle=93939050205568, notifier_id=25504530
cephfs::mirror::MirrorWatcher handle_notify
cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run:
trying to pick from 1 directories
cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1)
pick_directory
cephfs::mirror::Watcher handle_notify: notify_id=751516198184656,
handle=93939050205568, notifier_id=25504530
cephfs::mirror::MirrorWatcher handle_notify
cephfs::mirror::Mirror update_fs_mirrors
cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror
update (0x556fe3a7fc70) after 2 seconds
cephfs::mirror::Watcher handle_notify: notify_id=751516198184657,
handle=93939050205568, notifier_id=25504530
cephfs::mirror::MirrorWatcher handle_notify
(...)

Basically, the interesting bit is not captured since it probably
happened sometime back. Could you please set the following:

debug cephfs_mirror = 20
debug client = 20

and restart the mirror daemon? The daemon would start synchronizing
again. When synchronizing stalls, please share the daemon logs. If the
log is huge, you could upload them via ceph-post-file.

If I set debug_client to 20 'huge' is an understatement.

I now have three huge logfiles - one pair with debug_mirror set to 20 
capturing the restart and the point where the sync stalls again and one 
with both mirror and client debug at 20 capturing the  restart - but as 
this setting created ~10GB logs within 20min I reset the client logging 
again to spare our small system disks - if these logs are needed I think 
I will have to set up some remote logging facility.

The observation I made from the scanning the logs:

After the restart the mirror daemon spends some hours comparing the 
incomplete transfers
(Lots of limes with

do_synchronize: 18 entries in stack
do_synchronize: top of stack path=./...(FILENAME)...
do_synchronize: entry=./(FILENAME), data_sync=0, attr_sync=1

then there is the point where
the number of items in the stack goes down:
cephfs::mirror::PeerReplayer () do_synchronize: 8 entries in stack

The top of stack moves up in the diretory levesl:

cephfs::mirror::PeerReplayer () do_synchronize: top of stack 
path=./...FILENAME...

but then it just stops, without any error message visible in the logfile

and switches to the repeating sequence I posted already.

Should I try to upload the logs - even gzipped they are quite huge:
 388M syslog.restart-mirror-client_debug_20.gz
  98M syslog.restart-mirror-no_client_debug.gz
  54M syslog.stalled-no_client_debug.gz

(as our servers are in an isolated net I will have to see if the
'ceph-post-file' method works from another system.

Regards,
Holger Naundorf

Regards,
Holger

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naundorf@xxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naundorf@xxxxxxxxxxxxxx

--
Dr. Holger Naundorf
Christian-Albrechts-Universität zu Kiel
Rechenzentrum / HPC / Server und Storage
Tel: +49 431 880-1990
Fax:  +49 431 880-1523
naundorf@xxxxxxxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx