Re: cephfs snap-mirror stalled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Holger,

(sorry for the late reply)

On Fri, Dec 9, 2022 at 6:22 PM Holger Naundorf <naundorf@xxxxxxxxxxxxxx> wrote:
>
> As an update:
> After the third restart now the mirror-daemon is running normal again -
> only change to the restarts before was that during the restart
> dbug_client was set to 20. (First restart was after ~48h of no data
> movemnent to the receiver side and no changes, 2nd was with debug_mirror
> set to 20 and the thisrd - maybe to quickly - after only ~12h of no data
> movement with debug_mirror + _client set to 20). Currently I reset the
> debug levels back down to reduce the load on the system disks.

Do you see ESTALE related messages in mirror daemon logs (with
debug_client: 20)? We have run into a couple of instances where the
mirror daemon would be stuck on a directory entry. The workaround for
that is to find the directory path (where the daemon is stuck) and
from another client (mount), list the entries in that directory (or at
times restarting the daemon works, as in your case).

This will be fixed in the next pacific release (tracker:
https://tracker.ceph.com/issues/55935).

>
> Regards,
> Holger Naundorf
>
>
> On 07.12.22 15:53, Holger Naundorf wrote:
> > On 06.12.22 14:17, Venky Shankar wrote:
> >> On Tue, Dec 6, 2022 at 6:34 PM Holger Naundorf
> >> <naundorf@xxxxxxxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> On 06.12.22 09:54, Venky Shankar wrote:
> >>>> Hi Holger,
> >>>>
> >>>> On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf
> >>>> <naundorf@xxxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hello,
> >>>>> we have set up a snap-mirror for a directory on one of our clusters -
> >>>>> running ceph version
> >>>>>
> >>>>> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
> >>>>> (stable)
> >>>>>
> >>>>> to get mirrorred our other cluster - running ceph version
> >>>>>
> >>>>> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific
> >>>>> (stable)
> >>>>>
> >>>>> The initial setup went ok, when the first snapshot was created data
> >>>>> started to flow at a decent (for our HW) rate of 100-200MB/s. As the
> >>>>> directory contains  ~200TB this was expected to take some time -
> >>>>> but now
> >>>>> the process has stalled completely after ~100TB were mirrored and ~7d
> >>>>> running.
> >>>>>
> >>>>> Up to now I do not have any hints why it has stopped - I do not see
> >>>>> any
> >>>>> error messages from the cephfs-mirror daemon. Can the small version
> >>>>> mismatch be a problem?
> >>>>>
> >>>>> Any hints where to look to find out what has got stuck are welcome.
> >>>>
> >>>> I'd look at the mirror daemon logs for any errors to start with. You
> >>>> might want to crank up the log level for debugging (debug
> >>>> cephfs_mirror=20).
> >>>>
> >>>
> >>> Even on max debug I do not see anything which looks like an error - but
> >>> as this is the first time I try to dig into any cephfs-mirror logs I
> >>> might not notice (as long as it is not red and flashing).
> >>>
> >>> The Log basically this type of sequence, repeating forever:
> >>>
> >>> (...)
> >>> cephfs::mirror::MirrorWatcher handle_notify
> >>> cephfs::mirror::Mirror update_fs_mirrors
> >>> cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror
> >>> update (0x556fe3a7f130) after 2 seconds
> >>> cephfs::mirror::Watcher handle_notify: notify_id=751516198184655,
> >>> handle=93939050205568, notifier_id=25504530
> >>> cephfs::mirror::MirrorWatcher handle_notify
> >>> cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run:
> >>> trying to pick from 1 directories
> >>> cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1)
> >>> pick_directory
> >>> cephfs::mirror::Watcher handle_notify: notify_id=751516198184656,
> >>> handle=93939050205568, notifier_id=25504530
> >>> cephfs::mirror::MirrorWatcher handle_notify
> >>> cephfs::mirror::Mirror update_fs_mirrors
> >>> cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror
> >>> update (0x556fe3a7fc70) after 2 seconds
> >>> cephfs::mirror::Watcher handle_notify: notify_id=751516198184657,
> >>> handle=93939050205568, notifier_id=25504530
> >>> cephfs::mirror::MirrorWatcher handle_notify
> >>> (...)
> >>
> >> Basically, the interesting bit is not captured since it probably
> >> happened sometime back. Could you please set the following:
> >>
> >> debug cephfs_mirror = 20
> >> debug client = 20
> >>
> >> and restart the mirror daemon? The daemon would start synchronizing
> >> again. When synchronizing stalls, please share the daemon logs. If the
> >> log is huge, you could upload them via ceph-post-file.
> >>
> > If I set debug_client to 20 'huge' is an understatement.
> >
> > I now have three huge logfiles - one pair with debug_mirror set to 20
> > capturing the restart and the point where the sync stalls again and one
> > with both mirror and client debug at 20 capturing the  restart - but as
> > this setting created ~10GB logs within 20min I reset the client logging
> > again to spare our small system disks - if these logs are needed I think
> > I will have to set up some remote logging facility.
> >
> > The observation I made from the scanning the logs:
> >
> > After the restart the mirror daemon spends some hours comparing the
> > incomplete transfers
> > (Lots of limes with
> >
> > do_synchronize: 18 entries in stack
> > do_synchronize: top of stack path=./...(FILENAME)...
> > do_synchronize: entry=./(FILENAME), data_sync=0, attr_sync=1
> >
> > then there is the point where
> > the number of items in the stack goes down:
> > cephfs::mirror::PeerReplayer () do_synchronize: 8 entries in stack
> >
> > The top of stack moves up in the diretory levesl:
> >
> > cephfs::mirror::PeerReplayer () do_synchronize: top of stack
> > path=./...FILENAME...
> >
> > but then it just stops, without any error message visible in the logfile
> >
> > and switches to the repeating sequence I posted already.
> >
> > Should I try to upload the logs - even gzipped they are quite huge:
> >   388M syslog.restart-mirror-client_debug_20.gz
> >    98M syslog.restart-mirror-no_client_debug.gz
> >    54M syslog.stalled-no_client_debug.gz
> >
> > (as our servers are in an isolated net I will have to see if the
> > 'ceph-post-file' method works from another system.
> >
> > Regards,
> > Holger Naundorf
> >
> >>>
> >>>
> >>>
> >>>>>
> >>>>> Regards,
> >>>>> Holger
> >>>>>
> >>>>> --
> >>>>> Dr. Holger Naundorf
> >>>>> Christian-Albrechts-Universität zu Kiel
> >>>>> Rechenzentrum / HPC / Server und Storage
> >>>>> Tel: +49 431 880-1990
> >>>>> Fax:  +49 431 880-1523
> >>>>> naundorf@xxxxxxxxxxxxxx
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Dr. Holger Naundorf
> >>> Christian-Albrechts-Universität zu Kiel
> >>> Rechenzentrum / HPC / Server und Storage
> >>> Tel: +49 431 880-1990
> >>> Fax:  +49 431 880-1523
> >>> naundorf@xxxxxxxxxxxxxx
> >>
> >>
> >>
> >
>
> --
> Dr. Holger Naundorf
> Christian-Albrechts-Universität zu Kiel
> Rechenzentrum / HPC / Server und Storage
> Tel: +49 431 880-1990
> Fax:  +49 431 880-1523
> naundorf@xxxxxxxxxxxxxx



-- 
Cheers,
Venky

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux