On Fri, Dec 16, 2022 at 1:27 PM Holger Naundorf <naundorf@xxxxxxxxxxxxxx> wrote: > > > > On 15.12.22 14:06, Venky Shankar wrote: > > Hi Holger, > > > > (sorry for the late reply) > > > > On Fri, Dec 9, 2022 at 6:22 PM Holger Naundorf <naundorf@xxxxxxxxxxxxxx> wrote: > >> > >> As an update: > >> After the third restart now the mirror-daemon is running normal again - > >> only change to the restarts before was that during the restart > >> dbug_client was set to 20. (First restart was after ~48h of no data > >> movemnent to the receiver side and no changes, 2nd was with debug_mirror > >> set to 20 and the thisrd - maybe to quickly - after only ~12h of no data > >> movement with debug_mirror + _client set to 20). Currently I reset the > >> debug levels back down to reduce the load on the system disks. > > > > Do you see ESTALE related messages in mirror daemon logs (with > > debug_client: 20)? We have run into a couple of instances where the > > mirror daemon would be stuck on a directory entry. The workaround for > > that is to find the directory path (where the daemon is stuck) and > > from another client (mount), list the entries in that directory (or at > > times restarting the daemon works, as in your case). > > > Sounds like our bug - I do have the ESTALE messages with debug 20: > > syslog.restart-mirror-client_debug_20.gz:Dec 7 10:05:30 ceph07 > bash[1740788]: debug 2022-12-07T09:05:30.299+0000 7f05907a7700 20 > client.29690499 got ESTALE on tid 2258671 from mds.0 > syslog.restart-mirror-client_debug_20.gz:Dec 7 10:05:30 ceph07 > bash[1740788]: debug 2022-12-07T09:05:30.299+0000 7f05907a7700 20 > client.29690499 got ESTALE on tid 2258671 from mds.0 > syslog.restart-mirror-client_debug_20.gz:Dec 7 10:05:30 ceph07 > bash[1740788]: debug 2022-12-07T09:05:30.299+0000 7f05907a7700 20 > client.29690499 got ESTALE on tid 2258671 from mds.0 > (...) > > And when the mirror got stuck again after some time it mysteriously (at > that time) started to work again while I was poking around in the > logfiles and the system - and doing an 'ls' in the stuck dir was > definitly included in the poking. > > > > This will be fixed in the next pacific release (tracker: > > https://tracker.ceph.com/issues/55935). > > > Next means the .11 release or the current .10 we have not yet upgraded to? 16.2.11 > > Thanks for the update. > > Regards, > Holger Naundorf > > > >> > >> Regards, > >> Holger Naundorf > >> > >> > >> On 07.12.22 15:53, Holger Naundorf wrote: > >>> On 06.12.22 14:17, Venky Shankar wrote: > >>>> On Tue, Dec 6, 2022 at 6:34 PM Holger Naundorf > >>>> <naundorf@xxxxxxxxxxxxxx> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 06.12.22 09:54, Venky Shankar wrote: > >>>>>> Hi Holger, > >>>>>> > >>>>>> On Tue, Dec 6, 2022 at 1:42 PM Holger Naundorf > >>>>>> <naundorf@xxxxxxxxxxxxxx> wrote: > >>>>>>> > >>>>>>> Hello, > >>>>>>> we have set up a snap-mirror for a directory on one of our clusters - > >>>>>>> running ceph version > >>>>>>> > >>>>>>> ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific > >>>>>>> (stable) > >>>>>>> > >>>>>>> to get mirrorred our other cluster - running ceph version > >>>>>>> > >>>>>>> ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830) pacific > >>>>>>> (stable) > >>>>>>> > >>>>>>> The initial setup went ok, when the first snapshot was created data > >>>>>>> started to flow at a decent (for our HW) rate of 100-200MB/s. As the > >>>>>>> directory contains ~200TB this was expected to take some time - > >>>>>>> but now > >>>>>>> the process has stalled completely after ~100TB were mirrored and ~7d > >>>>>>> running. > >>>>>>> > >>>>>>> Up to now I do not have any hints why it has stopped - I do not see > >>>>>>> any > >>>>>>> error messages from the cephfs-mirror daemon. Can the small version > >>>>>>> mismatch be a problem? > >>>>>>> > >>>>>>> Any hints where to look to find out what has got stuck are welcome. > >>>>>> > >>>>>> I'd look at the mirror daemon logs for any errors to start with. You > >>>>>> might want to crank up the log level for debugging (debug > >>>>>> cephfs_mirror=20). > >>>>>> > >>>>> > >>>>> Even on max debug I do not see anything which looks like an error - but > >>>>> as this is the first time I try to dig into any cephfs-mirror logs I > >>>>> might not notice (as long as it is not red and flashing). > >>>>> > >>>>> The Log basically this type of sequence, repeating forever: > >>>>> > >>>>> (...) > >>>>> cephfs::mirror::MirrorWatcher handle_notify > >>>>> cephfs::mirror::Mirror update_fs_mirrors > >>>>> cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror > >>>>> update (0x556fe3a7f130) after 2 seconds > >>>>> cephfs::mirror::Watcher handle_notify: notify_id=751516198184655, > >>>>> handle=93939050205568, notifier_id=25504530 > >>>>> cephfs::mirror::MirrorWatcher handle_notify > >>>>> cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) run: > >>>>> trying to pick from 1 directories > >>>>> cephfs::mirror::PeerReplayer(19361031-928d-4366-99bd-50df70d3adf1) > >>>>> pick_directory > >>>>> cephfs::mirror::Watcher handle_notify: notify_id=751516198184656, > >>>>> handle=93939050205568, notifier_id=25504530 > >>>>> cephfs::mirror::MirrorWatcher handle_notify > >>>>> cephfs::mirror::Mirror update_fs_mirrors > >>>>> cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror > >>>>> update (0x556fe3a7fc70) after 2 seconds > >>>>> cephfs::mirror::Watcher handle_notify: notify_id=751516198184657, > >>>>> handle=93939050205568, notifier_id=25504530 > >>>>> cephfs::mirror::MirrorWatcher handle_notify > >>>>> (...) > >>>> > >>>> Basically, the interesting bit is not captured since it probably > >>>> happened sometime back. Could you please set the following: > >>>> > >>>> debug cephfs_mirror = 20 > >>>> debug client = 20 > >>>> > >>>> and restart the mirror daemon? The daemon would start synchronizing > >>>> again. When synchronizing stalls, please share the daemon logs. If the > >>>> log is huge, you could upload them via ceph-post-file. > >>>> > >>> If I set debug_client to 20 'huge' is an understatement. > >>> > >>> I now have three huge logfiles - one pair with debug_mirror set to 20 > >>> capturing the restart and the point where the sync stalls again and one > >>> with both mirror and client debug at 20 capturing the restart - but as > >>> this setting created ~10GB logs within 20min I reset the client logging > >>> again to spare our small system disks - if these logs are needed I think > >>> I will have to set up some remote logging facility. > >>> > >>> The observation I made from the scanning the logs: > >>> > >>> After the restart the mirror daemon spends some hours comparing the > >>> incomplete transfers > >>> (Lots of limes with > >>> > >>> do_synchronize: 18 entries in stack > >>> do_synchronize: top of stack path=./...(FILENAME)... > >>> do_synchronize: entry=./(FILENAME), data_sync=0, attr_sync=1 > >>> > >>> then there is the point where > >>> the number of items in the stack goes down: > >>> cephfs::mirror::PeerReplayer () do_synchronize: 8 entries in stack > >>> > >>> The top of stack moves up in the diretory levesl: > >>> > >>> cephfs::mirror::PeerReplayer () do_synchronize: top of stack > >>> path=./...FILENAME... > >>> > >>> but then it just stops, without any error message visible in the logfile > >>> > >>> and switches to the repeating sequence I posted already. > >>> > >>> Should I try to upload the logs - even gzipped they are quite huge: > >>> 388M syslog.restart-mirror-client_debug_20.gz > >>> 98M syslog.restart-mirror-no_client_debug.gz > >>> 54M syslog.stalled-no_client_debug.gz > >>> > >>> (as our servers are in an isolated net I will have to see if the > >>> 'ceph-post-file' method works from another system. > >>> > >>> Regards, > >>> Holger Naundorf > >>> > >>>>> > >>>>> > >>>>> > >>>>>>> > >>>>>>> Regards, > >>>>>>> Holger > >>>>>>> > >>>>>>> -- > >>>>>>> Dr. Holger Naundorf > >>>>>>> Christian-Albrechts-Universität zu Kiel > >>>>>>> Rechenzentrum / HPC / Server und Storage > >>>>>>> Tel: +49 431 880-1990 > >>>>>>> Fax: +49 431 880-1523 > >>>>>>> naundorf@xxxxxxxxxxxxxx > >>>>>>> _______________________________________________ > >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> Dr. Holger Naundorf > >>>>> Christian-Albrechts-Universität zu Kiel > >>>>> Rechenzentrum / HPC / Server und Storage > >>>>> Tel: +49 431 880-1990 > >>>>> Fax: +49 431 880-1523 > >>>>> naundorf@xxxxxxxxxxxxxx > >>>> > >>>> > >>>> > >>> > >> > >> -- > >> Dr. Holger Naundorf > >> Christian-Albrechts-Universität zu Kiel > >> Rechenzentrum / HPC / Server und Storage > >> Tel: +49 431 880-1990 > >> Fax: +49 431 880-1523 > >> naundorf@xxxxxxxxxxxxxx > > > > > > > > -- > Dr. Holger Naundorf > Christian-Albrechts-Universität zu Kiel > Rechenzentrum / HPC / Server und Storage > Tel: +49 431 880-1990 > Fax: +49 431 880-1523 > naundorf@xxxxxxxxxxxxxx -- Cheers, Venky _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx