Re: Storage down due to MON sync very slow

Frank Schilder <frans@xxxxxx> · Fri, 22 Jan 2021 12:47:58 +0000

Hi Dan,

it is possible that the payload reduction also solved or at least reduced a really bad problem that looks related (beware, that's a long one): https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/FBGIJZNFG445NMYGO73PFNQL2ZB3ZF2Z/#FBGIJZNFG445NMYGO73PFNQL2ZB3ZF2Z . Since reducing the payload size I still observe these large peaks in the MON network activity. However, it seems that the cluster does not go down like before any more. During these peaks, I see warnings like these:

2021-01-22 12:00:00.000102 [WRN]  overall HEALTH_WARN 1 pools nearfull
2021-01-22 11:04:09.156796 [INF]  Health check cleared: SLOW_OPS (was: 5 slow ops, oldest one blocked for 75 sec, mon.ceph-02 has slow ops)
2021-01-22 11:04:07.994416 [WRN]  Health check update: 5 slow ops, oldest one blocked for 75 sec, mon.ceph-02 has slow ops (SLOW_OPS)
2021-01-22 11:04:01.469498 [WRN]  Health check failed: 124 slow ops, oldest one blocked for 82 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops. (SLOW_OPS)
2021-01-22 11:00:00.000104 [WRN]  overall HEALTH_WARN 1 pools nearfull
2021-01-22 10:36:44.576663 [INF]  Health check cleared: SLOW_OPS (was: 25 slow ops, oldest one blocked for 42 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops.)
2021-01-22 10:36:38.543763 [WRN]  Health check failed: 18 slow ops, oldest one blocked for 38 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops. (SLOW_OPS)

So, at least stuff is working.

I now lean towards the hypothesis that these outages were caused by some synchronisation process between MONs that got less problematic with reducing the payload size. I might be able to reduce my insane beacon time-outs again, but before doing so, do you know of any other communication parameters similar to the mon_sync_max_payload_size that might be relevant in MON-[MON, MGR, OSD] communication?

In general, I have the impression that due to such little bugs the recommendation for production clusters should be elevated to at least 5 MONs so that one can afford 2 MONs going out of quorum temporarily. I will upgrade our cluster to 5 MONs as soon as I can.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
Sent: 06 January 2021 20:53:14
To: Frank Schilder
Subject: Re:  Re: Storage down due to MON sync very slow

Yeah I was going to say -- ignore all of the rsync advice in that
thread, it is unnecessary.
Setting a small mon sync payload works like magic :)

-- dan

On Wed, Jan 6, 2021 at 8:49 PM Frank Schilder <frans@xxxxxx> wrote:
>
> OK, sorry for all my questions.
>
> Setting mon_sync_max_payload_size=4096 actually makes the MON sync in no time! Thank you so much :)
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder
> Sent: 06 January 2021 20:40:26
> To: Dan van der Ster
> Subject: Re:  Re: Storage down due to MON sync very slow
>
> OK, thanks a lot! I will try it now. Hope the cluster remains responsive.
>
> I'm wondering about this approach someone brought up in your thread:
>
> Eventually I stopped one MON, tarballed it's database and used that to
> bring back the MON which was upgraded to 13.2.8
>
> That work without any hickups. The MON joined again within a few seconds.
>
> Stopping one MON for a copy would be much shorter storage outage than the sync I'm doing. I guess its the entire mon data directory to copy. I always wondered if this contains data tied to a specific MON. If not, the copy approach could speed things up a lot. What do you think?
>
> Thanks again and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> Sent: 06 January 2021 20:36:15
> To: Frank Schilder
> Subject: Re:  Re: Storage down due to MON sync very slow
>
> We have used mon_sync_max_payload_size 4096 on our largest most
> important prod cluster since that thread.
> The PR from Sage makes something like that the default anyway. (the PR
> counts keys rather than bytes, but the effect is the same).
>
> mon_sync_max_payload_size 4096 should not impact the speed of syncing
> -- it simply breaks the sync into smaller more manageable pieces.
> (Without this, if you have lots of keys in the mon db, in our case
> caused by lots of rbd snapshots, then syncing will never ever
> complete).
>
> -- dan
>
> On Wed, Jan 6, 2021 at 8:32 PM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Hi Dan,
> >
> > thanks for that. Will it slow down or accelerate the syncing (will read your post after that e-mail), or will it just allow I/O to continue and sync more in the background? Current value is
> >
> > mon_sync_max_payload_size  1048576
> >
> > Related to that, would building a MON store from OSDs following https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#recovery-using-osds help providing a head start? Not sure if this procedure works on an active cluster.
> >
> > Will study your thread now ...
> >
> > Thanks again and best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > Sent: 06 January 2021 20:26:46
> > To: Frank Schilder
> > Subject: Re:  Re: Storage down due to MON sync very slow
> >
> > (obviously just put that config in the ceph.conf on the mons if mimic
> > doesn't have ceph config... I don't quite remember.)
> >
> > -- dan
> >
> > On Wed, Jan 6, 2021 at 8:25 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > >
> > > This sounds a lot like an old thread of mine:
> > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/M5ZKF7PTEO2OGDDY5L74EV4QS5SDCZTH/
> > >
> > > See the discussion about mon_sync_max_payload_size, and the PR that
> > > fixed this at some point in nautilus.
> > >
> > > Our workaround was:
> > >
> > > ceph config set mon mon_sync_max_payload_size 4096
> > >
> > > Hope that helps,
> > >
> > > Dan
> > >
> > >
> > > On Wed, Jan 6, 2021 at 8:18 PM Frank Schilder <frans@xxxxxx> wrote:
> > > >
> > > > Dear Dan,
> > > >
> > > > thanks for your fast response.
> > > >
> > > > Version: mimic 13.2.10.
> > > >
> > > > Here is the mon_status of the "new" MON during syncing:
> > > >
> > > > [root@ceph-01 ~]# ceph daemon mon.ceph-01 mon_status
> > > > {
> > > >     "name": "ceph-01",
> > > >     "rank": 0,
> > > >     "state": "synchronizing",
> > > >     "election_epoch": 0,
> > > >     "quorum": [],
> > > >     "features": {
> > > >         "required_con": "144115188346404864",
> > > >         "required_mon": [
> > > >             "kraken",
> > > >             "luminous",
> > > >             "mimic",
> > > >             "osdmap-prune"
> > > >         ],
> > > >         "quorum_con": "0",
> > > >         "quorum_mon": []
> > > >     },
> > > >     "outside_quorum": [
> > > >         "ceph-01"
> > > >     ],
> > > >     "extra_probe_peers": [],
> > > >     "sync_provider": [],
> > > >     "sync": {
> > > >         "sync_provider": "mon.2 192.168.32.67:6789/0",
> > > >         "sync_cookie": 33302773774,
> > > >         "sync_start_version": 38355711
> > > >     },
> > > >     "monmap": {
> > > >         "epoch": 3,
> > > >         "fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9",
> > > >         "modified": "2019-03-14 23:08:34.717223",
> > > >         "created": "2019-03-14 22:18:15.088212",
> > > >         "features": {
> > > >             "persistent": [
> > > >                 "kraken",
> > > >                 "luminous",
> > > >                 "mimic",
> > > >                 "osdmap-prune"
> > > >             ],
> > > >             "optional": []
> > > >         },
> > > >         "mons": [
> > > >             {
> > > >                 "rank": 0,
> > > >                 "name": "ceph-01",
> > > >                 "addr": "192.168.32.65:6789/0",
> > > >                 "public_addr": "192.168.32.65:6789/0"
> > > >             },
> > > >             {
> > > >                 "rank": 1,
> > > >                 "name": "ceph-02",
> > > >                 "addr": "192.168.32.66:6789/0",
> > > >                 "public_addr": "192.168.32.66:6789/0"
> > > >             },
> > > >             {
> > > >                 "rank": 2,
> > > >                 "name": "ceph-03",
> > > >                 "addr": "192.168.32.67:6789/0",
> > > >                 "public_addr": "192.168.32.67:6789/0"
> > > >             }
> > > >         ]
> > > >     },
> > > >     "feature_map": {
> > > >         "mon": [
> > > >             {
> > > >                 "features": "0x3ffddff8ffacfffb",
> > > >                 "release": "luminous",
> > > >                 "num": 1
> > > >             }
> > > >         ],
> > > >         "mds": [
> > > >             {
> > > >                 "features": "0x3ffddff8ffacfffb",
> > > >                 "release": "luminous",
> > > >                 "num": 2
> > > >             }
> > > >         ],
> > > >         "client": [
> > > >             {
> > > >                 "features": "0x2f018fb86aa42ada",
> > > >                 "release": "luminous",
> > > >                 "num": 1
> > > >             },
> > > >             {
> > > >                 "features": "0x3ffddff8eeacfffb",
> > > >                 "release": "luminous",
> > > >                 "num": 1
> > > >             },
> > > >             {
> > > >                 "features": "0x3ffddff8ffacfffb",
> > > >                 "release": "luminous",
> > > >                 "num": 17
> > > >             }
> > > >         ]
> > > >     }
> > > > }
> > > >
> > > > I'm a bit surprised that the other 2 MONs don't remain in quorum until this MON has caught up. Is there any way to monitor the syncing progress? Right now I need to interrupt regularly to allow some I/O, but I have no clue how long I need to wait.
> > > >
> > > > Thanks for your help!
> > > >
> > > > Best regards,
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > >
> > > > ________________________________________
> > > > From: Dan van der Ster <dan@xxxxxxxxxxxxxx>
> > > > Sent: 06 January 2021 20:16:44
> > > > To: Frank Schilder
> > > > Cc: Ceph Users
> > > > Subject: Re:  Re: Storage down due to MON sync very slow
> > > >
> > > > Which version of Ceph are you running?
> > > >
> > > > .. dan
> > > >
> > > >
> > > > On Wed, Jan 6, 2021, 8:14 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
> > > > In the output of the MON I see slow ops warnings:
> > > >
> > > > debug 2021-01-06 20:12:48.854 7f1a3d29f700 -1 mon.ceph-01@0(synchronizing) e3 get_health_metrics reporting 20 slow ops, oldest is log(1 entries from seq 1 at 2021-01-06 20:00:12.014861)
> > > >
> > > > There appears to be no progress on this operation, it is stuck.
> > > >
> > > > Best regards,
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > >
> > > > ________________________________________
> > > > From: Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>>
> > > > Sent: 06 January 2021 20:11:25
> > > > To: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > > > Subject:  Storage down due to MON sync very slow
> > > >
> > > > Dear all,
> > > >
> > > > I had to restart one out of 3 MONs on an empty MON DB dir. It is in state syncing right now, but I'm not sure if there is any progress. The cluster is completely unresponsive even though I have 2 healthy MONs. Is there any way to sync the DB directory faster and/or without downtime?
> > > >
> > > > Thanks a lot!
> > > >
> > > > Best regards,
> > > > =================
> > > > Frank Schilder
> > > > AIT Risø Campus
> > > > Bygning 109, rum S14
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx