Re: Manager carries wrong information until killing it

Reed Dier <reed.dier@xxxxxxxxxxx> · Wed, 12 May 2021 12:33:05 -0500

I don't have a solution to offer, but I've seen this for years with no solution.
Any time a MGR bounces, be it for upgrades, or a new daemon coming online, etc, I'll see a scale spike like is reported below.

Just out of curiosity, which MGR plugins are you using?
I have historically used the influx plugin for stats exports, and it shows up in those values as well, throwing everything off.

I don't see it in my Zabbix stats, albeit those are scraped at a longer interval that may not catch this.

Just looking for any common threads.

Reed

> On May 4, 2021, at 3:46 AM, Nico Schottelius <nico.schottelius@xxxxxxxxxxx> wrote:
> 
> 
> Hello,
> 
> we have a recurring, funky problem with managers on Nautilus (and
> probably also earlier versions): the manager displays incorrect
> information.
> 
> This is a recurring pattern and it also breaks the prometheus graphs, as
> the I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k
> keys/s, 11.40M objects/s" - which basically changes the scale of any
> related graph to unusable.
> 
> The latest example from today shows slow ops for an OSD
> that has been down for 17h:
> 
> --------------------------------------------------------------------------------
> [09:50:31] black2.place6:~# ceph -s
>  cluster:
>    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
>    health: HEALTH_WARN
>            18 slow ops, oldest one blocked for 975 sec, osd.53 has slow ops
> 
>  services:
>    mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
>    mgr: server2(active, since 2w), standbys: server8, server4, server9, server6, ciara3
>    osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
> 
>  data:
>    pools:   4 pools, 2624 pgs
>    objects: 42.52M objects, 162 TiB
>    usage:   486 TiB used, 298 TiB / 784 TiB avail
>    pgs:     2616 active+clean
>             8    active+clean+scrubbing+deep
> 
>  io:
>    client:   522 MiB/s rd, 22 MiB/s wr, 8.18k op/s rd, 689 op/s wr
> --------------------------------------------------------------------------------
> 
> Killing the manager on server2 changes the status to another temporary
> incorrect status, because the rebalance finished hours ago, paired with
> the incorrect rebalance speed that we see from time to time:
> 
> --------------------------------------------------------------------------------
> [09:51:59] black2.place6:~# ceph -s
>  cluster:
>    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
>    health: HEALTH_OK
> 
>  services:
>    mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
>    mgr: server8(active, since 11s), standbys: server4, server9, server6, ciara3
>    osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
> 
>  data:
>    pools:   4 pools, 2624 pgs
>    objects: 42.52M objects, 162 TiB
>    usage:   486 TiB used, 298 TiB / 784 TiB avail
>    pgs:     2616 active+clean
>             8    active+clean+scrubbing+deep
> 
>  io:
>    client:   214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 1.06G op/s wr
>    recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s
> 
>  progress:
>    Rebalancing after osd.53 marked out
>      [========================......]
> --------------------------------------------------------------------------------
> 
> Then a bit later, the status on the newly started manager is correct:
> 
> --------------------------------------------------------------------------------
> [09:52:18] black2.place6:~# ceph -s
>  cluster:
>    id:     1ccd84f6-e362-4c50-9ffe-59436745e445
>    health: HEALTH_OK
> 
>  services:
>    mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w)
>    mgr: server8(active, since 47s), standbys: server4, server9, server6, server2, ciara3
>    osd: 108 osds: 107 up (since 17h), 107 in (since 17h)
> 
>  data:
>    pools:   4 pools, 2624 pgs
>    objects: 42.52M objects, 162 TiB
>    usage:   486 TiB used, 298 TiB / 784 TiB avail
>    pgs:     2616 active+clean
>             8    active+clean+scrubbing+deep
> 
>  io:
>    client:   422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 752 op/s wr
> --------------------------------------------------------------------------------
> 
> Question: is this a know bug, is anyone else seeing it or are we doing
> something wrong?
> 
> Best regards,
> 
> Nico
> 
> --
> Sustainable and modern Infrastructures by ungleich.ch
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx