Re: 18.2.2 dashboard really messed up.

Harry G Coin <hgcoin@xxxxxxxxx> · Wed, 13 Mar 2024 15:47:03 -0500

Thanks!  Oddly, all the dashboard checks you suggest appear normal, yet 
the result remains broken.

Before I used your instruction about the dashboard, I have this result:

root@noc3:~# ceph dashboard get-prometheus-api-host
http://noc3.1.quietfountain.com:9095
root@noc3:~# netstat -6nlp | grep 9095
tcp6       0      0 :::9095                :::* 
                   LISTEN      80963/prometheus
root@noc3:~#

To check it, I tried setting it to something random, the browser aimed 
at the dashboard site reported no connection.  The error message ended 
when I restored the above.  But the graphs remain empty, the numbers 1 
and 0.5 on each.

Regarding the used storage, notice the overall usage is 43.6 of 111 
TiB.    Seems quite a distance from the trigger warning points of 85 and 
95?  The default values are in use.  All the OSDs are between 37% to 42% 
usage.   What am I missing?

Thanks!

On 3/12/24 02:07, Nizamudeen A wrote:
Hi,

The warning and danger indicator in the capacity chart points to the 
nearful and full ratio set to the cluster and
the default values for them are 85% and 95% respectively. You can do a 
`ceph osd dump | grep ratio` and see those.

When this got introduced, there was a blog post 
<https://ceph.io/en/news/blog/2023/landing-page/#capacity-card>explaining 
how this is mapped in the chart. But when your used storage
crosses that 85% mark, the chart is colored with yellow to indicate 
the user, and when it crosses 95% (or the full ratio) the
chart is colored with red to tell that. But that doesn't mean the 
cluster is in bad shape but its a visual indicator to tell you
you are running out of storage.

Regarding the Cluster Utilization chart, it gets metrics directly from 
prometheus so that it can be used to show a time-series
data in UI rather than the metrics at current point in time (which was 
used before). So if you have prometheus configured in
dashboard and its url is provided in the dashboard settings `ceph 
dashboard set-prometheus-api-host <url-of-prometheus>`
then you should be able to see the metrics.

In case you need to read more about the new page you can check here 
<https://docs.ceph.com/en/latest/mgr/dashboard/#overview-of-the-dashboard-landing-page>.

Regards,
Nizam

On Mon, Mar 11, 2024 at 11:47 PM Harry G Coin <hgcoin@xxxxxxxxx> wrote:

    Looking at ceph -s, all is well.  Looking at the dashboard, 85% of my
    capacity is 'warned', and 95% is 'in danger'.   There is no hint
    given
    as to the nature of the danger or reason for the warning. Though
    apparently with merely 5% of my ceph world 'normal', the cluster
    reports
    'ok'.  Which, you know, seems contradictory.  I've used just under
    40%
    of capacity.

    Further down the dashboard, all the subsections of 'Cluster
    Utilization'
    are '1' and '0.5' with nothing whatever in the graphics area.

    Previous versions of ceph presented a normal dashboard.

    It's just a little half rack, 5 hosts, a few physical drives each,
    been
    running ceph for a couple years now.  Orchestrator is cephadm.  It's
    just about as 'plain vanilla' at it gets.  I've had to mute one
    alert,
    because cephadm refresh aborts when it finds drives on any host that
    have nothing to do with ceph that don't have a blkid_ip 'TYPE' key.
    Seems unrelated to a totally messed up dashboard.  (The tracker
    for that
    is here: https://tracker.ceph.com/issues/63502 ).

    Any idea what the steps are to get useful stuff back on the
    dashboard?
    Any idea where I can learn what my 85% danger and 95% warning is
    'about'?  (You'd think 'danger' (The volcano is blowing up now!) 
    would
    be worse than 'warning' (the volcano might blow up soon) , so how can
    warning+danger > 100%, or if not additive how can warning < danger?)

      Here's a bit of detail:

    root@noc1:~# ceph -s
      cluster:
        id:     4067126d-01cb-40af-824a-881c130140f8
        health: HEALTH_OK
                (muted: CEPHADM_REFRESH_FAILED)

      services:
        mon: 5 daemons, quorum noc4,noc2,noc1,noc3,sysmon1 (age 70m)
        mgr: noc2.yhyuxd(active, since 82m), standbys: noc4.tvhgac,
    noc3.sybsfb, noc1.jtteqg
        mds: 1/1 daemons up, 3 standby
        osd: 27 osds: 27 up (since 20m), 27 in (since 2d)

      data:
        volumes: 1/1 healthy
        pools:   16 pools, 1809 pgs
        objects: 12.29M objects, 17 TiB
        usage:   44 TiB used, 67 TiB / 111 TiB avail
        pgs:     1793 active+clean
                 9    active+clean+scrubbing
                 7    active+clean+scrubbing+deep

      io:
        client:   5.6 MiB/s rd, 273 KiB/s wr, 41 op/s rd, 58 op/s wr

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx