Service Discovery issue in Reef 18.2.0 release ( upgrading )

"Brent Kennedy" <bkennedy@xxxxxxxxxx> · Wed, 15 Nov 2023 01:09:57 -0500

Greetings group!

We recently reloaded a cluster from scratch using cephadm and reef.  The
cluster came up, no issues.  We then decided to upgrade two existing cephadm
clusters that were on quincy.  Those two clusters came up just fine but
there is an issue with the Grafana graphs on both clusters ( which were
working before the upgrade ).  They are now blank.  There is an error in the
Prometheus alerts (PrometheusJobMissing) that is alerting and it states the
following:

The prometheus job that scrapes from Ceph is no longer defined, this will
effectively mean you'll have no metrics or alerts for the cluster.  Please
review the job definitions in the prometheus.yml file of the prometheus
instance.

summary: The scrape job for Ceph is missing from Prometheus

When I look at the Prometheus.yml file on the performance monitoring node,
this is what is there( I replaced ip with x.x.x.x ):

global:

  scrape_interval: 10s

  evaluation_interval: 10s

rule_files:

  - /etc/prometheus/alerting/*

alerting:

  alertmanagers:

    - scheme: http

      http_sd_configs:

        - url:
http://x.x.x.x:8765/sd/prometheus/sd-config?service=alertmanager

scrape_configs:

  - job_name: 'ceph'

    honor_labels: true

    http_sd_configs:

    - url:
http://x.x.x.x:8765/sd/prometheus/sd-config?service=mgr-prometheus

  - job_name: 'node'

    http_sd_configs:

    - url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=node-exporter

  - job_name: 'ceph-exporter'

    honor_labels: true

    http_sd_configs:

    - url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=ceph-exporter

When I open a run "netstat -ntlp" on the active mgr node, I see the 8765
port being used by docker.  However, when I try to use the chrome browser to
access the URLs listed in the Prometheus.yml file, the page times out.
However, if I do this with the active manager on the cluster that was
installed from scratch ( and not upgraded ), the URL for that cluster
returns output( different for each URL ).  

So it appears to me that the service discovery function is not working for
upgrades from quincy.  Also, the ceph-exporter service was not installed on
the cluster during the upgrade process.  I manually added the service when I
noticed that it was not there ( when comparing the from scratch cluster to
the upgraded cluster ).  

Not sure if this will help or is even related, but I saw it in the cephadm
log:

2023-11-15T04:22:30.789998+0000 mgr. CEPH-MON-01.mlmups (mgr.144601) 753 :
cephadm 4 host CEPH-MON-02 `cephadm gather-facts` failed: Cannot decode
JSON:

Traceback (most recent call last):

  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1425, in
_run_cephadm_json

    return json.loads(''.join(out))

  File "/lib64/python3.6/json/__init__.py", line 354, in loads

    return _default_decoder.decode(s)

  File "/lib64/python3.6/json/decoder.py", line 339, in decode

    obj, end = self.raw_decode(s, idx=_w(s, 0).end())

  File "/lib64/python3.6/json/decoder.py", line 357, in raw_decode

    raise JSONDecodeError("Expecting value", s, err.value) from None

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Is there any way to fix the service discovery?

Thanks!

-Brent

Existing Clusters:

Test: Reef 18.2.0 ( all virtual on nvme )

US Production(HDD): Reef 18.2.0 with 11 osd servers, 3 mons, 4 gateways, 2
iscsi gateways

UK Production(HDD): Nautilus 14.2.22 with 18 osd servers, 3 mons, 4
gateways, 2 iscsi gateways

US Production(SSD): Reef 18.2.0 Cephadm with 6 osd servers, 5 mons, 4
gateways

UK Production(SSD): Reef 18.2.0 cephadm with 7 osd servers, 5 mons, 4
gateways

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx