Greetings group! We recently reloaded a cluster from scratch using cephadm and reef. The cluster came up, no issues. We then decided to upgrade two existing cephadm clusters that were on quincy. Those two clusters came up just fine but there is an issue with the Grafana graphs on both clusters ( which were working before the upgrade ). They are now blank. There is an error in the Prometheus alerts (PrometheusJobMissing) that is alerting and it states the following: The prometheus job that scrapes from Ceph is no longer defined, this will effectively mean you'll have no metrics or alerts for the cluster. Please review the job definitions in the prometheus.yml file of the prometheus instance. summary: The scrape job for Ceph is missing from Prometheus When I look at the Prometheus.yml file on the performance monitoring node, this is what is there( I replaced ip with x.x.x.x ): global: scrape_interval: 10s evaluation_interval: 10s rule_files: - /etc/prometheus/alerting/* alerting: alertmanagers: - scheme: http http_sd_configs: - url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=alertmanager scrape_configs: - job_name: 'ceph' honor_labels: true http_sd_configs: - url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=mgr-prometheus - job_name: 'node' http_sd_configs: - url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=node-exporter - job_name: 'ceph-exporter' honor_labels: true http_sd_configs: - url: http://x.x.x.x:8765/sd/prometheus/sd-config?service=ceph-exporter When I open a run "netstat -ntlp" on the active mgr node, I see the 8765 port being used by docker. However, when I try to use the chrome browser to access the URLs listed in the Prometheus.yml file, the page times out. However, if I do this with the active manager on the cluster that was installed from scratch ( and not upgraded ), the URL for that cluster returns output( different for each URL ). So it appears to me that the service discovery function is not working for upgrades from quincy. Also, the ceph-exporter service was not installed on the cluster during the upgrade process. I manually added the service when I noticed that it was not there ( when comparing the from scratch cluster to the upgraded cluster ). Not sure if this will help or is even related, but I saw it in the cephadm log: 2023-11-15T04:22:30.789998+0000 mgr. CEPH-MON-01.mlmups (mgr.144601) 753 : cephadm 4 host CEPH-MON-02 `cephadm gather-facts` failed: Cannot decode JSON: Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1425, in _run_cephadm_json return json.loads(''.join(out)) File "/lib64/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/lib64/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/lib64/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) Is there any way to fix the service discovery? Thanks! -Brent Existing Clusters: Test: Reef 18.2.0 ( all virtual on nvme ) US Production(HDD): Reef 18.2.0 with 11 osd servers, 3 mons, 4 gateways, 2 iscsi gateways UK Production(HDD): Nautilus 14.2.22 with 18 osd servers, 3 mons, 4 gateways, 2 iscsi gateways US Production(SSD): Reef 18.2.0 Cephadm with 6 osd servers, 5 mons, 4 gateways UK Production(SSD): Reef 18.2.0 cephadm with 7 osd servers, 5 mons, 4 gateways _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx