I think I see something similar on a Pacific cluster, the alertmanager
doesn't seem to be aware of a mgr failover. One of the active alerts
is CephMgrPrometheusModuleInactive stating:
The mgr/prometheus module at storage04.fqdn:9283 is unreachable.
...
Which is true because the active mgr failed two days ago. I found this
tracker issue [1] which apparently disabled a redirect according to
the PR [2]:
Prevent Alertmanager alerts from being redirected to the active mgr
dashboard instance. There are two reasons for it:
It doesn't bring any additional benefit. The Alertmanager config
includes all available mgr instances - active and passive ones. In
case of an alert, it will be sent to all of them. It ensures that
the active mgr dashboard will receive the alert in any case.
The redirect URL includes the mgr IP and NOT the FQDN. This leads
to issues in environments where an SSL certificate is configured and
matches the FQDNs, only.
So this seems to be expected, I guess. But should the log be flooded
with these messages in that case? Seems unnecessary to me.
I also found [3] with slightly different error messages, I don't see
the "unexpected status code 500":
Jul 27 13:08:34 storage06 conmon[241445]: level=warn
ts=2023-07-27T11:08:34.288Z caller=notify.go:674 component=dispatcher
receiver=ceph-dashboard integration=webhook[3] msg="Notify attempt
failed, will retry later" attempts=1 err="Post
\"https://storage04.fqdn:8443/api/prometheus_receiver\": dial tcp
<IP>:8443: i/o timeout"
Regards,
Eugen
[1] https://tracker.ceph.com/issues/56401
[2] https://github.com/ceph/ceph/pull/47011
[3] https://tracker.ceph.com/issues/61256
Zitat von Robert Sander <r.sander@xxxxxxxxxxxxxxxxxxx>:
Hi,
we noticed a strange error message in the logfiles:
The alert-manager deployed with cephadm receives a HTTP 500 error from
the inactive MGR when trying to call the URI /api/prometheus_receiver:
Jul 25 09:35:25 alert-manager conmon[2426]: level=error
ts=2023-07-25T07:35:25.171Z caller=dispatch.go:354
component=dispatcher msg="Notify for alerts failed" num_alerts=45
err="ceph-dashboard/webhook[0]: notify retry canceled after 7
attempts: unexpected status code 500:
https://mgr001.example.net:8443/api/prometheus_receiver;
ceph-dashboard/webhook[2]: notify retry canceled after 8 attempts:
unexpected status code 500:
https://mgr003.example.net:8443/api/prometheus_receiver"
Jul 25 09:35:25 alert-manager conmon[2426]: level=warn
ts=2023-07-25T07:35:25.175Z caller=notify.go:724
component=dispatcher receiver=ceph-dashboard integration=webhook[2]
msg="Notify attempt failed, will retry later" attempts=1
err="unexpected status code 500:
https://mgr003.example.net:8443/api/prometheus_receiver"
Jul 25 09:35:25 alert-manager conmon[2426]: level=warn
ts=2023-07-25T07:35:25.177Z caller=notify.go:724
component=dispatcher receiver=ceph-dashboard integration=webhook[0]
msg="Notify attempt failed, will retry later" attempts=1
err="unexpected status code 500:
https://mgr001.example.net:8443/api/prometheus_receiver"
Jul 25 09:35:35 alert-manager conmon[2426]: level=error
ts=2023-07-25T07:35:35.171Z caller=dispatch.go:354
component=dispatcher msg="Notify for alerts failed" num_alerts=45
err="ceph-dashboard/webhook[2]: notify retry canceled after 7
attempts: unexpected status code 500:
https://mgr003.example.net:8443/api/prometheus_receiver;
ceph-dashboard/webhook[0]: notify retry canceled after 8 attempts:
unexpected status code 500:
https://mgr001.example.net:8443/api/prometheus_receiver"
Jul 25 09:35:35 alert-manager conmon[2426]: level=warn
ts=2023-07-25T07:35:35.176Z caller=notify.go:724
component=dispatcher receiver=ceph-dashboard integration=webhook[2]
msg="Notify attempt failed, will retry later" attempts=1
err="unexpected status code 500:
https://mgr003.example.net:8443/api/prometheus_receiver"
Jul 25 09:35:35 alert-manager conmon[2426]: level=warn
ts=2023-07-25T07:35:35.176Z caller=notify.go:724
component=dispatcher receiver=ceph-dashboard integration=webhook[0]
msg="Notify attempt failed, will retry later" attempts=1
err="unexpected status code 500:
https://mgr001.example.net:8443/api/prometheus_receiver"
This is from the logfile of mgr002, which was passive first and then
became active. After being active the errors on the MGR where gone but
showed on the newly passive MGR.
Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard INFO request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B]
[581dce66-9c65-4e84-a41a-8d72b450791e] /api/prometheus_receiver
Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard ERROR request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.001s] [513.0B]
[26e1854a-3b93-49c4-8afc-1a96426a3dab] /api/prometheus_receiver
Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard ERROR request]
[b'{"status": "500 Internal Server Error", "detail": "The server
encountered an unexpected condition which prevented it from
fulfilling the request.", "request _id":
"26e1854a-3b93-49c4-8afc-1a96426a3dab"}
']
Jul 25 09:25:25 mgr002 ceph-mgr[1841]: [dashboard INFO request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B]
[26e1854a-3b93-49c4-8afc-1a96426a3dab] /api/prometheus_receiver
Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard ERROR request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.001s] [513.0B]
[46d7e78c-49d5-4652-9877-973129ad3977] /api/prometheus_receiver
Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard ERROR request]
[b'{"status": "500 Internal Server Error", "detail": "The server
encountered an unexpected condition which prevented it from
fulfilling the request.", "request _id":
"46d7e78c-49d5-4652-9877-973129ad3977"}
']
Jul 25 09:25:26 mgr002 ceph-mgr[1841]: [dashboard INFO request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B]
[46d7e78c-49d5-4652-9877-973129ad3977] /api/prometheus_receiver
Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard ERROR request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B]
[a9b25e54-f1e1-42eb-90b2-af5aa22769cf] /api/prometheus_receiver
Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard ERROR request]
[b'{"status": "500 Internal Server Error", "detail": "The server
encountered an unexpected condition which prevented it from
fulfilling the request.", "request _id":
"a9b25e54-f1e1-42eb-90b2-af5aa22769cf"}
']
Jul 25 09:25:27 mgr002 ceph-mgr[1841]: [dashboard INFO request]
[::ffff:10.54.226.222:49904] [POST] [500] [0.002s] [513.0B]
[a9b25e54-f1e1-42eb-90b2-af5aa22769cf] /api/prometheus_receiver
Jul 25 09:25:28 mgr002 ceph-mgr[1841]: mgr handle_mgr_map Activating!
Jul 25 09:25:28 mgr002 ceph-mgr[1841]: mgr handle_mgr_map I am now activating
We have a test cluster running also with version 17.2.6 where
this does not happen. In this test cluster the passive MGRs return an HTTP
code 204 when the alert-manager tries to request /api/prometheus_receiver.
What is happening here?
Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin
https://www.heinlein-support.de
Tel: 030 / 405051-43
Fax: 030 / 405051-19
Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx