I've finally solved this. There has been a change in behaviour in 17.2.6.
For cluster 2 (the one that failed):
* When they were built the hosts were configured with a hostname
without a domain (so hostname returned a short name)
* The hosts as reported by ceph all had short hostnames
* In ceph.conf each of the RGWs has a section like:
[client.rgw.host1]
host = host1
rgw frontends = "beast port=80"
rgw dns name = host1.my.domain
rgw_crypt_require_ssl = false
* The dashboard connections to the RGW servers all had a Host header
of the FQDN as specified in ceph.conf (observed using tcpdump)
* The RGW processes allowed the connections based on knowledge of
their own FQDN
But after the upgrade:
* The dashboard connections to the RGW all have a Host header of the
short host name (observed using tcpdump)
* The RGW processes are disallowing it has it doesn't match their FQDN
* By adding the short names to the zonegroup "hostnames" it now works
Cluster 1 (which didn't fail) had been built with FQDN hostnames, so
were still supplying an FQDN in the Host headers.
So my hypothesis is that in 17.2.6 the dashboard no longer honours the
"rgw dns name" field in ceph.conf. There may be some other subtleties
but that's my best guess.
If you were running TLS to the RGWs, that may well be sufficient to
cause certificate name mismatches too unless the certificate SANs
contained the short names. I guess you would hit that first, masking the
other problem.
Although cluster 2 should probably have been configured with FQDN
hostnames I do still think this is a regression. The "rgw dns name"
field should be honoured.
Thanks, Chris
On 13/04/2023 17:20, Chris Palmer wrote:
Hi
I have 3 Ceph clusters, all configured similarly, which have been
happy for some months on 17.2.5:
1. A test cluster
2. A small production cluster
3. A larger production cluster
All are debian 11 built from packages - no cephadm.
I upgraded (1) to 17.2.6 without any problems at all. In particular
the Object Gateway sections of the dashboard work as usual.
I then upgraded (2). Nothing seemed amiss, and everything seems to
work except... when I try to access the Object Gateway sections of the
dashboard I always get:
*The Object Gateway Service is not configured*
Error connecting to Object Gateway: RGW REST API failed request
with status code 403
(b'{"Code":"SignatureDoesNotMatch","RequestId":"tx0000022ba920e82ac4a9c-0064381'
b'934-10e73385-default","HostId":"10e73385-default-default"}')
(Just the RequestId changes each time). Before the upgrade it worked
just fine.
Other info:
* RGW requests using awscli and rclone all work with normal RGW
accounts. It just seems to be the dashboard that's died.
* Just the one zonegroup, no multisite/replication
* "radosgw-admin user info --uid=rgwadmin" gives the correct output
with the right access_key & secret_key. The other fields are as in
(1).
* "ceph dashboard get-rgw-api-access-key/get-rgw-api-secret-key" both
give the right values.
The rgw logs from (2) which fails show:
2023-04-13T16:36:28.720+0100 7fcc7966a700 1 ====== starting new
request req=0x7fcd88c10720 =====
2023-04-13T16:36:28.720+0100 7fcc80e79700 1 req 8090309398268968541
0.000000000s op->ERRORHANDLER: err_no=-2027 new_err_no=-2027
2023-04-13T16:36:28.724+0100 7fcc80e79700 1 ====== req done
req=0x7fcd88c10720 op status=0 http_status=403 latency=0.003999980s
======
2023-04-13T16:36:28.724+0100 7fcc80e79700 1 beast: 0x7fcd88c10720:
192.168.xx.xx - - [13/Apr/2023:16:36:28.720 +0100] "GET
/admin/metadata/user?myself HTTP/1.1" 403 134 -
"python-requests/2.25.1" - latency=0.003999980s
(Note this does not have rgwadmin as the user, and is always the same
URL)
Whereas the rgw logs from (1) which works show things like:
2023-04-13T15:44:19.396+0000 7f8478da1700 1 ====== starting new
request req=0x7f86284f5720 =====
2023-04-13T15:44:19.412+0000 7f8478da1700 1 ====== req done
req=0x7f86284f5720 op status=0 http_status=200 latency=0.016000060s
======
2023-04-13T15:44:19.412+0000 7f8478da1700 1 beast: 0x7f86284f5720:
10.xx.xx.xx - rgwadmin [13/Apr/2023:15:44:19.396 +0000] "GET
/admin/realm?list HTTP/1.1" 200 31 - "python-requests/2.25.1" -
latency=0.016000060s
(Note this has rgwadmin as the user, and various URLs)
The only thing I can see in the release notes that looks even vaguely
related is https://github.com/ceph/ceph/pull/47547, but it doesn't
seem likely.
I am really stumped on this, with no idea what has gone wrong on (2),
and what the difference is between (1) and (2). I'm not going to touch
(3) until I have resolved this.
Grateful for any help...
And thanks for all the good work.
Regards, Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx