Re: 17.2.6 Dashboard/RGW Signature Mismatch

Chris Palmer <chris.palmer@xxxxxxxxx> · Fri, 14 Apr 2023 16:48:40 +0100

I've finally solved this. There has been a change in behaviour in 17.2.6.

For cluster 2 (the one that failed):

 * When they were built the hosts were configured with a hostname
   without a domain (so hostname returned a short name)
 * The hosts as reported by ceph all had short hostnames
 * In ceph.conf each of the RGWs has a section like:

[client.rgw.host1]
    host = host1
    rgw frontends = "beast port=80"
    rgw dns name = host1.my.domain
    rgw_crypt_require_ssl = false

 * The dashboard connections to the RGW servers all had a Host header
   of the FQDN as specified in ceph.conf (observed using tcpdump)
 * The RGW processes allowed the connections based on knowledge of
   their own FQDN

But after the upgrade:

 * The dashboard connections to the RGW all have a Host header of the
   short host name (observed using tcpdump)
 * The RGW processes are disallowing it has it doesn't match their FQDN
 * By adding the short names to the zonegroup "hostnames" it now works

Cluster 1 (which didn't fail) had been built with FQDN hostnames, so 
were still supplying an FQDN in the Host headers.

So my hypothesis is that in 17.2.6 the dashboard no longer honours the 
"rgw dns name" field in ceph.conf. There may be some other subtleties 
but that's my best guess.

If you were running TLS to the RGWs, that may well be sufficient to 
cause certificate name mismatches too unless the certificate SANs 
contained the short names. I guess you would hit that first, masking the 
other problem.

Although cluster 2 should probably have been configured with FQDN 
hostnames I do still think this is a regression. The "rgw dns name" 
field should be honoured.

Thanks, Chris

On 13/04/2023 17:20, Chris Palmer wrote:
Hi

I have 3 Ceph clusters, all configured similarly, which have been 
happy for some months on 17.2.5:

1. A test cluster
2. A small production cluster
3. A larger production cluster

All are debian 11 built from packages - no cephadm.

I upgraded (1) to 17.2.6 without any problems at all. In particular 
the Object Gateway sections of the dashboard work as usual.

I then upgraded (2). Nothing seemed amiss, and everything seems to 
work except... when I try to access the Object Gateway sections of the 
dashboard I always get:

     *The Object Gateway Service is not configured*

       Error connecting to Object Gateway: RGW REST API failed request
       with status code 403
(b'{"Code":"SignatureDoesNotMatch","RequestId":"tx0000022ba920e82ac4a9c-0064381'
b'934-10e73385-default","HostId":"10e73385-default-default"}')

(Just the RequestId changes each time). Before the upgrade it worked 
just fine.

Other info:

 * RGW requests using awscli and rclone all work with normal RGW
   accounts. It just seems to be the dashboard that's died.
 * Just the one zonegroup, no multisite/replication
 * "radosgw-admin user info --uid=rgwadmin" gives the correct output
   with the right access_key & secret_key. The other fields are as in 
(1).
 * "ceph dashboard get-rgw-api-access-key/get-rgw-api-secret-key" both
   give the right values.

The rgw logs from (2) which fails show:

2023-04-13T16:36:28.720+0100 7fcc7966a700  1 ====== starting new 
request req=0x7fcd88c10720 =====
2023-04-13T16:36:28.720+0100 7fcc80e79700  1 req 8090309398268968541 
0.000000000s op->ERRORHANDLER: err_no=-2027 new_err_no=-2027
2023-04-13T16:36:28.724+0100 7fcc80e79700  1 ====== req done 
req=0x7fcd88c10720 op status=0 http_status=403 latency=0.003999980s 
======
2023-04-13T16:36:28.724+0100 7fcc80e79700  1 beast: 0x7fcd88c10720: 
192.168.xx.xx - - [13/Apr/2023:16:36:28.720 +0100] "GET 
/admin/metadata/user?myself HTTP/1.1" 403 134 - 
"python-requests/2.25.1" - latency=0.003999980s

(Note this does not have rgwadmin as the user, and is always the same 
URL)

Whereas the rgw logs from (1) which works show things like:

2023-04-13T15:44:19.396+0000 7f8478da1700  1 ====== starting new 
request req=0x7f86284f5720 =====
2023-04-13T15:44:19.412+0000 7f8478da1700  1 ====== req done 
req=0x7f86284f5720 op status=0 http_status=200 latency=0.016000060s 
======
2023-04-13T15:44:19.412+0000 7f8478da1700  1 beast: 0x7f86284f5720: 
10.xx.xx.xx - rgwadmin [13/Apr/2023:15:44:19.396 +0000] "GET 
/admin/realm?list HTTP/1.1" 200 31 - "python-requests/2.25.1" - 
latency=0.016000060s

(Note this has rgwadmin as the user, and various URLs)

The only thing I can see in the release notes that looks even vaguely 
related is https://github.com/ceph/ceph/pull/47547, but it doesn't 
seem likely.

I am really stumped on this, with no idea what has gone wrong on (2), 
and what the difference is between (1) and (2). I'm not going to touch 
(3) until I have resolved this.

Grateful for any help...

And thanks for all the good work.

Regards, Chris

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx