Re: 17.2.6 Dashboard/RGW Signature Mismatch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've finally solved this. There has been a change in behaviour in 17.2.6.

For cluster 2 (the one that failed):

 * When they were built the hosts were configured with a hostname
   without a domain (so hostname returned a short name)
 * The hosts as reported by ceph all had short hostnames
 * In ceph.conf each of the RGWs has a section like:

[client.rgw.host1]
    host = host1
    rgw frontends = "beast port=80"
    rgw dns name = host1.my.domain
    rgw_crypt_require_ssl = false

 * The dashboard connections to the RGW servers all had a Host header
   of the FQDN as specified in ceph.conf (observed using tcpdump)
 * The RGW processes allowed the connections based on knowledge of
   their own FQDN

But after the upgrade:

 * The dashboard connections to the RGW all have a Host header of the
   short host name (observed using tcpdump)
 * The RGW processes are disallowing it has it doesn't match their FQDN
 * By adding the short names to the zonegroup "hostnames" it now works

Cluster 1 (which didn't fail) had been built with FQDN hostnames, so were still supplying an FQDN in the Host headers.

So my hypothesis is that in 17.2.6 the dashboard no longer honours the "rgw dns name" field in ceph.conf. There may be some other subtleties but that's my best guess.

If you were running TLS to the RGWs, that may well be sufficient to cause certificate name mismatches too unless the certificate SANs contained the short names. I guess you would hit that first, masking the other problem.

Although cluster 2 should probably have been configured with FQDN hostnames I do still think this is a regression. The "rgw dns name" field should be honoured.

Thanks, Chris


On 13/04/2023 17:20, Chris Palmer wrote:
Hi

I have 3 Ceph clusters, all configured similarly, which have been happy for some months on 17.2.5:

1. A test cluster
2. A small production cluster
3. A larger production cluster

All are debian 11 built from packages - no cephadm.

I upgraded (1) to 17.2.6 without any problems at all. In particular the Object Gateway sections of the dashboard work as usual.

I then upgraded (2). Nothing seemed amiss, and everything seems to work except... when I try to access the Object Gateway sections of the dashboard I always get:


     *The Object Gateway Service is not configured*


       Error connecting to Object Gateway: RGW REST API failed request
       with status code 403
(b'{"Code":"SignatureDoesNotMatch","RequestId":"tx0000022ba920e82ac4a9c-0064381'
b'934-10e73385-default","HostId":"10e73385-default-default"}')

(Just the RequestId changes each time). Before the upgrade it worked just fine.

Other info:

 * RGW requests using awscli and rclone all work with normal RGW
   accounts. It just seems to be the dashboard that's died.
 * Just the one zonegroup, no multisite/replication
 * "radosgw-admin user info --uid=rgwadmin" gives the correct output
   with the right access_key & secret_key. The other fields are as in (1).
 * "ceph dashboard get-rgw-api-access-key/get-rgw-api-secret-key" both
   give the right values.

The rgw logs from (2) which fails show:

2023-04-13T16:36:28.720+0100 7fcc7966a700  1 ====== starting new request req=0x7fcd88c10720 ===== 2023-04-13T16:36:28.720+0100 7fcc80e79700  1 req 8090309398268968541 0.000000000s op->ERRORHANDLER: err_no=-2027 new_err_no=-2027 2023-04-13T16:36:28.724+0100 7fcc80e79700  1 ====== req done req=0x7fcd88c10720 op status=0 http_status=403 latency=0.003999980s ====== 2023-04-13T16:36:28.724+0100 7fcc80e79700  1 beast: 0x7fcd88c10720: 192.168.xx.xx - - [13/Apr/2023:16:36:28.720 +0100] "GET /admin/metadata/user?myself HTTP/1.1" 403 134 - "python-requests/2.25.1" - latency=0.003999980s

(Note this does not have rgwadmin as the user, and is always the same URL)


Whereas the rgw logs from (1) which works show things like:

2023-04-13T15:44:19.396+0000 7f8478da1700  1 ====== starting new request req=0x7f86284f5720 ===== 2023-04-13T15:44:19.412+0000 7f8478da1700  1 ====== req done req=0x7f86284f5720 op status=0 http_status=200 latency=0.016000060s ====== 2023-04-13T15:44:19.412+0000 7f8478da1700  1 beast: 0x7f86284f5720: 10.xx.xx.xx - rgwadmin [13/Apr/2023:15:44:19.396 +0000] "GET /admin/realm?list HTTP/1.1" 200 31 - "python-requests/2.25.1" - latency=0.016000060s

(Note this has rgwadmin as the user, and various URLs)

The only thing I can see in the release notes that looks even vaguely related is https://github.com/ceph/ceph/pull/47547, but it doesn't seem likely.

I am really stumped on this, with no idea what has gone wrong on (2), and what the difference is between (1) and (2). I'm not going to touch (3) until I have resolved this.

Grateful for any help...

And thanks for all the good work.

Regards, Chris



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux