Re: Expose rgw using consul or service discovery

Sebastian Wagner <sewagner@xxxxxxxxxx> · Tue, 9 Nov 2021 16:57:30 +0100

Am 09.11.21 um 15:58 schrieb Pierre GINDRAUD:
> I come back about radosgw deployment, I've test cephadm ingress
> service and then theses are my findings :
>
> Haproxy service is deployed but not "managed" by cephadm, here the
> sources
> https://github.com/ceph/ceph/blob/9ab9cc26e200cdc3108525770353b91b3dd6c6d8/src/pybind/mgr/cephadm/services/ingress.py
> So, when cephadm shutdown radosgw backend, it does not "drain" or "put
> in maintenance" the haproxy backend before. Haproxy, continue to serve
> request on failed backend until it is marked down by healthcheck.
> Fortunatly, the new retry feature of Haproxy 2
> https://www.haproxy.com/fr/blog/haproxy-layer-7-retries-and-chaos-engineering/
> will retry failed requests on another backend. But as it is wrote in
> document, not all "failures cases" are handled. So when the server
> (rados gw) return an empty answer, haproxy does not retry the request
> and forward the 502 code to client. We can think to enable "retry-on
> all-retryable-errors" option but what about retrying a POST or a PUT
> method on an api, if the first request passed fine but only it's
> answer was broken, the first "action" can still be finished sucessfully.
> In addition, the haproxy configuration file is not "fully" customizable,
> https://github.com/ceph/ceph/blob/9ab9cc26e200cdc3108525770353b91b3dd6c6d8/src/pybind/mgr/cephadm/templates/services/ingress/haproxy.cfg.j2
> does not allow for custom log format .....
you can overwrite this template (compre it to the monitoring templates
https://docs.ceph.com/en/latest/cephadm/services/monitoring/#using-custom-configuration-files
). This way you can do what ever you want right now.
>
> In front of theses findings, I'm wondering if cephadm should approach
> the problem differently. For example, my previous proposal for
> "pre-task" and "post-tasks", or allow service registration in backend
> such "consul".
Maybe!
>
> Finally, in our setup, I will certainly deploy my own haproxy (using
> our infrastructure tools) and use consul and healthcheck to have a
> setup similar to ingress service but in our standards.
>
> Do you think any of my proposal can be really proposed to ceph
> developpers teams ?

At some point the amount of flexibility provided by cephadm's services
will reach its limit. And at that point one will need an escape hatch.
right now the escape hatch is making those templates overwritatible.
That's enough?

>
> On 23/10/2021 01:47, Maged Mokhtar wrote:
>>
>>>> In PetaSAN we use Consul to provide a service mesh for running
>>>> services active/active over Ceph.
>>>>
>>>> For rgw, we use nginx to load balance rgw gateways, the nginx
>>>> themselves run in an active/active ha setup so they do not become a
>>>> bottleneck as you pointed out with the haproxy setup.
>>>
>>
>>> How do you manage rgw upgrade ? do you use cephadm or any other
>>> automation tool ?
>>>
>>> How is nginx configured to talk to rgw ? using a upstream an a proxy
>>> pass ?
>>>
>>>
>> PetaSAN is a Ceph storage appliance based on Ubuntu OS and SUSE
>> kernel. We rely on Consul service mesh to scale the service/gateways
>> layer in a scale-out active/active fashion, this is for iSCSI, NFS,
>> SMB and S3.
>> Upgrades are done live via apt upgrade We do not use cephadm, we
>> provide a web based deployment ui (wizard like steps) as well as ui
>> for cluster management.
>> For nginx, we use the upstream method to configure the load balancing
>> of the rgws. The nginx config file is dynamically created/updated by
>> a python script which receives notifications from Consul (nodes
>> added/nodes down/ip changes..).
>> You can read more on our website
>> http://www.petasan.org <http://www.petasan.org>
>>
>>
>>>>
>>>> /Maged
>>>>
>>>> On 22/10/2021 16:41, Pierre GINDRAUD wrote:
>>>>>
>>>>> On 20/10/2021 10:17, Sebastian Wagner wrote:
>>>>>> Am 20.10.21 um 09:12 schrieb Pierre GINDRAUD:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm migrating from puppet to cephadm to deploy a ceph cluster,
>>>>>>> and I'm
>>>>>>> using consul to expose radosgateway. Before, with puppet, we were
>>>>>>> deploying radosgateway with "apt install radosgw" and applying
>>>>>>> upgrade
>>>>>>> using "apt upgrade radosgw". In our consul service a simple
>>>>>>> healthcheck
>>>>>>> on this url worked fine "/swift/healthcheck", because we were
>>>>>>> able to
>>>>>>> put consul agent in maintenance mode before operations.
>>>>>>> I've seen this thread
>>>>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/32JZAIU45KDTOWEW6LKRGJGXOFCTJKSS/#N7EGVSDHMMIXHCTPEYBA4CYJBWLD3LLP
>>>>>>>
>>>>>>> that proves consul is a possible way.
>>>>>>>
>>>>>>> So, with cephadm, the upgrade process decide by himself when to
>>>>>>> stop,
>>>>>>> upgrade and start each radosgw instances.
>>>>>> Right
>>>>>>
>>>>>>> It's an issue because the
>>>>>>> consul healthcheck must detect "as fast as possible" the
>>>>>>> instance break
>>>>>>> to minimize the number of applicatives hits that can use the down
>>>>>>> instance's IP.
>>>>>>>
>>>>>>> In some application like traefik
>>>>>>> https://doc.traefik.io/traefik/reference/static-configuration/cli/
>>>>>>> we
>>>>>>> have an option "requestacceptgracetimeout" that allow the "http
>>>>>>> server"
>>>>>>> to handle requests some time after a stop signal has been
>>>>>>> received while
>>>>>>> the healthcheck endpoint immediatly started to response with an
>>>>>>> "error".
>>>>>>> This allow the loadbalancer (consul here) to put instance down
>>>>>>> and stop
>>>>>>> traffic to it before it fall effectively down.
>>>>>>>
>>>>>>> In https://docs.ceph.com/en/latest/radosgw/config-ref/ I have
>>>>>>> see any
>>>>>>> option like that. And in cephadm I haven't seen "pre-task" and
>>>>>>> "post
>>>>>>> task" to, for exemple, touch a file somewhere consul will be
>>>>>>> able to
>>>>>>> test it, or putting down a host in maintenance.
>>>>>>>
>>>>>>> How do you expose radosgw service over your application ?
>>>>>> cephadm nowadays ships an ingress services using haproxy for this
>>>>>> use case:
>>>>>>
>>>>>> https://docs.ceph.com/en/latest/cephadm/services/rgw/#high-availability-service-for-rgw
>>>>>>
>>>>> Thanks for the link. I've analysed the high-availability pattern
>>>>> but I 've found the following cons about ceph proposal :
>>>>> * the current active haproxy node can be considered as a
>>>>> bottleneck because it handle all TCP connections. In addition it
>>>>> add a significant overhead because require 2 TCP connections in
>>>>> total to talk to rgw
>>>>> * the keepalived failover mecanism "break" TCP connection at the
>>>>> moment of the failover
>>>>> * Is the cephadm module "drain" properly a node before to interact
>>>>> (stop/restart...) on it ? because if not, haproxy do not bring
>>>>> anything better than my consul service setup.
>>>>>
>>>>> I'm thinking that haproxy+keepalive is a bit of complexity, a
>>>>> service discovery oriented approach is more simple and provide a
>>>>> "zero downtime" during all type of "planned maintenance"
>>>>> (
>>>>> https://www.consul.io/use-cases/service-discovery-and-health-checking
>>>>> )
>>>>>
>>>>> What do you think ?
>>>>>
>>>>> Is someone already use this "high-availability-service-for-rgw" in
>>>>> a production environment ?
>>>>>
>>>>>
>>>>>>
>>>>>>> Have you any idea as workaround my issue ?
>>>>>> Plenty actually. cephadm itself does not provide a notification
>>>>>> mechanisms, but other component in the deployment stack might.
>>>>>>
>>>>>> On the highest level we have the config-key store of the MONs. you
>>>>>> should be able to get notifications for config-key changes.
>>>>>> Unfortunately this would involve some Coding.
>>>>>>
>>>>>> On the systemd level we have systemd-notify. I haven't looked
>>>>>> into it,
>>>>>> but maybe you can get events about the rgw unit deployed by cephadm.
>>>>>>
>>>>>> On the container level we have "podman events" that prints state
>>>>>> changes
>>>>>> of containers.
>>>>>>
>>>>>> To me a script that calls podman events on one hand and pushes
>>>>>> updates
>>>>>> to consul sounds like the most promising solution to me.
>>>>>>
>>>>>> In case you get this setup working properly, I'd love to read a blog
>>>>>> post about it.
>>>>>>
>>>>>>> Regards
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx 
>>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx