Re: RGW Bucket notification troubleshooting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jan 28, 2021 at 7:34 PM Schoonjans, Tom (RFI,RAL,-) <
Tom.Schoonjans@xxxxxxxxx> wrote:

> Hi Yuval,
>
>
> Together with Tom Byrne I ran some more tests today while keeping an eye
> on the logs as well.
>
> We immediately noticed that the nodes were logging errors when uploading
> files like:
>
> 2021-01-28 16:10:45.825 7f56ff5cf700  1 ====== starting new request req=0x7f56ff5c87f0 =====
> 2021-01-28 16:10:45.828 7f5721e14700  1 AMQP connect: exchange mismatch
> 2021-01-28 16:10:45.828 7f5721e14700  1 ERROR: failed to create push endpoint: amqp://<username>:<password>@<my.rabbitmq.server>:5672 due to: pubsub endpoint configuration error: AMQP: failed to create connection to: amqp://<username>:<password>@<my.rabbitmq.server>:5672
> 2021-01-28 16:10:45.828 7f571ee0e700  1 ====== req done req=0x7f571ee077f0 op status=0 http_status=200 latency=0.0569997s ======
>
>
> Which resulted in no connections being established to the RabbitMQ server.
>
> Tom restarted then the Ceph services on one gateway node, which led to
> events being sent to RabbitMQ without blocking, but only if this particular
> node was picked up by the boto3 upload request in the round-robin DNS.
>
> Restarting the Ceph service on all nodes fixed the problem and I got a
> nice steady stream of events to my consumer Python script!
>
>
we should fix it. no restart should be needed if one of the connection
parameters was wrong



> I did notice that any events that were sent while my consumer script was
> not running are lost, as they are not picked up after I restart the script.
> Any thoughts on this?
>
>
this is strange. in our code [1] we don't require immediate transfer of
messages.
how is the exchange declared?
can you check if this is happening when you send messages from a python
producer as well?

[1] https://github.com/ceph/ceph/blob/master/src/rgw/rgw_amqp.cc#L575



> Many thanks!!
>
> Best,
>
> Tom
>
>
>
> Dr Tom Schoonjans
>
> Research Software Engineer - HPC and Cloud
>
> Rosalind Franklin Institute
> Harwell Science & Innovation Campus
> Didcot
> Oxfordshire
> OX11 0FA
> United Kingdom
>
> https://www.rfi.ac.uk
>
> The Rosalind Franklin Institute is a registered charity in England and
> Wales, No. 1179810 Company Limited by Guarantee Registered in England
> and Wales, No.11266143. Funded by UK Research and Innovation through
> the Engineering and Physical Sciences Research Council.
>
> On 27 Jan 2021, at 16:21, Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
>
>
> On Wed, Jan 27, 2021 at 5:34 PM Schoonjans, Tom (RFI,RAL,-) <
> Tom.Schoonjans@xxxxxxxxx> wrote:
>
>> Looks like there’s already a ticket open for AMQP SSL support:
>> https://tracker.ceph.com/issues/42902 (you opened it ;-))
>>
>> I will give a try myself if I have some time, but don’t hold your breath
>> with lockdown and home schooling. Also I am not much of a C++ coder.
>>
>> I need to go over the logs with Tom Byrne to see why it is not working
>> properly. And perhaps I will be able to come up with a fix then.
>>
>> However this is what I have run into so far today:
>>
>> 1. After configuring a bucket with a topic using the non-SSL port, I
>> tried a couple of uploads to this bucket. They all hanged, which seemed
>> like something was very wrong, so I Ctrl-C’ed every time. After some time I
>> figured out from the RabbitMQ admin UI that Ceph was indeed connecting to
>> it, and the connections remained so I killed them from the UI.
>>
>
> sending the notification to the rabbitmq server is synchronous with the
> upload to the bucket. so, if the server is slow or not acking the
> notification, the upload request would hang. not that the upload itself is
> done first, but the reply to the client does not happen until rabbitmq
> server acks.
>
> would be great if you can share the radosgw logs.
> maybe the issue is related to the user/password method we use? we use:
> AMQP_SASL_METHOD_PLAIN
>
> one possible workaround would be to set "amqp-ack-level" to "none". in
> this case the radosgw does not wait for an ack
>
> in "pacific" you could use "persistent topics" where the notifications are
> sent asynchronously to the endpoint.
>
> 2. I then wrote a python script with Pika to consume the events, hoping
>> that would stop the blocking. I had some minor success with this. Usually
>> the first three or four uploaded files would generate events that I could
>> consume with my script.
>>
>
> the radosgw is waiting for an ack from the broker, not the end consumer,
> so this should not have mattered...
> did you actually see any notifications delivered to the consumer?
>
>
>> However, the rest would block for ever. I repeated this a couple of times
>> but always the same result. I noticed that after I stopped uploading,
>> removed the bucket and the topic, the connection from Ceph in the RabbitMQ
>> UI remained. I killed it but it came back seconds later from another port
>> on the Ceph cluster. I ended up playing whack-a-mole with this until no
>> more connections would be established from Ceph to RabbitMQ. I probably
>> killed a 100 or so of them.
>>
>
> once you remove the bucket there cannot be new notification sent. if you
> create the bucket again you may see notifications again (this is fixed in
> "pacific").
> either way, even if the connection to the rabbitmq server would still be
> open, but no new notifications should be sent there. just having the
> connection should not be an issue but would be nice to fix that as well:
> https://tracker.ceph.com/issues/49033
>
> 3. After this I couldn’t get any events sent anymore. There is no more
>> blocking when uploading, files get written but nothing else happens. No
>> connections are made anymore from Ceph to RabbitMQ.
>>
>> Hope this helps…
>>
>
> yes, this is very helpful!
>
>
>> Best,
>>
>> Tom
>>
>>
>>
>>
>> Dr Tom Schoonjans
>>
>> Research Software Engineer - HPC and Cloud
>>
>> Rosalind Franklin Institute
>> Harwell Science & Innovation Campus
>> Didcot
>> Oxfordshire
>> OX11 0FA
>> United Kingdom
>>
>> https://www.rfi.ac.uk
>>
>> The Rosalind Franklin Institute is a registered charity in England and
>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>> and Wales, No.11266143. Funded by UK Research and Innovation through
>> the Engineering and Physical Sciences Research Council.
>>
>> On 27 Jan 2021, at 13:04, Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
>>
>>
>>
>> On Wed, Jan 27, 2021 at 11:33 AM Schoonjans, Tom (RFI,RAL,-) <
>> Tom.Schoonjans@xxxxxxxxx> wrote:
>>
>>> Hi Yuval,
>>>
>>>
>>> Switching to non-SSL connections to RabbitMQ allowed us to get things
>>> working, although currently it’s not very reliable.
>>>
>>
>> can you please add more about that? what reliability issues did you see?
>>
>>
>>> I will open a new ticket over this if we can’t fix things ourselves.
>>>
>>>
>> this would be great. we have ssl support for kafka and http endpoint, so,
>> if you decide to give it a try you can look at them as examples.
>> and let me know if you have questions or need help.
>>
>>
>>
>>> I will open an issue on the tracker as soon as my account request has
>>> been approved :-)
>>>
>>> Best,
>>>
>>> Tom
>>>
>>>
>>>
>>>
>>>
>>> Dr Tom Schoonjans
>>>
>>> Research Software Engineer - HPC and Cloud
>>>
>>> Rosalind Franklin Institute
>>> Harwell Science & Innovation Campus
>>> Didcot
>>> Oxfordshire
>>> OX11 0FA
>>> United Kingdom
>>>
>>> https://www.rfi.ac.uk
>>>
>>> The Rosalind Franklin Institute is a registered charity in England and
>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>> the Engineering and Physical Sciences Research Council.
>>>
>>> On 26 Jan 2021, at 20:02, Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
>>>
>>>
>>>
>>> On Tue, Jan 26, 2021 at 9:48 PM Schoonjans, Tom (RFI,RAL,-) <
>>> Tom.Schoonjans@xxxxxxxxx> wrote:
>>>
>>>> Hi Yuval,
>>>>
>>>>
>>>> I worked on this earlier today with Tom Byrne and I think I may be able
>>>> to provide some more information.
>>>>
>>>> I set up the RabbitMQ server myself, and created the exchange with type
>>>> ’topic’ before configuring the bucket.
>>>>
>>>> Not sure if this matters, but the RabbitMQ endpoint is reached over
>>>> SSL, using certificates generated with Letsencrypt.
>>>>
>>>>
>>> it actually does. we don't support amqp over ssl.
>>> feel free to open a tracker for that - as we should probably support
>>> that!
>>> but note that it would probably be backported only to later versions
>>> than nautilus.
>>>
>>>
>>>
>>>> Many thanks,
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> Dr Tom Schoonjans
>>>>
>>>> Research Software Engineer - HPC and Cloud
>>>>
>>>> Rosalind Franklin Institute
>>>> Harwell Science & Innovation Campus
>>>> Didcot
>>>> Oxfordshire
>>>> OX11 0FA
>>>> United Kingdom
>>>>
>>>> https://www.rfi.ac.uk
>>>>
>>>> The Rosalind Franklin Institute is a registered charity in England and
>>>> Wales, No. 1179810 Company Limited by Guarantee Registered in England
>>>> and Wales, No.11266143. Funded by UK Research and Innovation through
>>>> the Engineering and Physical Sciences Research Council.
>>>>
>>>> On 26 Jan 2021, at 19:37, Yuval Lifshitz <ylifshit@xxxxxxxxxx> wrote:
>>>>
>>>> Hi Tom,
>>>> Did you create the exchange in rabbitmq? The RGW does not create it and
>>>> assume it is already created?
>>>> Could you increase the log level in RGW and see if there are more log
>>>> messages that have "AMQP" in them?
>>>>
>>>> Thanks,
>>>>
>>>> Yuval
>>>>
>>>> On Tue, Jan 26, 2021 at 7:33 PM Byrne, Thomas (STFC,RAL,SC) <
>>>> tom.byrne@xxxxxxxxxx> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> We've been trying to get RGW Bucket notifications working with a
>>>>> RabbitMQ endpoint on our Nautilus 14.2.15 cluster. The gateway host can
>>>>> communicate with the rabbitMQ server just fine, but when RGW tries to send
>>>>> a message to the endpoint, the message never appears in the queue, and we
>>>>> get this error from in the RGW logs:
>>>>>
>>>>> 2021-01-26 16:28:17.271 7f0468b1f700  1 push to endpoint AMQP(0.9.1)
>>>>> Endpoint
>>>>> URI: amqp://user:pass@host:5671
>>>>> Topic: ceph-topic-test
>>>>> Exchange: ceph-test
>>>>> Ack Level: broker failed, with error: -4098
>>>>>
>>>>> We've confirmed the URI is correct, and that the gateway host can send
>>>>> messages to the RabbitMQ via a standalone script (using the same
>>>>> information as in the URI). Does anyone have any hints about how to dig
>>>>> into this?
>>>>>
>>>>> Cheers,
>>>>> Tom
>>>>>
>>>>> This email and any attachments are intended solely for the use of the
>>>>> named recipients. If you are not the intended recipient you must not use,
>>>>> disclose, copy or distribute this email or any of its attachments and
>>>>> should notify the sender immediately and delete this email from your
>>>>> system. UK Research and Innovation (UKRI) has taken every reasonable
>>>>> precaution to minimise risk of this email or any attachments containing
>>>>> viruses or malware but the recipient should carry out its own virus and
>>>>> malware checks before opening the attachments. UKRI does not accept any
>>>>> liability for any losses or damages which the recipient may sustain due to
>>>>> presence of any viruses. Opinions, conclusions or other information in this
>>>>> message and attachments that are not related directly to UKRI business are
>>>>> solely those of the author and do not represent the views of UKRI.
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>>
>>>>
>>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux