Re: [EXTERNAL] RGW bucket notifications stop working after a while and blocking requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alex,

thank you for the script. We will monitor how the queue fills ups to see if this is the issue or not.


Cheers,
Florian

> On 5. Aug 2024, at 14:01, Alex Hussein-Kershaw (HE/HIM) <alexhus@xxxxxxxxxxxxx> wrote:
> 
> Hi Florian,
> 
> We are also gearing up to use persistent bucket notifications, but have not got as far as you yet so quite interested in this. As I understand it, a bunch of new function is coming in Squid on the radosgw-admin command to allow gathering metrics from the queues, but they are not available yet in Reef.
> 
> I've used this: parse-notifications.py (github.com) <https://gist.github.com/yuvalif/b44a67b6278fe811aa38dd81a91eb3ba> to parse all the objects in the queue, hopefully it helps you (credit to Yuval who wrote it). The reservation failure to me does look like the queue is full. It would surely be interesting to see what is in the queue. 
> 
> Best wishes,
> Alex
> 
> From: Florian Schwab <fschwab@xxxxxxxxxxxxxxxxxxx <mailto:fschwab@xxxxxxxxxxxxxxxxxxx>>
> Sent: Monday, August 5, 2024 11:02 AM
> To: ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
> Subject: [EXTERNAL]  RGW bucket notifications stop working after a while and blocking requests
>  
> [You don't often get email from fschwab@xxxxxxxxxxxxxxxxxxx <mailto:fschwab@xxxxxxxxxxxxxxxxxxx>. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> Hi,
> 
> we just set up 2 new ceph clusters (using rook). To do some processing of the user activity we configured a topic that sends events to Kafka.
> 
> After 5-12 hours this stops working with a 503 SlowDown response:
> debug 2024-08-02T09:17:58.205+0000 7ff4359ad700 1 req 13681579273117692719 0.005000019s ERROR: failed to reserve notification on queue: private.rgw. error: -28
> 
> First thought would be that the queue is full but up to this point see messages coming into Kafka and without much activity on the RGW itself (only a few requests against the S3 API) so it can’t be a load issue.
> 
> What helps is to remove the notification configuration on the buckets (put-bucket-notification-configuration). If we directly re-add the previous notification configuration it also continuous working for a few hours before failing again with the same error/behaviour.
> 
> We haven’t been able to reproduce this if we disable persistence for the topic so it looks like it is related to the persistence option - otherwise there would be also no queuing of the event for sending to Kafka.
> This also suggests that the issue is not with Kafka - this is also what we suspected first e.g. it can’t handle the amount of messages etc.
> 
> Does anyone else have or had this issue and found the cause or a suggestion on how to best continue debugging? Are there detailed metrics etc. on the size and usage of the event queue?
> 
> 
> Here is the configuration for the topic and for a bucket:
> 
> $ radosgw-admin topic list
> {
>    "topics": [
>        {
>            "user": "",
>            "name": "private.rgw",
>            "dest": {
>                "push_endpoint": "kafka://rgw-sasl-kafka-user:XXX@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512",
>                "push_endpoint_args": "OpaqueData=&Version=2010-03-31&kafka-ack-level=broker&persistent=false&push-endpoint=kafka://rgw-sasl-kafka-user:XXX@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:9094/private.rgw?sasl.mechanism=SCRAM-SHA-512&mechanism=SCRAM-SHA-512&use-ssl=true&verify-ssl=true",
>                "push_endpoint_topic": "private.rgw",
>                "stored_secret": true,
>                "persistent": true
>            },
>            "arn": "arn:aws:sns:ceph-objectstore::private.rgw",
>            "opaqueData": ""
>        }
>    ]
> }
> 
> $ aws s3api get-bucket-notification-configuration --bucket=XXX
> {
>    "TopicConfigurations": [
>        {
>            "Id": “my-id",
>            "TopicArn": "arn:aws:sns:ceph-objectstore::private.rgw",
>            "Events": [
>                "s3:ObjectCreated:*",
>                "s3:ObjectRemoved:*"
>            ]
>        }
>    ]
> }
> 
> 
> Thank you for any input to solve this!
> 
> 
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx <mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux