Dear Community, I would like to collect your feedback on this issue. This is a followup from a discussion that started in the RGW refactoring meeting on 31-May-23 (thanks @Krunal Chheda <kchheda3@xxxxxxxxxxxxx> for bringing up this topic!). Currently persistent notifications are retried indefinitely. The only limiting mechanism that exists is that all notifications to a specific topic are stored in one RADOS object (of size 128MB). Assuming notifications are ~1KB at most, this would give us at least 128K notifications that can wait in the queue. When the queue fills up (e.g. kafka broker is down for 20 minutes, we are sending ~100 notifications per second) we start sending "slow down" replies to the client, and in this case the S3 operation will not be performed. This means that, for example, an outage of the kafka system would eventually cause an outage of our service. Note that this may also be a result of a misconfiguration of the kafka broker, or decommissioning of a broker. To avoid that, we propose several options: * use a fifo instead of a queue. This would allow us to hold more than 128K messages - survive longer broker outages, and at a higher message rate. there should still probably be a limit set on the size of the fifo * define maximum number of retries allowed for a notification * define maximum time the notification may stay in the queue before it is removed We should probably start with these definitions done as topic attributes, reflecting our delivery guarantees for this specific destination. Will try to capture the results of the discussion in this tracker: https://tracker.ceph.com/issues/61532 Thanks, Yuval _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx