Re: bucket notification delivery guarantees

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 16 Jan 2020 13:23:34 -0500

On 1/16/20 12:13 PM, Yuval Lifshitz wrote:
two updates on the design (after some discussions):

(1) "best effort queue" (stretch goal) is probably not needed:
 - cls queue performance should be high enough when put on fast media pool
 - the "acl level" settings allow for existing mechanism to perform as 
"best effort" and non-blocking for topics that does not need delivery 
guarantees

(2) since the cls queue does not allow for random access (without 
linear search) the retries will have to be implemented based only on 
the end of the queue. This means that we must assume that the acks or 
nack arrive in the same order in which the notifications were set. 
This is true only for a specific endpoint (e.g. a specific kafka 
broker) which means that there will have to be a separate cls queue 
instance for each endpoint

On Tue, Jan 14, 2020 at 3:47 PM Yuval Lifshitz <ylifshit@xxxxxxxxxx 
<mailto:ylifshit@xxxxxxxxxx>> wrote:

    Dear Community,
    Would like to share some design ideas around the above topic.
    Feedback is welcomed!

    Current State

    - in "pull mode" [1] we have the same guarantees as the multisite
    syncing mechanism (guarantee against HW/SW failures). On top of
    that, if writing the event to RADOS fails, this trickle back as
    sync failure, which means that the master zone will try to sync
    the pubsub zone

    - in "push mode" [2] we send the notification from the ops context
    that triggered the notification. The original operation is blocked
    until we get a reply from the endpoint. As part of the
    configuration for the endpoint, we also configure the "ack level",
    indicating whether we block until we get a reply from the endpoint
    or not.
    Since the operation response is not sent back to the client until
    the endpoint acks, this method guarantees against any failure in
    the radosgw (at the cost of adding latency to the operation).
    This, however, does not guarantee delivery if the endpoint is down
    or disconnected. The endpoint we interact with (rabbitmq, kafka) ,
    usually have built in redundancy mechanism, but this does not
    cover the case where there is a network disconnect between our
    gateways and these systems.
    In some cases we can get a nack from the endpoint, indicating that
    our message would never reach the endpoint. But we can only log
    these cases:
    - we cannot fail the operation that triggered us, because we send
    the notification only after the actual operation (e.g. "put
    object") was done (=no atomicity)
    - no retry mechanism (in theory, we can add one)

    Next Phase Requirements

    We would like to add delivery guarantee to "push mode" for
    endpoint failures. For that we would use a message queue with the
    following features:
    - rados backed, so it would survive HW/SW failures
    - blocking only on local read/writes (so it introduces smaller
    latency than over-the-wire endpoint acks)
    - has reserve/commit semantics, so we can "reserve" before the
    operation (e.g. "put object") was done, and fail it if we cannot
    reserve a slot on the queue, and commit the notification to the
    queue only after the operation was successful (and unreserve if
    the operation failed)

I guess this reservation piece is only a requirement because of the 
choice of cls_queue, which resides in a single rados object and so 
enforces a bound on the total space used. The maximum size is 
configurable, but can't exceed osd_max_object_size=128M. How many 
notifications could we fit within that the 128M limit? I worry that 
clusters at a sufficient scale could fill that pretty quickly if the 
notification endpoint is unavailable or slow, and that would leave 
radosgw unable to satisfy any requests that would generate a notification.

    - we would have a retry mechanism based on the queue, which means
    that if a notification was successfully pushed into the queue, we
    can assume it would (eventually) be successfully delivered to the
    endpoint

    Proposed Solution

    - use the cls_queue [3] (cls_queue is not omap based, hence, no
    builtin iops limitations)
    - add reserve/commit functionality (probably store that info in
    the queue head)
    - a dedicated thread(s) should be reading requests from the queue,
    sending the notifications to the endpoints, and waiting for
    the replies (if needed) - this should be done via coroutines
    - acked requests are removed from the queue, nacked or
    timed-out requests should be retried (at least for a while)
    - both mechanism would coexist, as this would be configurable per
    topic
    - as a stretch goal, we may add a "best effort queue". This would
    be similar to the cls_queue solution, but won't address
    radosgw failures (as the queue would be in-memory), only endpoint
    failures/disconnects
    - for now, this mechanism won't be supported for pushing events
    from the pubsub zone (="pull+push mode"), but might be added if
    users would find it useful

    Yuval

    [1] https://docs.ceph.com/docs/master/radosgw/pubsub-module/
    [2] https://docs.ceph.com/docs/master/radosgw/notifications/
    [3] https://github.com/ceph/ceph/tree/master/src/cls/queue

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx