Re: bucket notification delivery guarantees

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Well, the need for reservation is general, arising from need for
reliable delivery.  You could substitute the fifo abstraction if 128M
is too small.

Matt

On Thu, Jan 16, 2020 at 1:23 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
>
> On 1/16/20 12:13 PM, Yuval Lifshitz wrote:
> > two updates on the design (after some discussions):
> >
> > (1) "best effort queue" (stretch goal) is probably not needed:
> >  - cls queue performance should be high enough when put on fast media pool
> >  - the "acl level" settings allow for existing mechanism to perform as
> > "best effort" and non-blocking for topics that does not need delivery
> > guarantees
> >
> > (2) since the cls queue does not allow for random access (without
> > linear search) the retries will have to be implemented based only on
> > the end of the queue. This means that we must assume that the acks or
> > nack arrive in the same order in which the notifications were set.
> > This is true only for a specific endpoint (e.g. a specific kafka
> > broker) which means that there will have to be a separate cls queue
> > instance for each endpoint
> >
> >
> > On Tue, Jan 14, 2020 at 3:47 PM Yuval Lifshitz <ylifshit@xxxxxxxxxx
> > <mailto:ylifshit@xxxxxxxxxx>> wrote:
> >
> >     Dear Community,
> >     Would like to share some design ideas around the above topic.
> >     Feedback is welcomed!
> >
> >     Current State
> >
> >     - in "pull mode" [1] we have the same guarantees as the multisite
> >     syncing mechanism (guarantee against HW/SW failures). On top of
> >     that, if writing the event to RADOS fails, this trickle back as
> >     sync failure, which means that the master zone will try to sync
> >     the pubsub zone
> >
> >     - in "push mode" [2] we send the notification from the ops context
> >     that triggered the notification. The original operation is blocked
> >     until we get a reply from the endpoint. As part of the
> >     configuration for the endpoint, we also configure the "ack level",
> >     indicating whether we block until we get a reply from the endpoint
> >     or not.
> >     Since the operation response is not sent back to the client until
> >     the endpoint acks, this method guarantees against any failure in
> >     the radosgw (at the cost of adding latency to the operation).
> >     This, however, does not guarantee delivery if the endpoint is down
> >     or disconnected. The endpoint we interact with (rabbitmq, kafka) ,
> >     usually have built in redundancy mechanism, but this does not
> >     cover the case where there is a network disconnect between our
> >     gateways and these systems.
> >     In some cases we can get a nack from the endpoint, indicating that
> >     our message would never reach the endpoint. But we can only log
> >     these cases:
> >     - we cannot fail the operation that triggered us, because we send
> >     the notification only after the actual operation (e.g. "put
> >     object") was done (=no atomicity)
> >     - no retry mechanism (in theory, we can add one)
> >
> >     Next Phase Requirements
> >
> >     We would like to add delivery guarantee to "push mode" for
> >     endpoint failures. For that we would use a message queue with the
> >     following features:
> >     - rados backed, so it would survive HW/SW failures
> >     - blocking only on local read/writes (so it introduces smaller
> >     latency than over-the-wire endpoint acks)
> >     - has reserve/commit semantics, so we can "reserve" before the
> >     operation (e.g. "put object") was done, and fail it if we cannot
> >     reserve a slot on the queue, and commit the notification to the
> >     queue only after the operation was successful (and unreserve if
> >     the operation failed)
> >
> I guess this reservation piece is only a requirement because of the
> choice of cls_queue, which resides in a single rados object and so
> enforces a bound on the total space used. The maximum size is
> configurable, but can't exceed osd_max_object_size=128M. How many
> notifications could we fit within that the 128M limit? I worry that
> clusters at a sufficient scale could fill that pretty quickly if the
> notification endpoint is unavailable or slow, and that would leave
> radosgw unable to satisfy any requests that would generate a notification.
>
> >     - we would have a retry mechanism based on the queue, which means
> >     that if a notification was successfully pushed into the queue, we
> >     can assume it would (eventually) be successfully delivered to the
> >     endpoint
> >
> >     Proposed Solution
> >
> >     - use the cls_queue [3] (cls_queue is not omap based, hence, no
> >     builtin iops limitations)
> >     - add reserve/commit functionality (probably store that info in
> >     the queue head)
> >     - a dedicated thread(s) should be reading requests from the queue,
> >     sending the notifications to the endpoints, and waiting for
> >     the replies (if needed) - this should be done via coroutines
> >     - acked requests are removed from the queue, nacked or
> >     timed-out requests should be retried (at least for a while)
> >     - both mechanism would coexist, as this would be configurable per
> >     topic
> >     - as a stretch goal, we may add a "best effort queue". This would
> >     be similar to the cls_queue solution, but won't address
> >     radosgw failures (as the queue would be in-memory), only endpoint
> >     failures/disconnects
> >     - for now, this mechanism won't be supported for pushing events
> >     from the pubsub zone (="pull+push mode"), but might be added if
> >     users would find it useful
> >
> >     Yuval
> >
> >     [1] https://docs.ceph.com/docs/master/radosgw/pubsub-module/
> >     [2] https://docs.ceph.com/docs/master/radosgw/notifications/
> >     [3] https://github.com/ceph/ceph/tree/master/src/cls/queue
> >
> >
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux