Re: Appending to a rados object with feedback

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 26 Jan 2015 22:38:53 -0800

On Mon, Jan 26, 2015 at 6:47 PM, Kim Vandry <vandry@xxxxxxxxx> wrote:
> Hello Ceph users,
>
> In our application, we found that we have a use case for appending to a
> rados object in such a way that the client knows afterwards at what offset
> the append happened, even while there may be other concurrent clients doing
> the same thing.
>
> At first I thought the client might use a write op for this purpose, which
> allows multiple OSD operations to happen atomically. My understanding is
> that successful write ops cannot return any data, so one cannot stat the
> object, then append, then return the size obtained from the stat (which is
> guaranteed to be the append offset). Instead, the following algorithm can be
> used:
>
> 1. client stats the object to get its size
> 2. client issues a (atomic) write op which first verifies that the size is
> still equal to what it was in step 1, and if yes then appends data. If no,
> then the write op fails and the client returns to step 1.
>
> But while there exists rados_write_op_cmpxattr() which offers a similar
> validation feature for xattrs, there does not seem to be a way to validate
> the size of an object in a write op.
>
> To get around this, we wrote a Ceph class to implement step 2 above. It
> takes an offset and some data as input, and appends the data to the object
> only if the offset matches the object's size.
>
> Did we miss another, simpler way of doing this? Is using a class a good idea
> in this case?
>
> By the way, I have a question about the class. Following the example in
> cle_hello.cc method record_hello, our method calls cls_cxx_stat() and yet is
> declared CLS_METHOD_WR, not CLS_METHOD_RD|CLS_METHOD_WR. Is stating an
> object not considered reading it? How come the method does not need the
> CLS_METHOD_RD flag? I tried including that flag to see what would happen but
> then my method was unable to create new objects, which we want to support
> with the same meaning as appending to a 0-size object. It seems that in that
> case Ceph asserts that the objects exists before calling the method.

Mmmm, this actually might be an issue. Write ops don't always force an
object into a readable state before being processed, so you could read
out-of-date status in some cases. :/

> We also briefly tried an alternative method using locking:
> rados_lock_exclusive(), rados_stat(), rados_append(), rados_unlock() but I
> felt that wasn't as good of a solution because locks don't block waiting to
> be acquired, can remain stuck if a client terminates abnormally, and that
> solution involves more round trips between the client and server anyway.
>
> Finally, is native support for this feature something that the Ceph team
> would consider including?

I don't have the exact API calls to hand, but librados exposes
versions on op completion and you can assert the version when
submitting ops, too. Did you check that out?

Depending on your application, you might also want to explore a few
other options:
1) a class op that does the write and records the offset into a
user-specified or well-defined omap key
2) just using omap keys instead of blobs

-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com