Appending to a rados object with feedback

Kim Vandry <vandry@xxxxxxxxx> · Tue, 27 Jan 2015 11:47:53 +0900

Hello Ceph users,

In our application, we found that we have a use case for appending to a 
rados object in such a way that the client knows afterwards at what 
offset the append happened, even while there may be other concurrent 
clients doing the same thing.

At first I thought the client might use a write op for this purpose, 
which allows multiple OSD operations to happen atomically. My 
understanding is that successful write ops cannot return any data, so 
one cannot stat the object, then append, then return the size obtained 
from the stat (which is guaranteed to be the append offset). Instead, 
the following algorithm can be used:

1. client stats the object to get its size
2. client issues a (atomic) write op which first verifies that the size 
is still equal to what it was in step 1, and if yes then appends data. 
If no, then the write op fails and the client returns to step 1.

But while there exists rados_write_op_cmpxattr() which offers a similar 
validation feature for xattrs, there does not seem to be a way to 
validate the size of an object in a write op.

To get around this, we wrote a Ceph class to implement step 2 above. It 
takes an offset and some data as input, and appends the data to the 
object only if the offset matches the object's size.

Did we miss another, simpler way of doing this? Is using a class a good 
idea in this case?

By the way, I have a question about the class. Following the example in 
cle_hello.cc method record_hello, our method calls cls_cxx_stat() and 
yet is declared CLS_METHOD_WR, not CLS_METHOD_RD|CLS_METHOD_WR. Is 
stating an object not considered reading it? How come the method does 
not need the CLS_METHOD_RD flag? I tried including that flag to see what 
would happen but then my method was unable to create new objects, which 
we want to support with the same meaning as appending to a 0-size 
object. It seems that in that case Ceph asserts that the objects exists 
before calling the method.

We also briefly tried an alternative method using locking: 
rados_lock_exclusive(), rados_stat(), rados_append(), rados_unlock() but 
I felt that wasn't as good of a solution because locks don't block 
waiting to be acquired, can remain stuck if a client terminates 
abnormally, and that solution involves more round trips between the 
client and server anyway.

Finally, is native support for this feature something that the Ceph team 
would consider including?

-kv
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com