rgw atomic operations, revisited

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Mon, 14 May 2012 23:09:49 -0700

I was ask recently about how does RADOS gateway read and write its
objects. We revisited the way we handled atomic operations in rgw a
while ago, so here is a brief description of the new scheme.

The original scheme was that when an object had been uploaded it was
written to a temporary location, and once the upload had been
completed it was cloned in a single atomic operation to the final
location. This had a few issues:
 - on non-btrfs backends the clone operation translated to another
full write of the object, thus we ended up writing the object twice
 - in order to clone the object we needed to read it, which meant that
it had to be flushed to disk
 - the clone operation required the use of object locators that could
potentially affect the balancing

The new scheme

We now hold a manifest in the object header that describes where all
the object data is located. It is a map that contains the offset and
size of each object part, and the actual RADOS object and the offset
within it where it can be found. It is common now for objects to be
spread on more than one RADOS objects: the 'head' object, where the
object attributes (and usually the first object chunk) are located,
and one or more (or zero for small objects) tail objects. It is
guaranteed that the tail object names are unique for that specific
object instance.

One thing that hasn't changed with the new scheme is that when we read
an object, we read its first chunk and all its attributes in a single
atomic RADOS operation. This means that reading objects up to the size
of the read chunk (512k) requires only a single round trip.

In the old scheme we kept a tag for each object that represented the
name of the object once it has been overwritten. That is, if the
object was overwritten it was guaranteed that the old contents would
have been replicated to a new location. This required the writer to
clone the object it was running over. This does not exist in the new
scheme anymore, the manifest replaces that.

Reading an object

We access the object's head and read its attributes and the first data
chunk. We then continue reading the object by using the data locations
that are specified in the manifest. Note that the first chunk won't
necessarily reside in the head, an in such objects the first chunk of
data is part of the tail. Reading of the head is always atomic -- done
in a single RADOS operation. Reading the tail is not considered
atomic, however, since the tail resides in a unique RADOS object, we
don't need to access it atomically.

Writing an object (regular case)

We first generate a unique object name for the tail. The unique object
name is created by the original object name with a random string
appended to it. The object name is also created in a separate
namespace.
We skip writing the first chunk (512k) of data, and cache it in
memory. We then write the tail.
When we finish writing the tail we write the head object (the first
chunk and the object manifest and attributes) in a single RADOS
compound operation.

Writing an object (multipart upload)

Each chunk is uploaded separately to a unique location that is located
in the 'multipart' namespace. When the multipart upload completes we
generate a head object with a manifest that points to where all the
object parts reside. Note that in the multipart case the head object
only contains the object manifest and attributes but does not contain
any data.

HTH,
Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html