Re: RBD layering design draft

Tommi Virtanen <tv@xxxxxxxxxxx> · Fri, 22 Jun 2012 09:27:11 -0700

On Thu, Jun 21, 2012 at 2:51 PM, Alex Elder <elder@xxxxxxxxxxxxx> wrote:
>> Before cloning a snapshot, you must mark it as preserved, to prevent
>> it from being deleted while child images refer to it:
>> ::
>>
>>     $ rbd preserve pool/image@snap
>
> Why is it necessary to do this?  I think it may be desirable to

So the snapshot will not be removed.

See this: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6595/focus=6675

>>     $ rbd clone --parent pool/parent@snap pool2/child1
>
> Based on my comments above, if the parent had not been "preserved"
> it would automatically be at this point, by virtue of the fact it
> has a clone associated with it.

The client creating the child typically has no write access to the
parent, and cannot do anything to it.

>> To delete the parent, you must first mark it unpreserved, which checks
>> that there are no children left:
>> ::
>
> Please show what happens here if this is done at this point:
>
>      $ rbd snap rm pool/image@snap

rbd: Cannot remove a preserved snapshot: pool/image@snap

or something like that.

> Note that the "preserve" and "unpreserve" operations are
> valid on snapshots, not RBD images or clones.

That's a very good point. Perhaps the command should be "rbd snap
preserve" and "rbd snap unpreserve".

>> In the initial implementation, called 'trivial layering', there will
>> be no tracking of which objects exist in a clone. A read that hits a
>> non-existent object will attempt to read from the parent object, and
>> this will continue recursively until an object exists or an image with
>> no parent is found.
>
> So a non-existent object in a clone is a bit like a hole in a file, but
> instead of implicitly backing it with zeroes it backs it with the data
> found at the same range as the snapshot the clone was based on?

Yes.

Continuation of that: will the clone store sparse objects, or always
copy all the data for that object from the parent? That is, what
happens if I write 1 byte to a fresh clone? (And remember that block
sizes can differ.)

> If a clone had snapshots, does this mean a snapshot can include
> non-existent objects in it?

I don't like the phrase "include non-existent objects", and find that
an overambitious topological exercise, but yes, a snapshot may be
sparse.

Reads fall through toward parents until they find something -- or run
out of parents, in which case they read zeros.

> Does this mean that an attempt to read beyond the end of an RBD snapshot
> is not an error if the read is being done for a clone whose size has
> been increased from what it was originally?  (In that case, the correct
> action would be to read the range as zeroes.)

This was discussed later in the email, and I see you responded to that part.

>> In addition to knowing which parent a given image has, we want to be
>> able to tell if a preserved image still has children. This is
>> accomplished with a new per-pool object, `rbd_children`, which maps
>> (parent pool, parent id, parent snapshot id) to a list of child
>
> My first thought was, why does the parent snapshot need to know the
> *identity* of its descendant clones?  The main thing it seems to need
> is a count of the number of clones it has.

Maintaining that count in a distributed system, without listing the
things that are in it, gets challenging. Idempotent counters are
challenging. Maintaining it as a set is easier, significantly more
debuggable, and unlikely to be too costly. Plus it lets us serve "rbd
children" faster.

> The other thing though is that you shouldn't store the mapping
> in the "rbd_children" object.  Instead, you should only store
> the child object ids there, and consult those objects to identify
> their parents.  Otherwise you end up with problems related to
> possible discrepancy between what a child points to and what the
> "rbd_children" mapping says.

The question we need to ask is "who here is a child of $FOO". Needing
an indirection for every member makes that cost a lot more.

>> image ids. This is stored in the same pool as the child image
>> because the client creating a clone already has read/write access to
>> everything in this pool. This lets a client with read-only access to
>> one pool clone a snapshot from that pool into a pool they have full
>> access to. It increases the cost of unpreserving an image, since this
>
> This is really a bad feature of this design because it doesn't scale.
> So we ought to be thinking about a better way to do it if possible.

That would be nice. Good luck! We await your email, though not holding
our breath ;)

>> To support resizing of layered images, we need to keep track of the
>> minimum size the image ever was, so that if a child image is shrunk
>
> We don't want the minimum size.  We want to know the highest valid
> offset in the image:
> - Upon cloning, the last valid offset of the clone is set to the last
>  valid offset of the snapshot.
> - If an image is resized larger, the last valid offset remains the same.
> - If an image is resized smaller, the last valid offset is reduced to
>  the new, smaller size.
> - If data is written to an image at an offset between the last valid
>  offset and the image size, the last valid offset is updated to the
>  reflect the newly-written data.

If I resize the child down, then resize it up again, and write in the
middle of the resized range, will the non-written parts above your
valid_offset be zero? That sounds like a difference in your & Josh's
designs, and something you two need to sort out.

>>     get_children(uint64_t parent_pool_id, string parent_image_id,
>>                  uint64_t parent_snap_id, uint64_t max_return,
>>                  string start);
>
> This is the one that requires an exhaustive query across all pools.
>
> This kind of interface implies there is a well-defined ordering of
> image ids.  What does "start" look like?  I guess this generally
> raises a lot of questions about whether this (or perhaps some other)
> interface can reliably produce an accurate list.  (It's like the
> directory interfaces--opendir(), readdir(), seekdir(), etc.)

It's racy against concurrent changes, sure. But we only care about the
races when the parent is preserved, and that guarantees there won't be
new (relevant) children created.

>>     get_parents(uint64_t max_return, uint64_t start_pool_id,
>>                 string start_image_id, string start_snap_id);
>
> This interface implies an ordering across pool ids, image ids,
> and snapshot ids that I'm not sure we want to rely on.

They're all either numbers or a strings, and have a clear correct
hierarchical order of (pool, image, snap).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html