Re: rbd layering

Christian Brunner <chb@xxxxxx> · Thu, 3 Feb 2011 21:36:29 +0100

2011/2/2 Gregory Farnum <gregf@xxxxxxxxxxxxxxx>:
> On Wed, Feb 2, 2011 at 9:47 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> When I mentioned allocation bitmap before, I meant simply a bitmap
>> specifying whether the block exists, that would let us avoid looking for
>> an object in the parent image.  In its simplest form, you would mark the
>> image read-only, then generate the bitmap once.
>> ...
>> Mainly I'm interested in feedback on the simple layering use-case...
> So my thought with the bitmap was that it might make more sense for
> rbd to maintain a bitmap specifying whether the child has overwritten
> the parent block device. Then when doing a read in the parent region,
> rbd defaults to reading the parent block device unless the bitmap says
> the child has overwritten it.
> This is reasonably fast in terms of failed reads and such: assuming
> the bitmap is kept in-memory, you don't need to do multiple attempts
> to do a read, and if the client wants to zero a block, or overwrites a
> block and then deletes it, there's no concern about preventing
> inappropriate fall-through to the parent since the bitmap still has
> that block set as overwritten.
>
> The other advantage to something like this is that it could allow
> overwriting at a finer level than the size of the child's blocks. For
> instance, you might store in 4MB chunks, but that's a bit large for
> some things that are going to commonly change between images like
> config files. So with a bitmap with, say, 1KB resolution you could
> change a config file and have rbd read the block from the parent and
> then plug in the 1KB containing the config file that the child
> overwrote. This doesn't require too much space: storing a
> 1KB-granularity bitmap for a 1GB image only requires 1MB.
> -Greg

I would go for this kind of allocation bitmap, too. However I'm asking
myself if we could add TRIM support this way as well.

In my scenario the bitmap would be available in every image and should
have a 512 Byte resolution to match the block size of common hard
disks and the bitmap needs to support three states:

0: Block is not allocated
1: Block is allocated in this image (child)
2: Block is allocated in the parent image

- When we create a new image the bitmap is filled with zero.
- When we clone an image we have to copy the bitmap and switch every
allocated block from state 1 to state 2.
- When we are writing to a block in state 0 or state 2 we have to set
it to state 1 and we will have to sync the bitmap to disk.
- When a block is discarded we set the state to 0 and we will have to
sync the bitmap to disk.
- When all blocks of an object are set to 0 we can delete the object.

This way the only performance impact would be at the first write to a block.

Regards
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html