Re: Redirect snapshot COW to alternative pool

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 29 Mar 2016 16:43:34 -0700



On Tue, Mar 29, 2016 at 1:07 PM, Jason Dillaman <dillaman@xxxxxxxxxx> wrote:
>> I think this is where I see slow performance. If you are doing large IO, then
>> copying 4MB objects (assuming defaults) is maybe only 2x times the original
>> IO to the disk. However if you are doing smaller IO from what I can see a
>> single 4kb write would lead to a 4MB object being copied to the snapshot,
>> with 3x replication, this could be amplification in the thousands. Is my
>> understanding correct here, it's certainly what I see?
>
> The first write (4K or otherwise) to a recently snapshotted object will result in CoW to a new clone of the snapshotted object.  Subsequent writes to the same object will not have the same penalty.  In the parent/child image case, the first write to the child would also result in a full object CoW from the parent to the child.
>
>> >With RBD layering, you do whole-object copy-on-write from the client.
>> > Doing it from the client does let you put "child" images inside of a faster
>> > pool,
>> > yes. But creating new objects doesn't make the *old* ones slow, so why do
>> > you think there's still the same problem? (Other than "the pool is faster"
>> > being perhaps too optimistic about the improvement you'd get under this
>> > workload.)
>>
>> From reading the RBD layering docs it looked like you could also specify a
>> different object size for the target. If there was some way that the
>> snapshot could have a different object size or some sort of dirty bitmap,
>> then this would reduce the amount of data that would have to copied on each
>> write.
>
> Have you tried using a different object size for your RBD image?  I think your proposal is effectively the same as just reducing the object size (with the added overhead of a OSD<->client round-trip for CoW instead of handling it within the OSD directly).  The default 4MB object size was an attempt to strike a balance between the CoW cost and the number of objects the OSDs would have to manage.

Let's not overstate things: 4MB was chosen for RBD objects because
that's the size we use for objects in our cluster. That size comes
from the filesystem work Sage did at Santa Cruz and is not entirely
made up, but I don't think there was any kind of realistic testing nor
much in the way of numbers behind it. :)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com