Re: Redirect snapshot COW to alternative pool

Jason Dillaman <dillaman@xxxxxxxxxx> · Tue, 29 Mar 2016 16:07:51 -0400 (EDT)

> I think this is where I see slow performance. If you are doing large IO, then
> copying 4MB objects (assuming defaults) is maybe only 2x times the original
> IO to the disk. However if you are doing smaller IO from what I can see a
> single 4kb write would lead to a 4MB object being copied to the snapshot,
> with 3x replication, this could be amplification in the thousands. Is my
> understanding correct here, it's certainly what I see?

The first write (4K or otherwise) to a recently snapshotted object will result in CoW to a new clone of the snapshotted object.  Subsequent writes to the same object will not have the same penalty.  In the parent/child image case, the first write to the child would also result in a full object CoW from the parent to the child.

> >With RBD layering, you do whole-object copy-on-write from the client.
> > Doing it from the client does let you put "child" images inside of a faster
> > pool,
> > yes. But creating new objects doesn't make the *old* ones slow, so why do
> > you think there's still the same problem? (Other than "the pool is faster"
> > being perhaps too optimistic about the improvement you'd get under this
> > workload.)
> 
> From reading the RBD layering docs it looked like you could also specify a
> different object size for the target. If there was some way that the
> snapshot could have a different object size or some sort of dirty bitmap,
> then this would reduce the amount of data that would have to copied on each
> write.

Have you tried using a different object size for your RBD image?  I think your proposal is effectively the same as just reducing the object size (with the added overhead of a OSD<->client round-trip for CoW instead of handling it within the OSD directly).  The default 4MB object size was an attempt to strike a balance between the CoW cost and the number of objects the OSDs would have to manage.

> 
> What I meant about it slowing down the pool, is due to the extra 4MB copy
> writes, the max small IO you can do is dramatically reduced, as each small
> IO is now a 4MB IO. By shifting the COW to a different pool you could reduce
> the load on the primary pool and effect on primary workloads. You are
> effectively shifting this snapshot "tax" onto an isolated set of
> disks/SSD's.

Except eventually all your IO will be against the new "fast" pool as enough snapshotted objects have been CoW over to the new pool? 

> To give it some context in what I am trying to achieve here is the
> background. We are currently migrating our OLB service from LVM thinpools to
> Ceph. As part of the service we offer, we take regular archive backups to
> tape and also offer DR tests. Both of these require snapshots to allow the
> normal OLB backups to continue uninterrupted and for these snapshots to
> potentially be left open for several days at time. As its OLB, as you can
> imagine, there is a lot of write IO.
> 
> Currently with LVM, although there is a slight performance hit, the block
> size in LVM roughly matches the average IO size (128-512KB) and so the COW
> process doesn't cause much overhead. When I did some quick FIO tests with
> Ceph it seemed to have a much greater knock on effect when using 4MB object
> RBD's.
>
> We can probably work around this by having a cluster with more disks, or
> reducing the RBD object size, but I thought it was worth asking in case
> there was any other way round it,
> 
> Nick
> 
> >There's definitely nothing integrated into the Ceph codebase
> > about internal layering, or a way to redirect snapshots outside of the OSD,
> > though you could always experiment with flashcache et al.
> > -Greg
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 

Jason Dillaman 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com