Re: Redirect snapshot COW to alternative pool

Nick Fisk <nick@xxxxxxxxxx> · Tue, 29 Mar 2016 19:44:22 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Gregory Farnum
> Sent: 29 March 2016 18:52
> To: Nick Fisk <nick@xxxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Redirect snapshot COW to alternative pool
> 
> On Sat, Mar 26, 2016 at 3:13 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> >
> > Evening All,
> >
> > I’ve been testing the RBD snapshot functionality and one thing that I have
> seen is that once you take a snapshot of a RBD and perform small random IO
> on the original RBD, performance is really bad due to the amount of write
> amplification going on doing the COW’s. ie every IO to the parent no matter
> what size, equals 12MB of writes.
> >
> > I was wondering if there was anyway to redirect these writes to a different
> pool. Since only a small capacity would be required, a SSD/NVME pool could
> be provisioned very cheaply and would hopefully provide enough
> performance to allow the IO operations to the parent to be unaffected.
> >
> > I’ve looked at the RBD layering, which sort of looks like you can do stuff like
> this and also change the order. But it looks like you have to base it on an
> existent snapshot, so I believe I would still have the same problem. Or is
> there a “hidden feature” to make normal snapshots use this layering
> functionality?
> >
> > Nick
> 

Thanks for your response Greg.

> This isn't quite making sense to me. When you do a snapshot, then as you
> say it's copy-on-write and every operation copies the data to new blocks
> (whole-object copies with XFS; mere local blocks with btrfs) inside of the
> OSD. 

I think this is where I see slow performance. If you are doing large IO, then copying 4MB objects (assuming defaults) is maybe only 2x times the original IO to the disk. However if you are doing smaller IO from what I can see a single 4kb write would lead to a 4MB object being copied to the snapshot, with 3x replication, this could be amplification in the thousands. Is my understanding correct here, it's certainly what I see?

>With RBD layering, you do whole-object copy-on-write from the client.
> Doing it from the client does let you put "child" images inside of a faster pool,
> yes. But creating new objects doesn't make the *old* ones slow, so why do
> you think there's still the same problem? (Other than "the pool is faster"
> being perhaps too optimistic about the improvement you'd get under this
> workload.) 

>From reading the RBD layering docs it looked like you could also specify a different object size for the target. If there was some way that the snapshot could have a different object size or some sort of dirty bitmap, then this would reduce the amount of data that would have to copied on each write.

What I meant about it slowing down the pool, is due to the extra 4MB copy writes, the max small IO you can do is dramatically reduced, as each small IO is now a 4MB IO. By shifting the COW to a different pool you could reduce the load on the primary pool and effect on primary workloads. You are effectively shifting this snapshot "tax" onto an isolated set of disks/SSD's.

To give it some context in what I am trying to achieve here is the background. We are currently migrating our OLB service from LVM thinpools to Ceph. As part of the service we offer, we take regular archive backups to tape and also offer DR tests. Both of these require snapshots to allow the normal OLB backups to continue uninterrupted and for these snapshots to potentially be left open for several days at time. As its OLB, as you can imagine, there is a lot of write IO.

Currently with LVM, although there is a slight performance hit, the block size in LVM roughly matches the average IO size (128-512KB) and so the COW process doesn't cause much overhead. When I did some quick FIO tests with Ceph it seemed to have a much greater knock on effect when using 4MB object RBD's.

We can probably work around this by having a cluster with more disks, or reducing the RBD object size, but I thought it was worth asking in case there was any other way round it,

Nick

>There's definitely nothing integrated into the Ceph codebase
> about internal layering, or a way to redirect snapshots outside of the OSD,
> though you could always experiment with flashcache et al.
> -Greg
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com