Re: Redirect snapshot COW to alternative pool

Nick Fisk <nick@xxxxxxxxxx> · Tue, 29 Mar 2016 21:47:55 +0100

> > I think this is where I see slow performance. If you are doing large
> > IO, then copying 4MB objects (assuming defaults) is maybe only 2x
> > times the original IO to the disk. However if you are doing smaller IO
> > from what I can see a single 4kb write would lead to a 4MB object
> > being copied to the snapshot, with 3x replication, this could be
> > amplification in the thousands. Is my understanding correct here, it's
> certainly what I see?
> 
> The first write (4K or otherwise) to a recently snapshotted object will result in
> CoW to a new clone of the snapshotted object.  Subsequent writes to the
> same object will not have the same penalty.  In the parent/child image case,
> the first write to the child would also result in a full object CoW from the
> parent to the child.

The IO can sometimes can be fairly random depending on the changed blocks in the backup, but yes sequential writes are much less effected. At the end of each backup it merges the oldest incremental into the full, which is also very random. My tests were more worst case, but I like to at least know what that limit is, so it doesn’t surprise you late night on a Friday evening. :-)

> 
> > >With RBD layering, you do whole-object copy-on-write from the client.
> > > Doing it from the client does let you put "child" images inside of a
> > >faster  pool,  yes. But creating new objects doesn't make the *old*
> > >ones slow, so why do  you think there's still the same problem?
> > >(Other than "the pool is faster"
> > > being perhaps too optimistic about the improvement you'd get under
> > >this
> > > workload.)
> >
> > From reading the RBD layering docs it looked like you could also
> > specify a different object size for the target. If there was some way
> > that the snapshot could have a different object size or some sort of
> > dirty bitmap, then this would reduce the amount of data that would
> > have to copied on each write.
> 
> Have you tried using a different object size for your RBD image?  I think your
> proposal is effectively the same as just reducing the object size (with the
> added overhead of a OSD<->client round-trip for CoW instead of handling it
> within the OSD directly).  The default 4MB object size was an attempt to
> strike a balance between the CoW cost and the number of objects the OSDs
> would have to manage.

Yeah, this is something we are most likely going to have to do. It’s a lot more performant with 1MB objects when using snapshots, but that causes problems in other areas (backfilling, PG splitting...) and also overall large IO performance seems slightly lower. Using 6TB disks means that there is going to be a ton of objects as well. I was more interested if there was going to be any enhancements done around something like a bitmap where the COW would be more granular, but I understand that this is probably quite a unique usage scenario.

The other option might just be to start with a larger cluster, so this snapshot COW stuff is a lower percentage of total performance. 

> 
> >
> > What I meant about it slowing down the pool, is due to the extra 4MB
> > copy writes, the max small IO you can do is dramatically reduced, as
> > each small IO is now a 4MB IO. By shifting the COW to a different pool
> > you could reduce the load on the primary pool and effect on primary
> > workloads. You are effectively shifting this snapshot "tax" onto an
> > isolated set of disks/SSD's.
> 
> Except eventually all your IO will be against the new "fast" pool as enough
> snapshotted objects have been CoW over to the new pool?

That potentially could be a problem, but I was hoping that 8-10 cheap SSD's should easily cover the required write bandwidth (infrequent so can use low DWPD drives). The incoming write bandwidth for 1 customer might only be about 10MB/s, but at points where the writes were random, I was seeing this saturate a 48 disk cluster once a snapshot was taken. Worse case this pool might slow down, but that could be acceptable risk of doing a DR test, what we don't want is to slow down everyone else backups. I guess QOS could also be another option here, which I know has been discussed.

> 
> > To give it some context in what I am trying to achieve here is the
> > background. We are currently migrating our OLB service from LVM
> > thinpools to Ceph. As part of the service we offer, we take regular
> > archive backups to tape and also offer DR tests. Both of these require
> > snapshots to allow the normal OLB backups to continue uninterrupted
> > and for these snapshots to potentially be left open for several days
> > at time. As its OLB, as you can imagine, there is a lot of write IO.
> >
> > Currently with LVM, although there is a slight performance hit, the
> > block size in LVM roughly matches the average IO size (128-512KB) and
> > so the COW process doesn't cause much overhead. When I did some quick
> > FIO tests with Ceph it seemed to have a much greater knock on effect
> > when using 4MB object RBD's.
> >
> > We can probably work around this by having a cluster with more disks,
> > or reducing the RBD object size, but I thought it was worth asking in
> > case there was any other way round it,
> >
> > Nick
> >
> > >There's definitely nothing integrated into the Ceph codebase  about
> > >internal layering, or a way to redirect snapshots outside of the OSD,
> > >though you could always experiment with flashcache et al.
> > > -Greg
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> 
> --
> 
> Jason Dillaman

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com