Re: Redirect snapshot COW to alternative pool

Nick Fisk <nick@xxxxxxxxxx> · Wed, 30 Mar 2016 09:37:05 +0100

> 
> > > > I think this is where I see slow performance. If you are doing
> > > > large IO, then copying 4MB objects (assuming defaults) is maybe
> > > > only 2x times the original IO to the disk. However if you are
> > > > doing smaller IO from what I can see a single 4kb write would lead
> > > > to a 4MB object being copied to the snapshot, with 3x replication,
> > > > this could be amplification in the thousands. Is my understanding
> > > > correct here, it's
> > > certainly what I see?
> > >
> > > The first write (4K or otherwise) to a recently snapshotted object
> > > will result in CoW to a new clone of the snapshotted object.
> > > Subsequent writes to the same object will not have the same penalty.
> > > In the parent/child image case, the first write to the child would
> > > also result in a full object CoW from the parent to the child.
> >
> > The IO can sometimes can be fairly random depending on the changed
> > blocks in the backup, but yes sequential writes are much less
> > effected. At the end of each backup it merges the oldest incremental
> > into the full, which is also very random. My tests were more worst
> > case, but I like to at least know what that limit is, so it doesn’t surprise you
> late night on a Friday evening.
> > :-)
> >
> 
> Makes sense that sequential would be less affected as compared to random.
> Are you snapshotting all images in parallel or are you doing the backup in
> batches?  Note that snap removal does have some cost as the snap trimmer
> process of the OSD needs to eventually clean up the objects associated with
> the deleted snapshot.

The snapshots are done one at a time and are fairly spaced out. 

> 
> > >
> > > > >With RBD layering, you do whole-object copy-on-write from the client.
> > > > > Doing it from the client does let you put "child" images inside
> > > > >of a faster  pool,  yes. But creating new objects doesn't make
> > > > >the *old* ones slow, so why do  you think there's still the same
> problem?
> > > > >(Other than "the pool is faster"
> > > > > being perhaps too optimistic about the improvement you'd get
> > > > >under this
> > > > > workload.)
> > > >
> > > > From reading the RBD layering docs it looked like you could also
> > > > specify a different object size for the target. If there was some
> > > > way that the snapshot could have a different object size or some
> > > > sort of dirty bitmap, then this would reduce the amount of data
> > > > that would have to copied on each write.
> > >
> > > Have you tried using a different object size for your RBD image?  I
> > > think your proposal is effectively the same as just reducing the
> > > object size (with the added overhead of a OSD<->client round-trip
> > > for CoW instead of handling it within the OSD directly).  The
> > > default 4MB object size was an attempt to strike a balance between
> > > the CoW cost and the number of objects the OSDs would have to
> > > manage.
> >
> > Yeah, this is something we are most likely going to have to do. It’s a
> > lot more performant with 1MB objects when using snapshots, but that
> > causes problems in other areas (backfilling, PG splitting...) and also
> > overall large IO performance seems slightly lower. Using 6TB disks
> > means that there is going to be a ton of objects as well. I was more
> > interested if there was going to be any enhancements done around
> > something like a bitmap where the COW would be more granular, but I
> > understand that this is probably quite a unique usage scenario.
> >
> > The other option might just be to start with a larger cluster, so this
> > snapshot COW stuff is a lower percentage of total performance.
> >
> > >
> > > >
> > > > What I meant about it slowing down the pool, is due to the extra
> > > > 4MB copy writes, the max small IO you can do is dramatically
> > > > reduced, as each small IO is now a 4MB IO. By shifting the COW to
> > > > a different pool you could reduce the load on the primary pool and
> > > > effect on primary workloads. You are effectively shifting this
> > > > snapshot "tax" onto an isolated set of disks/SSD's.
> > >
> > > Except eventually all your IO will be against the new "fast" pool as
> > > enough snapshotted objects have been CoW over to the new pool?
> >
> > That potentially could be a problem, but I was hoping that 8-10 cheap
> > SSD's should easily cover the required write bandwidth (infrequent so
> > can use low DWPD drives). The incoming write bandwidth for 1 customer
> > might only be about 10MB/s, but at points where the writes were
> > random, I was seeing this saturate a 48 disk cluster once a snapshot
> > was taken. Worse case this pool might slow down, but that could be
> > acceptable risk of doing a DR test, what we don't want is to slow down
> > everyone else backups. I guess QOS could also be another option here,
> which I know has been discussed.
> 
> If this is for DR, may I shamelessly plug the forthcoming RBD mirroring
> support in Jewel [1]?  ;-) All modifications are continuously replicated to a DR
> cluster in a crash-consistent manner.  The journal that it uses to accomplish
> this can be stored in a separate (faster, smaller) pool as compared to your
> bulk image storage.

Oh believe me, I'm really looking forward to the mirroring functionality and will be checking it out on Jewel's release. But for this case I don't think it's what I am looking for. When we perform a DR test, it's on the data the customer has uploaded to us. We take snapshot and then present this to our hypervisors. In effect we are the DR site for the customer.

> 
> > >
> > > > To give it some context in what I am trying to achieve here is the
> > > > background. We are currently migrating our OLB service from LVM
> > > > thinpools to Ceph. As part of the service we offer, we take
> > > > regular archive backups to tape and also offer DR tests. Both of
> > > > these require snapshots to allow the normal OLB backups to
> > > > continue uninterrupted and for these snapshots to potentially be
> > > > left open for several days at time. As its OLB, as you can imagine, there
> is a lot of write IO.
> > > >
> > > > Currently with LVM, although there is a slight performance hit,
> > > > the block size in LVM roughly matches the average IO size
> > > > (128-512KB) and so the COW process doesn't cause much overhead.
> > > > When I did some quick FIO tests with Ceph it seemed to have a much
> > > > greater knock on effect when using 4MB object RBD's.
> > > >
> > > > We can probably work around this by having a cluster with more
> > > > disks, or reducing the RBD object size, but I thought it was worth
> > > > asking in case there was any other way round it,
> > > >
> > > > Nick
> > > >
> > > > >There's definitely nothing integrated into the Ceph codebase
> > > > >about internal layering, or a way to redirect snapshots outside
> > > > >of the OSD, though you could always experiment with flashcache et al.
> > > > > -Greg
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > >
> > >
> > > --
> > >
> > > Jason Dillaman
> >
> >
> >
> 
> [1] http://docs.ceph.com/docs/master/rbd/rbd-mirroring/
> 
> --
> 
> Jason Dillaman

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com