> > > > > I think this is where I see slow performance. If you are doing > > > > large IO, then copying 4MB objects (assuming defaults) is maybe > > > > only 2x times the original IO to the disk. However if you are > > > > doing smaller IO from what I can see a single 4kb write would lead > > > > to a 4MB object being copied to the snapshot, with 3x replication, > > > > this could be amplification in the thousands. Is my understanding > > > > correct here, it's > > > certainly what I see? > > > > > > The first write (4K or otherwise) to a recently snapshotted object > > > will result in CoW to a new clone of the snapshotted object. > > > Subsequent writes to the same object will not have the same penalty. > > > In the parent/child image case, the first write to the child would > > > also result in a full object CoW from the parent to the child. > > > > The IO can sometimes can be fairly random depending on the changed > > blocks in the backup, but yes sequential writes are much less > > effected. At the end of each backup it merges the oldest incremental > > into the full, which is also very random. My tests were more worst > > case, but I like to at least know what that limit is, so it doesn’t surprise you > late night on a Friday evening. > > :-) > > > > Makes sense that sequential would be less affected as compared to random. > Are you snapshotting all images in parallel or are you doing the backup in > batches? Note that snap removal does have some cost as the snap trimmer > process of the OSD needs to eventually clean up the objects associated with > the deleted snapshot. The snapshots are done one at a time and are fairly spaced out. > > > > > > > > >With RBD layering, you do whole-object copy-on-write from the client. > > > > > Doing it from the client does let you put "child" images inside > > > > >of a faster pool, yes. But creating new objects doesn't make > > > > >the *old* ones slow, so why do you think there's still the same > problem? > > > > >(Other than "the pool is faster" > > > > > being perhaps too optimistic about the improvement you'd get > > > > >under this > > > > > workload.) > > > > > > > > From reading the RBD layering docs it looked like you could also > > > > specify a different object size for the target. If there was some > > > > way that the snapshot could have a different object size or some > > > > sort of dirty bitmap, then this would reduce the amount of data > > > > that would have to copied on each write. > > > > > > Have you tried using a different object size for your RBD image? I > > > think your proposal is effectively the same as just reducing the > > > object size (with the added overhead of a OSD<->client round-trip > > > for CoW instead of handling it within the OSD directly). The > > > default 4MB object size was an attempt to strike a balance between > > > the CoW cost and the number of objects the OSDs would have to > > > manage. > > > > Yeah, this is something we are most likely going to have to do. It’s a > > lot more performant with 1MB objects when using snapshots, but that > > causes problems in other areas (backfilling, PG splitting...) and also > > overall large IO performance seems slightly lower. Using 6TB disks > > means that there is going to be a ton of objects as well. I was more > > interested if there was going to be any enhancements done around > > something like a bitmap where the COW would be more granular, but I > > understand that this is probably quite a unique usage scenario. > > > > The other option might just be to start with a larger cluster, so this > > snapshot COW stuff is a lower percentage of total performance. > > > > > > > > > > > > > What I meant about it slowing down the pool, is due to the extra > > > > 4MB copy writes, the max small IO you can do is dramatically > > > > reduced, as each small IO is now a 4MB IO. By shifting the COW to > > > > a different pool you could reduce the load on the primary pool and > > > > effect on primary workloads. You are effectively shifting this > > > > snapshot "tax" onto an isolated set of disks/SSD's. > > > > > > Except eventually all your IO will be against the new "fast" pool as > > > enough snapshotted objects have been CoW over to the new pool? > > > > That potentially could be a problem, but I was hoping that 8-10 cheap > > SSD's should easily cover the required write bandwidth (infrequent so > > can use low DWPD drives). The incoming write bandwidth for 1 customer > > might only be about 10MB/s, but at points where the writes were > > random, I was seeing this saturate a 48 disk cluster once a snapshot > > was taken. Worse case this pool might slow down, but that could be > > acceptable risk of doing a DR test, what we don't want is to slow down > > everyone else backups. I guess QOS could also be another option here, > which I know has been discussed. > > If this is for DR, may I shamelessly plug the forthcoming RBD mirroring > support in Jewel [1]? ;-) All modifications are continuously replicated to a DR > cluster in a crash-consistent manner. The journal that it uses to accomplish > this can be stored in a separate (faster, smaller) pool as compared to your > bulk image storage. Oh believe me, I'm really looking forward to the mirroring functionality and will be checking it out on Jewel's release. But for this case I don't think it's what I am looking for. When we perform a DR test, it's on the data the customer has uploaded to us. We take snapshot and then present this to our hypervisors. In effect we are the DR site for the customer. > > > > > > > > To give it some context in what I am trying to achieve here is the > > > > background. We are currently migrating our OLB service from LVM > > > > thinpools to Ceph. As part of the service we offer, we take > > > > regular archive backups to tape and also offer DR tests. Both of > > > > these require snapshots to allow the normal OLB backups to > > > > continue uninterrupted and for these snapshots to potentially be > > > > left open for several days at time. As its OLB, as you can imagine, there > is a lot of write IO. > > > > > > > > Currently with LVM, although there is a slight performance hit, > > > > the block size in LVM roughly matches the average IO size > > > > (128-512KB) and so the COW process doesn't cause much overhead. > > > > When I did some quick FIO tests with Ceph it seemed to have a much > > > > greater knock on effect when using 4MB object RBD's. > > > > > > > > We can probably work around this by having a cluster with more > > > > disks, or reducing the RBD object size, but I thought it was worth > > > > asking in case there was any other way round it, > > > > > > > > Nick > > > > > > > > >There's definitely nothing integrated into the Ceph codebase > > > > >about internal layering, or a way to redirect snapshots outside > > > > >of the OSD, though you could always experiment with flashcache et al. > > > > > -Greg > > > > > _______________________________________________ > > > > > ceph-users mailing list > > > > > ceph-users@xxxxxxxxxxxxxx > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > -- > > > > > > Jason Dillaman > > > > > > > > [1] http://docs.ceph.com/docs/master/rbd/rbd-mirroring/ > > -- > > Jason Dillaman _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com