Re: RBD image "lightweight snapshots"

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 13 Aug 2018 11:20:48 -0400

On Fri, Aug 10, 2018 at 8:29 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
On Fri, 10 Aug 2018, Paweł Sadowski wrote:

> On 08/09/2018 04:39 PM, Alex Elder wrote:

> > On 08/09/2018 08:15 AM, Sage Weil wrote:

> >> On Thu, 9 Aug 2018, Piotr Dałek wrote:

> >>> Hello,

> >>>

> >>> At OVH we're heavily utilizing snapshots for our backup system. We think

> >>> there's an interesting optimization opportunity regarding snapshots I'd like

> >>> to discuss here.

> >>>

> >>> The idea is to introduce a concept of a "lightweight" snapshots - such

> >>> snapshot would not contain data but only the information about what has

> >>> changed on the image since it was created (so basically only the object map

> >>> part of snapshots).

> >>>

> >>> Our backup solution (which seems to be a pretty common practice) is as

> >>> follows:

> >>>

> >>> 1. Create snapshot of the image we want to backup

> >>> 2. If there's a previous backup snapshot, export diff and apply it on the

> >>> backup image

> >>> 3. If there's no older snapshot, just do a full backup of image

> >>>

> >>> This introduces one big issue: it enforces COW snapshot on image, meaning that

> >>> original image access latencies and consumed space increases. "Lightweight"

> >>> snapshots would remove these inefficiencies - no COW performance and storage

> >>> overhead.

> >>

> >> The snapshot in 1 would be lightweight you mean?  And you'd do the backup 

> >> some (short) time later based on a diff with changed extents?

> >>

> >> I'm pretty sure this will export a garbage image.  I mean, it will usually 

> >> be non-garbage, but the result won't be crash consistent, and in some 

> >> (many?) cases won't be usable.

> >>

> >> Consider:

> >>

> >> - take reference snapshot

> >> - back up this image (assume for now it is perfect)

> >> - write A to location 1

> >> - take lightweight snapshot

> >> - write B to location 1

> >> - backup process copie location 1 (B) to target

> 

> The way I (we) see it working is a bit different:

>  - take snapshot (1)

>  - data write might occur, it's ok - CoW kicks in here to preserve data

>  - export data

>  - convert snapshot (1) to a lightweight one (not create new):

>    * from now on just remember which blocks has been modified instead

>      of doing CoW

>    * you can get rid on previously CoW data blocks (they've been

>      exported already)

>  - more writes

>  - take snapshot (2)

>  - export diff - only blocks modified since snap (1)

>  - convert snapshot (2) to a lightweight one

>  - ...

> 

> 

> That way I don't see a place for data corruption. Of course this has

> some drawbacks - you can't rollback/export data from such lightweight

> snapshot anymore. But on the other hand we are reducing need for CoW -

> and that's the main goal with this idea. Instead of making CoW ~all the

> time it's needed only for the time of exporting image/modified blocks.

Ok, so this is a bit different.  I'm a bit fuzzy still on how the 

'lightweight' (1) snapshot will be implemented, but basically I think 

you just mean saving on its storage overhead, but keeping enough metadata 

to make a fully consistent (2) for the purposes of the backup.

Maybe Jason has a better idea for how this would work in practice?  I 

haven't thought about the RBD snapshots in a while (not above the rados 

layer at least).

The 'fast-diff' object map already tracks updated objects since a snapshot was taken, so I think such an approach would just require deleting the RADOS self-managed snapshot when converting to "lightweight" mode and then just using the existing "--whole-object" option for "rbd export-diff" to utilize the 'fast-diff' object map for calculating deltas instead of relying on RADOS snap diffs.

If you don't mind getting your hands dirty writing a little Python code to invoke "remove_self_managed_snap" using the snap id provided by "rbd snap ls", you should be able to test it out now. If it were to be incorporated into RBD core, I think it would need some sanity checks to ensure it relies on 'fast-diff' when handling a lightweight snapshot. However, I would also be interested to know if bluestore alleviates a lot of your latency concerns given that it attempts to redirect-on-write by updating metadata instead of copy-on-write.

> >> That's the wrong data.  Maybe that change is harmless, but maybe location 

> >> 1 belongs to the filesystem journal, and you have some records that now 

> >> reference location 10 that as an A-era value, or haven't been written at 

> >> all yet, and now your file system journal won't replay and you can't 

> >> mount...

> > 

> > Forgive me if I'm misunderstanding; this just caught my attention.

> > 

> > The goal here seems to be to reduce the storage needed to do backups of an

> > RBD image, and I think there's something to that.

> 

> Storage reduction is only side effect here. We want to get rid of CoW as

> much as possible. In an example - we are doing snapshot every 24h - this

> means that every 24h we will start doing CoW from the beginning on every

> image. This has big impact on a cluster latency

> 

> As for the storage need, with 24h backup period we see a space usage

> increase by about 5% on our clusters. But this clearly depends on client

> traffic.

One thing to keep in mind here is that the CoW/clone overheard goes *way* 

down with BlueStore.  On FileStore we are literally blocking to make 

a copy of each 4MB object.  With BlueStore there is a bit of metadata 

overhead for the tracking but it is doing CoW at the lowest layer.

Lightweight snapshots might be a big win for FileStore but that advantage 

will mostly evaporate once you repave the OSDs.

sage

> > This seems to be no different from any other incremental backup scheme.  It's

> > layered, and it's ultimately based on an "epoch" complete backup image (what

> > you call the reference snapshot).

> > 

> > If you're using that model, it would be useful to be able to back up only

> > the data present in a second snapshot that's the child of the reference

> > snapshot.  (And so on, with snapshot 2 building on snapshot 1, etc.)

> > RBD internally *knows* this information, but I'm not sure how (or whether)

> > it's formally exposed.

> > 

> > Restoring an image in this scheme requires restoring the epoch, then the

> > incrementals, in order.  The cost to restore is higher, but the cost

> > of incremental backups is significantly smaller than doing full ones.

> 

> It depends how we will store exported data. We might just want to merge

> all diffs into base image right after export to keep only single copy.

> But that is out of scope of main topic here, IMHO.

> 

> > I'm not sure how the "lightweight" snapshot would work though.  Without

> > references to objects there's no guarantee the data taken at the time of

> > the snapshot still exists when you want to back it up.

> > 

> >                                     -Alex

> > 

> >>

> >> sage

> >>  

> >>> At first glance, it seems like it could be implemented as extension to current

> >>> RBD snapshot system, leaving out the machinery required for copy-on-write. In

> >>> theory it could even co-exist with regular snapshots. Removal of these

> >>> "lightweight" snapshots would be instant (or near instant).

> >>>

> >>> So what do others think about this?

> >>>

> >>> -- 

> >>> Piotr Dałek

> >>> piotr.dalek@xxxxxxxxxxxx

> >>> https://www.ovhcloud.com

> >>> --

> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

> >>> the body of a message to majordomo@xxxxxxxxxxxxxxx

> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

> >>>

> > 

> > --

> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in

> > the body of a message to majordomo@xxxxxxxxxxxxxxx

> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

> > 

> 

> _______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com