Snapshot Costs (Was: Re: Pool Sizes)

gfarnum@xxxxxxxxxx (Gregory Farnum) · Tue, 7 Mar 2017 13:35:03 -0800

On Tue, Mar 7, 2017 at 12:43 PM, Kent Borg <kentborg at borg.org> wrote:
> On 01/04/2017 03:41 PM, Brian Andrus wrote:
>>
>> Think "many objects, few pools". The number of pools do not scale well
>> because of PG limitations. Keep a small number of pools with the proper
>> number of PGs.
>
>
> I finally got it through my head, seems the larger answer is: Not only it is
> okay to have a (properly configured) pool grow to insane numbers of objects,
> the inverse is also true; keep the number of pools not just small, but to a
> very bare minimum. For example, Cephfs, which aspires to scale to crazy
> sizes, only uses two pools. And when Cephfs picks up the ability to offer
> multiple Cephfs file systems in of a single cluster...it will probably still
> only be using two pools.
>
>
> Continuing along with my theme of trying to understand Ceph (specifically
> RADOS, if that matters): Snapshots!
>
> What does a snapshot cost? In time? In other resources? When do those costs
> hit? What does it cost to destroy a snapshot? What does it cost to
> accumulate multiple snapshots? What does it cost to alter a snapshotted
> object? (Does that alteration cost hit only once or does it linger?)
> Whatever their costs, what makes them greater and what makes them smaller?
> It is sensible to make snapshots programmatically? If so, how rapidly?

Creating a snapshot generally involves a round-trip to the monitor,
which requires a new OSDMap epoch (although it can coalesce) ? ie, the
monitor paxos commit and processing the new map on all the OSDs/PGs.
Destroying a snapshot involves adding the snapshot ID to an
interval_set in a new OSDMap epoch; and then going through the snap
trimming process (which can be fairly expensive).
If you send a write to a snapshotted object, it is (for
FileStore-on-xfs) copied on write. (FileStore-on-xfs does
filesystem-level copy-on-write, which is one reason we kept hoping it
would be our stable future...) I think BlueStore also does block-level
copy-on-write. It's a one-time penalty.

> For example, one idea in the back of my mind is whether there would be a way
> to use snapshots as a way to kinda fake transactions. I have no idea whether
> that might be clever or an abuse of the feature...

I don't really think so ? they're read-only so it's a linear structure.

>
> I would love it if someone could toss out some examples of the sorts of
> things snapshots are good for and the sorts of things they are terrible for.
> (And some hints as to why, please.)

They're good for CephFS snapshots. They're good at RBD snapshots as
long as you don't take them too frequently. In general if you're using
self-managed snapshots and *can* reuse snapids across objects, that
better mimics their original design goal (CephFS subtree snapshots)
and minimizes the associated costs.

I'll be giving a developer-focused talk on this at Vault (and it looks
like an admin-focused one at the OpenStack Boston Ceph day) which will
involve gathering up the data in one place and presenting it more
accessibly, so keep an eye out for those if you're interested. :)