Re: refcounting chunks vs snapshots

myoungwon oh <ohmyoungwon@xxxxxxxxx> · Mon, 1 Jul 2019 21:55:13 +0900

Manifest object (at the base pool) --> chunked object (at the chunk pool).

I'd prefer to maintain the manifest object with adding metadata.
The manifest object at the base pool tracks chunks it has.
Unlike existing cache tier implementation, the manifest object does
not be flushed. So, if we
delete the manifest object, there is no way to recover it (it also has
chunks info and refs to reassemble original object).

To maintain the manifest object that is snapshotted,
as Greg's comment, we might need something like other reference count
for clones and the manifest object.
I will investigate this more detail to find out a reasonable way that
can examine whether the ref is gone, and then make a new PR
to discuss this.

Anyway, I am working on misc fixes and adding stress tests for dedup
tier. After finishing these work,
I will start this.

Myoungwon

2019년 6월 29일 (토) 오전 7:12, Gregory Farnum <gfarnum@xxxxxxxxxx>님이 작성:
>
> On Fri, Jun 28, 2019 at 7:50 AM Sage Weil <sweil@xxxxxxxxxx> wrote:
> >
> > Hi Myoungwon,
> >
> > I was thinking about how a refcounted cas pool would interact with
> > snapshots and it occurred to me that dropping refs when an object is
> > deleted may break snapshotted versions of that object.  If object A has
> > a ref to chunk X, is snapshotted, then A is deleted, we'll (currently)
> > drop the ref to X and remove it.  That means that A can't be read.
> >
> > One way to get around that would be to mirror snaps from the source pool
> > to the chunk pool--this is how cache tiering works.  The problem I see
> > there is that I'd hoped to allow multiple pools to share/consume the same
> > chunk pool, but each pool has its own snapid namespace.
> >
> > Another would be to bake the refs more deepling into the source rados pool
> > so that the refs are only dropped after all clones also drop the ref.
> > That is harder to track, though, since I think you'd need to examine all
> > of the clones to know whether the ref is truly gone.  Unless we embed
> > even more metadata in the SnapSet--something analogous to clone_overlap to
> > identifying the chunks.  That seems like it will bloat that structure,
> > though.
> >
> > Other ideas?
>
> Is there much design work around refcounting and snapshots yet?
>
> I haven't thought it through much but one possibility is that each
> on-disk clone counts as its own reference, and on a write to the
> manifest object you increment the reference to all the chunks in
> common. When snaptrimming finally removes a clone, it has to decrement
> all the chunk references contained in the manifest.
>
> I don't love this for the extra trimming work and remote reference
> updates, but it's one way to keep the complexity of the data
> structures down.
>
> Other options:
> * Force 1:1 mapping. Not sure how good or bad this is since I haven't
> seen a lot of CAS pool discussion.
> * no longer giving each pool its own snapshot namespace. Not sure this
> was a great design decision to begin with; would require updating
> CephFS snap allocation but I don't think anything else outside the
> monitors.
> * Disallowing snapshots on manifest-based objects/pools. What are the
> target workloads for these?
> -Greg
>
> >
> > sage
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx