Re: Seastore storage structure design consideration based on HLC and R-tree

Xuehan Xu <xxhdx1985126@xxxxxxxxx> · Fri, 10 Jan 2020 18:42:53 +0800

On Fri, 10 Jan 2020 at 02:00, Sam Just <sjust@xxxxxxxxxx> wrote:
>
> On Wed, Jan 8, 2020 at 7:35 PM Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
> >
> > > I added a few comments, my high level perspective is that it looks
> > > like an approach for dealing with multiversioned extents which might
> > > be a component of rados pool level point-in-time globally consistent
> > > snapshots for purposes like rados pool level cross-cluster
> > > replication.  However, that sort of thing would require a great deal
> > > of higher level support, so I'd consider the disk layout portion to be
> > > out of scope for now.  Is there another use case you are hoping to
> > > address with this?
> >
> > Hi, sam. Thanks for reviewing the doc:-)
> >
> > The main focus of this initiative is about doing efficient
> > replication/backup. Specifically, we intended to provide higher level
> > modules, especially to rbd and cephfs, the ability to do very high
> > rate snapshots (like one snapshot every 5 seconds or even multiple
> > snapshots within a second) and efficient snapshot diff and
> > export-diff.
>
> I didn't mention it explicitly, but the refcounts in the seastore doc
> lba tree are intended to permit extent sharing to support the existing
> snapshot machinery via clone().
>
> >
> > We thought, with this ability, upper level applications can achieve
> > near real-time replication that can be compared to the common op-by-op
> > replication, but with less overhead. Because it doesn't involve any
> > extra replication-dedicated journal operations. And as multiple write
> > operations' targeting extents may overlap with each other, even the
> > op-by-op replication can also avoid extra journal operations, they
> > inevitably replicate overlapped extents multiple times, while, in
> > snapshot diff export, only the latest version of the overlapped
> > extents need to be replicated.
>
> It's not clear to me how this versioning scheme changes journaling or
> pg logging.  For recovery, we already track overlapping extents
> between versions and use cloning appropriately.  Can you expand on
> this portion?

Oh, sorry that I caused this miss understanding. By op-by-op
replication, I meant mechanisms which, to do cross cluster
replication, do an extra journaling operation for every rbd image
write operation, like rbd mirroring. The upper layer applications did
this because RADOS doesn't provide cross cluster replication, which
leaves them no other choice. And even if we implement op-by-op cross
cluster replication within RADOS, there still seems to be some
drawbacks, because multiple write operations' target extents could
overlap with each other and op-by-op replication will replicate each
one of those write ops many of which may already been outdated, while
on the contrary, lightweight snapshots based cross cluster replication
won't have these overheads.

>
> >
> > We thought maybe we can let upper layer applications to choose whether
> > to replicate their data instead of doing the replication forcibly at
> > the whole rados pool scale.
>
> The existing self-managed snapshot scheme already gives rbd image
> granularity snapshots and cephfs recursive, subtree granularity
> snapshots. The difference is that the versioning lives in the
> hobject_t tuple -- each version is a different object with shared
> extents.

Yes, but, if I'm understanding correctly, in the current snapshot
mechanism, we have to create clone objects, copy attrs and omaps of
the cloned ones, store them on the disk and calculate and record
clone_overlaps for every write when doing writes on a snapshoted
objects. And in the upper layer applications, snapshot-creating
clients need to request a snapshot id from MONs and cooperate with
other clients who are writing to the same set of objects to finish the
snapshot creation. All of these may produce non-negligible impact on
the performance of normal read/write operations when doing high rate
snapshots.

On the other hand, HLC gives systems a method to query any system
state in the past by physical clock as long as those states are
remembered and it promises point-in-time consistency within the scope
of the system just as what Lamport clock promises. So if write ops is
tagged with a HLC timestamp and recorded somewhere (in the journal,
for example), they are already snapshots, and all we need to do when
creating a snapshot is just remembering which snapshot is needed and
tell OSDs not to clear that snapshot, right? So, for the upper layer
applications to do snapshots, they just need to record the time at
which they want to do the snapshot and tell OSDs not to clear the
corresponding write ops. There's no need to cooperate with other
clients through some kind of mutual locking or request snapshot ids.
With R-tree, we can a range search to easily read any data at any time
out of the journal or calculate a snapshot diff, so there's no need to
do those objects clone work. So, generally speaking, with HLC and
R-tree, the system don't need to do any normal R/W performance
influencing job when doing snapshot related work, which makes the
snapshots really lightweight. And as a side-effect, since we can read
data out of journal with the help of r-tree, there's no need for OSDs
to flush dirty data blocks to the underlying disk because those data
are already recorded in the journal, which I think could simplify the
design of OSDs.

>
> >
> > Whether this approach can really achieve that goal and whether to do
> > it is to be discussed, as we also realised that it may not be
> > cost-effective with respect to the amount of development work:-)
>
> I guess I'm not sure what this approach gets us that the existing
> cloning scheme does not.  The main problem with high snapshot rates
> currently isn't that the ondisk structure doesn't support it, but
> rather that snapshot stamps are mediated through the monitor.  There
> are reasons for doing it that way -- an rbd client need only issue a
> single monitor command to get rid of a snapshot and all involved osds
> will remove the now unnecessary clones asynchronously without
> requiring the client to track them down.  Similarly, the mds needn't
> find every clone within a subtree -- a potentially expensive
> operation.
>
> I think what I'm missing is how this structure fits into some higher
> level snapshot scheme you are proposing.

Um...Actually I didn't thought much about how the higher level
snapshot mechanism should be to keep the advantages of both the
current snapshot mechanism and the OSD structure I'm proposing.
Because I thought the higher level snapshot mechanism would be
relatively easy once we have a clear in-OSD snapshot mechanism, which
has obviously been proved wrong now....

I think I can take some time to try to figure out one such higher
level snapshot mechanism:-)

Thanks.

> -Sam
>
> >
> > Thanks.
> >
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx