Re: performance of list omap

zengran zhang <z13121369189@xxxxxxxxx> · Tue, 16 Apr 2019 09:48:56 +0800



cc devel list
zengran zhang <z13121369189@xxxxxxxxx> 于2019年4月16日周二 上午9:38写道：
>
> Gregory Farnum <gfarnum@xxxxxxxxxx> 于2019年4月16日周二 上午5:59写道：
> >
> > On Mon, Apr 15, 2019 at 5:28 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > >
> > > On Sat, 13 Apr 2019, zengran zhang wrote:
> > > > Hi developers:
> > > >
> > > >     Always, we may see some slow requests of list omap, some even will
> > > > cause the op thread suicide。
> > > > recently, we introduced the rocksdb delete range api, but found the
> > > > cluster be more easier complaining
> > > > list omap timed out. I checked the rocksdb of one osd, and I think I
> > > > found the reason.
> > > >
> > > >     the omap iterator when listing omap use the tail of '~', when the
> > > > iterator moved to the last key of the omaps
> > > > we wanted, we will try to call extra next(), usually this will be
> > > > another object's omap header(with '-'). *IF*
> > > > there are some deleted key or tombstones, rocksdb will fall in the
> > > > loop of `FindNextUserEntryInternal` until
> > > > find a valid key, so it will travels all dead key in mid and read the
> > > > sst file heavily.
> > >
> > > If I'm understanding the problem correctly, it's that object foo has a
> > > bunch of omap items, but the object(s) that came after foo are all
> > > deleted, so that when we enumerate foo's omap, we have to skip over
> > > all the delete objects' sst files to reach the head for the next
> > > non-deleted object, right?
>
> right, forgive me for not being good at English  : (
>
> > >
> > > >     I think there may be 3 approachs:
> > > > 1) change the omap header from '-' to '~', let it play role of end when iterate.
> > > > 2) force to add the omap end key(use '~') on metadata pool.
> > >
> > > This seems like the way to fix it.  We just need to make sure we create
> > > the end record in a backard-compatible way.  Two options:
> > >
> > >  1) We create the tail when we first set FLAG_OMAP.  This will leave
> > > the non-optimal behavior in place for existing objects.  Or,
> > >  2) We add a new FLAG_OMAP_TAIL and create the tail if we don't have one
> > > yet.
> > >
> > > Either way, we have to make sure that on deletion, we clean up the tail
> > > keys...
> >
> > I'm not totally sure this will help. We'll still be deleting all the
> > intermediate key ranges and need to go through them for a lookup,
> > right? So it helps if we have a bunch of deleted objects, but I
> > suspect the common case is just that we removed a single bucket index
> > or something, and iterating through those levels takes a while?
>
> yeah, I agree that this will not help the common case.  So, let's talk
> about the necessity of
> the above approach. I think the worst thing is not the delete
> operation in bucket, but the
> recovery of index/dir object, because it need clear all omap, the
> second worse thing may be
> the deletion of bucket. I means maybe the bunch of deleted objects is
> the case we need consider.
> objcet/file delete operation in bucket/dir is usually  with low
> frequency, even than the rocksdb compaction.
>
> >
> > >
> > > > 3) when iterate, found the rbegin of omap key, then we have the end
> > > > key, and could avoid extra next().
> > >
> > > sage