Re: performance of list omap

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 15 Apr 2019 14:59:14 -0700

On Mon, Apr 15, 2019 at 5:28 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> On Sat, 13 Apr 2019, zengran zhang wrote:
> > Hi developers:
> >
> >     Always, we may see some slow requests of list omap, some even will
> > cause the op thread suicide。
> > recently, we introduced the rocksdb delete range api, but found the
> > cluster be more easier complaining
> > list omap timed out. I checked the rocksdb of one osd, and I think I
> > found the reason.
> >
> >     the omap iterator when listing omap use the tail of '~', when the
> > iterator moved to the last key of the omaps
> > we wanted, we will try to call extra next(), usually this will be
> > another object's omap header(with '-'). *IF*
> > there are some deleted key or tombstones, rocksdb will fall in the
> > loop of `FindNextUserEntryInternal` until
> > find a valid key, so it will travels all dead key in mid and read the
> > sst file heavily.
>
> If I'm understanding the problem correctly, it's that object foo has a
> bunch of omap items, but the object(s) that came after foo are all
> deleted, so that when we enumerate foo's omap, we have to skip over
> all the delete objects' sst files to reach the head for the next
> non-deleted object, right?
>
> >     I think there may be 3 approachs:
> > 1) change the omap header from '-' to '~', let it play role of end when iterate.
> > 2) force to add the omap end key(use '~') on metadata pool.
>
> This seems like the way to fix it.  We just need to make sure we create
> the end record in a backard-compatible way.  Two options:
>
>  1) We create the tail when we first set FLAG_OMAP.  This will leave
> the non-optimal behavior in place for existing objects.  Or,
>  2) We add a new FLAG_OMAP_TAIL and create the tail if we don't have one
> yet.
>
> Either way, we have to make sure that on deletion, we clean up the tail
> keys...

I'm not totally sure this will help. We'll still be deleting all the
intermediate key ranges and need to go through them for a lookup,
right? So it helps if we have a bunch of deleted objects, but I
suspect the common case is just that we removed a single bucket index
or something, and iterating through those levels takes a while?

>
> > 3) when iterate, found the rbegin of omap key, then we have the end
> > key, and could avoid extra next().
>
> sage