On Mon, Apr 15, 2019 at 5:28 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > On Sat, 13 Apr 2019, zengran zhang wrote: > > Hi developers: > > > > Always, we may see some slow requests of list omap, some even will > > cause the op thread suicide。 > > recently, we introduced the rocksdb delete range api, but found the > > cluster be more easier complaining > > list omap timed out. I checked the rocksdb of one osd, and I think I > > found the reason. > > > > the omap iterator when listing omap use the tail of '~', when the > > iterator moved to the last key of the omaps > > we wanted, we will try to call extra next(), usually this will be > > another object's omap header(with '-'). *IF* > > there are some deleted key or tombstones, rocksdb will fall in the > > loop of `FindNextUserEntryInternal` until > > find a valid key, so it will travels all dead key in mid and read the > > sst file heavily. > > If I'm understanding the problem correctly, it's that object foo has a > bunch of omap items, but the object(s) that came after foo are all > deleted, so that when we enumerate foo's omap, we have to skip over > all the delete objects' sst files to reach the head for the next > non-deleted object, right? > > > I think there may be 3 approachs: > > 1) change the omap header from '-' to '~', let it play role of end when iterate. > > 2) force to add the omap end key(use '~') on metadata pool. > > This seems like the way to fix it. We just need to make sure we create > the end record in a backard-compatible way. Two options: > > 1) We create the tail when we first set FLAG_OMAP. This will leave > the non-optimal behavior in place for existing objects. Or, > 2) We add a new FLAG_OMAP_TAIL and create the tail if we don't have one > yet. > > Either way, we have to make sure that on deletion, we clean up the tail > keys... I'm not totally sure this will help. We'll still be deleting all the intermediate key ranges and need to go through them for a lookup, right? So it helps if we have a bunch of deleted objects, but I suspect the common case is just that we removed a single bucket index or something, and iterating through those levels takes a while? > > > 3) when iterate, found the rbegin of omap key, then we have the end > > key, and could avoid extra next(). > > sage