cc devel list zengran zhang <z13121369189@xxxxxxxxx> 于2019年4月16日周二 上午9:38写道: > > Gregory Farnum <gfarnum@xxxxxxxxxx> 于2019年4月16日周二 上午5:59写道: > > > > On Mon, Apr 15, 2019 at 5:28 AM Sage Weil <sage@xxxxxxxxxxxx> wrote: > > > > > > On Sat, 13 Apr 2019, zengran zhang wrote: > > > > Hi developers: > > > > > > > > Always, we may see some slow requests of list omap, some even will > > > > cause the op thread suicide。 > > > > recently, we introduced the rocksdb delete range api, but found the > > > > cluster be more easier complaining > > > > list omap timed out. I checked the rocksdb of one osd, and I think I > > > > found the reason. > > > > > > > > the omap iterator when listing omap use the tail of '~', when the > > > > iterator moved to the last key of the omaps > > > > we wanted, we will try to call extra next(), usually this will be > > > > another object's omap header(with '-'). *IF* > > > > there are some deleted key or tombstones, rocksdb will fall in the > > > > loop of `FindNextUserEntryInternal` until > > > > find a valid key, so it will travels all dead key in mid and read the > > > > sst file heavily. > > > > > > If I'm understanding the problem correctly, it's that object foo has a > > > bunch of omap items, but the object(s) that came after foo are all > > > deleted, so that when we enumerate foo's omap, we have to skip over > > > all the delete objects' sst files to reach the head for the next > > > non-deleted object, right? > > right, forgive me for not being good at English : ( > > > > > > > > I think there may be 3 approachs: > > > > 1) change the omap header from '-' to '~', let it play role of end when iterate. > > > > 2) force to add the omap end key(use '~') on metadata pool. > > > > > > This seems like the way to fix it. We just need to make sure we create > > > the end record in a backard-compatible way. Two options: > > > > > > 1) We create the tail when we first set FLAG_OMAP. This will leave > > > the non-optimal behavior in place for existing objects. Or, > > > 2) We add a new FLAG_OMAP_TAIL and create the tail if we don't have one > > > yet. > > > > > > Either way, we have to make sure that on deletion, we clean up the tail > > > keys... > > > > I'm not totally sure this will help. We'll still be deleting all the > > intermediate key ranges and need to go through them for a lookup, > > right? So it helps if we have a bunch of deleted objects, but I > > suspect the common case is just that we removed a single bucket index > > or something, and iterating through those levels takes a while? > > yeah, I agree that this will not help the common case. So, let's talk > about the necessity of > the above approach. I think the worst thing is not the delete > operation in bucket, but the > recovery of index/dir object, because it need clear all omap, the > second worse thing may be > the deletion of bucket. I means maybe the bunch of deleted objects is > the case we need consider. > objcet/file delete operation in bucket/dir is usually with low > frequency, even than the rocksdb compaction. > > > > > > > > > > 3) when iterate, found the rbegin of omap key, then we have the end > > > > key, and could avoid extra next(). > > > > > > sage