Hi, ----- "Zhiqiang Wang" <zhiqiang.wang@xxxxxxxxx> wrote: > Hi Matt, > > > -----Original Message----- > > From: Matt W. Benjamin [mailto:matt@xxxxxxxxxxxx] > > Sent: Tuesday, July 21, 2015 10:16 PM > > To: Wang, Zhiqiang > > Cc: sjust@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx; Sage Weil > > Subject: Re: The design of the eviction improvement > > > > Hi, > > > > Couple of points. > > > > 1) a successor to 2Q is MQ (Li et al). We have an intrusive MQ LRU > > implementation with 2 levels currently, plus a pinned queue, that > addresses > > stuff like partitioning ("sharding"), scan resistance, and > coordination w/lookup > > tables. We might extend/re-use it. > > The MQ algorithm is more complex, and seems like it has more overhead > than 2Q. The approximate 2Q algorithm here combines some benefits of > the clock algorithm, and works well on the linux kernel. MQ could be > another choice. There are some other candidates like LIRS, ARC, etc., > which have been deployed in some practical systems. MQ has been deployed in practical systems, and is more general. > > > > > 2) I'm a bit confused by active/inactive vocabulary, dimensioning of > cache > > segments (are you proposing to/do we now always cache whole > objects?), and > > cost of "looking for" dirty objects; I suspect that it makes sense > to amortize > > the cost of locating segments eligible to be flushed, rather than > minimize > > bookkeeping. > > Though currently the caching decisions are made in the unit of object > as Greg pointed out in another mail, I think we still have something > to improve here. I'll come back to this some time later. > > > > > Matt > > > > ----- "Zhiqiang Wang" <zhiqiang.wang@xxxxxxxxx> wrote: > > > > > > -----Original Message----- > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage > Weil > > > > Sent: Tuesday, July 21, 2015 6:38 AM > > > > To: Wang, Zhiqiang > > > > Cc: sjust@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > > > > Subject: Re: The design of the eviction improvement > > > > > > > > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: > > > > > Hi all, > > > > > > > > > > This is a follow-up of one of the CDS session at > > > > > > > > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_ti > > > eri > > > > ng_eviction. We discussed the drawbacks of the current eviction > > > algorithm and > > > > several ways to improve it. Seems like the LRU variants is the > right > > > way to go. I > > > > come up with some design points after the CDS, and want to > discuss > > > it with you. > > > > It is an approximate 2Q algorithm, combining some benefits of > the > > > clock > > > > algorithm, similar to what the linux kernel does for the page > > > cache. > > > > > > > > Unfortunately I missed this last CDS so I'm behind on the > > > discussion. I have a > > > > few questions though... > > > > > > > > > # Design points: > > > > > > > > > > ## LRU lists > > > > > - Maintain LRU lists at the PG level. > > > > > The SharedLRU and SimpleLRU implementation in the current > code > > > have a > > > > > max_size, which limits the max number of elements in the > list. > > > This > > > > > mostly looks like a MRU, though its name implies they are > LRUs. > > > Since > > > > > the object size may vary in a PG, it's not possible to > caculate > > > the > > > > > total number of objects which the cache tier can hold ahead > of > > > time. > > > > > We need a new LRU implementation with no limit on the size. > > > > > > > > This last sentence seems to me to be the crux of it. Assuming > we > > > have an > > > > OSD based by flash storing O(n) objects, we need a way to > maintain > > > an LRU of > > > > O(n) objects in memory. The current hitset-based approach was > taken > > > based > > > > on the assumption that this wasn't feasible--or at least we > didn't > > > know how to > > > > implmement such a thing. If it is, or we simply want to > stipulate > > > that cache > > > > tier OSDs get gobs of RAM to make it possible, then lots of > better > > > options > > > > become possible... > > > > > > > > Let's say you have a 1TB SSD, with an average object size of 1MB > -- > > > that's > > > > 1 million objects. At maybe ~100bytes per object of RAM for an > LRU > > > entry > > > > that's 100MB... so not so unreasonable, perhaps! > > > > > > I was having the same question before proposing this. I did the > > > similar calculation and thought it would be ok to use this many > memory > > > :-) > > > > > > > > > > > > - Two lists for each PG: active and inactive Objects are > first > > > put > > > > > into the inactive list when they are accessed, and moved > between > > > these two > > > > lists based on some criteria. > > > > > Object flag: active, referenced, unevictable, dirty. > > > > > - When an object is accessed: > > > > > 1) If it's not in both of the lists, it's put on the top of > the > > > > > inactive list > > > > > 2) If it's in the inactive list, and the referenced flag is > not > > > set, the referenced > > > > flag is set, and it's moved to the top of the inactive list. > > > > > 3) If it's in the inactive list, and the referenced flag is > set, > > > the referenced flag > > > > is cleared, and it's removed from the inactive list, and put on > top > > > of the active > > > > list. > > > > > 4) If it's in the active list, and the referenced flag is not > set, > > > the referenced > > > > flag is set, and it's moved to the top of the active list. > > > > > 5) If it's in the active list, and the referenced flag is > set, > > > it's moved to the top > > > > of the active list. > > > > > - When selecting objects to evict: > > > > > 1) Objects at the bottom of the inactive list are selected to > > > evict. They are > > > > removed from the inactive list. > > > > > 2) If the number of the objects in the inactive list becomes > low, > > > some of the > > > > objects at the bottom of the active list are moved to the > inactive > > > list. For those > > > > objects which have the referenced flag set, they are given one > more > > > chance in > > > > the active list. They are moved to the top of the active list > with > > > the referenced > > > > flag cleared. For those objects which don't have the referenced > flag > > > set, they > > > > are moved to the inactive list, with the referenced flag set. > So > > > that they can be > > > > quickly promoted to the active list when necessary. > > > > > > > > > > ## Combine flush with eviction > > > > > - When evicting an object, if it's dirty, it's flushed first. > > > After flushing, it's > > > > evicted. If not dirty, it's evicted directly. > > > > > - This means that we won't have separate activities and won't > set > > > different > > > > ratios for flush and evict. Is there a need to do so? > > > > > - Number of objects to evict at a time. 'evict_effort' acts as > the > > > priority, which > > > > is used to calculate the number of objects to evict. > > > > > > > > As someone else mentioned in a follow-up, the reason we let the > > > dirty level be > > > > set lower than the full level is that it provides headroom so > that > > > objects can be > > > > quickly evicted (delete, no flush) to make room for new writes > or > > > new > > > > promotions. > > > > > > > > That said, we probably can/should streamline the flush so that > an > > > evict can > > > > immediately follow without waiting for the agent to come around > > > again. > > > > (I don't think we do that now?) > > > > > > I was afraid of having to maintain different lists for > flush/evict. So > > > I proposed to combine flush with evict. But seems like we don't > need > > > to. When reaching the dirty ratio and selecting objects to flush, > we > > > look for dirty objects from the inactive list. The objects are > marked > > > clean and kept in the same position in the LRU list after > flushing. > > > When evicting objects, clean objects in the inactive list are > > > selected. Dirty objects are ignored. This will keep the > flush/evict > > > the same as current implementation. Make sense? > > > > > > > > > > > sage > > > > > > > > > > > > > ## LRU lists Snapshotting > > > > > - The two lists are snapshotted persisted periodically. > > > > > - Only one copy needs to be saved. The old copy is removed > when > > > persisting > > > > the lists. The saved lists are used to restore the LRU lists > when > > > OSD reboots. > > > > > > > > > > Any comments/feedbacks are welcomed. > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" > > > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > majordomo > > > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > -- > > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" in the body > > > > of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info > at > > > > http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > ceph-devel" > > > in > > > the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info > > > at http://vger.kernel.org/majordomo-info.html > > > > -- > > Matt Benjamin > > CohortFS, LLC. > > 315 West Huron Street, Suite 140A > > Ann Arbor, Michigan 48103 > > > > http://cohortfs.com > > > > tel. 734-761-4689 > > fax. 734-769-8938 > > cel. 734-216-5309 Matt -- Matt Benjamin CohortFS, LLC. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://cohortfs.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html