On Tue, 21 Jul 2015, Wang, Zhiqiang wrote: > > -----Original Message----- > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > > Sent: Tuesday, July 21, 2015 6:38 AM > > To: Wang, Zhiqiang > > Cc: sjust@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > > Subject: Re: The design of the eviction improvement > > > > On Mon, 20 Jul 2015, Wang, Zhiqiang wrote: > > > Hi all, > > > > > > This is a follow-up of one of the CDS session at > > http://tracker.ceph.com/projects/ceph/wiki/Improvement_on_the_cache_tieri > > ng_eviction. We discussed the drawbacks of the current eviction algorithm and > > several ways to improve it. Seems like the LRU variants is the right way to go. I > > come up with some design points after the CDS, and want to discuss it with you. > > It is an approximate 2Q algorithm, combining some benefits of the clock > > algorithm, similar to what the linux kernel does for the page cache. > > > > Unfortunately I missed this last CDS so I'm behind on the discussion. I have a > > few questions though... > > > > > # Design points: > > > > > > ## LRU lists > > > - Maintain LRU lists at the PG level. > > > The SharedLRU and SimpleLRU implementation in the current code have a > > > max_size, which limits the max number of elements in the list. This > > > mostly looks like a MRU, though its name implies they are LRUs. Since > > > the object size may vary in a PG, it's not possible to caculate the > > > total number of objects which the cache tier can hold ahead of time. > > > We need a new LRU implementation with no limit on the size. > > > > This last sentence seems to me to be the crux of it. Assuming we have an > > OSD based by flash storing O(n) objects, we need a way to maintain an LRU of > > O(n) objects in memory. The current hitset-based approach was taken based > > on the assumption that this wasn't feasible--or at least we didn't know how to > > implmement such a thing. If it is, or we simply want to stipulate that cache > > tier OSDs get gobs of RAM to make it possible, then lots of better options > > become possible... > > > > Let's say you have a 1TB SSD, with an average object size of 1MB -- that's > > 1 million objects. At maybe ~100bytes per object of RAM for an LRU entry > > that's 100MB... so not so unreasonable, perhaps! > > I was having the same question before proposing this. I did the similar > calculation and thought it would be ok to use this many memory :-) The part that worries me now is the speed with which we can load and manage such a list. Assuming it is several hundred MB, it'll take a while to load that into memory and set up all the pointers (assuming a conventional linked list structure). Maybe tens of seconds... I wonder if instead we should construct some sort of flat model where we load slabs of contiguous memory, 10's of MB each, and have the next/previous pointers be a (slab,position) pair. That way we can load it into memory in big chunks, quickly, and be able to operate on it (adjust links) immediately. Another thought: currently we use the hobject_t hash only instead of the full object name. We could continue to do the same, or we could do a hash pair (hobject_t hash + a different hash of the rest of the object) to keep the representation compact. With a model lke the above, that could get the object representation down to 2 u32's. A link could be a slab + position (2 more u32's), and if we have prev + next that'd be just 6x4=24 bytes per object. With fixed-sized slots on the slabs, the slab allocator could be very simple.. maybe just a bitmap, a free counter, and any other trivial optimizations to make finding a slab's next free a slot nice and quick. > > > - Two lists for each PG: active and inactive Objects are first put > > > into the inactive list when they are accessed, and moved between these two > > lists based on some criteria. > > > Object flag: active, referenced, unevictable, dirty. > > > - When an object is accessed: > > > 1) If it's not in both of the lists, it's put on the top of the > > > inactive list > > > 2) If it's in the inactive list, and the referenced flag is not set, the referenced > > flag is set, and it's moved to the top of the inactive list. > > > 3) If it's in the inactive list, and the referenced flag is set, the referenced flag > > is cleared, and it's removed from the inactive list, and put on top of the active > > list. > > > 4) If it's in the active list, and the referenced flag is not set, the referenced > > flag is set, and it's moved to the top of the active list. > > > 5) If it's in the active list, and the referenced flag is set, it's moved to the top > > of the active list. > > > - When selecting objects to evict: > > > 1) Objects at the bottom of the inactive list are selected to evict. They are > > removed from the inactive list. > > > 2) If the number of the objects in the inactive list becomes low, some of the > > objects at the bottom of the active list are moved to the inactive list. For those > > objects which have the referenced flag set, they are given one more chance in > > the active list. They are moved to the top of the active list with the referenced > > flag cleared. For those objects which don't have the referenced flag set, they > > are moved to the inactive list, with the referenced flag set. So that they can be > > quickly promoted to the active list when necessary. > > > > > > ## Combine flush with eviction > > > - When evicting an object, if it's dirty, it's flushed first. After flushing, it's > > evicted. If not dirty, it's evicted directly. > > > - This means that we won't have separate activities and won't set different > > ratios for flush and evict. Is there a need to do so? > > > - Number of objects to evict at a time. 'evict_effort' acts as the priority, which > > is used to calculate the number of objects to evict. > > > > As someone else mentioned in a follow-up, the reason we let the dirty level be > > set lower than the full level is that it provides headroom so that objects can be > > quickly evicted (delete, no flush) to make room for new writes or new > > promotions. > > > > That said, we probably can/should streamline the flush so that an evict can > > immediately follow without waiting for the agent to come around again. > > (I don't think we do that now?) > > I was afraid of having to maintain different lists for flush/evict. So I > proposed to combine flush with evict. But seems like we don't need to. > When reaching the dirty ratio and selecting objects to flush, we look > for dirty objects from the inactive list. The objects are marked clean > and kept in the same position in the LRU list after flushing. When > evicting objects, clean objects in the inactive list are selected. Dirty > objects are ignored. This will keep the flush/evict the same as current > implementation. Make sense? Yep! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html