Hi Allen, > -----Original Message----- > From: Allen Samuels [mailto:Allen.Samuels@xxxxxxxxxxx] > Sent: Thursday, July 23, 2015 2:41 AM > To: Sage Weil; Wang, Zhiqiang > Cc: sjust@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: The design of the eviction improvement > > I'm very concerned about designing around the assumption that objects are > ~1MB in size. That's probably a good assumption for block and HDFS dominated > systems, but likely a very poor assumption about many object and file > dominated systems. This is true. If we have lots of small objects/files, the memory used for LRU lists could be extremely large. > > If I understand the proposals that have been discussed, each of them assumes > in in-memory data structure with an entry per object (the exact size of the > entry varies with the different proposals). > > Under that assumption, I have another concern which is the lack of graceful > degradation as the object counts grow and the in-memory data structures get > larger. Everything seems fine until just a few objects get added then the system > starts to page and performance drops dramatically (likely) to the point where > Linux will start killing OSDs. > > What's really needed is some kind of way to extend the lists into storage in way > that's doesn't cause a zillion I/O operations. > > I have some vague idea that some data structure like the LSM mechanism > ought to be able to accomplish what we want. Some amount of the data > structure (the most likely to be used) is held in DRAM [and backed to storage > for restart] and the least likely to be used is flushed to storage with some > mechanism that allows batched updates. The LSM mechanism could solve the memory consumption problem. But I guess the process to choose which objects to evict is complex and inefficient. Also, after evicting some objects, we need to update the on-disk file to remove the entries of these objects. This is inefficient, too. > > Allen Samuels > Software Architect, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@xxxxxxxxxxx > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > Sent: Wednesday, July 22, 2015 5:57 AM > To: Wang, Zhiqiang > Cc: sjust@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: The design of the eviction improvement > > On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: > > > The part that worries me now is the speed with which we can load and > > > manage such a list. Assuming it is several hundred MB, it'll take a > > > while to load that into memory and set up all the pointers (assuming > > > a conventional linked list structure). Maybe tens of seconds... > > > > I'm thinking of maintaining the lists at the PG level. That's to say, > > we have an active/inactive list for every PG. We can load the lists in > > parallel during rebooting. Also, the ~100 MB lists are split among > > different OSD nodes. Perhaps it does not need such long time to load > > them? > > > > > > > > I wonder if instead we should construct some sort of flat model > > > where we load slabs of contiguous memory, 10's of MB each, and have > > > the next/previous pointers be a (slab,position) pair. That way we > > > can load it into memory in big chunks, quickly, and be able to > > > operate on it (adjust links) immediately. > > > > > > Another thought: currently we use the hobject_t hash only instead of > > > the full object name. We could continue to do the same, or we could > > > do a hash pair (hobject_t hash + a different hash of the rest of the > > > object) to keep the representation compact. With a model lke the > > > above, that could get the object representation down to 2 u32's. A > > > link could be a slab + position (2 more u32's), and if we have prev > > > + next that'd be just 6x4=24 bytes per object. > > > > Looks like for an object, the head and the snapshot version have the > > same hobject hash. Thus we have to use the hash pair instead of just > > the hobject hash. But I still have two questions if we use the hash > > pair to represent an object. > > > > 1) Does the hash pair uniquely identify an object? That's to say, is > > it possible for two objects to have the same hash pair? > > With two hashes collisions would be rare but could happen > > > 2) We need a way to get the full object name from the hash pair, so > > that we know what objects to evict. But seems like we don't have a > > good way to do this? > > Ah, yeah--I'm a little stuck in the current hitset view of things. I think we can > either embed the full ghobject_t (which means we lose the fixed-size property, > and the per-object overhead goes way up.. probably from ~24 bytes to more > like 80 or 100). Or, we can enumerate objects starting at the (hobject_t) hash > position to find the object. That's somewhat inefficient for FileStore (it'll list a > directory of a hundred or so objects, probably, and iterate over them to find the > right one), but for NewStore it will be quite fast (NewStore has all objects > sorted into keys in rocksdb, so we just start listing at the right offset). Usually > we'll get the object right off, unless there are hobject_t hash collisions (already > reasonably rare since it's a 2^32 space for the pool). > > Given that, I would lean toward the 2-hash fixed-sized records (of these 2 > options)... > > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body > of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly prohibited. If > you have received this communication in error, please notify the sender by > telephone or e-mail (as shown above) immediately and destroy any and all > copies of this message in your possession (whether hard copies or > electronically stored copies). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html