Hi, ----- "Allen Samuels" <Allen.Samuels@xxxxxxxxxxx> wrote: > I'm very concerned about designing around the assumption that objects > are ~1MB in size. That's probably a good assumption for block and HDFS > dominated systems, but likely a very poor assumption about many object > and file dominated systems. ++ > > If I understand the proposals that have been discussed, each of them > assumes in in-memory data structure with an entry per object (the > exact size of the entry varies with the different proposals). > > Under that assumption, I have another concern which is the lack of > graceful degradation as the object counts grow and the in-memory data > structures get larger. Everything seems fine until just a few objects > get added then the system starts to page and performance drops > dramatically (likely) to the point where Linux will start killing > OSDs. I'm not clear why that needs to be the case (but don't think it matters just now whether I do, I was just letting folks know that we have MQ implementation(s)), but what you're describing seems consistent the model Sage and Greg, at least, are describing. Matt > > What's really needed is some kind of way to extend the lists into > storage in way that's doesn't cause a zillion I/O operations. > > I have some vague idea that some data structure like the LSM mechanism > ought to be able to accomplish what we want. Some amount of the data > structure (the most likely to be used) is held in DRAM [and backed to > storage for restart] and the least likely to be used is flushed to > storage with some mechanism that allows batched updates. > > Allen Samuels > Software Architect, Systems and Software Solutions > > 2880 Junction Avenue, San Jose, CA 95134 > T: +1 408 801 7030| M: +1 408 780 6416 > allen.samuels@xxxxxxxxxxx > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > Sent: Wednesday, July 22, 2015 5:57 AM > To: Wang, Zhiqiang > Cc: sjust@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: The design of the eviction improvement > > On Wed, 22 Jul 2015, Wang, Zhiqiang wrote: > > > The part that worries me now is the speed with which we can load > and > > > manage such a list. Assuming it is several hundred MB, it'll take > a > > > while to load that into memory and set up all the pointers > (assuming > > > a conventional linked list structure). Maybe tens of seconds... > > > > I'm thinking of maintaining the lists at the PG level. That's to > say, > > we have an active/inactive list for every PG. We can load the lists > in > > parallel during rebooting. Also, the ~100 MB lists are split among > > different OSD nodes. Perhaps it does not need such long time to > load > > them? > > > > > > > > I wonder if instead we should construct some sort of flat model > > > where we load slabs of contiguous memory, 10's of MB each, and > have > > > the next/previous pointers be a (slab,position) pair. That way > we > > > can load it into memory in big chunks, quickly, and be able to > > > operate on it (adjust links) immediately. > > > > > > Another thought: currently we use the hobject_t hash only instead > of > > > the full object name. We could continue to do the same, or we > could > > > do a hash pair (hobject_t hash + a different hash of the rest of > the > > > object) to keep the representation compact. With a model lke the > > > above, that could get the object representation down to 2 u32's. > A > > > link could be a slab + position (2 more u32's), and if we have > prev > > > + next that'd be just 6x4=24 bytes per object. > > > > Looks like for an object, the head and the snapshot version have > the > > same hobject hash. Thus we have to use the hash pair instead of > just > > the hobject hash. But I still have two questions if we use the hash > > pair to represent an object. > > > > 1) Does the hash pair uniquely identify an object? That's to say, > is > > it possible for two objects to have the same hash pair? > > With two hashes collisions would be rare but could happen > > > 2) We need a way to get the full object name from the hash pair, so > > that we know what objects to evict. But seems like we don't have a > > good way to do this? > > Ah, yeah--I'm a little stuck in the current hitset view of things. I > think we can either embed the full ghobject_t (which means we lose the > fixed-size property, and the per-object overhead goes way up.. > probably from ~24 bytes to more like 80 or 100). Or, we can enumerate > objects starting at the (hobject_t) hash position to find the object. > That's somewhat inefficient for FileStore (it'll list a directory of a > hundred or so objects, probably, and iterate over them to find the > right one), but for NewStore it will be quite fast (NewStore has all > objects sorted into keys in rocksdb, so we just start listing at the > right offset). Usually we'll get the object right off, unless there > are hobject_t hash collisions (already reasonably rare since it's a > 2^32 space for the pool). > > Given that, I would lean toward the 2-hash fixed-sized records (of > these 2 options)... > > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info at http://vger.kernel.org/majordomo-info.html > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message > is intended only for the use of the designated recipient(s) named > above. If the reader of this message is not the intended recipient, > you are hereby notified that you have received this message in error > and that any review, dissemination, distribution, or copying of this > message is strictly prohibited. If you have received this > communication in error, please notify the sender by telephone or > e-mail (as shown above) immediately and destroy any and all copies of > this message in your possession (whether hard copies or electronically > stored copies). > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Matt Benjamin CohortFS, LLC. 315 West Huron Street, Suite 140A Ann Arbor, Michigan 48103 http://cohortfs.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html