Made some changes: BufferTransaction::buffers is now using map< <coll_t, ghobject_t>, map<pair<prefix, key>, bufferlist> > to records IOs in one transaction, codes is under test with 700s seq write 64k and 700s random write 4k without osd crash. Best Regards, -Chendi -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Xue, Chendi Sent: Sunday, November 9, 2014 10:32 PM To: Haomai Wang Cc: ceph-devel@xxxxxxxxxxxxxxx; Sage Weil Subject: RE: ObjectStore collections Hi, Haomai As discussed this afternoon, I moved buffers from strip_header to BufferTransaction The codes has been tested, there seems to be a slight performance enhancement After a pretty long time seq/random write test, I observed no segment fault and growing memory usage problem or any other osd crash problem. Hope you can help to review the codes, thanks https://github.com/ceph/ceph/pull/2889 =================================== For others, there is a brief explanation: 'strip_header' in KeyValueStore is used to cache all headers, 'buffers' is used to ensure all io in one transaction can read the lastest update. Move buffers from strip_header to BufferTransaction can ensure 'buffers' being destructed after the transaction being submitted, which prevent growing 'strip_header->buffers' causing OOM problem. Also, for all 'meta collection' will be cached in strip_header, and there lacks of lock of parallel io accessing 'meta collection', multi-thread accessing the 'meta collection' strip_header->buffers causes segmentation fault, so moving 'buffers' from strip_header to BufferTransaction can also prevent such scenario. Best Regards, -Chendi -----Original Message----- From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] Sent: Sunday, November 9, 2014 5:21 PM To: Sage Weil Cc: Xue, Chendi; ceph-devel@xxxxxxxxxxxxxxx Subject: Re: ObjectStore collections On Sun, Nov 9, 2014 at 5:59 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Sat, 8 Nov 2014, Haomai Wang wrote: >> As for OOM, I think the root cause is the mistake commit above too. >> Because "meta" collection will be updated each transaction and >> StripObjectHeader::buffers will be always kept in memory because of >> the strategy of cache. So this object's buffers will keep in >> increasing all the time. So I think if we avoid cache "meta" >> collection's object will just be fine. Although we don't observe OOM >> for previous release except this mistake commit, I prefer to add >> codes to discard "buffers" each submit transaction time to avoid >> potential unpredicted memory growing. >> >> Do you have a more clear impl about it? I'm just thinking a better >> way to solve the performance bottleneck for "meta" collections. > > I would really like to see if we can eliminate collections from the > API entirely. Or, perhaps more importantly, if that would be helpful. > For the most part, hobject_t's already sort themselves into > collections based on the hash. The exceptions are: > > - The 'meta' collection. Mostly this includes the pg logs and pg info > objects (which are per-pg and would otherwise need no locking) and the > osdmap objects. > > - collection_move and collection_move_rename. I think if we move > everything to collection_move_rename and use temporary objects with > unique names for everything (I think in-progress recovery objects is > the main user of collection_move) then this really just turns into a > rename operation. > > - object listing is currently in terms of pg, but could just as easily > specify a hash range. > > - collection attributes can be moved to the pginfo objects. > > It sounds like the problem in KeyValueStore is that the pglog/pginfo > objects are written by transactions in all PGs but the per-collection > index/cache structures weren't being locked. If we can find a way to > fit these into the sorted hash in the right position that is > conceptually simpler. But I'm not sure if that simplicity actually > helps with the implementation, where the data structure locking is the important part. > Perhaps we need to keep a collection concept simply for that purpose > and the only real problem is 'meta'? Original, KeyvalueStore is true by just avoiding cache "meta" collection for concurrent purpose. I'm just thinking about is there a way to make "meta" collection ops more parallelization. BTW, FileStore also can get the benefits because of the same reason. > > sage -- Best Regards, Wheat 칻 & ~ & +- ݶ w ˛ m ^ b ^n r z h & G h ( 階 ݢj" m z ޖ f h ~ m ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f