RE: ObjectStore collections

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, Haomai

As discussed this afternoon, I moved buffers from strip_header to BufferTransaction

The codes has been tested, there seems to be a slight performance enhancement
After a pretty long time seq/random write test, I observed no segment fault and growing memory usage problem or any other osd crash problem.

Hope you can help to review the codes, thanks

https://github.com/ceph/ceph/pull/2889
 
===================================
For others, there is a brief  explanation:

'strip_header' in KeyValueStore is used to cache all headers, 
'buffers' is used to ensure all io in one transaction can read the lastest update.

Move buffers from strip_header to BufferTransaction can ensure 'buffers' being destructed after the transaction being submitted, which prevent growing 'strip_header->buffers' causing OOM problem.

Also, for all 'meta collection' will be cached in strip_header, and there lacks of lock of parallel io accessing 'meta collection', multi-thread accessing the 'meta collection' strip_header->buffers causes segmentation fault, so moving 'buffers' from strip_header to BufferTransaction can also prevent such scenario. 

Best Regards,
-Chendi

-----Original Message-----
From: Haomai Wang [mailto:haomaiwang@xxxxxxxxx] 
Sent: Sunday, November 9, 2014 5:21 PM
To: Sage Weil
Cc: Xue, Chendi; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: ObjectStore collections

On Sun, Nov 9, 2014 at 5:59 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Sat, 8 Nov 2014, Haomai Wang wrote:
>> As for OOM, I think the root cause is the mistake commit above too.
>> Because "meta" collection will be updated each transaction and 
>> StripObjectHeader::buffers will be always kept in memory because of 
>> the strategy of cache. So this object's buffers will keep in 
>> increasing all the time. So I think if we avoid cache "meta"
>> collection's object will just be fine. Although we don't observe OOM 
>> for previous release except this mistake commit, I prefer to add 
>> codes to discard "buffers" each submit transaction time to avoid 
>> potential unpredicted memory growing.
>>
>> Do you have a more clear impl about it? I'm just thinking a better 
>> way to solve the performance bottleneck for "meta" collections.
>
> I would really like to see if we can eliminate collections from the 
> API entirely.  Or, perhaps more importantly, if that would be helpful.  
> For the most part, hobject_t's already sort themselves into 
> collections based on the hash.  The exceptions are:
>
> - The 'meta' collection.  Mostly this includes the pg logs and pg info 
> objects (which are per-pg and would otherwise need no locking) and the 
> osdmap objects.
>
> - collection_move and collection_move_rename.  I think if we move 
> everything to collection_move_rename and use temporary objects with 
> unique names for everything (I think in-progress recovery objects is 
> the main user of collection_move) then this really just turns into a 
> rename operation.
>
> - object listing is currently in terms of pg, but could just as easily 
> specify a hash range.
>
> - collection attributes can be moved to the pginfo objects.
>
> It sounds like the problem in KeyValueStore is that the pglog/pginfo 
> objects are written by transactions in all PGs but the per-collection 
> index/cache structures weren't being locked.  If we can find a way to 
> fit these into the sorted hash in the right position that is 
> conceptually simpler.  But I'm not sure if that simplicity actually 
> helps with the implementation, where the data structure locking is the important part.
> Perhaps we need to keep a collection concept simply for that purpose 
> and the only real problem is 'meta'?

Original, KeyvalueStore is true by just avoiding cache "meta"
collection for concurrent purpose.

I'm just thinking about is there a way to make "meta" collection ops more parallelization. BTW, FileStore also can get the benefits because of the same reason.

>
> sage



--
Best Regards,

Wheat
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux