ObjectStore collections

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 8 Nov 2014 20:59:54 -0800 (PST)

On Sat, 8 Nov 2014, Haomai Wang wrote:
> As for OOM, I think the root cause is the mistake commit above too.
> Because "meta" collection will be updated each transaction and
> StripObjectHeader::buffers will be always kept in memory because of
> the strategy of cache. So this object's buffers will keep in
> increasing all the time. So I think if we avoid cache "meta"
> collection's object will just be fine. Although we don't observe OOM
> for previous release except this mistake commit, I prefer to add codes
> to discard "buffers" each submit transaction time to avoid potential
> unpredicted memory growing.
> 
> Do you have a more clear impl about it? I'm just thinking a better way
> to solve the performance bottleneck for "meta" collections.

I would really like to see if we can eliminate collections from the API 
entirely.  Or, perhaps more importantly, if that would be helpful.  For 
the most part, hobject_t's already sort themselves into collections based 
on the hash.  The exceptions are:

- The 'meta' collection.  Mostly this includes the pg logs and pg info 
objects (which are per-pg and would otherwise need no locking) and the 
osdmap objects.

- collection_move and collection_move_rename.  I think if we move 
everything to collection_move_rename and use temporary objects with 
unique names for everything (I think in-progress recovery objects is 
the main user of collection_move) then this really just turns into a 
rename operation.

- object listing is currently in terms of pg, but could just as easily 
specify a hash range.

- collection attributes can be moved to the pginfo objects.

It sounds like the problem in KeyValueStore is that the pglog/pginfo 
objects are written by transactions in all PGs but the per-collection 
index/cache structures weren't being locked.  If we can find a way to fit 
these into the sorted hash in the right position that is conceptually 
simpler.  But I'm not sure if that simplicity actually helps with the 
implementation, where the data structure locking is the important part.  
Perhaps we need to keep a collection concept simply for that purpose and 
the only real problem is 'meta'?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html