Judging from debug output, the problem is in journal recovery, when it tries to delete object with huge (several milion keys - it is radosgw index* for bucket with over 50mln objects) amount of keys, using leveldb's rmkeys_by_prefix() method. Looking at the source code, rmkeys_by_prefix() batches all operations into one list and then submit_transaction() executes them all atomically. I'd love to write a patch for this issue, but it seems unfixable (or is it?) with current API and method behaviour. Could you offer any advice on how to proceed?
Answering myself, could anyone verify if attached patch looks ok? Should reduce memory footprint a bit.
When I first read this code, I assumed that data pointed by leveldb::Slice have to be reachable until db->Write is called.
However, looking into leveldb and into its source code, there is no such requirement - leveldb makes its own copy of key, so we're effectivly doubling memory footprint for no reason.
-- mg
--- a/src/os/LevelDBStore.cc +++ b/src/os/LevelDBStore.cc @@ -156,9 +156,8 @@ void LevelDBStore::LevelDBTransactionImpl::set( buffers.push_back(to_set_bl); bufferlist &bl = *(buffers.rbegin()); string key = combine_strings(prefix, k); - keys.push_back(key); - bat.Delete(leveldb::Slice(*(keys.rbegin()))); - bat.Put(leveldb::Slice(*(keys.rbegin())), + bat.Delete(leveldb::Slice(key)); + bat.Put(leveldb::Slice(key), leveldb::Slice(bl.c_str(), bl.length())); } @@ -166,8 +165,7 @@ void LevelDBStore::LevelDBTransactionImpl::rmkey(const string &prefix, const string &k) { string key = combine_strings(prefix, k); - keys.push_back(key); - bat.Delete(leveldb::Slice(*(keys.rbegin()))); + bat.Delete(leveldb::Slice(key)); } void LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(const string &prefix) @@ -177,8 +175,7 @@ void LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(const string &prefix it->valid(); it->next()) { string key = combine_strings(prefix, it->key()); - keys.push_back(key); - bat.Delete(*(keys.rbegin())); + bat.Delete(key); } } diff --git a/src/os/LevelDBStore.h b/src/os/LevelDBStore.h index 4617c5c..dd248dd 100644 --- a/src/os/LevelDBStore.h +++ b/src/os/LevelDBStore.h @@ -175,7 +175,6 @@ public: public: leveldb::WriteBatch bat; list<bufferlist> buffers; - list<string> keys; LevelDBStore *db; LevelDBTransactionImpl(LevelDBStore *db) : db(db) {}