Re: RocksDB Incorrect API Usage

Varada Kari <Varada.Kari@xxxxxxxxxxx> · Wed, 1 Jun 2016 03:37:47 +0000

On Wednesday 01 June 2016 08:54 AM, Jianjian Huo wrote:
> Hi Allen,
>
> On Tue, May 31, 2016 at 7:52 PM, Allen Samuels <Allen.Samuels@xxxxxxxxxxx> wrote:
>>> -----Original Message-----
>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang
>>> Sent: Tuesday, May 31, 2016 7:38 PM
>>> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx>
>>> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>;
>>> ceph-devel@xxxxxxxxxxxxxxx
>>> Subject: Re: RocksDB Incorrect API Usage
>>>
>>> On Wed, Jun 1, 2016 at 3:40 AM, Jianjian Huo <jianjian.huo@xxxxxxxxxxx>
>>> wrote:
>>>> On Tue, May 31, 2016 at 11:12 AM, Haomai Wang <haomai@xxxxxxxx>
>>> wrote:
>>>>> On Wed, Jun 1, 2016 at 2:08 AM, Jianjian Huo
>>> <jianjian.huo@xxxxxxxxxxx> wrote:
>>>>>> Hi Haomai,
>>>>>>
>>>>>> I noticed this as well, and made same changes to RocksDBStore in this PR
>>> last week:
>>>>>> https://github.com/ceph/ceph/pull/9215
>>>>>>
>>>>>> One thing which is even worse,  seek will bypass row cache, so kv pairs
>>> won't be able to be cached in row cache.
>>>>>> I am working to benchmark the performance impact, will publish the
>>> results after I am done this week.
>>>>> Oh, cool! I think you can cherry-pick my another leveldb fix.
>>>> No problem, I will do that.
>>>>> BTW, do you pay an attention to prefix seek api? I think it will be
>>>>> more suitable than column family in ceph case. If we can have
>>>>> well-defined prefix rule, we can make most of range query cheaper!
>>>> When keys with same prefix are stored in their own SST files(CF case),
>>> even seeking without prefix will be faster than seeking with prefix but mixed
>>> with different prefixed keys?
>>>
>>> From my current view, prefix could be more flexible. For example, each rgw
>>> bucket index could use one prefix to make each bucket index object seek
>>> separated. For CF, it would be too heavry.
>> This might cause other problems. Would it dramatically increase the number of files that BlueFS needs to manage? If so, that might effectively break that code too (of course it's fixable also :))
> Thanks for bringing up this issue. What's current limit of number of files for BlueFS, from your experience?
> Jianjian
There is no limit right now, fnode ino is a unit64_t.  Directory is a
ref counted object contains a file map. As the number of inodes grow,
file_map will grow, will consume more memory.

Varada
>>>> I am not sure what optimization prefix seek will use internally for block
>>> based format, but to me, it's hard to beat the case when you only have one
>>> prefixed keys stored separately.
>>>>>> Jianjian
>>>>>>
>>>>>> On Tue, May 31, 2016 at 9:49 AM, Haomai Wang <haomai@xxxxxxxx>
>>> wrote:
>>>>>>> Hi Sage and Mark,
>>>>>>>
>>>>>>> As mentioned in BlueStore standup, I found rocksdb iterator *Seek*
>>>>>>> won't use bloom filter like *Get*.
>>>>>>>
>>>>>>> *Get* impl: it will look at filter firstly
>>>>>>>
>>> https://github.com/facebook/rocksdb/blob/master/table/block_based_t
>>>>>>> able_reader.cc#L1369
>>>>>>>
>>>>>>> Iterator *Seek*: it will do binary search, by default we don't
>>>>>>> specify prefix
>>> feature(https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-
>>> Changes).
>>>>>>> https://github.com/facebook/rocksdb/blob/master/table/block.cc#L94
>>>>>>>
>>>>>>> So I use a simple tests:
>>>>>>>
>>>>>>> ./db_bench -num 10000000  -benchmarks fillbatch fill the db firstly
>>>>>>> with 1000w records.
>>>>>>>
>>>>>>> ./db_bench -use_existing_db  -benchmarks readrandomfast
>>>>>>> readrandomfast case will use *Get* API to retrive data
>>>>>>>
>>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db
>>>>>>> -benchmarks readrandomfast
>>>>>>>
>>>>>>> LevelDB:    version 4.3
>>>>>>> Date:       Wed Jun  1 00:29:16 2016
>>>>>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
>>>>>>> CPUCache:   20480 KB
>>>>>>> Keys:       16 bytes each
>>>>>>> Values:     100 bytes each (50 bytes after compression)
>>>>>>> Entries:    1000000
>>>>>>> Prefix:    0 bytes
>>>>>>> Keys per prefix:    0
>>>>>>> RawSize:    110.6 MB (estimated)
>>>>>>> FileSize:   62.9 MB (estimated)
>>>>>>> Writes per second: 0
>>>>>>> Compression: Snappy
>>>>>>> Memtablerep: skip_list
>>>>>>> Perf Level: 0
>>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
>>>>>>> ------------------------------------------------
>>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench]
>>>>>>> readrandomfast :       4.570 micros/op 218806 ops/sec; (1000100 of
>>>>>>> 1000100 found, issued 46639 non-exist keys)
>>>>>>>
>>>>>>> ===========================
>>>>>>> then I modify readrandomfast to use Iterator API[0]:
>>>>>>>
>>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db
>>>>>>> -benchmarks readrandomfast
>>>>>>> LevelDB:    version 4.3
>>>>>>> Date:       Wed Jun  1 00:33:03 2016
>>>>>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
>>>>>>> CPUCache:   20480 KB
>>>>>>> Keys:       16 bytes each
>>>>>>> Values:     100 bytes each (50 bytes after compression)
>>>>>>> Entries:    1000000
>>>>>>> Prefix:    0 bytes
>>>>>>> Keys per prefix:    0
>>>>>>> RawSize:    110.6 MB (estimated)
>>>>>>> FileSize:   62.9 MB (estimated)
>>>>>>> Writes per second: 0
>>>>>>> Compression: Snappy
>>>>>>> Memtablerep: skip_list
>>>>>>> Perf Level: 0
>>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
>>>>>>> ------------------------------------------------
>>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench]
>>>>>>> readrandomfast :      45.188 micros/op 22129 ops/sec; (1000100 of
>>>>>>> 1000100 found, issued 46639 non-exist keys)
>>>>>>>
>>>>>>>
>>>>>>> 45.18 us/op vs 4.57us/op!
>>>>>>>
>>>>>>> The test can be repeated and easy to do! Plz correct if I'm doing
>>>>>>> foolish thing I'm not aware..
>>>>>>>
>>>>>>> So I proposal this PR: https://github.com/ceph/ceph/pull/9411
>>>>>>>
>>>>>>> We still can make further improvements by scanning all iterate
>>>>>>> usage to make it better!
>>>>>>>
>>>>>>> [0]:
>>>>>>> --- a/db/db_bench.cc
>>>>>>> +++ b/db/db_bench.cc
>>>>>>> @@ -2923,14 +2923,12 @@ class Benchmark {
>>>>>>>          int64_t key_rand = thread->rand.Next() & (pot - 1);
>>>>>>>          GenerateKeyFromInt(key_rand, FLAGS_num, &key);
>>>>>>>          ++read;
>>>>>>> -        auto status = db->Get(options, key, &value);
>>>>>>> -        if (status.ok()) {
>>>>>>> -          ++found;
>>>>>>> -        } else if (!status.IsNotFound()) {
>>>>>>> -          fprintf(stderr, "Get returned an error: %s\n",
>>>>>>> -                  status.ToString().c_str());
>>>>>>> -          abort();
>>>>>>> -        }
>>>>>>> +        Iterator* iter = db->NewIterator(options);
>>>>>>> +      iter->Seek(key);
>>>>>>> +      if (iter->Valid() && iter->key().compare(key) == 0) {
>>>>>>> +        found++;
>>>>>>> +      }
>>>>>>> +
>>>>>>>          if (key_rand >= FLAGS_num) {
>>>>>>>            ++nonexist;
>>>>>>>          }
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>>> majordomo
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
>>> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!�i

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f