RE: RocksDB Incorrect API Usage

Jianjian Huo <jianjian.huo@xxxxxxxxxxx> · Wed, 01 Jun 2016 03:13:58 +0000



On Tue, May 31, 2016 at 7:38 PM, Haomai Wang <haomai@xxxxxxxx> wrote:
> On Wed, Jun 1, 2016 at 3:40 AM, Jianjian Huo <jianjian.huo@xxxxxxxxxxx> wrote:
>>
>> On Tue, May 31, 2016 at 11:12 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
>>> On Wed, Jun 1, 2016 at 2:08 AM, Jianjian Huo <jianjian.huo@xxxxxxxxxxx> wrote:
>>>> Hi Haomai,
>>>>
>>>> I noticed this as well, and made same changes to RocksDBStore in this PR last week:
>>>> https://github.com/ceph/ceph/pull/9215
>>>>
>>>> One thing which is even worse,  seek will bypass row cache, so kv pairs won't be able to be cached in row cache.
>>>> I am working to benchmark the performance impact, will publish the results after I am done this week.
>>>
>>> Oh, cool! I think you can cherry-pick my another leveldb fix.
>>
>> No problem, I will do that.
>>>
>>> BTW, do you pay an attention to prefix seek api? I think it will be
>>> more suitable than column family in ceph case. If we can have
>>> well-defined prefix rule, we can make most of range query cheaper!
>>
>> When keys with same prefix are stored in their own SST files(CF case), even seeking without prefix will be faster than seeking with prefix but mixed with different prefixed keys?
>
> From my current view, prefix could be more flexible. For example, each
> rgw bucket index could use one prefix to make each bucket index object
> seek separated. For CF, it would be too heavry.

Sure, no doubt prefix seek will help some cases, CF will help others. And prefix will mostly benefit seek, while CF has broader impact.
>
>> I am not sure what optimization prefix seek will use internally for block based format, but to me, it's hard to beat the case when you only have one prefixed keys stored separately.
>>>
>>>>
>>>> Jianjian
>>>>
>>>> On Tue, May 31, 2016 at 9:49 AM, Haomai Wang <haomai@xxxxxxxx> wrote:
>>>>> Hi Sage and Mark,
>>>>>
>>>>> As mentioned in BlueStore standup, I found rocksdb iterator *Seek*
>>>>> won't use bloom filter like *Get*.
>>>>>
>>>>> *Get* impl: it will look at filter firstly
>>>>> https://github.com/facebook/rocksdb/blob/master/table/block_based_table_reader.cc#L1369
>>>>>
>>>>> Iterator *Seek*: it will do binary search, by default we don't specify
>>>>> prefix feature(https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-Changes).
>>>>> https://github.com/facebook/rocksdb/blob/master/table/block.cc#L94
>>>>>
>>>>> So I use a simple tests:
>>>>>
>>>>> ./db_bench -num 10000000  -benchmarks fillbatch
>>>>> fill the db firstly with 1000w records.
>>>>>
>>>>> ./db_bench -use_existing_db  -benchmarks readrandomfast
>>>>> readrandomfast case will use *Get* API to retrive data
>>>>>
>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db  -benchmarks
>>>>> readrandomfast
>>>>>
>>>>> LevelDB:    version 4.3
>>>>> Date:       Wed Jun  1 00:29:16 2016
>>>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
>>>>> CPUCache:   20480 KB
>>>>> Keys:       16 bytes each
>>>>> Values:     100 bytes each (50 bytes after compression)
>>>>> Entries:    1000000
>>>>> Prefix:    0 bytes
>>>>> Keys per prefix:    0
>>>>> RawSize:    110.6 MB (estimated)
>>>>> FileSize:   62.9 MB (estimated)
>>>>> Writes per second: 0
>>>>> Compression: Snappy
>>>>> Memtablerep: skip_list
>>>>> Perf Level: 0
>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
>>>>> ------------------------------------------------
>>>>> DB path: [/tmp/rocksdbtest-0/dbbench]
>>>>> readrandomfast :       4.570 micros/op 218806 ops/sec; (1000100 of
>>>>> 1000100 found, issued 46639 non-exist keys)
>>>>>
>>>>> ===========================
>>>>> then I modify readrandomfast to use Iterator API[0]:
>>>>>
>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db  -benchmarks
>>>>> readrandomfast
>>>>> LevelDB:    version 4.3
>>>>> Date:       Wed Jun  1 00:33:03 2016
>>>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
>>>>> CPUCache:   20480 KB
>>>>> Keys:       16 bytes each
>>>>> Values:     100 bytes each (50 bytes after compression)
>>>>> Entries:    1000000
>>>>> Prefix:    0 bytes
>>>>> Keys per prefix:    0
>>>>> RawSize:    110.6 MB (estimated)
>>>>> FileSize:   62.9 MB (estimated)
>>>>> Writes per second: 0
>>>>> Compression: Snappy
>>>>> Memtablerep: skip_list
>>>>> Perf Level: 0
>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
>>>>> ------------------------------------------------
>>>>> DB path: [/tmp/rocksdbtest-0/dbbench]
>>>>> readrandomfast :      45.188 micros/op 22129 ops/sec; (1000100 of
>>>>> 1000100 found, issued 46639 non-exist keys)
>>>>>
>>>>>
>>>>> 45.18 us/op vs 4.57us/op!
>>>>>
>>>>> The test can be repeated and easy to do! Plz correct if I'm doing
>>>>> foolish thing I'm not aware..
>>>>>
>>>>> So I proposal this PR: https://github.com/ceph/ceph/pull/9411
>>>>>
>>>>> We still can make further improvements by scanning all iterate usage
>>>>> to make it better!
>>>>>
>>>>> [0]:
>>>>> --- a/db/db_bench.cc
>>>>> +++ b/db/db_bench.cc
>>>>> @@ -2923,14 +2923,12 @@ class Benchmark {
>>>>>          int64_t key_rand = thread->rand.Next() & (pot - 1);
>>>>>          GenerateKeyFromInt(key_rand, FLAGS_num, &key);
>>>>>          ++read;
>>>>> -        auto status = db->Get(options, key, &value);
>>>>> -        if (status.ok()) {
>>>>> -          ++found;
>>>>> -        } else if (!status.IsNotFound()) {
>>>>> -          fprintf(stderr, "Get returned an error: %s\n",
>>>>> -                  status.ToString().c_str());
>>>>> -          abort();
>>>>> -        }
>>>>> +        Iterator* iter = db->NewIterator(options);
>>>>> +      iter->Seek(key);
>>>>> +      if (iter->Valid() && iter->key().compare(key) == 0) {
>>>>> +        found++;
>>>>> +      }
>>>>> +
>>>>>          if (key_rand >= FLAGS_num) {
>>>>>            ++nonexist;
>>>>>          }
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f