RE: RocksDB Incorrect API Usage

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 1 Jun 2016 02:52:23 +0000



> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang
> Sent: Tuesday, May 31, 2016 7:38 PM
> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx>
> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>;
> ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: RocksDB Incorrect API Usage
> 
> On Wed, Jun 1, 2016 at 3:40 AM, Jianjian Huo <jianjian.huo@xxxxxxxxxxx>
> wrote:
> >
> > On Tue, May 31, 2016 at 11:12 AM, Haomai Wang <haomai@xxxxxxxx>
> wrote:
> >> On Wed, Jun 1, 2016 at 2:08 AM, Jianjian Huo
> <jianjian.huo@xxxxxxxxxxx> wrote:
> >>> Hi Haomai,
> >>>
> >>> I noticed this as well, and made same changes to RocksDBStore in this PR
> last week:
> >>> https://github.com/ceph/ceph/pull/9215
> >>>
> >>> One thing which is even worse,  seek will bypass row cache, so kv pairs
> won't be able to be cached in row cache.
> >>> I am working to benchmark the performance impact, will publish the
> results after I am done this week.
> >>
> >> Oh, cool! I think you can cherry-pick my another leveldb fix.
> >
> > No problem, I will do that.
> >>
> >> BTW, do you pay an attention to prefix seek api? I think it will be
> >> more suitable than column family in ceph case. If we can have
> >> well-defined prefix rule, we can make most of range query cheaper!
> >
> > When keys with same prefix are stored in their own SST files(CF case),
> even seeking without prefix will be faster than seeking with prefix but mixed
> with different prefixed keys?
> 
> From my current view, prefix could be more flexible. For example, each rgw
> bucket index could use one prefix to make each bucket index object seek
> separated. For CF, it would be too heavry.

This might cause other problems. Would it dramatically increase the number of files that BlueFS needs to manage? If so, that might effectively break that code too (of course it's fixable also :))
> 
> > I am not sure what optimization prefix seek will use internally for block
> based format, but to me, it's hard to beat the case when you only have one
> prefixed keys stored separately.
> >>
> >>>
> >>> Jianjian
> >>>
> >>> On Tue, May 31, 2016 at 9:49 AM, Haomai Wang <haomai@xxxxxxxx>
> wrote:
> >>>> Hi Sage and Mark,
> >>>>
> >>>> As mentioned in BlueStore standup, I found rocksdb iterator *Seek*
> >>>> won't use bloom filter like *Get*.
> >>>>
> >>>> *Get* impl: it will look at filter firstly
> >>>>
> https://github.com/facebook/rocksdb/blob/master/table/block_based_t
> >>>> able_reader.cc#L1369
> >>>>
> >>>> Iterator *Seek*: it will do binary search, by default we don't
> >>>> specify prefix
> feature(https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-
> Changes).
> >>>> https://github.com/facebook/rocksdb/blob/master/table/block.cc#L94
> >>>>
> >>>> So I use a simple tests:
> >>>>
> >>>> ./db_bench -num 10000000  -benchmarks fillbatch fill the db firstly
> >>>> with 1000w records.
> >>>>
> >>>> ./db_bench -use_existing_db  -benchmarks readrandomfast
> >>>> readrandomfast case will use *Get* API to retrive data
> >>>>
> >>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db
> >>>> -benchmarks readrandomfast
> >>>>
> >>>> LevelDB:    version 4.3
> >>>> Date:       Wed Jun  1 00:29:16 2016
> >>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> >>>> CPUCache:   20480 KB
> >>>> Keys:       16 bytes each
> >>>> Values:     100 bytes each (50 bytes after compression)
> >>>> Entries:    1000000
> >>>> Prefix:    0 bytes
> >>>> Keys per prefix:    0
> >>>> RawSize:    110.6 MB (estimated)
> >>>> FileSize:   62.9 MB (estimated)
> >>>> Writes per second: 0
> >>>> Compression: Snappy
> >>>> Memtablerep: skip_list
> >>>> Perf Level: 0
> >>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
> >>>> ------------------------------------------------
> >>>> DB path: [/tmp/rocksdbtest-0/dbbench]
> >>>> readrandomfast :       4.570 micros/op 218806 ops/sec; (1000100 of
> >>>> 1000100 found, issued 46639 non-exist keys)
> >>>>
> >>>> ===========================
> >>>> then I modify readrandomfast to use Iterator API[0]:
> >>>>
> >>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db
> >>>> -benchmarks readrandomfast
> >>>> LevelDB:    version 4.3
> >>>> Date:       Wed Jun  1 00:33:03 2016
> >>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> >>>> CPUCache:   20480 KB
> >>>> Keys:       16 bytes each
> >>>> Values:     100 bytes each (50 bytes after compression)
> >>>> Entries:    1000000
> >>>> Prefix:    0 bytes
> >>>> Keys per prefix:    0
> >>>> RawSize:    110.6 MB (estimated)
> >>>> FileSize:   62.9 MB (estimated)
> >>>> Writes per second: 0
> >>>> Compression: Snappy
> >>>> Memtablerep: skip_list
> >>>> Perf Level: 0
> >>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
> >>>> ------------------------------------------------
> >>>> DB path: [/tmp/rocksdbtest-0/dbbench]
> >>>> readrandomfast :      45.188 micros/op 22129 ops/sec; (1000100 of
> >>>> 1000100 found, issued 46639 non-exist keys)
> >>>>
> >>>>
> >>>> 45.18 us/op vs 4.57us/op!
> >>>>
> >>>> The test can be repeated and easy to do! Plz correct if I'm doing
> >>>> foolish thing I'm not aware..
> >>>>
> >>>> So I proposal this PR: https://github.com/ceph/ceph/pull/9411
> >>>>
> >>>> We still can make further improvements by scanning all iterate
> >>>> usage to make it better!
> >>>>
> >>>> [0]:
> >>>> --- a/db/db_bench.cc
> >>>> +++ b/db/db_bench.cc
> >>>> @@ -2923,14 +2923,12 @@ class Benchmark {
> >>>>          int64_t key_rand = thread->rand.Next() & (pot - 1);
> >>>>          GenerateKeyFromInt(key_rand, FLAGS_num, &key);
> >>>>          ++read;
> >>>> -        auto status = db->Get(options, key, &value);
> >>>> -        if (status.ok()) {
> >>>> -          ++found;
> >>>> -        } else if (!status.IsNotFound()) {
> >>>> -          fprintf(stderr, "Get returned an error: %s\n",
> >>>> -                  status.ToString().c_str());
> >>>> -          abort();
> >>>> -        }
> >>>> +        Iterator* iter = db->NewIterator(options);
> >>>> +      iter->Seek(key);
> >>>> +      if (iter->Valid() && iter->key().compare(key) == 0) {
> >>>> +        found++;
> >>>> +      }
> >>>> +
> >>>>          if (key_rand >= FLAGS_num) {
> >>>>            ++nonexist;
> >>>>          }
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe
> >>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f