RE: RocksDB Incorrect API Usage

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 1 Jun 2016 13:46:26 +0000

> -----Original Message-----
> From: Varada Kari
> Sent: Tuesday, May 31, 2016 8:38 PM
> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx>; Allen Samuels
> <Allen.Samuels@xxxxxxxxxxx>; Haomai Wang <haomai@xxxxxxxx>
> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>;
> ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: RocksDB Incorrect API Usage
> 
> 
> 
> On Wednesday 01 June 2016 08:54 AM, Jianjian Huo wrote:
> > Hi Allen,
> >
> > On Tue, May 31, 2016 at 7:52 PM, Allen Samuels
> <Allen.Samuels@xxxxxxxxxxx> wrote:
> >>> -----Original Message-----
> >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang
> >>> Sent: Tuesday, May 31, 2016 7:38 PM
> >>> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx>
> >>> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson
> <mnelson@xxxxxxxxxx>;
> >>> ceph-devel@xxxxxxxxxxxxxxx
> >>> Subject: Re: RocksDB Incorrect API Usage
> >>>
> >>> On Wed, Jun 1, 2016 at 3:40 AM, Jianjian Huo
> >>> <jianjian.huo@xxxxxxxxxxx>
> >>> wrote:
> >>>> On Tue, May 31, 2016 at 11:12 AM, Haomai Wang <haomai@xxxxxxxx>
> >>> wrote:
> >>>>> On Wed, Jun 1, 2016 at 2:08 AM, Jianjian Huo
> >>> <jianjian.huo@xxxxxxxxxxx> wrote:
> >>>>>> Hi Haomai,
> >>>>>>
> >>>>>> I noticed this as well, and made same changes to RocksDBStore in
> >>>>>> this PR
> >>> last week:
> >>>>>> https://github.com/ceph/ceph/pull/9215
> >>>>>>
> >>>>>> One thing which is even worse,  seek will bypass row cache, so kv
> >>>>>> pairs
> >>> won't be able to be cached in row cache.
> >>>>>> I am working to benchmark the performance impact, will publish
> >>>>>> the
> >>> results after I am done this week.
> >>>>> Oh, cool! I think you can cherry-pick my another leveldb fix.
> >>>> No problem, I will do that.
> >>>>> BTW, do you pay an attention to prefix seek api? I think it will
> >>>>> be more suitable than column family in ceph case. If we can have
> >>>>> well-defined prefix rule, we can make most of range query cheaper!
> >>>> When keys with same prefix are stored in their own SST files(CF
> >>>> case),
> >>> even seeking without prefix will be faster than seeking with prefix
> >>> but mixed with different prefixed keys?
> >>>
> >>> From my current view, prefix could be more flexible. For example,
> >>> each rgw bucket index could use one prefix to make each bucket index
> >>> object seek separated. For CF, it would be too heavry.
> >> This might cause other problems. Would it dramatically increase the
> >> number of files that BlueFS needs to manage? If so, that might
> >> effectively break that code too (of course it's fixable also :))
> > Thanks for bringing up this issue. What's current limit of number of files for
> BlueFS, from your experience?
> > Jianjian
> There is no limit right now, fnode ino is a unit64_t.  Directory is a ref counted
> object contains a file map. As the number of inodes grow, file_map will grow,
> will consume more memory.
> 
> Varada

Technically correct, my concern is one of performance for BlueFS journal compaction.

> >>>> I am not sure what optimization prefix seek will use internally for
> >>>> block
> >>> based format, but to me, it's hard to beat the case when you only
> >>> have one prefixed keys stored separately.
> >>>>>> Jianjian
> >>>>>>
> >>>>>> On Tue, May 31, 2016 at 9:49 AM, Haomai Wang
> <haomai@xxxxxxxx>
> >>> wrote:
> >>>>>>> Hi Sage and Mark,
> >>>>>>>
> >>>>>>> As mentioned in BlueStore standup, I found rocksdb iterator
> >>>>>>> *Seek* won't use bloom filter like *Get*.
> >>>>>>>
> >>>>>>> *Get* impl: it will look at filter firstly
> >>>>>>>
> >>>
> https://github.com/facebook/rocksdb/blob/master/table/block_based_t
> >>>>>>> able_reader.cc#L1369
> >>>>>>>
> >>>>>>> Iterator *Seek*: it will do binary search, by default we don't
> >>>>>>> specify prefix
> >>> feature(https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API-
> >>> Changes).
> >>>>>>>
> https://github.com/facebook/rocksdb/blob/master/table/block.cc#L
> >>>>>>> 94
> >>>>>>>
> >>>>>>> So I use a simple tests:
> >>>>>>>
> >>>>>>> ./db_bench -num 10000000  -benchmarks fillbatch fill the db
> >>>>>>> firstly with 1000w records.
> >>>>>>>
> >>>>>>> ./db_bench -use_existing_db  -benchmarks readrandomfast
> >>>>>>> readrandomfast case will use *Get* API to retrive data
> >>>>>>>
> >>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db
> >>>>>>> -benchmarks readrandomfast
> >>>>>>>
> >>>>>>> LevelDB:    version 4.3
> >>>>>>> Date:       Wed Jun  1 00:29:16 2016
> >>>>>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> >>>>>>> CPUCache:   20480 KB
> >>>>>>> Keys:       16 bytes each
> >>>>>>> Values:     100 bytes each (50 bytes after compression)
> >>>>>>> Entries:    1000000
> >>>>>>> Prefix:    0 bytes
> >>>>>>> Keys per prefix:    0
> >>>>>>> RawSize:    110.6 MB (estimated)
> >>>>>>> FileSize:   62.9 MB (estimated)
> >>>>>>> Writes per second: 0
> >>>>>>> Compression: Snappy
> >>>>>>> Memtablerep: skip_list
> >>>>>>> Perf Level: 0
> >>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
> >>>>>>> ------------------------------------------------
> >>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench]
> >>>>>>> readrandomfast :       4.570 micros/op 218806 ops/sec; (1000100 of
> >>>>>>> 1000100 found, issued 46639 non-exist keys)
> >>>>>>>
> >>>>>>> ===========================
> >>>>>>> then I modify readrandomfast to use Iterator API[0]:
> >>>>>>>
> >>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db
> >>>>>>> -benchmarks readrandomfast
> >>>>>>> LevelDB:    version 4.3
> >>>>>>> Date:       Wed Jun  1 00:33:03 2016
> >>>>>>> CPU:        32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
> >>>>>>> CPUCache:   20480 KB
> >>>>>>> Keys:       16 bytes each
> >>>>>>> Values:     100 bytes each (50 bytes after compression)
> >>>>>>> Entries:    1000000
> >>>>>>> Prefix:    0 bytes
> >>>>>>> Keys per prefix:    0
> >>>>>>> RawSize:    110.6 MB (estimated)
> >>>>>>> FileSize:   62.9 MB (estimated)
> >>>>>>> Writes per second: 0
> >>>>>>> Compression: Snappy
> >>>>>>> Memtablerep: skip_list
> >>>>>>> Perf Level: 0
> >>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow
> >>>>>>> ------------------------------------------------
> >>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench]
> >>>>>>> readrandomfast :      45.188 micros/op 22129 ops/sec; (1000100 of
> >>>>>>> 1000100 found, issued 46639 non-exist keys)
> >>>>>>>
> >>>>>>>
> >>>>>>> 45.18 us/op vs 4.57us/op!
> >>>>>>>
> >>>>>>> The test can be repeated and easy to do! Plz correct if I'm
> >>>>>>> doing foolish thing I'm not aware..
> >>>>>>>
> >>>>>>> So I proposal this PR: https://github.com/ceph/ceph/pull/9411
> >>>>>>>
> >>>>>>> We still can make further improvements by scanning all iterate
> >>>>>>> usage to make it better!
> >>>>>>>
> >>>>>>> [0]:
> >>>>>>> --- a/db/db_bench.cc
> >>>>>>> +++ b/db/db_bench.cc
> >>>>>>> @@ -2923,14 +2923,12 @@ class Benchmark {
> >>>>>>>          int64_t key_rand = thread->rand.Next() & (pot - 1);
> >>>>>>>          GenerateKeyFromInt(key_rand, FLAGS_num, &key);
> >>>>>>>          ++read;
> >>>>>>> -        auto status = db->Get(options, key, &value);
> >>>>>>> -        if (status.ok()) {
> >>>>>>> -          ++found;
> >>>>>>> -        } else if (!status.IsNotFound()) {
> >>>>>>> -          fprintf(stderr, "Get returned an error: %s\n",
> >>>>>>> -                  status.ToString().c_str());
> >>>>>>> -          abort();
> >>>>>>> -        }
> >>>>>>> +        Iterator* iter = db->NewIterator(options);
> >>>>>>> +      iter->Seek(key);
> >>>>>>> +      if (iter->Valid() && iter->key().compare(key) == 0) {
> >>>>>>> +        found++;
> >>>>>>> +      }
> >>>>>>> +
> >>>>>>>          if (key_rand >= FLAGS_num) {
> >>>>>>>            ++nonexist;
> >>>>>>>          }
> >>>>>>> --
> >>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>> ceph-devel" in the body of a message to
> >>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
> >>>>>>> http://vger.kernel.org/majordomo-info.html
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe
> >>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j  f   h   z  w      j
> > :+v   w j m        zZ+     ݢj"  ! i

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f