> -----Original Message----- > From: Varada Kari > Sent: Tuesday, May 31, 2016 8:38 PM > To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx>; Allen Samuels > <Allen.Samuels@xxxxxxxxxxx>; Haomai Wang <haomai@xxxxxxxx> > Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>; > ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: RocksDB Incorrect API Usage > > > > On Wednesday 01 June 2016 08:54 AM, Jianjian Huo wrote: > > Hi Allen, > > > > On Tue, May 31, 2016 at 7:52 PM, Allen Samuels > <Allen.Samuels@xxxxxxxxxxx> wrote: > >>> -----Original Message----- > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang > >>> Sent: Tuesday, May 31, 2016 7:38 PM > >>> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx> > >>> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson > <mnelson@xxxxxxxxxx>; > >>> ceph-devel@xxxxxxxxxxxxxxx > >>> Subject: Re: RocksDB Incorrect API Usage > >>> > >>> On Wed, Jun 1, 2016 at 3:40 AM, Jianjian Huo > >>> <jianjian.huo@xxxxxxxxxxx> > >>> wrote: > >>>> On Tue, May 31, 2016 at 11:12 AM, Haomai Wang <haomai@xxxxxxxx> > >>> wrote: > >>>>> On Wed, Jun 1, 2016 at 2:08 AM, Jianjian Huo > >>> <jianjian.huo@xxxxxxxxxxx> wrote: > >>>>>> Hi Haomai, > >>>>>> > >>>>>> I noticed this as well, and made same changes to RocksDBStore in > >>>>>> this PR > >>> last week: > >>>>>> https://github.com/ceph/ceph/pull/9215 > >>>>>> > >>>>>> One thing which is even worse, seek will bypass row cache, so kv > >>>>>> pairs > >>> won't be able to be cached in row cache. > >>>>>> I am working to benchmark the performance impact, will publish > >>>>>> the > >>> results after I am done this week. > >>>>> Oh, cool! I think you can cherry-pick my another leveldb fix. > >>>> No problem, I will do that. > >>>>> BTW, do you pay an attention to prefix seek api? I think it will > >>>>> be more suitable than column family in ceph case. If we can have > >>>>> well-defined prefix rule, we can make most of range query cheaper! > >>>> When keys with same prefix are stored in their own SST files(CF > >>>> case), > >>> even seeking without prefix will be faster than seeking with prefix > >>> but mixed with different prefixed keys? > >>> > >>> From my current view, prefix could be more flexible. For example, > >>> each rgw bucket index could use one prefix to make each bucket index > >>> object seek separated. For CF, it would be too heavry. > >> This might cause other problems. Would it dramatically increase the > >> number of files that BlueFS needs to manage? If so, that might > >> effectively break that code too (of course it's fixable also :)) > > Thanks for bringing up this issue. What's current limit of number of files for > BlueFS, from your experience? > > Jianjian > There is no limit right now, fnode ino is a unit64_t. Directory is a ref counted > object contains a file map. As the number of inodes grow, file_map will grow, > will consume more memory. > > Varada Technically correct, my concern is one of performance for BlueFS journal compaction. > >>>> I am not sure what optimization prefix seek will use internally for > >>>> block > >>> based format, but to me, it's hard to beat the case when you only > >>> have one prefixed keys stored separately. > >>>>>> Jianjian > >>>>>> > >>>>>> On Tue, May 31, 2016 at 9:49 AM, Haomai Wang > <haomai@xxxxxxxx> > >>> wrote: > >>>>>>> Hi Sage and Mark, > >>>>>>> > >>>>>>> As mentioned in BlueStore standup, I found rocksdb iterator > >>>>>>> *Seek* won't use bloom filter like *Get*. > >>>>>>> > >>>>>>> *Get* impl: it will look at filter firstly > >>>>>>> > >>> > https://github.com/facebook/rocksdb/blob/master/table/block_based_t > >>>>>>> able_reader.cc#L1369 > >>>>>>> > >>>>>>> Iterator *Seek*: it will do binary search, by default we don't > >>>>>>> specify prefix > >>> feature(https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API- > >>> Changes). > >>>>>>> > https://github.com/facebook/rocksdb/blob/master/table/block.cc#L > >>>>>>> 94 > >>>>>>> > >>>>>>> So I use a simple tests: > >>>>>>> > >>>>>>> ./db_bench -num 10000000 -benchmarks fillbatch fill the db > >>>>>>> firstly with 1000w records. > >>>>>>> > >>>>>>> ./db_bench -use_existing_db -benchmarks readrandomfast > >>>>>>> readrandomfast case will use *Get* API to retrive data > >>>>>>> > >>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db > >>>>>>> -benchmarks readrandomfast > >>>>>>> > >>>>>>> LevelDB: version 4.3 > >>>>>>> Date: Wed Jun 1 00:29:16 2016 > >>>>>>> CPU: 32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz > >>>>>>> CPUCache: 20480 KB > >>>>>>> Keys: 16 bytes each > >>>>>>> Values: 100 bytes each (50 bytes after compression) > >>>>>>> Entries: 1000000 > >>>>>>> Prefix: 0 bytes > >>>>>>> Keys per prefix: 0 > >>>>>>> RawSize: 110.6 MB (estimated) > >>>>>>> FileSize: 62.9 MB (estimated) > >>>>>>> Writes per second: 0 > >>>>>>> Compression: Snappy > >>>>>>> Memtablerep: skip_list > >>>>>>> Perf Level: 0 > >>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow > >>>>>>> ------------------------------------------------ > >>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench] > >>>>>>> readrandomfast : 4.570 micros/op 218806 ops/sec; (1000100 of > >>>>>>> 1000100 found, issued 46639 non-exist keys) > >>>>>>> > >>>>>>> =========================== > >>>>>>> then I modify readrandomfast to use Iterator API[0]: > >>>>>>> > >>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db > >>>>>>> -benchmarks readrandomfast > >>>>>>> LevelDB: version 4.3 > >>>>>>> Date: Wed Jun 1 00:33:03 2016 > >>>>>>> CPU: 32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz > >>>>>>> CPUCache: 20480 KB > >>>>>>> Keys: 16 bytes each > >>>>>>> Values: 100 bytes each (50 bytes after compression) > >>>>>>> Entries: 1000000 > >>>>>>> Prefix: 0 bytes > >>>>>>> Keys per prefix: 0 > >>>>>>> RawSize: 110.6 MB (estimated) > >>>>>>> FileSize: 62.9 MB (estimated) > >>>>>>> Writes per second: 0 > >>>>>>> Compression: Snappy > >>>>>>> Memtablerep: skip_list > >>>>>>> Perf Level: 0 > >>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow > >>>>>>> ------------------------------------------------ > >>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench] > >>>>>>> readrandomfast : 45.188 micros/op 22129 ops/sec; (1000100 of > >>>>>>> 1000100 found, issued 46639 non-exist keys) > >>>>>>> > >>>>>>> > >>>>>>> 45.18 us/op vs 4.57us/op! > >>>>>>> > >>>>>>> The test can be repeated and easy to do! Plz correct if I'm > >>>>>>> doing foolish thing I'm not aware.. > >>>>>>> > >>>>>>> So I proposal this PR: https://github.com/ceph/ceph/pull/9411 > >>>>>>> > >>>>>>> We still can make further improvements by scanning all iterate > >>>>>>> usage to make it better! > >>>>>>> > >>>>>>> [0]: > >>>>>>> --- a/db/db_bench.cc > >>>>>>> +++ b/db/db_bench.cc > >>>>>>> @@ -2923,14 +2923,12 @@ class Benchmark { > >>>>>>> int64_t key_rand = thread->rand.Next() & (pot - 1); > >>>>>>> GenerateKeyFromInt(key_rand, FLAGS_num, &key); > >>>>>>> ++read; > >>>>>>> - auto status = db->Get(options, key, &value); > >>>>>>> - if (status.ok()) { > >>>>>>> - ++found; > >>>>>>> - } else if (!status.IsNotFound()) { > >>>>>>> - fprintf(stderr, "Get returned an error: %s\n", > >>>>>>> - status.ToString().c_str()); > >>>>>>> - abort(); > >>>>>>> - } > >>>>>>> + Iterator* iter = db->NewIterator(options); > >>>>>>> + iter->Seek(key); > >>>>>>> + if (iter->Valid() && iter->key().compare(key) == 0) { > >>>>>>> + found++; > >>>>>>> + } > >>>>>>> + > >>>>>>> if (key_rand >= FLAGS_num) { > >>>>>>> ++nonexist; > >>>>>>> } > >>>>>>> -- > >>>>>>> To unsubscribe from this list: send the line "unsubscribe > >>>>>>> ceph-devel" in the body of a message to > >>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at > >>>>>>> http://vger.kernel.org/majordomo-info.html > >>>>> -- > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >>> majordomo > >>>>> info at http://vger.kernel.org/majordomo-info.html > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe > >>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > N r y b X ǧv ^ ){.n + z ]z {ay ʇڙ ,j f h z w j > > :+v w j m zZ+ ݢj" ! i ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f