On Wednesday 01 June 2016 07:16 PM, Allen Samuels wrote: >> -----Original Message----- >> From: Varada Kari >> Sent: Tuesday, May 31, 2016 8:38 PM >> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx>; Allen Samuels >> <Allen.Samuels@xxxxxxxxxxx>; Haomai Wang <haomai@xxxxxxxx> >> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson <mnelson@xxxxxxxxxx>; >> ceph-devel@xxxxxxxxxxxxxxx >> Subject: Re: RocksDB Incorrect API Usage >> >> >> >> On Wednesday 01 June 2016 08:54 AM, Jianjian Huo wrote: >>> Hi Allen, >>> >>> On Tue, May 31, 2016 at 7:52 PM, Allen Samuels >> <Allen.Samuels@xxxxxxxxxxx> wrote: >>>>> -----Original Message----- >>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- >>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Haomai Wang >>>>> Sent: Tuesday, May 31, 2016 7:38 PM >>>>> To: Jianjian Huo <jianjian.huo@xxxxxxxxxxx> >>>>> Cc: Sage Weil <sweil@xxxxxxxxxx>; Mark Nelson >> <mnelson@xxxxxxxxxx>; >>>>> ceph-devel@xxxxxxxxxxxxxxx >>>>> Subject: Re: RocksDB Incorrect API Usage >>>>> >>>>> On Wed, Jun 1, 2016 at 3:40 AM, Jianjian Huo >>>>> <jianjian.huo@xxxxxxxxxxx> >>>>> wrote: >>>>>> On Tue, May 31, 2016 at 11:12 AM, Haomai Wang <haomai@xxxxxxxx> >>>>> wrote: >>>>>>> On Wed, Jun 1, 2016 at 2:08 AM, Jianjian Huo >>>>> <jianjian.huo@xxxxxxxxxxx> wrote: >>>>>>>> Hi Haomai, >>>>>>>> >>>>>>>> I noticed this as well, and made same changes to RocksDBStore in >>>>>>>> this PR >>>>> last week: >>>>>>>> https://github.com/ceph/ceph/pull/9215 >>>>>>>> >>>>>>>> One thing which is even worse, seek will bypass row cache, so kv >>>>>>>> pairs >>>>> won't be able to be cached in row cache. >>>>>>>> I am working to benchmark the performance impact, will publish >>>>>>>> the >>>>> results after I am done this week. >>>>>>> Oh, cool! I think you can cherry-pick my another leveldb fix. >>>>>> No problem, I will do that. >>>>>>> BTW, do you pay an attention to prefix seek api? I think it will >>>>>>> be more suitable than column family in ceph case. If we can have >>>>>>> well-defined prefix rule, we can make most of range query cheaper! >>>>>> When keys with same prefix are stored in their own SST files(CF >>>>>> case), >>>>> even seeking without prefix will be faster than seeking with prefix >>>>> but mixed with different prefixed keys? >>>>> >>>>> From my current view, prefix could be more flexible. For example, >>>>> each rgw bucket index could use one prefix to make each bucket index >>>>> object seek separated. For CF, it would be too heavry. >>>> This might cause other problems. Would it dramatically increase the >>>> number of files that BlueFS needs to manage? If so, that might >>>> effectively break that code too (of course it's fixable also :)) >>> Thanks for bringing up this issue. What's current limit of number of files for >> BlueFS, from your experience? >>> Jianjian >> There is no limit right now, fnode ino is a unit64_t. Directory is a ref counted >> object contains a file map. As the number of inodes grow, file_map will grow, >> will consume more memory. >> >> Varada > Technically correct, my concern is one of performance for BlueFS journal compaction. > Yes. I am working on optimizing the latency part during the compaction. Varada >>>>>> I am not sure what optimization prefix seek will use internally for >>>>>> block >>>>> based format, but to me, it's hard to beat the case when you only >>>>> have one prefixed keys stored separately. >>>>>>>> Jianjian >>>>>>>> >>>>>>>> On Tue, May 31, 2016 at 9:49 AM, Haomai Wang >> <haomai@xxxxxxxx> >>>>> wrote: >>>>>>>>> Hi Sage and Mark, >>>>>>>>> >>>>>>>>> As mentioned in BlueStore standup, I found rocksdb iterator >>>>>>>>> *Seek* won't use bloom filter like *Get*. >>>>>>>>> >>>>>>>>> *Get* impl: it will look at filter firstly >>>>>>>>> >> https://github.com/facebook/rocksdb/blob/master/table/block_based_t >>>>>>>>> able_reader.cc#L1369 >>>>>>>>> >>>>>>>>> Iterator *Seek*: it will do binary search, by default we don't >>>>>>>>> specify prefix >>>>> feature(https://github.com/facebook/rocksdb/wiki/Prefix-Seek-API- >>>>> Changes). >> https://github.com/facebook/rocksdb/blob/master/table/block.cc#L >>>>>>>>> 94 >>>>>>>>> >>>>>>>>> So I use a simple tests: >>>>>>>>> >>>>>>>>> ./db_bench -num 10000000 -benchmarks fillbatch fill the db >>>>>>>>> firstly with 1000w records. >>>>>>>>> >>>>>>>>> ./db_bench -use_existing_db -benchmarks readrandomfast >>>>>>>>> readrandomfast case will use *Get* API to retrive data >>>>>>>>> >>>>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db >>>>>>>>> -benchmarks readrandomfast >>>>>>>>> >>>>>>>>> LevelDB: version 4.3 >>>>>>>>> Date: Wed Jun 1 00:29:16 2016 >>>>>>>>> CPU: 32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz >>>>>>>>> CPUCache: 20480 KB >>>>>>>>> Keys: 16 bytes each >>>>>>>>> Values: 100 bytes each (50 bytes after compression) >>>>>>>>> Entries: 1000000 >>>>>>>>> Prefix: 0 bytes >>>>>>>>> Keys per prefix: 0 >>>>>>>>> RawSize: 110.6 MB (estimated) >>>>>>>>> FileSize: 62.9 MB (estimated) >>>>>>>>> Writes per second: 0 >>>>>>>>> Compression: Snappy >>>>>>>>> Memtablerep: skip_list >>>>>>>>> Perf Level: 0 >>>>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow >>>>>>>>> ------------------------------------------------ >>>>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench] >>>>>>>>> readrandomfast : 4.570 micros/op 218806 ops/sec; (1000100 of >>>>>>>>> 1000100 found, issued 46639 non-exist keys) >>>>>>>>> >>>>>>>>> =========================== >>>>>>>>> then I modify readrandomfast to use Iterator API[0]: >>>>>>>>> >>>>>>>>> [root@hunter-node2 rocksdb]# ./db_bench -use_existing_db >>>>>>>>> -benchmarks readrandomfast >>>>>>>>> LevelDB: version 4.3 >>>>>>>>> Date: Wed Jun 1 00:33:03 2016 >>>>>>>>> CPU: 32 * Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz >>>>>>>>> CPUCache: 20480 KB >>>>>>>>> Keys: 16 bytes each >>>>>>>>> Values: 100 bytes each (50 bytes after compression) >>>>>>>>> Entries: 1000000 >>>>>>>>> Prefix: 0 bytes >>>>>>>>> Keys per prefix: 0 >>>>>>>>> RawSize: 110.6 MB (estimated) >>>>>>>>> FileSize: 62.9 MB (estimated) >>>>>>>>> Writes per second: 0 >>>>>>>>> Compression: Snappy >>>>>>>>> Memtablerep: skip_list >>>>>>>>> Perf Level: 0 >>>>>>>>> WARNING: Assertions are enabled; benchmarks unnecessarily slow >>>>>>>>> ------------------------------------------------ >>>>>>>>> DB path: [/tmp/rocksdbtest-0/dbbench] >>>>>>>>> readrandomfast : 45.188 micros/op 22129 ops/sec; (1000100 of >>>>>>>>> 1000100 found, issued 46639 non-exist keys) >>>>>>>>> >>>>>>>>> >>>>>>>>> 45.18 us/op vs 4.57us/op! >>>>>>>>> >>>>>>>>> The test can be repeated and easy to do! Plz correct if I'm >>>>>>>>> doing foolish thing I'm not aware.. >>>>>>>>> >>>>>>>>> So I proposal this PR: https://github.com/ceph/ceph/pull/9411 >>>>>>>>> >>>>>>>>> We still can make further improvements by scanning all iterate >>>>>>>>> usage to make it better! >>>>>>>>> >>>>>>>>> [0]: >>>>>>>>> --- a/db/db_bench.cc >>>>>>>>> +++ b/db/db_bench.cc >>>>>>>>> @@ -2923,14 +2923,12 @@ class Benchmark { >>>>>>>>> int64_t key_rand = thread->rand.Next() & (pot - 1); >>>>>>>>> GenerateKeyFromInt(key_rand, FLAGS_num, &key); >>>>>>>>> ++read; >>>>>>>>> - auto status = db->Get(options, key, &value); >>>>>>>>> - if (status.ok()) { >>>>>>>>> - ++found; >>>>>>>>> - } else if (!status.IsNotFound()) { >>>>>>>>> - fprintf(stderr, "Get returned an error: %s\n", >>>>>>>>> - status.ToString().c_str()); >>>>>>>>> - abort(); >>>>>>>>> - } >>>>>>>>> + Iterator* iter = db->NewIterator(options); >>>>>>>>> + iter->Seek(key); >>>>>>>>> + if (iter->Valid() && iter->key().compare(key) == 0) { >>>>>>>>> + found++; >>>>>>>>> + } >>>>>>>>> + >>>>>>>>> if (key_rand >= FLAGS_num) { >>>>>>>>> ++nonexist; >>>>>>>>> } >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>> ceph-devel" in the body of a message to >>>>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at >>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More >>>>> majordomo >>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe >>>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> N r y b X ǧv ^ ){.n + z ]z {ay ʇڙ ,j f h z w j >>> :+v w j m zZ+ ݢj" ! i > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f