Hi, Somnath, Mark I'm trying to get your patch(https://github.com/somnathr/ceph/tree/wip-bluestore-multi-kv-sync-thread) installed and test BlueStore performance with ZetaScale. However getting two errors. When I restarted ceph-osd process, using "killall ceph-osd" and "ceph-osd -i ${process_id} --pid-file=/var/run/ceph/osd.${process_id}.pid", I got this error. 2016-12-27 13:21:10.384809 7f3149ee3a00 30 bluestore.OnodeSpace(0x7f338b0557d8 in 0x7f3154725340) lookup 2016-12-27 13:21:10.384810 7f3149ee3a00 30 bluestore.OnodeSpace(0x7f338b0557d8 in 0x7f3154725340) lookup #0:4c000000::::head# hit 0x7f315475b680 2016-12-27 13:21:10.384812 7f3149ee3a00 20 bluestore.onode(0x7f315475b680) flush 2016-12-27 13:21:10.384812 7f3149ee3a00 20 bluestore.onode(0x7f315475b680) flush done 2016-12-27 13:21:10.384814 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [M0000000000000006c6._biginfo] 2016-12-27 13:21:10.384825 7f3149ee3a00 30 zs: _get:869 ZSReadObject logging: [1]110 2016-12-27 13:21:10.384828 7f3149ee3a00 30 bluestore(/var/lib/ceph/mnt/osd-device-58-data) omap_get_values got 0x00000000000006c6'._biginfo' -> _biginfo 2016-12-27 13:21:10.384831 7f3149ee3a00 30 zs: is_logging:120 M 00000000000006c6._fastinfoEkÿ^?(18) 2016-12-27 13:21:10.384835 7f3149ee3a00 30 zs: append_logging_prefix:104 2_1734_(7) 2016-12-27 13:21:10.384837 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [2_1734_M0000000000000006c6._fastinfo] 2016-12-27 13:21:10.384854 7f3149ee3a00 30 zs: _get:857 ZSReadObject logging: [12]110 2016-12-27 13:21:10.384857 7f3149ee3a00 30 zs: is_logging:120 M 00000000000006c6._info(14) 2016-12-27 13:21:10.384860 7f3149ee3a00 30 zs: append_logging_prefix:104 2_1734_(7) 2016-12-27 13:21:10.384868 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [2_1734_M0000000000000006c6._info] 2016-12-27 13:21:10.384872 7f3149ee3a00 30 zs: _get:857 ZSReadObject logging: [12]110 2016-12-27 13:21:10.384874 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [M0000000000000006c6._infoverinfo] 2016-12-27 13:21:10.384879 7f3149ee3a00 30 zs: _get:869 ZSReadObject logging: [1]1 2016-12-27 13:21:10.384889 7f3149ee3a00 30 bluestore(/var/lib/ceph/mnt/osd-device-58-data) omap_get_values got 0x00000000000006c6'._infover' -> _infover 2016-12-27 13:21:10.384892 7f3149ee3a00 10 bluestore(/var/lib/ceph/mnt/osd-device-58-data) omap_get_values 0.32_head oid #0:4c000000::::head# = 0 2016-12-27 13:21:10.386868 7f3149ee3a00 -1 /root/ceph/src/osd/PG.cc: In function 'static int PG::read_info(ObjectStore*, spg_t, const coll_t&, ceph::bufferlist&, pg_info_t&, std::map<unsigned int, pg_interval_t>&, __u8&)' thread 7f3149ee3a00 time 2016-12-27 13:21:10.384897/root/ceph/src/osd/PG.cc: 3142: FAILED assert(values.size() == 3 || values.size() == 4) ceph version 11.0.2-2332-gd5d734c (d5d734ce16ab7c0770a00972478c39e07db4aed0) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f314a9aa1db] 2: (PG::read_info(ObjectStore*, spg_t, coll_t const&, ceph::buffer::list&, pg_info_t&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned char&)+0x174) [0x7f314a401d54] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x76) [0x7f314a402856] 4: (OSD::load_pgs()+0x9b4) [0x7f314a3580e4] 5: (OSD::init()+0x2026) [0x7f314a366c76] 6: (main()+0x2a7e) [0x7f314a29cf1e] 7: (__libc_start_main()+0xf5) [0x7f3147628f45] 8: (()+0x4134a6) [0x7f314a3164a6] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. When ran 4k random write test, I would hit 'transaction submit sync' error after about 100s, as shown below. 2016-12-28 15:50:13.001185 7ff810261700 -1 /root/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread(uint32_t)' thread 7ff810261700 time 2016-12-28 15:50:12.999799 /root/ceph/src/os/bluestore/BlueStore.cc: 6978: FAILED assert(r == 0) ceph version 11.0.2-2332-gd5d734c (d5d734ce16ab7c0770a00972478c39e07db4aed0) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7ff840ad31db] 2: (BlueStore::_kv_sync_thread(unsigned int)+0x295f) [0x7ff84084be3f] 3: (BlueStore::KVSyncThread::entry()+0x10) [0x7ff84086d970] 4: (()+0x8184) [0x7ff83f0d6184] 5: (clone()+0x6d) [0x7ff83d82a37d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. Nonetheless, we get more stable performance during 100s test with ZS compared with RocksDB. Rocks can provide higher peak performance but once compaction became frequent, there would be a great many of read ops from compaction, which throttles Rocks thourghput a lot. BlueStore with Rocks and ZS brought all most the same performance on average. Furthermore we took some tunings on Rocks while without any tuning for ZS. Min alloc size for both test is 64K. More detailed configuration as follows. OSD side: "bluestore_min_alloc_size": 65536 "osd_op_num_threads_per_shard": "2" "osd_op_num_shards": "8" For Rocks: separated partition(40G, P3700) for db&wal "bluestore_rocksdb_options": "max_write_buffer_number=64,min_write_buffer_number_to_merge=2,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,compression=kNoCompression,write_buffer_size=134217728,target_file_size_base=134217728,max_background_compactions=32,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8" For ZS: separated partition(20G, P3700) for db "bluestore_kvbackend": zs "bluestore_sync_submit_transaction": false Somnath, any suggestions for superior zs configuration? Thanks, Haodong ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f