RE: Bluestore with ZetaScale

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Fri, 30 Dec 2016 08:19:02 +0000

Thanks Haodong for trying. Please see my response inline.

Regards
Somnath

-----Original Message-----
From: Tang, Haodong [mailto:haodong.tang@xxxxxxxxx]
Sent: Thursday, December 29, 2016 11:33 PM
To: mnelson@xxxxxxxxxx; Somnath Roy
Cc: ceph-devel@xxxxxxxxxxxxxxx
Subject: Bluestore with ZetaScale

 Hi, Somnath, Mark

I'm trying to get your patch(https://github.com/somnathr/ceph/tree/wip-bluestore-multi-kv-sync-thread) installed and test BlueStore performance with ZetaScale. However getting two errors.

When I restarted ceph-osd process, using "killall ceph-osd" and "ceph-osd -i ${process_id} --pid-file=/var/run/ceph/osd.${process_id}.pid", I got this error.

[Somnath] Unfortunately, we have some known bug in the ZS shim and restart is not yet supported. WE are working on this to fix ASAP.

2016-12-27 13:21:10.384809 7f3149ee3a00 30 bluestore.OnodeSpace(0x7f338b0557d8 in 0x7f3154725340) lookup
2016-12-27 13:21:10.384810 7f3149ee3a00 30 bluestore.OnodeSpace(0x7f338b0557d8 in 0x7f3154725340) lookup #0:4c000000::::head# hit 0x7f315475b680
2016-12-27 13:21:10.384812 7f3149ee3a00 20 bluestore.onode(0x7f315475b680) flush
2016-12-27 13:21:10.384812 7f3149ee3a00 20 bluestore.onode(0x7f315475b680) flush done
2016-12-27 13:21:10.384814 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [M0000000000000006c6._biginfo]
2016-12-27 13:21:10.384825 7f3149ee3a00 30 zs: _get:869 ZSReadObject logging: [1]110
2016-12-27 13:21:10.384828 7f3149ee3a00 30 bluestore(/var/lib/ceph/mnt/osd-device-58-data) omap_get_values  got 0x00000000000006c6'._biginfo' -> _biginfo
2016-12-27 13:21:10.384831 7f3149ee3a00 30 zs: is_logging:120  M 00000000000006c6._fastinfoEkÿ^?(18)
2016-12-27 13:21:10.384835 7f3149ee3a00 30 zs: append_logging_prefix:104  2_1734_(7)
2016-12-27 13:21:10.384837 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [2_1734_M0000000000000006c6._fastinfo]
2016-12-27 13:21:10.384854 7f3149ee3a00 30 zs: _get:857 ZSReadObject logging: [12]110
2016-12-27 13:21:10.384857 7f3149ee3a00 30 zs: is_logging:120  M 00000000000006c6._info(14)
2016-12-27 13:21:10.384860 7f3149ee3a00 30 zs: append_logging_prefix:104  2_1734_(7)
2016-12-27 13:21:10.384868 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [2_1734_M0000000000000006c6._info]
2016-12-27 13:21:10.384872 7f3149ee3a00 30 zs: _get:857 ZSReadObject logging: [12]110
2016-12-27 13:21:10.384874 7f3149ee3a00 30 zs: _get:844 ZSReadObject: [M0000000000000006c6._infoverinfo]
2016-12-27 13:21:10.384879 7f3149ee3a00 30 zs: _get:869 ZSReadObject logging: [1]1
2016-12-27 13:21:10.384889 7f3149ee3a00 30 bluestore(/var/lib/ceph/mnt/osd-device-58-data) omap_get_values  got 0x00000000000006c6'._infover' -> _infover
2016-12-27 13:21:10.384892 7f3149ee3a00 10 bluestore(/var/lib/ceph/mnt/osd-device-58-data) omap_get_values 0.32_head oid #0:4c000000::::head# = 0
2016-12-27 13:21:10.386868 7f3149ee3a00 -1 /root/ceph/src/osd/PG.cc: In function 'static int PG::read_info(ObjectStore*, spg_t, const coll_t&, ceph::bufferlist&, pg_info_t&, std::map<unsigned int, pg_interval_t>&, __u8&)' thread 7f3149ee3a00 time 2016-12-27 13:21:10.384897/root/ceph/src/osd/PG.cc: 3142: FAILED assert(values.size() == 3 || values.size() == 4)

 ceph version 11.0.2-2332-gd5d734c (d5d734ce16ab7c0770a00972478c39e07db4aed0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f314a9aa1db]
 2: (PG::read_info(ObjectStore*, spg_t, coll_t const&, ceph::buffer::list&, pg_info_t&, std::map<unsigned int, pg_interval_t, std::less<unsigned int>, std::allocator<std::pair<unsigned int const, pg_interval_t> > >&, unsigned char&)+0x174) [0x7f314a401d54] 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x76) [0x7f314a402856]
 4: (OSD::load_pgs()+0x9b4) [0x7f314a3580e4]
 5: (OSD::init()+0x2026) [0x7f314a366c76]
 6: (main()+0x2a7e) [0x7f314a29cf1e]
 7: (__libc_start_main()+0xf5) [0x7f3147628f45]
 8: (()+0x4134a6) [0x7f314a3164a6]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

When ran 4k random write test, I would hit 'transaction submit sync' error after about 100s, as shown below.

[Somnath] what is your db partition size ? Unlike rocks + bluestore , we are yet to implement the data partition sharing with ZS. Also, there is a known bug in shim that is not handling pglog trim properly and end up leaking space. The fix will be available soon. In the meantime, I would say try with 100G db partition or so , you should be able to run it ~1hour for say 600G image size. Since, all of us working on to see performance gain with ZS first these are becoming a bit lower priority. But, hopefully, by 1st week of Jan I should be able to upload a more robust and optimized version.

2016-12-28 15:50:13.001185 7ff810261700 -1 /root/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_kv_sync_thread(uint32_t)' thread 7ff810261700 time 2016-12-28 15:50:12.999799
/root/ceph/src/os/bluestore/BlueStore.cc: 6978: FAILED assert(r == 0)

 ceph version 11.0.2-2332-gd5d734c (d5d734ce16ab7c0770a00972478c39e07db4aed0)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7ff840ad31db]
 2: (BlueStore::_kv_sync_thread(unsigned int)+0x295f) [0x7ff84084be3f]
 3: (BlueStore::KVSyncThread::entry()+0x10) [0x7ff84086d970]
 4: (()+0x8184) [0x7ff83f0d6184]
 5: (clone()+0x6d) [0x7ff83d82a37d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Nonetheless, we get more stable performance during 100s test with ZS compared with RocksDB. Rocks can provide higher peak performance but once compaction became frequent, there would be a great many of read ops from compaction, which throttles Rocks thourghput a lot. BlueStore with Rocks and ZS brought all most the same performance on average. Furthermore we took some tunings on Rocks while without any tuning for ZS. Min alloc size for both test is 64K. More detailed configuration as follows.

OSD side:
"bluestore_min_alloc_size": 65536
"osd_op_num_threads_per_shard": "2"
"osd_op_num_shards": "8"

For Rocks:
separated partition(40G, P3700) for db&wal
"bluestore_rocksdb_options": "max_write_buffer_number=64,min_write_buffer_number_to_merge=2,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,compression=kNoCompression,write_buffer_size=134217728,target_file_size_base=134217728,max_background_compactions=32,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,num_levels=5,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_threads=32,flusher_threads=8"

For ZS:
separated partition(20G, P3700) for db
"bluestore_kvbackend": zs
"bluestore_sync_submit_transaction": false

Somnath, any suggestions for superior zs configuration?

[Somnath] Unfortunately, 64K min alloc size will induce lot of WA for ZS. Use these settings for 4K RW test.

bluestore_min_alloc_size = 4096
bluestore_num_kv_sync_threads = 4
bluestore_zs_options="ZS_LOG_LEVEL=info,ZS_BTREE_L1CACHE_SIZE=3073741824,ZS_SCAVENGER_ENABLE=0,ZS_SCAVENGER_THREADS=0,ZS_O_DIRECT=0"

If you are memory constrained, don't use ZS_BTREE_L1CACHE_SIZE=<cache-size-in-bytes> , default 1G

Also, hope you are following our discussion of decode_some() overhead and shard size tweaking improving ~40% performance improvement both for rocks/ZS. Use those shard size settings. The multi kv branch is quite outdated, I will upload a new one tomorrow synced with latest master that has some improvement on that line.

Thanks,
Haodong

________________________________

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f