NewStore performance analysis

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Mon, 20 Apr 2015 14:59:36 +0000

[Resend in plain text]

Hi,
       I have played some tunable on RocksDB these days, try to optimize the performance of Newstore.   From the data now ,seems the WA of  RocksDB is not the issue that blocking the performance, and also seems not the fragment part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 1-write-per-sync workload. I cannot find the number online so I will do it by myself,  if that number is low, maybe we need holding multiple RocksDB instance in one OSD and do some sharding .

The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 3 separate SSDs. 
/dev/sdc1      156172796    32928 156139868   1% /root/ceph-0-db
/dev/sdd1      195264572    32928 195231644   1% /root/ceph-0-db-wal
/dev/sdb1      156172796 10589552 145583244   7% /var/lib/ceph/osd/ceph-0

Some interesting finds here:

1.  Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request block size(4KB),  IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW matched

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   58.33     0.00    28.98  1017.51     6.33  108.55    0.00  108.55   1.30   7.60
sdb               0.00     0.00    0.00 2038.00     0.00     3.98     4.00     0.13    0.07    0.00    0.07   0.07  13.33
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00   747.67    0.00 2099.67     0.00    11.28    11.00     0.76    0.36    0.00    0.36   0.36  75.73  

I believe newstore will not split the request, so there should be some very small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?  

Also checked the Filestore data, this behavior is not present in Filestore, changing the WBThrottle will affect the number. So seems this behavior is related with the flushing mechanism? In newstore we are doing fdatasync more aggressively.

2. Notice that by tuning the write_buffer_size , wirte_buffer_num and min_write_buffer_number_to_merge, we can make the DB write to ZERO

Look at the iostat of SDC, actually there is almost no IO happened there, that is because most of the WAL entries were merged before flushing to Level0.

Other RocksDB tuning are originally trying  to optimize the compaction behavior, but since there is few data written to Level0, the compaction is almost unmeasurable here.

3. Disable RocksDB WAL can 3X  the performance(Although this is definitely WRONG WAY)

Just curious if there is no extra IO happened in DB side, what the performance looks like.
I turn off the WAL log of rocks DB, the performance is 3x(799-2464 , lat from 10 -> 3.2)

4. The avg queue size is <1 in any case, both DB_WAL part and fragment part.

I guess there is some lock in rocksdb::WriteBatch() that preventing multiple OSD_OP_THREAD working concurrently, not carefully analyzed. 

An easy way to measure might be comment out  db->submit_transaction(txc->t); in NewStore::_txc_submit_kv, to see if we can get more QD in fragment part without issuing the DB.

----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
       My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO configuration is:
                bs=4k
iodepth=8
size=10g
iodepth_batch_submit=1
iodepth_batch_complete=1

       The tuning I am using are listed here, this might not be the best but already showing something.
                    rocksdb_stats_dump_period_sec = 5
    rocksdb_max_background_compactions = 4
    rocksdb_compaction_threads = 4
    rocksdb_write_buffer_size = 536870912  //512MB
    rocksdb_write_buffer_num = 4
    rocksdb_min_write_buffer_number_to_merge = 2
    rocksdb_level0_file_num_compaction_trigger = 4
    rocksdb_max_bytes_for_level_base = 104857600 //100MB
    rocksdb_target_file_size_base = 10485760      //10MB
    rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), fair enough.
　rocksdb_compression = none

                                                                                                                                                                                                                                                                                                                                                                                Xiaoxi

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f