Re: High io wait when osd rocksdb is compacting

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 29 Jul 2020 09:53:58 -0500

Wow, that's crazy.  You only had 13 compaction events for that OSD over 
roughly 15 days but the average compaction time was 116 seconds!  Notice 
too though that the average compaction output size is 422MB with an 
average output throughput of 3.5MB!  That's really slow with RocksDB 
sitting on an NVMe drive.  You are only processing about 16K records/second.

Here are some of the results from our internal NVMe (Intel P4510) test 
cluster looking at Sharded vs Unsharded rocksdb.  This was based on 
master from last fall so figure it's about halfway between Nautilus and 
Octopus.  These results are not exactly comparable to yours since we're 
using some experimental settings, but your compaction events look like 
they are orders of magnitude slower.

https://docs.google.com/spreadsheets/d/1FYFBxwvE1i28AKoLyqrksHptE1Z523NU3Fag0MELTQo/edit?usp=sharing

No wonder you are seeing periodic stalls.  How many DBs per NVMe drive?  
What's your cluster workload typically like? Also, can you see if the 
NVMe drive aqu-sz is getting big waiting for the requests to be serviced?

Mark

On 7/29/20 8:35 AM, Raffael Bachmann wrote:
Hi Mark

Unfortunately it is the production cluster and I don't have another 
one :-(

This is the output of the log parser. I'have nothing to compare them 
to. Stupid me has no more logs from before the upgrade.

python ceph_rocksdb_log_parser.py ceph-osd.1.log
Compaction Statistics   ceph-osd.1.log
Total OSD Log Duration (seconds)        55500.457
Number of Compaction Events     13
Avg Compaction Time (seconds)   116.498074615
Total Compaction Time (seconds) 1514.47497
Avg Output Size: (MB)   422.757656391
Total Output Size: (MB) 5495.84953308
Total Input Records     21019590
Total Output Records    18093259
Avg Output Throughput (MB/s)    3.53010211372
Avg Input Records/second        17994.0419635
Avg Output Records/second       16449.9710169
Avg Output/Input Ratio  0.891530624966

ceph-osd.1.log

start_offset    compaction_time_seconds output_level 
num_output_files        total_output_size num_input_records 
num_output_records      output (MB/s) input (r/s)     output (r/s)    
output/input ratio
417.204 70.247058       1       5       261853019       1476689 
1384444 3.55491754393   21021.3643396   19708.2132607 0.937532547476
546.271 128.652685      2       7       473883973       1674393 
1098908 3.51279861751   13014.8313655   8541.66393807 0.656302313734
5761.795        60.460736       1       4       211033833 1041408 
1013909 3.32873133441   17224.5339521   16769.7098494 0.973594402962
14912.985       64.958415       1       4       231336608 1316575 
1249120 3.3963233477    20267.9668215   19229.5332329 0.948764787422
15152.316       238.925764      2       14      944635417 2445094 
1902084 3.77052068592   10233.6975262   7960.98322825 0.77791855855
24607.857       53.022134       1       4       188414045 1029179 
988116  3.38887973778   19410.36549     18635.915333 0.960101206884
31259.993       55.442826       1       4       210856392 1296725 
1221474 3.62694941814   23388.5083708   22031.2362865 0.941968420444
31574.193       313.736584      2       18      1213247010 2928742 
2359960 3.68794259867   9335.03502416   7522.10650703 0.805793067467
37708.375       49.78089        1       3       171888381 974097 
939847  3.29294101107   19567.6895291   18879.6745096 0.96483923059
43219.745       51.798215       1       4       193360867 1246101 
1172257 3.5600318014    24056.8328465   22631.2238752 0.940739956071
48041.751       56.559014       1       4       208216413 1451105 
1367052 3.5108576209    25656.4762604   24170.3647804 0.942076555453
48368.403       325.833185      2       19      1289359869 3196156 
2489088 3.77380036251   9809.17889011   7639.1482347 0.778775504074
52693.952       45.057464       1       3       164730093 943326 
907000  3.48663339848   20936.0651101   20129.8501842 0.961491573433

cheers
Raffael

On 29/07/2020 15:19, Mark Nelson wrote:
Hi Raffael,

Adam made a PR this year that shards rocksdb data across different 
column families to help reduce compaction overhead. The goal is to 
reduce write-amplification during compaction by storing multiple 
small LSM hierarchies rather than 1 big one. We've seen evidence that 
this lowers compaction time and overhead, sometimes significantly.  
That PR was merged to master on April 26th so I don't believe it's in 
any of the releases yet but you can test it if you have a 
non-production cluster available.  That PR is here:

https://github.com/ceph/ceph/pull/34006

Normally though you should have about 1GB of WAL to absorb writes 
during compaction and rocksdb automatically slows writes down if the 
buffers start filling up.  You should only see a write stall from 
compaction if you completely fill all of the buffers.  Also, you 
shouldn't see compaction at one level blocking IO to the entire 
database.  Something seems off to me here.

If you have OSD logs, you can see a history of the compaction events 
by running this script:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py

That can give you an idea of how long your compaction events are 
lasting and what they are doing.

Mark

On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,

I'm kind of crossposting this from here: 
https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/
But since I'm more and more sure that it's a ceph problem I'll try 
my luck here.

Since updating from Luminous to Nautilus I have a big problem.

I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T 
net for ceph.
Every few minutes a osd seems to compact the rocksdb. While doing 
this it uses alot of I/O and blocks.
This basically blocks the whole cluster and no VM/Container can read 
data for some seconds (minutes).

While it happens "iostat -x" looks like this:

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s wrqm/s  
%rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util
nvme0n1          0.00    2.00      0.00     24.00     0.00 46.00   
0.00  95.83    0.00    0.00   0.00     0.00    12.00 2.00   0.40
nvme1n1          0.00 1495.00      0.00   3924.00     0.00 6099.00   
0.00  80.31    0.00  352.39 523.78     0.00     2.62 0.67 100.00

And iotop:

Total DISK READ:         0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE: 3.43 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO> COMMAND
   2306 be/4 ceph        0.00 B/s 1533.22 K/s  0.00 % 99.99 % 
ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph 
[rocksdb:low1]

In the ceph-osd log I see that rocksdb is compacting. 
https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08

The pool and one OSD is nearfull. I'd planed to move some data away 
to another ceph pool. But now I'm not sure anymore if I should go 
with ceph.
I'l move some data away anyway today to see if that helps, but 
before the upgrade there was the same amount of data an I haven't 
had a problem.

Any hints to solve this are appreciated.

Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx