Re: High io wait when osd rocksdb is compacting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mark

I think its 15 hours not 15 days. But the compaction time seems really to be slow. I' destroying and recreating all nvme osds one by one. And the recreated ones don't have latency problems and are also much faster compacting the disk.

This is since two hours:
Compaction Statistics   /var/log/ceph/ceph-osd.0.log
Total OSD Log Duration (seconds)        7909.104
Number of Compaction Events     11
Avg Compaction Time (seconds)   1.26702554545
Total Compaction Time (seconds) 13.937281
Avg Output Size: (MB)   268.282840729
Total Output Size: (MB) 2951.11124802
Total Input Records     7693669
Total Output Records    7670229
Avg Output Throughput (MB/s)    225.29745104
Avg Input Records/second        533134.954087
Avg Output Records/second       531386.725197
Avg Output/Input Ratio  0.996558805862

Not sure if you are interested in the answers to your questions anymore but: DBs per drive: I think one? (Not yet familiar with all the ceph details. It was "just" a normal osd. the full disk, bluestore including db) Workload: Rather low. Having three node proxmox/ceph cluster is just for not have a single point of failure.  The cpus and disks are mostly bored.
In iostat the aqu-sz value is getting to about 520 when this occours

There is one difference between the old osds and the recreated ones. The old ones were partitioned and the mount /var/lib/ceph/osd/ceph-1 was the first partition as xfs. Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Both, old and new, are bluestore.

I'm still in the middle recreating one by one. Luckily it's not a petabyte cluster with thousand of disks ;-)

Anyway, thanks everyone for answering and helping so fast. Having a mailing list this active is really nice.
Cheers,
Raffael

On 29/07/2020 16:53, Mark Nelson wrote:
Wow, that's crazy.  You only had 13 compaction events for that OSD over roughly 15 days but the average compaction time was 116 seconds! Notice too though that the average compaction output size is 422MB with an average output throughput of 3.5MB!  That's really slow with RocksDB sitting on an NVMe drive.  You are only processing about 16K records/second.


Here are some of the results from our internal NVMe (Intel P4510) test cluster looking at Sharded vs Unsharded rocksdb.  This was based on master from last fall so figure it's about halfway between Nautilus and Octopus.  These results are not exactly comparable to yours since we're using some experimental settings, but your compaction events look like they are orders of magnitude slower.


https://docs.google.com/spreadsheets/d/1FYFBxwvE1i28AKoLyqrksHptE1Z523NU3Fag0MELTQo/edit?usp=sharing


No wonder you are seeing periodic stalls.  How many DBs per NVMe drive?  What's your cluster workload typically like? Also, can you see if the NVMe drive aqu-sz is getting big waiting for the requests to be serviced?


Mark


On 7/29/20 8:35 AM, Raffael Bachmann wrote:
Hi Mark

Unfortunately it is the production cluster and I don't have another one :-(

This is the output of the log parser. I'have nothing to compare them to. Stupid me has no more logs from before the upgrade.

python ceph_rocksdb_log_parser.py ceph-osd.1.log
Compaction Statistics   ceph-osd.1.log
Total OSD Log Duration (seconds)        55500.457
Number of Compaction Events     13
Avg Compaction Time (seconds)   116.498074615
Total Compaction Time (seconds) 1514.47497
Avg Output Size: (MB)   422.757656391
Total Output Size: (MB) 5495.84953308
Total Input Records     21019590
Total Output Records    18093259
Avg Output Throughput (MB/s)    3.53010211372
Avg Input Records/second        17994.0419635
Avg Output Records/second       16449.9710169
Avg Output/Input Ratio  0.891530624966

ceph-osd.1.log

start_offset    compaction_time_seconds output_level num_output_files        total_output_size num_input_records num_output_records      output (MB/s) input (r/s)     output (r/s)    output/input ratio 417.204 70.247058       1       5       261853019       1476689 1384444 3.55491754393   21021.3643396   19708.2132607 0.937532547476 546.271 128.652685      2       7       473883973       1674393 1098908 3.51279861751   13014.8313655   8541.66393807 0.656302313734 5761.795        60.460736       1       4       211033833 1041408 1013909 3.32873133441   17224.5339521   16769.7098494 0.973594402962 14912.985       64.958415       1       4       231336608 1316575 1249120 3.3963233477    20267.9668215   19229.5332329 0.948764787422 15152.316       238.925764      2       14      944635417 2445094 1902084 3.77052068592   10233.6975262   7960.98322825 0.77791855855 24607.857       53.022134       1       4       188414045 1029179 988116  3.38887973778   19410.36549     18635.915333 0.960101206884 31259.993       55.442826       1       4       210856392 1296725 1221474 3.62694941814   23388.5083708   22031.2362865 0.941968420444 31574.193       313.736584      2       18      1213247010 2928742 2359960 3.68794259867   9335.03502416   7522.10650703 0.805793067467 37708.375       49.78089        1       3       171888381 974097 939847  3.29294101107   19567.6895291   18879.6745096 0.96483923059 43219.745       51.798215       1       4       193360867 1246101 1172257 3.5600318014    24056.8328465   22631.2238752 0.940739956071 48041.751       56.559014       1       4       208216413 1451105 1367052 3.5108576209    25656.4762604   24170.3647804 0.942076555453 48368.403       325.833185      2       19      1289359869 3196156 2489088 3.77380036251   9809.17889011   7639.1482347 0.778775504074 52693.952       45.057464       1       3       164730093 943326 907000  3.48663339848   20936.0651101   20129.8501842 0.961491573433

cheers
Raffael


On 29/07/2020 15:19, Mark Nelson wrote:
Hi Raffael,


Adam made a PR this year that shards rocksdb data across different column families to help reduce compaction overhead. The goal is to reduce write-amplification during compaction by storing multiple small LSM hierarchies rather than 1 big one. We've seen evidence that this lowers compaction time and overhead, sometimes significantly.  That PR was merged to master on April 26th so I don't believe it's in any of the releases yet but you can test it if you have a non-production cluster available.  That PR is here:


https://github.com/ceph/ceph/pull/34006


Normally though you should have about 1GB of WAL to absorb writes during compaction and rocksdb automatically slows writes down if the buffers start filling up.  You should only see a write stall from compaction if you completely fill all of the buffers.  Also, you shouldn't see compaction at one level blocking IO to the entire database.  Something seems off to me here.

If you have OSD logs, you can see a history of the compaction events by running this script:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


That can give you an idea of how long your compaction events are lasting and what they are doing.


Mark


On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,

I'm kind of crossposting this from here: https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/ But since I'm more and more sure that it's a ceph problem I'll try my luck here.

Since updating from Luminous to Nautilus I have a big problem.

I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net for ceph. Every few minutes a osd seems to compact the rocksdb. While doing this it uses alot of I/O and blocks. This basically blocks the whole cluster and no VM/Container can read data for some seconds (minutes).

While it happens "iostat -x" looks like this:

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util nvme0n1          0.00    2.00      0.00     24.00     0.00 46.00   0.00  95.83    0.00    0.00   0.00     0.00    12.00 2.00   0.40 nvme1n1          0.00 1495.00      0.00   3924.00     0.00 6099.00   0.00  80.31    0.00  352.39 523.78     0.00 2.62 0.67 100.00

And iotop:

Total DISK READ:         0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE: 3.43 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO> COMMAND
   2306 be/4 ceph        0.00 B/s 1533.22 K/s  0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1]


In the ceph-osd log I see that rocksdb is compacting. https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08

The pool and one OSD is nearfull. I'd planed to move some data away to another ceph pool. But now I'm not sure anymore if I should go with ceph. I'l move some data away anyway today to see if that helps, but before the upgrade there was the same amount of data an I haven't had a problem.

Any hints to solve this are appreciated.

Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux