Re: High io wait when osd rocksdb is compacting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Raffael,


Adam made a PR this year that shards rocksdb data across different column families to help reduce compaction overhead.  The goal is to reduce write-amplification during compaction by storing multiple small LSM hierarchies rather than 1 big one.  We've seen evidence that this lowers compaction time and overhead, sometimes significantly.  That PR was merged to master on April 26th so I don't believe it's in any of the releases yet but you can test it if you have a non-production cluster available.  That PR is here:


https://github.com/ceph/ceph/pull/34006


Normally though you should have about 1GB of WAL to absorb writes during compaction and rocksdb automatically slows writes down if the buffers start filling up.  You should only see a write stall from compaction if you completely fill all of the buffers.  Also, you shouldn't see compaction at one level blocking IO to the entire database.  Something seems off to me here.

If you have OSD logs, you can see a history of the compaction events by running this script:

https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py


That can give you an idea of how long your compaction events are lasting and what they are doing.


Mark


On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,

I'm kind of crossposting this from here: https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/ But since I'm more and more sure that it's a ceph problem I'll try my luck here.

Since updating from Luminous to Nautilus I have a big problem.

I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net for ceph. Every few minutes a osd seems to compact the rocksdb. While doing this it uses alot of I/O and blocks. This basically blocks the whole cluster and no VM/Container can read data for some seconds (minutes).

While it happens "iostat -x" looks like this:

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm  %util nvme0n1          0.00    2.00      0.00     24.00     0.00 46.00   0.00  95.83    0.00    0.00   0.00     0.00    12.00 2.00   0.40 nvme1n1          0.00 1495.00      0.00   3924.00     0.00 6099.00   0.00  80.31    0.00  352.39 523.78     0.00     2.62 0.67 100.00

And iotop:

Total DISK READ:         0.00 B/s | Total DISK WRITE:      1573.47 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       3.43 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN IO>    COMMAND
   2306 be/4 ceph        0.00 B/s 1533.22 K/s  0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1]


In the ceph-osd log I see that rocksdb is compacting. https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08

The pool and one OSD is nearfull. I'd planed to move some data away to another ceph pool. But now I'm not sure anymore if I should go with ceph. I'l move some data away anyway today to see if that helps, but before the upgrade there was the same amount of data an I haven't had a problem.

Any hints to solve this are appreciated.

Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux