Hi All,
I'm kind of crossposting this from here:
https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/
But since I'm more and more sure that it's a ceph problem I'll try my
luck here.
Since updating from Luminous to Nautilus I have a big problem.
I have a 3 node cluster. Each cluster has 2 nvme ssd and a 10GBASE-T net
for ceph.
Every few minutes a osd seems to compact the rocksdb. While doing this
it uses alot of I/O and blocks.
This basically blocks the whole cluster and no VM/Container can read
data for some seconds (minutes).
While it happens "iostat -x" looks like this:
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00 0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40
nvme1n1 0.00 1495.00 0.00 3924.00 0.00 6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67 100.00
And iotop:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 % ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph [rocksdb:low1]
In the ceph-osd log I see that rocksdb is compacting.
https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08
The pool and one OSD is nearfull. I'd planed to move some data away to
another ceph pool. But now I'm not sure anymore if I should go with ceph.
I'l move some data away anyway today to see if that helps, but before
the upgrade there was the same amount of data an I haven't had a problem.
Any hints to solve this are appreciated.
Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx