On 7/29/20 7:47 PM, Raffael Bachmann wrote:
Hi Mark
I think its 15 hours not 15 days. But the compaction time seems really
to be slow. I' destroying and recreating all nvme osds one by one. And
the recreated ones don't have latency problems and are also much
faster compacting the disk.
This is since two hours:
Compaction Statistics /var/log/ceph/ceph-osd.0.log
Total OSD Log Duration (seconds) 7909.104
Number of Compaction Events 11
Avg Compaction Time (seconds) 1.26702554545
Total Compaction Time (seconds) 13.937281
Avg Output Size: (MB) 268.282840729
Total Output Size: (MB) 2951.11124802
Total Input Records 7693669
Total Output Records 7670229
Avg Output Throughput (MB/s) 225.29745104
Avg Input Records/second 533134.954087
Avg Output Records/second 531386.725197
Avg Output/Input Ratio 0.996558805862
Not sure if you are interested in the answers to your questions
anymore but:
DBs per drive: I think one? (Not yet familiar with all the ceph
details. It was "just" a normal osd. the full disk, bluestore
including db)
Workload: Rather low. Having three node proxmox/ceph cluster is just
for not have a single point of failure. The cpus and disks are mostly
bored.
In iostat the aqu-sz value is getting to about 520 when this occours
Well now that is an interesting detail. That's the number of requests
that are backed up waiting to be serviced by the device below Ceph.
That would indicate that the device wasn't servicing requests quickly.
Not sure what that means in relation to all of your other new findings,
but something definitely seems to be behaving strangely.
There is one difference between the old osds and the recreated ones.
The old ones were partitioned and the mount /var/lib/ceph/osd/ceph-1
was the first partition as xfs.
Now they are lvm and /var/lib/ceph/osd/ceph-1 is tmpfs. Both, old and
new, are bluestore.
I'm still in the middle recreating one by one. Luckily it's not a
petabyte cluster with thousand of disks ;-)
Anyway, thanks everyone for answering and helping so fast. Having a
mailing list this active is really nice.
Cheers,
Raffael
On 29/07/2020 16:53, Mark Nelson wrote:
Wow, that's crazy. You only had 13 compaction events for that OSD
over roughly 15 days but the average compaction time was 116 seconds!
Notice too though that the average compaction output size is 422MB
with an average output throughput of 3.5MB! That's really slow with
RocksDB sitting on an NVMe drive. You are only processing about 16K
records/second.
Here are some of the results from our internal NVMe (Intel P4510)
test cluster looking at Sharded vs Unsharded rocksdb. This was based
on master from last fall so figure it's about halfway between
Nautilus and Octopus. These results are not exactly comparable to
yours since we're using some experimental settings, but your
compaction events look like they are orders of magnitude slower.
https://docs.google.com/spreadsheets/d/1FYFBxwvE1i28AKoLyqrksHptE1Z523NU3Fag0MELTQo/edit?usp=sharing
No wonder you are seeing periodic stalls. How many DBs per NVMe
drive? What's your cluster workload typically like? Also, can you
see if the NVMe drive aqu-sz is getting big waiting for the requests
to be serviced?
Mark
On 7/29/20 8:35 AM, Raffael Bachmann wrote:
Hi Mark
Unfortunately it is the production cluster and I don't have another
one :-(
This is the output of the log parser. I'have nothing to compare them
to. Stupid me has no more logs from before the upgrade.
python ceph_rocksdb_log_parser.py ceph-osd.1.log
Compaction Statistics ceph-osd.1.log
Total OSD Log Duration (seconds) 55500.457
Number of Compaction Events 13
Avg Compaction Time (seconds) 116.498074615
Total Compaction Time (seconds) 1514.47497
Avg Output Size: (MB) 422.757656391
Total Output Size: (MB) 5495.84953308
Total Input Records 21019590
Total Output Records 18093259
Avg Output Throughput (MB/s) 3.53010211372
Avg Input Records/second 17994.0419635
Avg Output Records/second 16449.9710169
Avg Output/Input Ratio 0.891530624966
ceph-osd.1.log
start_offset compaction_time_seconds output_level
num_output_files total_output_size num_input_records
num_output_records output (MB/s) input (r/s) output
(r/s) output/input ratio
417.204 70.247058 1 5 261853019 1476689 1384444
3.55491754393 21021.3643396 19708.2132607 0.937532547476
546.271 128.652685 2 7 473883973 1674393 1098908
3.51279861751 13014.8313655 8541.66393807 0.656302313734
5761.795 60.460736 1 4 211033833 1041408
1013909 3.32873133441 17224.5339521 16769.7098494 0.973594402962
14912.985 64.958415 1 4 231336608 1316575
1249120 3.3963233477 20267.9668215 19229.5332329 0.948764787422
15152.316 238.925764 2 14 944635417 2445094
1902084 3.77052068592 10233.6975262 7960.98322825 0.77791855855
24607.857 53.022134 1 4 188414045 1029179
988116 3.38887973778 19410.36549 18635.915333 0.960101206884
31259.993 55.442826 1 4 210856392 1296725
1221474 3.62694941814 23388.5083708 22031.2362865 0.941968420444
31574.193 313.736584 2 18 1213247010 2928742
2359960 3.68794259867 9335.03502416 7522.10650703 0.805793067467
37708.375 49.78089 1 3 171888381 974097
939847 3.29294101107 19567.6895291 18879.6745096 0.96483923059
43219.745 51.798215 1 4 193360867 1246101
1172257 3.5600318014 24056.8328465 22631.2238752 0.940739956071
48041.751 56.559014 1 4 208216413 1451105
1367052 3.5108576209 25656.4762604 24170.3647804 0.942076555453
48368.403 325.833185 2 19 1289359869 3196156
2489088 3.77380036251 9809.17889011 7639.1482347 0.778775504074
52693.952 45.057464 1 3 164730093 943326
907000 3.48663339848 20936.0651101 20129.8501842 0.961491573433
cheers
Raffael
On 29/07/2020 15:19, Mark Nelson wrote:
Hi Raffael,
Adam made a PR this year that shards rocksdb data across different
column families to help reduce compaction overhead. The goal is to
reduce write-amplification during compaction by storing multiple
small LSM hierarchies rather than 1 big one. We've seen evidence
that this lowers compaction time and overhead, sometimes
significantly. That PR was merged to master on April 26th so I
don't believe it's in any of the releases yet but you can test it
if you have a non-production cluster available. That PR is here:
https://github.com/ceph/ceph/pull/34006
Normally though you should have about 1GB of WAL to absorb writes
during compaction and rocksdb automatically slows writes down if
the buffers start filling up. You should only see a write stall
from compaction if you completely fill all of the buffers. Also,
you shouldn't see compaction at one level blocking IO to the entire
database. Something seems off to me here.
If you have OSD logs, you can see a history of the compaction
events by running this script:
https://github.com/ceph/cbt/blob/master/tools/ceph_rocksdb_log_parser.py
That can give you an idea of how long your compaction events are
lasting and what they are doing.
Mark
On 7/29/20 7:52 AM, Raffael Bachmann wrote:
Hi All,
I'm kind of crossposting this from here:
https://forum.proxmox.com/threads/i-o-wait-after-upgrade-5-x-to-6-2-and-ceph-luminous-to-nautilus.73581/
But since I'm more and more sure that it's a ceph problem I'll try
my luck here.
Since updating from Luminous to Nautilus I have a big problem.
I have a 3 node cluster. Each cluster has 2 nvme ssd and a
10GBASE-T net for ceph.
Every few minutes a osd seems to compact the rocksdb. While doing
this it uses alot of I/O and blocks.
This basically blocks the whole cluster and no VM/Container can
read data for some seconds (minutes).
While it happens "iostat -x" looks like this:
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s
%rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 0.00 2.00 0.00 24.00 0.00 46.00
0.00 95.83 0.00 0.00 0.00 0.00 12.00 2.00 0.40
nvme1n1 0.00 1495.00 0.00 3924.00 0.00
6099.00 0.00 80.31 0.00 352.39 523.78 0.00 2.62 0.67
100.00
And iotop:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 1573.47 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 3.43 M/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
2306 be/4 ceph 0.00 B/s 1533.22 K/s 0.00 % 99.99 %
ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph
[rocksdb:low1]
In the ceph-osd log I see that rocksdb is compacting.
https://gist.github.com/qwasli/3bd0c7d535ee462feff8aaee618f3e08
The pool and one OSD is nearfull. I'd planed to move some data
away to another ceph pool. But now I'm not sure anymore if I
should go with ceph.
I'l move some data away anyway today to see if that helps, but
before the upgrade there was the same amount of data an I haven't
had a problem.
Any hints to solve this are appreciated.
Cheers
Raffael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx