Hi Stefan,
given that allocation probes include every allocation (including short
4K ones) your stats look pretty high indeed.
Although you omitted historic probes so it's hard to tell if there is
negative trend in it..
As I mentioned in my reply to Hector one might want to make further
investigation by e.g. building a histogram (chunk-size, num chanks)
using the output from 'ceph tell osd.N bluestore allocator dump block'
command and monitoring how it evolves over time. Script to build such a
histogram still to be written. ;)
As for Pacific release being a culprit - likely it is. But there were
two major updates which could have the impact. Both came in the same PR
(https://github.com/ceph/ceph/pull/34588):
1. 4K allocation unit for spinners
2. Switch to avl/hybrid allocator.
Honestly I'd rather bet on 1.
>BlueFS 4K allocation unit will not be backported to Pacific [3]. Would
it make sense to skip re-provisiong OSDs in Pacific altogether and do
re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?
IIRC this feature doesn't require OSD redeployment - new superblock
format is applied on-the-fly and 4K allocations are enabled immediately.
So there is no specific requirement to re-provision OSD at Quincy+.
Hence you're free to go with Pacific and enable 4K for BlueFS later in
Quincy.
Thanks,
Igor
On 26/05/2023 16:03, Stefan Kooman wrote:
On 5/25/23 22:12, Igor Fedotov wrote:
On 25/05/2023 20:36, Stefan Kooman wrote:
On 5/25/23 18:17, Igor Fedotov wrote:
Perhaps...
I don't like the idea to use fragmentation score as a real index.
IMO it's mostly like a very imprecise first turn marker to alert
that something might be wrong. But not a real quantitative
high-quality estimate.
Chiming in on the high fragmentation issue. We started collecting
"fragmentation_rating" of each OSD this afternoon. All OSDs that
have been provisioned a year ago have a fragmentation rating of ~
0.9. Not sure for how long they are on this level.
Could you please collect allocation probes from existing OSD logs?
Just a few samples from different OSDs...
10 OSDs from one host, but I have checked other nodes and they are
similar:
CNT FRAG Size Ratio Avg Frag size
21350923 37146899 317040259072 1.73982637659271 8534.77053554322
20951932 38122769 317841477632 1.8195347808498 8337.31352599283
21188454 37298950 278389411840 1.76034315670223 7463.73321072041
21605451 39369462 270427185152 1.82220042525379 6868.95810646333
19215230 36063713 290967818240 1.87682962941375 8068.16032059705
19293599 35464928 269238423552 1.83817068033807 7591.68109835159
19963538 36088151 315796836352 1.80770317365589 8750.70702159277
18030613 31753098 297826177024 1.76106591606176 9379.43683554909
17889602 31718012 299550142464 1.77298589426417 9444.16511551859
18475332 33264944 266053271552 1.80050588536109 7998.0074985847
18618154 31914219 254801883136 1.71414518324427 7983.96110323113
16437108 29421873 275350355968 1.78996651965784 9358.69568766067
17164338 28605353 249404649472 1.66655731202683 8718.81040838755
17895480 29658102 309047177216 1.65729569701399 10420.3288941416
19546560 34588509 301368737792 1.76954456436324 8712.97279081905
18525784 34806856 314875801600 1.87883309014075 9046.37297893266
18550989 35236438 273069948928 1.89943716747393 7749.64679823767
19085807 34605572 255512043520 1.81315738967705 7383.55209155335
17203820 31205542 277097357312 1.81387284916954 8879.74826112618
18003801 33723670 269696761856 1.87314167713807 7997.25420916525
18655425 33227176 306511810560 1.78109992133655 9224.7325069094
26380965 45627920 335281111040 1.72957736762093 7348.15680925188
24923956 44721109 328790982656 1.79430219664968 7352.03106559813
25312482 43035393 287792226304 1.70016488308021 6687.33817079351
25841471 46276699 288168476672 1.79079197929561 6227.07502693742
25618384 43785917 321591488512 1.70915999229303 7344.63294469772
26006097 45056206 298747666432 1.73252472295247 6630.55532088077
26684805 45196730 351100243968 1.69372532420604 7768.26650883814
24025872 42450135 353265467392 1.76685095966548 8321.89267223768
24080466 45510525 371726323712 1.88993539410741 8167.91991988666
23195936 45095051 326473826304 1.94409274969546 7239.68193990955
23653302 43312705 307549573120 1.83114835298683 7100.67803707942
21589455 40034670 322982109184 1.85436223378497 8067.56017182107
22469039 42042723 314323701760 1.87114023879704 7476.29266924504
23647633 43486098 370003841024 1.83891969230071 8508.55464254346
23750561 37387139 320471453696 1.57415814304344 8571.70305799542
23142315 38640274 329341046784 1.66968058294946 8523.25857689312
23539469 39573256 292528910336 1.68114480407353 7392.08596674481
23810938 37968499 277270380544 1.59458224619291 7302.64266027477
19361754 33610252 286391676928 1.73590946357443 8520.96190555191
20331818 34119736 256076865536 1.67814486633709 7505.24170339419
21017537 35862221 318755282944 1.70629988661374 8888.33078531305
21660731 42648077 329217507328 1.96891217567865 7719.39863380007
20708620 42285124 344562262016 2.04190931119505 8148.54562129225
21371937 43158447 312754188288 2.01939800777066 7246.65065654471
21447150 40034134 283613331456 1.86664120873869 7084.28790931259
18906469 36598724 302526169088 1.93577785465916 8266.03050663734
20086704 36824872 280208515072 1.83329589563325 7609.21898308296
20912511 40116356 340019290112 1.91829455582833 8475.82691987278
17728197 30717152 270751887360 1.73267208165613 8814.35516417668
16778676 30875765 267493560320 1.84017886751017 8663.54437922429
17700395 31528725 239652761600 1.78124414737637 7601.09270514428
17727766 31338207 232399462400 1.76774710361136 7415.85063880649
15488369 27225173 246367821824 1.75778179096844 9049.26561252705
16332731 29287976 227973730304 1.7932075168568 7783.86769724204
17043318 31659676 274151649280 1.85760049774346 8659.33211950748
21627836 34504152 279215091712 1.59535850003671 8092.2171833697
21244729 35619286 303324131328 1.67661757417569 8515.72744405938
22132156 38534232 281272401920 1.74109707160929 7299.28656473548
22035014 34627308 246920048640 1.57146748352418 7130.78962534425
20277457 33126067 265162657792 1.63364010585746 8004.65258347754
20669142 34587911 254815776768 1.67340816566067 7367.1918714027
21648239 34364823 292156514304 1.58741886580243 8501.61557078295
21117643 34737044 292367892480 1.64492997632359 8416.60253186771
20531946 37038043 292538568704 1.8039226773731 7898.32682855301
21393711 35682241 257189515264 1.66788459468299 7207.77361668512
21738966 34753281 252140285952 1.59866301828707 7255.1505554828
19197606 32922066 269381632000 1.71490476468785 8182.40361950553
20044574 33864896 245486792704 1.68947945713389 7249.00477190304
20601681 35851902 305202065408 1.74024158514055 8512.85561943129
1.76995040322111 8014.69622126768
So average fragment size is around 8 KiB, and the ratio between
requests / fragments a bit lower than two.
And after reading your mails it might not be a problem at all. But
we will start collecting this information in the coming weeks.
We will be re-provisioning all our OSDs, so that might be a good
time to look at the behavior and development of "cnt versus frags"
ratio.
After we completely emptied a host, even after having the OSDs run
idle for a couple of hours, the fragmentation ratio would not drop
lower than 0.27 for some OSDs, and up to 0.62 for others. Is it
expected that this will not go to ~ zero?
You might be facing the issue fixed by
https://github.com/ceph/ceph/pull/49885
Possibly.
I have read some tracker tickets that got mentioned in PRs [1,2]. The
problem seems to reveal itself in Pacific release. I wonder if this
has something to do with the change in default allocator: bitmap ->
hybrid in Pacific.
BlueFS 4K allocation unit will not be backported to Pacific [3]. Would
it make sense to skip re-provisiong OSDs in Pacific altogether and do
re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?
Gr. Stefan
[1]: https://tracker.ceph.com/issues/58022
[2]: https://tracker.ceph.com/issues/57672
[3]: https://tracker.ceph.com/issues/58589
[4]: https://tracker.ceph.com/issues/58588
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx