Re: BlueStore fragmentation woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Stefan,

given that allocation probes include every allocation (including short 4K ones) your stats look pretty high indeed.

Although you omitted historic probes so it's hard to tell if there is negative trend in it..

As I mentioned in my reply to Hector one might want to make further investigation by e.g. building a histogram (chunk-size, num chanks) using the output from 'ceph tell osd.N bluestore allocator dump block' command and monitoring how  it evolves over time. Script to build such a histogram still to be written. ;)


As for Pacific release being a culprit - likely it is. But there were two major updates which could have the impact. Both came in the same PR (https://github.com/ceph/ceph/pull/34588):

1. 4K allocation unit for spinners

2. Switch to avl/hybrid allocator.

Honestly I'd rather bet on 1.

>BlueFS 4K allocation unit will not be backported to Pacific [3]. Would it make sense to skip re-provisiong OSDs in Pacific altogether and do re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?

IIRC this feature doesn't require OSD redeployment - new superblock format is applied on-the-fly and 4K allocations are enabled immediately. So there is no specific requirement to re-provision OSD at Quincy+. Hence you're free to go with Pacific and enable 4K for BlueFS later in Quincy.


Thanks,

Igor

On 26/05/2023 16:03, Stefan Kooman wrote:
On 5/25/23 22:12, Igor Fedotov wrote:

On 25/05/2023 20:36, Stefan Kooman wrote:
On 5/25/23 18:17, Igor Fedotov wrote:
Perhaps...

I don't like the idea to use fragmentation score as a real index. IMO it's mostly like a very imprecise first turn marker to alert that something might be wrong. But not a real quantitative high-quality estimate.

Chiming in on the high fragmentation issue. We started collecting "fragmentation_rating" of each OSD this afternoon. All OSDs that have been provisioned a year ago have a fragmentation rating of ~ 0.9. Not sure for how long they are on this level.

Could you please collect allocation probes from existing OSD logs? Just a few samples from different OSDs...

10 OSDs from one host, but I have checked other nodes and they are similar:

CNT    FRAG    Size    Ratio    Avg Frag size
21350923    37146899    317040259072    1.73982637659271 8534.77053554322
20951932    38122769    317841477632    1.8195347808498 8337.31352599283
21188454    37298950    278389411840    1.76034315670223 7463.73321072041
21605451    39369462    270427185152    1.82220042525379 6868.95810646333
19215230    36063713    290967818240    1.87682962941375 8068.16032059705
19293599    35464928    269238423552    1.83817068033807 7591.68109835159
19963538    36088151    315796836352    1.80770317365589 8750.70702159277
18030613    31753098    297826177024    1.76106591606176 9379.43683554909
17889602    31718012    299550142464    1.77298589426417 9444.16511551859
18475332    33264944    266053271552    1.80050588536109 7998.0074985847
18618154    31914219    254801883136    1.71414518324427 7983.96110323113
16437108    29421873    275350355968    1.78996651965784 9358.69568766067
17164338    28605353    249404649472    1.66655731202683 8718.81040838755
17895480    29658102    309047177216    1.65729569701399 10420.3288941416
19546560    34588509    301368737792    1.76954456436324 8712.97279081905
18525784    34806856    314875801600    1.87883309014075 9046.37297893266
18550989    35236438    273069948928    1.89943716747393 7749.64679823767
19085807    34605572    255512043520    1.81315738967705 7383.55209155335
17203820    31205542    277097357312    1.81387284916954 8879.74826112618
18003801    33723670    269696761856    1.87314167713807 7997.25420916525
18655425    33227176    306511810560    1.78109992133655 9224.7325069094
26380965    45627920    335281111040    1.72957736762093 7348.15680925188
24923956    44721109    328790982656    1.79430219664968 7352.03106559813
25312482    43035393    287792226304    1.70016488308021 6687.33817079351
25841471    46276699    288168476672    1.79079197929561 6227.07502693742
25618384    43785917    321591488512    1.70915999229303 7344.63294469772
26006097    45056206    298747666432    1.73252472295247 6630.55532088077
26684805    45196730    351100243968    1.69372532420604 7768.26650883814
24025872    42450135    353265467392    1.76685095966548 8321.89267223768
24080466    45510525    371726323712    1.88993539410741 8167.91991988666
23195936    45095051    326473826304    1.94409274969546 7239.68193990955
23653302    43312705    307549573120    1.83114835298683 7100.67803707942
21589455    40034670    322982109184    1.85436223378497 8067.56017182107
22469039    42042723    314323701760    1.87114023879704 7476.29266924504
23647633    43486098    370003841024    1.83891969230071 8508.55464254346
23750561    37387139    320471453696    1.57415814304344 8571.70305799542
23142315    38640274    329341046784    1.66968058294946 8523.25857689312
23539469    39573256    292528910336    1.68114480407353 7392.08596674481
23810938    37968499    277270380544    1.59458224619291 7302.64266027477
19361754    33610252    286391676928    1.73590946357443 8520.96190555191
20331818    34119736    256076865536    1.67814486633709 7505.24170339419
21017537    35862221    318755282944    1.70629988661374 8888.33078531305
21660731    42648077    329217507328    1.96891217567865 7719.39863380007
20708620    42285124    344562262016    2.04190931119505 8148.54562129225
21371937    43158447    312754188288    2.01939800777066 7246.65065654471
21447150    40034134    283613331456    1.86664120873869 7084.28790931259
18906469    36598724    302526169088    1.93577785465916 8266.03050663734
20086704    36824872    280208515072    1.83329589563325 7609.21898308296
20912511    40116356    340019290112    1.91829455582833 8475.82691987278
17728197    30717152    270751887360    1.73267208165613 8814.35516417668
16778676    30875765    267493560320    1.84017886751017 8663.54437922429
17700395    31528725    239652761600    1.78124414737637 7601.09270514428
17727766    31338207    232399462400    1.76774710361136 7415.85063880649
15488369    27225173    246367821824    1.75778179096844 9049.26561252705
16332731    29287976    227973730304    1.7932075168568 7783.86769724204
17043318    31659676    274151649280    1.85760049774346 8659.33211950748
21627836    34504152    279215091712    1.59535850003671 8092.2171833697
21244729    35619286    303324131328    1.67661757417569 8515.72744405938
22132156    38534232    281272401920    1.74109707160929 7299.28656473548
22035014    34627308    246920048640    1.57146748352418 7130.78962534425
20277457    33126067    265162657792    1.63364010585746 8004.65258347754
20669142    34587911    254815776768    1.67340816566067 7367.1918714027
21648239    34364823    292156514304    1.58741886580243 8501.61557078295
21117643    34737044    292367892480    1.64492997632359 8416.60253186771
20531946    37038043    292538568704    1.8039226773731 7898.32682855301
21393711    35682241    257189515264    1.66788459468299 7207.77361668512
21738966    34753281    252140285952    1.59866301828707 7255.1505554828
19197606    32922066    269381632000    1.71490476468785 8182.40361950553
20044574    33864896    245486792704    1.68947945713389 7249.00477190304
20601681    35851902    305202065408    1.74024158514055 8512.85561943129
            1.76995040322111    8014.69622126768


So average fragment size is around 8 KiB, and the ratio between requests / fragments a bit lower than two.





And after reading your mails it might not be a problem at all. But we will start collecting this information in the coming weeks.

We will be re-provisioning all our OSDs, so that might be a good time to look at the behavior and development of "cnt versus frags" ratio.

After we completely emptied a host, even after having the OSDs run idle for a couple of hours, the fragmentation ratio would not drop lower than 0.27 for some OSDs, and up to 0.62 for others. Is it expected that this will not go to ~ zero?

You might be facing the issue fixed by https://github.com/ceph/ceph/pull/49885

Possibly.


I have read some tracker tickets that got mentioned in PRs [1,2]. The problem seems to reveal itself in Pacific release. I wonder if this has something to do with the change in default allocator: bitmap -> hybrid in Pacific.




BlueFS 4K allocation unit will not be backported to Pacific [3]. Would it make sense to skip re-provisiong OSDs in Pacific altogether and do re-provisioning in Quincy release with BlueFS 4K alloc size support [4]?

Gr. Stefan

[1]: https://tracker.ceph.com/issues/58022
[2]: https://tracker.ceph.com/issues/57672
[3]: https://tracker.ceph.com/issues/58589
[4]: https://tracker.ceph.com/issues/58588
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux