Re: 3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 10 Mar 2022 12:35:54 +0300

Hi Sasa,

jsut a few thoughts/questions on your issue in attempt to understand 
what's happening.

First of all I'd like to clarify what exact command are you using to 
assess the fragmentation. There are two options: "bluestore allocator 
score" and "bluestore allocator fragmentation"

Both are not very accurate though but it would be interesting to have 
both numbers for the case with presumably high fragmentation.

Secondly - I can imagine two performance issues when writing to 
all-flash OSD under heavy fragmentation:

1) Bluestore Allocator takes too long to allocate a new block.

2) Bluestore to invoke a large bunch of disk write requests to process 
single 4M user writing. Which might be less efficient.

I've never seen the latter being a significant issue when SSDs are in 
use (it definitely is for spinners) .

But I recall we've seen some issues with 1), e.g. 
https://tracker.ceph.com/issues/52804

In this respect could you please try to switch bluestore and bluefs 
allocators to bitmap and run some smoke benchmarking again.

Additionally you might want to upgrade to 15.2.16 which includes a bunch 
of improvements for Avl/Hybrid allocators tail latency numbers as per 
the ticket above.

And finally it would be great to get bluestore performance counters for 
both good and bad benchmarks. This can be obtained via: ceph tell osd.N 
perf dump bluestore

but please reset the counters before each benchmarking with: ceph tell 
osd.N perf reset all

Thanks,

Igor

On 3/8/2022 12:50 PM, Sasa Glumac wrote:
Proxmox = 6.4-8

CEPH =  15.2.15

Nodes = 3

Network = 2x100G / node

Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB

             nvme Samsung PM-1733 MZWLJ1T9HBJR  2TB

CPU = EPYC 7252

CEPH pools = 2 separate pools for each disk type and each disk spliced in 2
OSD's

Replica = 3

VM don't do many writes and i migrated main testing VM's to 2TB pool which
in turns fragments faster.

[SPOILER="ceph osd df"]

[CODE]ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP
META     AVAIL    %USE   VAR   PGS  STATUS

  3   nvme  1.74660   1.00000  1.7 TiB  432 GiB  431 GiB  4.3 MiB  1.3 GiB
1.3 TiB  24.18  0.90  186      up

10   nvme  1.74660   1.00000  1.7 TiB  382 GiB  381 GiB  599 KiB  1.4 GiB
1.4 TiB  21.38  0.79  151      up

  7  ssd2n  0.87329   1.00000  894 GiB  279 GiB  278 GiB  2.0 MiB  1.2 GiB
615 GiB  31.19  1.16  113      up

  8  ssd2n  0.87329   1.00000  894 GiB  351 GiB  349 GiB  5.8 MiB  1.2 GiB
544 GiB  39.22  1.46  143      up

  4   nvme  1.74660   1.00000  1.7 TiB  427 GiB  425 GiB  9.6 MiB  1.4 GiB
1.3 TiB  23.85  0.89  180      up

11   nvme  1.74660   1.00000  1.7 TiB  388 GiB  387 GiB  3.5 MiB  1.5 GiB
1.4 TiB  21.72  0.81  157      up

  2  ssd2n  0.87329   1.00000  894 GiB  297 GiB  296 GiB  4.1 MiB  1.1 GiB
598 GiB  33.18  1.23  121      up

  6  ssd2n  0.87329   1.00000  894 GiB  333 GiB  332 GiB  8.6 MiB  1.2 GiB
561 GiB  37.23  1.38  135      up

  5   nvme  1.74660   1.00000  1.7 TiB  415 GiB  413 GiB  5.9 MiB  1.3 GiB
1.3 TiB  23.18  0.86  176      up

  9   nvme  1.74660   1.00000  1.7 TiB  400 GiB  399 GiB  4.3 MiB  1.7 GiB
1.4 TiB  22.38  0.83  161      up

  0  ssd2n  0.87329   1.00000  894 GiB  332 GiB  330 GiB  4.3 MiB  1.3 GiB
563 GiB  37.07  1.38  135      up

  1  ssd2n  0.87329   1.00000  894 GiB  298 GiB  297 GiB  1.7 MiB  1.3 GiB
596 GiB  33.35  1.24  121      up

                        TOTAL   16 TiB  4.2 TiB  4.2 TiB   55 MiB   16 GiB
11 TiB  26.92

MIN/MAX VAR: 0.79/1.46  STDDEV: 6.88[/CODE]

[/SPOILER]

[SPOILER="ceph osd crush tree"]

[CODE]ID   CLASS  WEIGHT    TYPE NAME

-12  ssd2n   5.23975  root default~ssd2n

  -9  ssd2n   1.74658      host pmx-s01~ssd2n

   7  ssd2n   0.87329          osd.7

   8  ssd2n   0.87329          osd.8

-10  ssd2n   1.74658      host pmx-s02~ssd2n

   2  ssd2n   0.87329          osd.2

   6  ssd2n   0.87329          osd.6

-11  ssd2n   1.74658      host pmx-s03~ssd2n

   0  ssd2n   0.87329          osd.0

   1  ssd2n   0.87329          osd.1

  -2   nvme  10.47958  root default~nvme

  -4   nvme   3.49319      host pmx-s01~nvme

   3   nvme   1.74660          osd.3

  10   nvme   1.74660          osd.10

  -6   nvme   3.49319      host pmx-s02~nvme

   4   nvme   1.74660          osd.4

  11   nvme   1.74660          osd.11

  -8   nvme   3.49319      host pmx-s03~nvme

   5   nvme   1.74660          osd.5

   9   nvme   1.74660          osd.9

  -1         15.71933  root default

  -3          5.23978      host pmx-s01

   3   nvme   1.74660          osd.3

  10   nvme   1.74660          osd.10

   7  ssd2n   0.87329          osd.7

   8  ssd2n   0.87329          osd.8

  -5          5.23978      host pmx-s02

   4   nvme   1.74660          osd.4

  11   nvme   1.74660          osd.11

   2  ssd2n   0.87329          osd.2

   6  ssd2n   0.87329          osd.6

  -7          5.23978      host pmx-s03

   5   nvme   1.74660          osd.5

   9   nvme   1.74660          osd.9

   0  ssd2n   0.87329          osd.0

   1  ssd2n   0.87329          osd.1[/CODE]

[/SPOILER]

Did a lot of tests and recreated pools and OSD's in many ways but in a
matter of days every time each OSD's gets severely fragmented and loses up
to 80% of write performance (tested with many FIO tests , rados benches ,
osd benches , RBD benches).

If i delete the osd's from node and let it sync from 2 nodes it will be
perfect for a few days 0.1 - 0.2 bluestore fragmentation but then it is in
0.8+ state soon. We are using only block devices for VM's with ext4 FS and
no SWAP on them.

[SPOILER="CEPH bluestore fragmentation"]

[CODE]osd.3  "fragmentation_rating": 0.090421032864104897

osd.10 "fragmentation_rating": 0.093359029842755931

osd.7  "fragmentation_rating": 0.083908842581664561

osd.8  "fragmentation_rating": 0.067356428512611116

after 5 days

osd.3  "fragmentation_rating": 0.2567613553223777

osd.10 "fragmentation_rating": 0.25025098722978778

osd.7  "fragmentation_rating": 0.77481281469969676

osd.8  "fragmentation_rating": 0.82260745733487917

after few weeks

0,882571391878622

0,891192311159292

..etc

[/CODE]

[/SPOILER]

[SPOILER="CEPH OSD bench degradation"]

[CODE]after recreating OSD's and syncing data to them

osd.0: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.41652934400000002,

     "bytes_per_sec": 2577829964.3638072,

     "iops": 614.60255726905041

}

osd.1: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.42986965700000002,

     "bytes_per_sec": 2497831160.0160232,

     "iops": 595.52935600662784

}

osd.2: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.424486221,

     "bytes_per_sec": 2529509253.4935308,

     "iops": 603.08200204218167

}

osd.3: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.31504493500000003,

     "bytes_per_sec": 3408218018.1693759,

     "iops": 812.58249716028593

}

osd.4: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.26949361700000002,

     "bytes_per_sec": 3984294084.412396,

     "iops": 949.92973432836436

}

osd.5: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.278853238,

     "bytes_per_sec": 3850562509.8748178,

     "iops": 918.04564234610029

}

osd.6: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.41076984700000002,

     "bytes_per_sec": 2613974301.7700129,

     "iops": 623.22003883600541

}

osd.7: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.42715592699999999,

     "bytes_per_sec": 2513699930.4705892,

     "iops": 599.31276571049432

}

osd.8: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.42246709999999998,

     "bytes_per_sec": 2541598680.7020001,

     "iops": 605.96434609937671

}

osd.9: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.27906448499999997,

     "bytes_per_sec": 3847647700.4947443,

     "iops": 917.35069763535125

}

osd.10: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.29398438999999998,

     "bytes_per_sec": 3652376998.6562896,

     "iops": 870.79453436286201

}

osd.11: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.29044762800000001,

     "bytes_per_sec": 3696851757.3846393,

     "iops": 881.39814314475996

}[/CODE]

[CODE]5 days later when 2TB pool fragmented

osd.0: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 1.2355760659999999,

     "bytes_per_sec": 869021223.01226258,

     "iops": 207.19080519968571

}

osd.1: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 1.2537920739999999,

     "bytes_per_sec": 856395447.27254355,

     "iops": 204.18058568776692

}

osd.2: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 1.109058316,

     "bytes_per_sec": 968156325.51462686,

     "iops": 230.82645547738716

}

osd.3: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.303978943,

     "bytes_per_sec": 3532290142.8734818,

     "iops": 842.16359683835071

}

osd.4: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.29256520600000002,

     "bytes_per_sec": 3670094057.5961719,

     "iops": 875.01861038116738

}

osd.5: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.34798205999999998,

     "bytes_per_sec": 3085624080.7356563,

     "iops": 735.67010897056014

}

osd.6: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 1.037829675,

     "bytes_per_sec": 1034603124.0627226,

     "iops": 246.6686067730719

}

osd.7: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 1.1761135300000001,

     "bytes_per_sec": 912957632.58501065,

     "iops": 217.66606154084459

}

osd.8: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 1.154277314,

     "bytes_per_sec": 930228646.9436754,

     "iops": 221.78379224388013

}

osd.9: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.27671432299999998,

     "bytes_per_sec": 3880326151.3860998,

     "iops": 925.14184746410842

}

osd.10: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.301649371,

     "bytes_per_sec": 3559569245.7121019,

     "iops": 848.66744177629994

}

osd.11: {

     "bytes_written": 1073741824,

     "blocksize": 4194304,

     "elapsed_sec": 0.269951261,

     "bytes_per_sec": 3977539575.1902046,

     "iops": 948.3193338370811

}[/CODE]

[CODE]Diff between them

4,6c4,6

<     "elapsed_sec": 0.41652934400000002,

<     "bytes_per_sec": 2577829964.3638072,

<     "iops": 614.60255726905041

---

     "elapsed_sec": 1.2355760659999999,
     "bytes_per_sec": 869021223.01226258,
     "iops": 207.19080519968571
11,13c11,13

<     "elapsed_sec": 0.42986965700000002,

<     "bytes_per_sec": 2497831160.0160232,

<     "iops": 595.52935600662784

---

     "elapsed_sec": 1.2537920739999999,
     "bytes_per_sec": 856395447.27254355,
     "iops": 204.18058568776692
18,20c18,20

<     "elapsed_sec": 0.424486221,

<     "bytes_per_sec": 2529509253.4935308,

<     "iops": 603.08200204218167

---

     "elapsed_sec": 1.109058316,
     "bytes_per_sec": 968156325.51462686,
     "iops": 230.82645547738716
25,27c25,27

<     "elapsed_sec": 0.31504493500000003,

<     "bytes_per_sec": 3408218018.1693759,

<     "iops": 812.58249716028593

---

     "elapsed_sec": 0.303978943,
     "bytes_per_sec": 3532290142.8734818,
     "iops": 842.16359683835071
32,34c32,34

<     "elapsed_sec": 0.26949361700000002,

<     "bytes_per_sec": 3984294084.412396,

<     "iops": 949.92973432836436

---

     "elapsed_sec": 0.29256520600000002,
     "bytes_per_sec": 3670094057.5961719,
     "iops": 875.01861038116738
39,41c39,41

<     "elapsed_sec": 0.278853238,

<     "bytes_per_sec": 3850562509.8748178,

<     "iops": 918.04564234610029

---

     "elapsed_sec": 0.34798205999999998,
     "bytes_per_sec": 3085624080.7356563,
     "iops": 735.67010897056014
46,48c46,48

<     "elapsed_sec": 0.41076984700000002,

<     "bytes_per_sec": 2613974301.7700129,

<     "iops": 623.22003883600541

---

     "elapsed_sec": 1.037829675,
     "bytes_per_sec": 1034603124.0627226,
     "iops": 246.6686067730719
53,55c53,55

<     "elapsed_sec": 0.42715592699999999,

<     "bytes_per_sec": 2513699930.4705892,

<     "iops": 599.31276571049432

---

     "elapsed_sec": 1.1761135300000001,
     "bytes_per_sec": 912957632.58501065,
     "iops": 217.66606154084459
60,62c60,62

<     "elapsed_sec": 0.42246709999999998,

<     "bytes_per_sec": 2541598680.7020001,

<     "iops": 605.96434609937671

---

     "elapsed_sec": 1.154277314,
     "bytes_per_sec": 930228646.9436754,
     "iops": 221.78379224388013
67,69c67,69

<     "elapsed_sec": 0.27906448499999997,

<     "bytes_per_sec": 3847647700.4947443,

<     "iops": 917.35069763535125

---

     "elapsed_sec": 0.27671432299999998,
     "bytes_per_sec": 3880326151.3860998,
     "iops": 925.14184746410842
74,76c74,76

<     "elapsed_sec": 0.29398438999999998,

<     "bytes_per_sec": 3652376998.6562896,

<     "iops": 870.79453436286201

---

     "elapsed_sec": 0.301649371,
     "bytes_per_sec": 3559569245.7121019,
     "iops": 848.66744177629994
81,83c81,83

<     "elapsed_sec": 0.29044762800000001,

<     "bytes_per_sec": 3696851757.3846393,

<     "iops": 881.39814314475996

---

     "elapsed_sec": 0.269951261,
     "bytes_per_sec": 3977539575.1902046,
     "iops": 948.3193338370811[/CODE]

[/SPOILER]

IOPS in osd bench after some time go to as low as 108 with 455MB/s

I noticed posts on the internet asking how to prevent or fix fragmentation
but no replies to them and RedHat CEPH documentation says "to call Redhat
to assist with fragmentation."

Anyone knows what causes fragmentation and how to solve it without deleting
OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes
for all 3 nodes with 5TB of data but this is a testing cluster so for
production it is not acceptable).

I tried changing these values :

ceph config set osd osd_memory_target 17179869184

ceph config set osd osd_memory_expected_fragmentation 0.800000

ceph config set osd osd_memory_base 2147483648

ceph config set osd osd_memory_cache_min 805306368

ceph config set osd bluestore_cache_size 17179869184

ceph config set osd bluestore_cache_size_ssd 17179869184

Cluster is really not in use.

[SPOILER="RADOS df"]

[CODE]POOL_NAME                 USED  OBJECTS  CLONES  COPIES
MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD     WR_OPS
WR  USED COMPR  UNDER COMPR

cephfs_data                0 B        0       0       0
0        0         0         0      0 B          0      0 B         0 B
       0 B

cephfs_metadata         15 MiB       22       0      66
0        0         0       176  716 KiB        206  195 KiB         0 B
       0 B

containers              12 KiB        3       0       9
0        0         0  14890830  834 GiB   11371993  641 GiB         0 B
       0 B

device_health_metrics  1.2 MiB        6       0      18
0        0         0      1389  3.6 MiB       1713  1.4 MiB         0 B
       0 B

machines               2.4 TiB   221068       0  663204
0        0         0  35032709  3.3 TiB  433971410  7.3 TiB         0 B
       0 B

two_tb_pool            1.8 TiB   186662       0  559986
0        0         0  12742384  864 GiB  217071088  5.0 TiB         0 B
       0 B

total_objects    407761

total_used       4.2 TiB

total_avail      11 TiB

total_space      16 TiB[/CODE]

[/SPOILER]

[SPOILER="ceph -s"]

[CODE]  cluster:

     id:     REMOVED-for-pricvacy

     health: HEALTH_OK

   services:

     mon: 3 daemons, quorum pmx-s01,pmx-s02,pmx-s03 (age 2w)

     mgr: pmx-s03(active, since 4d), standbys: pmx-s01, pmx-s02

     mds: cephfs:1 {0=pmx-s03=up:active} 2 up:standby

     osd: 12 osds: 12 up (since 4h), 12 in (since 3d)

   data:

     pools:   6 pools, 593 pgs

     objects: 407.76k objects, 1.5 TiB

     usage:   4.2 TiB used, 11 TiB / 16 TiB avail

     pgs:     593 active+clean

   io:

     client:   55 KiB/s rd, 11 MiB/s wr, 2 op/s rd, 257 op/s wr[/CODE]

[/SPOILER]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx