3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

Sasa Glumac <cts.cobra@xxxxxxxxx> · Tue, 8 Mar 2022 10:50:00 +0100

Proxmox = 6.4-8

CEPH =  15.2.15

Nodes = 3

Network = 2x100G / node

Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB

            nvme Samsung PM-1733 MZWLJ1T9HBJR  2TB

CPU = EPYC 7252

CEPH pools = 2 separate pools for each disk type and each disk spliced in 2
OSD's

Replica = 3

VM don't do many writes and i migrated main testing VM's to 2TB pool which
in turns fragments faster.

[SPOILER="ceph osd df"]

[CODE]ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP
META     AVAIL    %USE   VAR   PGS  STATUS

 3   nvme  1.74660   1.00000  1.7 TiB  432 GiB  431 GiB  4.3 MiB  1.3 GiB
1.3 TiB  24.18  0.90  186      up

10   nvme  1.74660   1.00000  1.7 TiB  382 GiB  381 GiB  599 KiB  1.4 GiB
1.4 TiB  21.38  0.79  151      up

 7  ssd2n  0.87329   1.00000  894 GiB  279 GiB  278 GiB  2.0 MiB  1.2 GiB
615 GiB  31.19  1.16  113      up

 8  ssd2n  0.87329   1.00000  894 GiB  351 GiB  349 GiB  5.8 MiB  1.2 GiB
544 GiB  39.22  1.46  143      up

 4   nvme  1.74660   1.00000  1.7 TiB  427 GiB  425 GiB  9.6 MiB  1.4 GiB
1.3 TiB  23.85  0.89  180      up

11   nvme  1.74660   1.00000  1.7 TiB  388 GiB  387 GiB  3.5 MiB  1.5 GiB
1.4 TiB  21.72  0.81  157      up

 2  ssd2n  0.87329   1.00000  894 GiB  297 GiB  296 GiB  4.1 MiB  1.1 GiB
598 GiB  33.18  1.23  121      up

 6  ssd2n  0.87329   1.00000  894 GiB  333 GiB  332 GiB  8.6 MiB  1.2 GiB
561 GiB  37.23  1.38  135      up

 5   nvme  1.74660   1.00000  1.7 TiB  415 GiB  413 GiB  5.9 MiB  1.3 GiB
1.3 TiB  23.18  0.86  176      up

 9   nvme  1.74660   1.00000  1.7 TiB  400 GiB  399 GiB  4.3 MiB  1.7 GiB
1.4 TiB  22.38  0.83  161      up

 0  ssd2n  0.87329   1.00000  894 GiB  332 GiB  330 GiB  4.3 MiB  1.3 GiB
563 GiB  37.07  1.38  135      up

 1  ssd2n  0.87329   1.00000  894 GiB  298 GiB  297 GiB  1.7 MiB  1.3 GiB
596 GiB  33.35  1.24  121      up

                       TOTAL   16 TiB  4.2 TiB  4.2 TiB   55 MiB   16 GiB
11 TiB  26.92

MIN/MAX VAR: 0.79/1.46  STDDEV: 6.88[/CODE]

[/SPOILER]

[SPOILER="ceph osd crush tree"]

[CODE]ID   CLASS  WEIGHT    TYPE NAME

-12  ssd2n   5.23975  root default~ssd2n

 -9  ssd2n   1.74658      host pmx-s01~ssd2n

  7  ssd2n   0.87329          osd.7

  8  ssd2n   0.87329          osd.8

-10  ssd2n   1.74658      host pmx-s02~ssd2n

  2  ssd2n   0.87329          osd.2

  6  ssd2n   0.87329          osd.6

-11  ssd2n   1.74658      host pmx-s03~ssd2n

  0  ssd2n   0.87329          osd.0

  1  ssd2n   0.87329          osd.1

 -2   nvme  10.47958  root default~nvme

 -4   nvme   3.49319      host pmx-s01~nvme

  3   nvme   1.74660          osd.3

 10   nvme   1.74660          osd.10

 -6   nvme   3.49319      host pmx-s02~nvme

  4   nvme   1.74660          osd.4

 11   nvme   1.74660          osd.11

 -8   nvme   3.49319      host pmx-s03~nvme

  5   nvme   1.74660          osd.5

  9   nvme   1.74660          osd.9

 -1         15.71933  root default

 -3          5.23978      host pmx-s01

  3   nvme   1.74660          osd.3

 10   nvme   1.74660          osd.10

  7  ssd2n   0.87329          osd.7

  8  ssd2n   0.87329          osd.8

 -5          5.23978      host pmx-s02

  4   nvme   1.74660          osd.4

 11   nvme   1.74660          osd.11

  2  ssd2n   0.87329          osd.2

  6  ssd2n   0.87329          osd.6

 -7          5.23978      host pmx-s03

  5   nvme   1.74660          osd.5

  9   nvme   1.74660          osd.9

  0  ssd2n   0.87329          osd.0

  1  ssd2n   0.87329          osd.1[/CODE]

[/SPOILER]

Did a lot of tests and recreated pools and OSD's in many ways but in a
matter of days every time each OSD's gets severely fragmented and loses up
to 80% of write performance (tested with many FIO tests , rados benches ,
osd benches , RBD benches).

If i delete the osd's from node and let it sync from 2 nodes it will be
perfect for a few days 0.1 - 0.2 bluestore fragmentation but then it is in
0.8+ state soon. We are using only block devices for VM's with ext4 FS and
no SWAP on them.

[SPOILER="CEPH bluestore fragmentation"]

[CODE]osd.3  "fragmentation_rating": 0.090421032864104897

osd.10 "fragmentation_rating": 0.093359029842755931

osd.7  "fragmentation_rating": 0.083908842581664561

osd.8  "fragmentation_rating": 0.067356428512611116

after 5 days

osd.3  "fragmentation_rating": 0.2567613553223777

osd.10 "fragmentation_rating": 0.25025098722978778

osd.7  "fragmentation_rating": 0.77481281469969676

osd.8  "fragmentation_rating": 0.82260745733487917

after few weeks

0,882571391878622

0,891192311159292

..etc

[/CODE]

[/SPOILER]

[SPOILER="CEPH OSD bench degradation"]

[CODE]after recreating OSD's and syncing data to them

osd.0: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.41652934400000002,

    "bytes_per_sec": 2577829964.3638072,

    "iops": 614.60255726905041

}

osd.1: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.42986965700000002,

    "bytes_per_sec": 2497831160.0160232,

    "iops": 595.52935600662784

}

osd.2: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.424486221,

    "bytes_per_sec": 2529509253.4935308,

    "iops": 603.08200204218167

}

osd.3: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.31504493500000003,

    "bytes_per_sec": 3408218018.1693759,

    "iops": 812.58249716028593

}

osd.4: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.26949361700000002,

    "bytes_per_sec": 3984294084.412396,

    "iops": 949.92973432836436

}

osd.5: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.278853238,

    "bytes_per_sec": 3850562509.8748178,

    "iops": 918.04564234610029

}

osd.6: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.41076984700000002,

    "bytes_per_sec": 2613974301.7700129,

    "iops": 623.22003883600541

}

osd.7: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.42715592699999999,

    "bytes_per_sec": 2513699930.4705892,

    "iops": 599.31276571049432

}

osd.8: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.42246709999999998,

    "bytes_per_sec": 2541598680.7020001,

    "iops": 605.96434609937671

}

osd.9: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.27906448499999997,

    "bytes_per_sec": 3847647700.4947443,

    "iops": 917.35069763535125

}

osd.10: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.29398438999999998,

    "bytes_per_sec": 3652376998.6562896,

    "iops": 870.79453436286201

}

osd.11: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.29044762800000001,

    "bytes_per_sec": 3696851757.3846393,

    "iops": 881.39814314475996

}[/CODE]

[CODE]5 days later when 2TB pool fragmented

osd.0: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.2355760659999999,

    "bytes_per_sec": 869021223.01226258,

    "iops": 207.19080519968571

}

osd.1: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.2537920739999999,

    "bytes_per_sec": 856395447.27254355,

    "iops": 204.18058568776692

}

osd.2: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.109058316,

    "bytes_per_sec": 968156325.51462686,

    "iops": 230.82645547738716

}

osd.3: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.303978943,

    "bytes_per_sec": 3532290142.8734818,

    "iops": 842.16359683835071

}

osd.4: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.29256520600000002,

    "bytes_per_sec": 3670094057.5961719,

    "iops": 875.01861038116738

}

osd.5: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.34798205999999998,

    "bytes_per_sec": 3085624080.7356563,

    "iops": 735.67010897056014

}

osd.6: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.037829675,

    "bytes_per_sec": 1034603124.0627226,

    "iops": 246.6686067730719

}

osd.7: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.1761135300000001,

    "bytes_per_sec": 912957632.58501065,

    "iops": 217.66606154084459

}

osd.8: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.154277314,

    "bytes_per_sec": 930228646.9436754,

    "iops": 221.78379224388013

}

osd.9: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.27671432299999998,

    "bytes_per_sec": 3880326151.3860998,

    "iops": 925.14184746410842

}

osd.10: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.301649371,

    "bytes_per_sec": 3559569245.7121019,

    "iops": 848.66744177629994

}

osd.11: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.269951261,

    "bytes_per_sec": 3977539575.1902046,

    "iops": 948.3193338370811

}[/CODE]

[CODE]Diff between them

4,6c4,6

<     "elapsed_sec": 0.41652934400000002,

<     "bytes_per_sec": 2577829964.3638072,

<     "iops": 614.60255726905041

---

>     "elapsed_sec": 1.2355760659999999,

>     "bytes_per_sec": 869021223.01226258,

>     "iops": 207.19080519968571

11,13c11,13

<     "elapsed_sec": 0.42986965700000002,

<     "bytes_per_sec": 2497831160.0160232,

<     "iops": 595.52935600662784

---

>     "elapsed_sec": 1.2537920739999999,

>     "bytes_per_sec": 856395447.27254355,

>     "iops": 204.18058568776692

18,20c18,20

<     "elapsed_sec": 0.424486221,

<     "bytes_per_sec": 2529509253.4935308,

<     "iops": 603.08200204218167

---

>     "elapsed_sec": 1.109058316,

>     "bytes_per_sec": 968156325.51462686,

>     "iops": 230.82645547738716

25,27c25,27

<     "elapsed_sec": 0.31504493500000003,

<     "bytes_per_sec": 3408218018.1693759,

<     "iops": 812.58249716028593

---

>     "elapsed_sec": 0.303978943,

>     "bytes_per_sec": 3532290142.8734818,

>     "iops": 842.16359683835071

32,34c32,34

<     "elapsed_sec": 0.26949361700000002,

<     "bytes_per_sec": 3984294084.412396,

<     "iops": 949.92973432836436

---

>     "elapsed_sec": 0.29256520600000002,

>     "bytes_per_sec": 3670094057.5961719,

>     "iops": 875.01861038116738

39,41c39,41

<     "elapsed_sec": 0.278853238,

<     "bytes_per_sec": 3850562509.8748178,

<     "iops": 918.04564234610029

---

>     "elapsed_sec": 0.34798205999999998,

>     "bytes_per_sec": 3085624080.7356563,

>     "iops": 735.67010897056014

46,48c46,48

<     "elapsed_sec": 0.41076984700000002,

<     "bytes_per_sec": 2613974301.7700129,

<     "iops": 623.22003883600541

---

>     "elapsed_sec": 1.037829675,

>     "bytes_per_sec": 1034603124.0627226,

>     "iops": 246.6686067730719

53,55c53,55

<     "elapsed_sec": 0.42715592699999999,

<     "bytes_per_sec": 2513699930.4705892,

<     "iops": 599.31276571049432

---

>     "elapsed_sec": 1.1761135300000001,

>     "bytes_per_sec": 912957632.58501065,

>     "iops": 217.66606154084459

60,62c60,62

<     "elapsed_sec": 0.42246709999999998,

<     "bytes_per_sec": 2541598680.7020001,

<     "iops": 605.96434609937671

---

>     "elapsed_sec": 1.154277314,

>     "bytes_per_sec": 930228646.9436754,

>     "iops": 221.78379224388013

67,69c67,69

<     "elapsed_sec": 0.27906448499999997,

<     "bytes_per_sec": 3847647700.4947443,

<     "iops": 917.35069763535125

---

>     "elapsed_sec": 0.27671432299999998,

>     "bytes_per_sec": 3880326151.3860998,

>     "iops": 925.14184746410842

74,76c74,76

<     "elapsed_sec": 0.29398438999999998,

<     "bytes_per_sec": 3652376998.6562896,

<     "iops": 870.79453436286201

---

>     "elapsed_sec": 0.301649371,

>     "bytes_per_sec": 3559569245.7121019,

>     "iops": 848.66744177629994

81,83c81,83

<     "elapsed_sec": 0.29044762800000001,

<     "bytes_per_sec": 3696851757.3846393,

<     "iops": 881.39814314475996

---

>     "elapsed_sec": 0.269951261,

>     "bytes_per_sec": 3977539575.1902046,

>     "iops": 948.3193338370811[/CODE]

[/SPOILER]

IOPS in osd bench after some time go to as low as 108 with 455MB/s

I noticed posts on the internet asking how to prevent or fix fragmentation
but no replies to them and RedHat CEPH documentation says "to call Redhat
to assist with fragmentation."

Anyone knows what causes fragmentation and how to solve it without deleting
OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes
for all 3 nodes with 5TB of data but this is a testing cluster so for
production it is not acceptable).

I tried changing these values :

ceph config set osd osd_memory_target 17179869184

ceph config set osd osd_memory_expected_fragmentation 0.800000

ceph config set osd osd_memory_base 2147483648

ceph config set osd osd_memory_cache_min 805306368

ceph config set osd bluestore_cache_size 17179869184

ceph config set osd bluestore_cache_size_ssd 17179869184

Cluster is really not in use.

[SPOILER="RADOS df"]

[CODE]POOL_NAME                 USED  OBJECTS  CLONES  COPIES
MISSING_ON_PRIMARY  UNFOUND  DEGRADED    RD_OPS       RD     WR_OPS
WR  USED COMPR  UNDER COMPR

cephfs_data                0 B        0       0       0
0        0         0         0      0 B          0      0 B         0 B
      0 B

cephfs_metadata         15 MiB       22       0      66
0        0         0       176  716 KiB        206  195 KiB         0 B
      0 B

containers              12 KiB        3       0       9
0        0         0  14890830  834 GiB   11371993  641 GiB         0 B
      0 B

device_health_metrics  1.2 MiB        6       0      18
0        0         0      1389  3.6 MiB       1713  1.4 MiB         0 B
      0 B

machines               2.4 TiB   221068       0  663204
0        0         0  35032709  3.3 TiB  433971410  7.3 TiB         0 B
      0 B

two_tb_pool            1.8 TiB   186662       0  559986
0        0         0  12742384  864 GiB  217071088  5.0 TiB         0 B
      0 B

total_objects    407761

total_used       4.2 TiB

total_avail      11 TiB

total_space      16 TiB[/CODE]

[/SPOILER]

[SPOILER="ceph -s"]

[CODE]  cluster:

    id:     REMOVED-for-pricvacy

    health: HEALTH_OK

  services:

    mon: 3 daemons, quorum pmx-s01,pmx-s02,pmx-s03 (age 2w)

    mgr: pmx-s03(active, since 4d), standbys: pmx-s01, pmx-s02

    mds: cephfs:1 {0=pmx-s03=up:active} 2 up:standby

    osd: 12 osds: 12 up (since 4h), 12 in (since 3d)

  data:

    pools:   6 pools, 593 pgs

    objects: 407.76k objects, 1.5 TiB

    usage:   4.2 TiB used, 11 TiB / 16 TiB avail

    pgs:     593 active+clean

  io:

    client:   55 KiB/s rd, 11 MiB/s wr, 2 op/s rd, 257 op/s wr[/CODE]

[/SPOILER]
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx