3 node CEPH PVE hyper-converged cluster serious fragmentation and performance loss in matter of days.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Proxmox = 6.4-8

CEPH =  15.2.15

Nodes = 3

Network = 2x100G / node

Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB

            nvme Samsung PM-1733 MZWLJ1T9HBJR  2TB

CPU = EPYC 7252

CEPH pools = 2 separate pools for each disk type and each disk spliced in 2

Replica = 3

VM don't do many writes and i migrated main testing VM's to 2TB pool which
in turns fragments faster.

[SPOILER="ceph osd df"]


 3   nvme  1.74660   1.00000  1.7 TiB  432 GiB  431 GiB  4.3 MiB  1.3 GiB
1.3 TiB  24.18  0.90  186      up

10   nvme  1.74660   1.00000  1.7 TiB  382 GiB  381 GiB  599 KiB  1.4 GiB
1.4 TiB  21.38  0.79  151      up

 7  ssd2n  0.87329   1.00000  894 GiB  279 GiB  278 GiB  2.0 MiB  1.2 GiB
615 GiB  31.19  1.16  113      up

 8  ssd2n  0.87329   1.00000  894 GiB  351 GiB  349 GiB  5.8 MiB  1.2 GiB
544 GiB  39.22  1.46  143      up

 4   nvme  1.74660   1.00000  1.7 TiB  427 GiB  425 GiB  9.6 MiB  1.4 GiB
1.3 TiB  23.85  0.89  180      up

11   nvme  1.74660   1.00000  1.7 TiB  388 GiB  387 GiB  3.5 MiB  1.5 GiB
1.4 TiB  21.72  0.81  157      up

 2  ssd2n  0.87329   1.00000  894 GiB  297 GiB  296 GiB  4.1 MiB  1.1 GiB
598 GiB  33.18  1.23  121      up

 6  ssd2n  0.87329   1.00000  894 GiB  333 GiB  332 GiB  8.6 MiB  1.2 GiB
561 GiB  37.23  1.38  135      up

 5   nvme  1.74660   1.00000  1.7 TiB  415 GiB  413 GiB  5.9 MiB  1.3 GiB
1.3 TiB  23.18  0.86  176      up

 9   nvme  1.74660   1.00000  1.7 TiB  400 GiB  399 GiB  4.3 MiB  1.7 GiB
1.4 TiB  22.38  0.83  161      up

 0  ssd2n  0.87329   1.00000  894 GiB  332 GiB  330 GiB  4.3 MiB  1.3 GiB
563 GiB  37.07  1.38  135      up

 1  ssd2n  0.87329   1.00000  894 GiB  298 GiB  297 GiB  1.7 MiB  1.3 GiB
596 GiB  33.35  1.24  121      up

                       TOTAL   16 TiB  4.2 TiB  4.2 TiB   55 MiB   16 GiB
11 TiB  26.92

MIN/MAX VAR: 0.79/1.46  STDDEV: 6.88[/CODE]


[SPOILER="ceph osd crush tree"]


-12  ssd2n   5.23975  root default~ssd2n

 -9  ssd2n   1.74658      host pmx-s01~ssd2n

  7  ssd2n   0.87329          osd.7

  8  ssd2n   0.87329          osd.8

-10  ssd2n   1.74658      host pmx-s02~ssd2n

  2  ssd2n   0.87329          osd.2

  6  ssd2n   0.87329          osd.6

-11  ssd2n   1.74658      host pmx-s03~ssd2n

  0  ssd2n   0.87329          osd.0

  1  ssd2n   0.87329          osd.1

 -2   nvme  10.47958  root default~nvme

 -4   nvme   3.49319      host pmx-s01~nvme

  3   nvme   1.74660          osd.3

 10   nvme   1.74660          osd.10

 -6   nvme   3.49319      host pmx-s02~nvme

  4   nvme   1.74660          osd.4

 11   nvme   1.74660          osd.11

 -8   nvme   3.49319      host pmx-s03~nvme

  5   nvme   1.74660          osd.5

  9   nvme   1.74660          osd.9

 -1         15.71933  root default

 -3          5.23978      host pmx-s01

  3   nvme   1.74660          osd.3

 10   nvme   1.74660          osd.10

  7  ssd2n   0.87329          osd.7

  8  ssd2n   0.87329          osd.8

 -5          5.23978      host pmx-s02

  4   nvme   1.74660          osd.4

 11   nvme   1.74660          osd.11

  2  ssd2n   0.87329          osd.2

  6  ssd2n   0.87329          osd.6

 -7          5.23978      host pmx-s03

  5   nvme   1.74660          osd.5

  9   nvme   1.74660          osd.9

  0  ssd2n   0.87329          osd.0

  1  ssd2n   0.87329          osd.1[/CODE]


Did a lot of tests and recreated pools and OSD's in many ways but in a
matter of days every time each OSD's gets severely fragmented and loses up
to 80% of write performance (tested with many FIO tests , rados benches ,
osd benches , RBD benches).

If i delete the osd's from node and let it sync from 2 nodes it will be
perfect for a few days 0.1 - 0.2 bluestore fragmentation but then it is in
0.8+ state soon. We are using only block devices for VM's with ext4 FS and
no SWAP on them.

[SPOILER="CEPH bluestore fragmentation"]

[CODE]osd.3  "fragmentation_rating": 0.090421032864104897

osd.10 "fragmentation_rating": 0.093359029842755931

osd.7  "fragmentation_rating": 0.083908842581664561

osd.8  "fragmentation_rating": 0.067356428512611116

after 5 days

osd.3  "fragmentation_rating": 0.2567613553223777

osd.10 "fragmentation_rating": 0.25025098722978778

osd.7  "fragmentation_rating": 0.77481281469969676

osd.8  "fragmentation_rating": 0.82260745733487917

after few weeks






[SPOILER="CEPH OSD bench degradation"]

[CODE]after recreating OSD's and syncing data to them

osd.0: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.41652934400000002,

    "bytes_per_sec": 2577829964.3638072,

    "iops": 614.60255726905041


osd.1: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.42986965700000002,

    "bytes_per_sec": 2497831160.0160232,

    "iops": 595.52935600662784


osd.2: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.424486221,

    "bytes_per_sec": 2529509253.4935308,

    "iops": 603.08200204218167


osd.3: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.31504493500000003,

    "bytes_per_sec": 3408218018.1693759,

    "iops": 812.58249716028593


osd.4: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.26949361700000002,

    "bytes_per_sec": 3984294084.412396,

    "iops": 949.92973432836436


osd.5: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.278853238,

    "bytes_per_sec": 3850562509.8748178,

    "iops": 918.04564234610029


osd.6: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.41076984700000002,

    "bytes_per_sec": 2613974301.7700129,

    "iops": 623.22003883600541


osd.7: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.42715592699999999,

    "bytes_per_sec": 2513699930.4705892,

    "iops": 599.31276571049432


osd.8: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.42246709999999998,

    "bytes_per_sec": 2541598680.7020001,

    "iops": 605.96434609937671


osd.9: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.27906448499999997,

    "bytes_per_sec": 3847647700.4947443,

    "iops": 917.35069763535125


osd.10: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.29398438999999998,

    "bytes_per_sec": 3652376998.6562896,

    "iops": 870.79453436286201


osd.11: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.29044762800000001,

    "bytes_per_sec": 3696851757.3846393,

    "iops": 881.39814314475996


[CODE]5 days later when 2TB pool fragmented

osd.0: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.2355760659999999,

    "bytes_per_sec": 869021223.01226258,

    "iops": 207.19080519968571


osd.1: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.2537920739999999,

    "bytes_per_sec": 856395447.27254355,

    "iops": 204.18058568776692


osd.2: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.109058316,

    "bytes_per_sec": 968156325.51462686,

    "iops": 230.82645547738716


osd.3: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.303978943,

    "bytes_per_sec": 3532290142.8734818,

    "iops": 842.16359683835071


osd.4: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.29256520600000002,

    "bytes_per_sec": 3670094057.5961719,

    "iops": 875.01861038116738


osd.5: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.34798205999999998,

    "bytes_per_sec": 3085624080.7356563,

    "iops": 735.67010897056014


osd.6: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.037829675,

    "bytes_per_sec": 1034603124.0627226,

    "iops": 246.6686067730719


osd.7: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.1761135300000001,

    "bytes_per_sec": 912957632.58501065,

    "iops": 217.66606154084459


osd.8: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 1.154277314,

    "bytes_per_sec": 930228646.9436754,

    "iops": 221.78379224388013


osd.9: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.27671432299999998,

    "bytes_per_sec": 3880326151.3860998,

    "iops": 925.14184746410842


osd.10: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.301649371,

    "bytes_per_sec": 3559569245.7121019,

    "iops": 848.66744177629994


osd.11: {

    "bytes_written": 1073741824,

    "blocksize": 4194304,

    "elapsed_sec": 0.269951261,

    "bytes_per_sec": 3977539575.1902046,

    "iops": 948.3193338370811


[CODE]Diff between them


<     "elapsed_sec": 0.41652934400000002,

<     "bytes_per_sec": 2577829964.3638072,

<     "iops": 614.60255726905041


>     "elapsed_sec": 1.2355760659999999,

>     "bytes_per_sec": 869021223.01226258,

>     "iops": 207.19080519968571


<     "elapsed_sec": 0.42986965700000002,

<     "bytes_per_sec": 2497831160.0160232,

<     "iops": 595.52935600662784


>     "elapsed_sec": 1.2537920739999999,

>     "bytes_per_sec": 856395447.27254355,

>     "iops": 204.18058568776692


<     "elapsed_sec": 0.424486221,

<     "bytes_per_sec": 2529509253.4935308,

<     "iops": 603.08200204218167


>     "elapsed_sec": 1.109058316,

>     "bytes_per_sec": 968156325.51462686,

>     "iops": 230.82645547738716


<     "elapsed_sec": 0.31504493500000003,

<     "bytes_per_sec": 3408218018.1693759,

<     "iops": 812.58249716028593


>     "elapsed_sec": 0.303978943,

>     "bytes_per_sec": 3532290142.8734818,

>     "iops": 842.16359683835071


<     "elapsed_sec": 0.26949361700000002,

<     "bytes_per_sec": 3984294084.412396,

<     "iops": 949.92973432836436


>     "elapsed_sec": 0.29256520600000002,

>     "bytes_per_sec": 3670094057.5961719,

>     "iops": 875.01861038116738


<     "elapsed_sec": 0.278853238,

<     "bytes_per_sec": 3850562509.8748178,

<     "iops": 918.04564234610029


>     "elapsed_sec": 0.34798205999999998,

>     "bytes_per_sec": 3085624080.7356563,

>     "iops": 735.67010897056014


<     "elapsed_sec": 0.41076984700000002,

<     "bytes_per_sec": 2613974301.7700129,

<     "iops": 623.22003883600541


>     "elapsed_sec": 1.037829675,

>     "bytes_per_sec": 1034603124.0627226,

>     "iops": 246.6686067730719


<     "elapsed_sec": 0.42715592699999999,

<     "bytes_per_sec": 2513699930.4705892,

<     "iops": 599.31276571049432


>     "elapsed_sec": 1.1761135300000001,

>     "bytes_per_sec": 912957632.58501065,

>     "iops": 217.66606154084459


<     "elapsed_sec": 0.42246709999999998,

<     "bytes_per_sec": 2541598680.7020001,

<     "iops": 605.96434609937671


>     "elapsed_sec": 1.154277314,

>     "bytes_per_sec": 930228646.9436754,

>     "iops": 221.78379224388013


<     "elapsed_sec": 0.27906448499999997,

<     "bytes_per_sec": 3847647700.4947443,

<     "iops": 917.35069763535125


>     "elapsed_sec": 0.27671432299999998,

>     "bytes_per_sec": 3880326151.3860998,

>     "iops": 925.14184746410842


<     "elapsed_sec": 0.29398438999999998,

<     "bytes_per_sec": 3652376998.6562896,

<     "iops": 870.79453436286201


>     "elapsed_sec": 0.301649371,

>     "bytes_per_sec": 3559569245.7121019,

>     "iops": 848.66744177629994


<     "elapsed_sec": 0.29044762800000001,

<     "bytes_per_sec": 3696851757.3846393,

<     "iops": 881.39814314475996


>     "elapsed_sec": 0.269951261,

>     "bytes_per_sec": 3977539575.1902046,

>     "iops": 948.3193338370811[/CODE]


IOPS in osd bench after some time go to as low as 108 with 455MB/s

I noticed posts on the internet asking how to prevent or fix fragmentation
but no replies to them and RedHat CEPH documentation says "to call Redhat
to assist with fragmentation."

Anyone knows what causes fragmentation and how to solve it without deleting
OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes
for all 3 nodes with 5TB of data but this is a testing cluster so for
production it is not acceptable).

I tried changing these values :

ceph config set osd osd_memory_target 17179869184

ceph config set osd osd_memory_expected_fragmentation 0.800000

ceph config set osd osd_memory_base 2147483648

ceph config set osd osd_memory_cache_min 805306368

ceph config set osd bluestore_cache_size 17179869184

ceph config set osd bluestore_cache_size_ssd 17179869184

Cluster is really not in use.



cephfs_data                0 B        0       0       0
0        0         0         0      0 B          0      0 B         0 B
      0 B

cephfs_metadata         15 MiB       22       0      66
0        0         0       176  716 KiB        206  195 KiB         0 B
      0 B

containers              12 KiB        3       0       9
0        0         0  14890830  834 GiB   11371993  641 GiB         0 B
      0 B

device_health_metrics  1.2 MiB        6       0      18
0        0         0      1389  3.6 MiB       1713  1.4 MiB         0 B
      0 B

machines               2.4 TiB   221068       0  663204
0        0         0  35032709  3.3 TiB  433971410  7.3 TiB         0 B
      0 B

two_tb_pool            1.8 TiB   186662       0  559986
0        0         0  12742384  864 GiB  217071088  5.0 TiB         0 B
      0 B

total_objects    407761

total_used       4.2 TiB

total_avail      11 TiB

total_space      16 TiB[/CODE]


[SPOILER="ceph -s"]

[CODE]  cluster:

    id:     REMOVED-for-pricvacy

    health: HEALTH_OK


    mon: 3 daemons, quorum pmx-s01,pmx-s02,pmx-s03 (age 2w)

    mgr: pmx-s03(active, since 4d), standbys: pmx-s01, pmx-s02

    mds: cephfs:1 {0=pmx-s03=up:active} 2 up:standby

    osd: 12 osds: 12 up (since 4h), 12 in (since 3d)


    pools:   6 pools, 593 pgs

    objects: 407.76k objects, 1.5 TiB

    usage:   4.2 TiB used, 11 TiB / 16 TiB avail

    pgs:     593 active+clean


    client:   55 KiB/s rd, 11 MiB/s wr, 2 op/s rd, 257 op/s wr[/CODE]

ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux