Proxmox = 6.4-8 CEPH = 15.2.15 Nodes = 3 Network = 2x100G / node Disk = nvme Samsung PM-1733 MZWLJ3T8HBLS 4TB nvme Samsung PM-1733 MZWLJ1T9HBJR 2TB CPU = EPYC 7252 CEPH pools = 2 separate pools for each disk type and each disk spliced in 2 OSD's Replica = 3 VM don't do many writes and i migrated main testing VM's to 2TB pool which in turns fragments faster. [SPOILER="ceph osd df"] [CODE]ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 3 nvme 1.74660 1.00000 1.7 TiB 432 GiB 431 GiB 4.3 MiB 1.3 GiB 1.3 TiB 24.18 0.90 186 up 10 nvme 1.74660 1.00000 1.7 TiB 382 GiB 381 GiB 599 KiB 1.4 GiB 1.4 TiB 21.38 0.79 151 up 7 ssd2n 0.87329 1.00000 894 GiB 279 GiB 278 GiB 2.0 MiB 1.2 GiB 615 GiB 31.19 1.16 113 up 8 ssd2n 0.87329 1.00000 894 GiB 351 GiB 349 GiB 5.8 MiB 1.2 GiB 544 GiB 39.22 1.46 143 up 4 nvme 1.74660 1.00000 1.7 TiB 427 GiB 425 GiB 9.6 MiB 1.4 GiB 1.3 TiB 23.85 0.89 180 up 11 nvme 1.74660 1.00000 1.7 TiB 388 GiB 387 GiB 3.5 MiB 1.5 GiB 1.4 TiB 21.72 0.81 157 up 2 ssd2n 0.87329 1.00000 894 GiB 297 GiB 296 GiB 4.1 MiB 1.1 GiB 598 GiB 33.18 1.23 121 up 6 ssd2n 0.87329 1.00000 894 GiB 333 GiB 332 GiB 8.6 MiB 1.2 GiB 561 GiB 37.23 1.38 135 up 5 nvme 1.74660 1.00000 1.7 TiB 415 GiB 413 GiB 5.9 MiB 1.3 GiB 1.3 TiB 23.18 0.86 176 up 9 nvme 1.74660 1.00000 1.7 TiB 400 GiB 399 GiB 4.3 MiB 1.7 GiB 1.4 TiB 22.38 0.83 161 up 0 ssd2n 0.87329 1.00000 894 GiB 332 GiB 330 GiB 4.3 MiB 1.3 GiB 563 GiB 37.07 1.38 135 up 1 ssd2n 0.87329 1.00000 894 GiB 298 GiB 297 GiB 1.7 MiB 1.3 GiB 596 GiB 33.35 1.24 121 up TOTAL 16 TiB 4.2 TiB 4.2 TiB 55 MiB 16 GiB 11 TiB 26.92 MIN/MAX VAR: 0.79/1.46 STDDEV: 6.88[/CODE] [/SPOILER] [SPOILER="ceph osd crush tree"] [CODE]ID CLASS WEIGHT TYPE NAME -12 ssd2n 5.23975 root default~ssd2n -9 ssd2n 1.74658 host pmx-s01~ssd2n 7 ssd2n 0.87329 osd.7 8 ssd2n 0.87329 osd.8 -10 ssd2n 1.74658 host pmx-s02~ssd2n 2 ssd2n 0.87329 osd.2 6 ssd2n 0.87329 osd.6 -11 ssd2n 1.74658 host pmx-s03~ssd2n 0 ssd2n 0.87329 osd.0 1 ssd2n 0.87329 osd.1 -2 nvme 10.47958 root default~nvme -4 nvme 3.49319 host pmx-s01~nvme 3 nvme 1.74660 osd.3 10 nvme 1.74660 osd.10 -6 nvme 3.49319 host pmx-s02~nvme 4 nvme 1.74660 osd.4 11 nvme 1.74660 osd.11 -8 nvme 3.49319 host pmx-s03~nvme 5 nvme 1.74660 osd.5 9 nvme 1.74660 osd.9 -1 15.71933 root default -3 5.23978 host pmx-s01 3 nvme 1.74660 osd.3 10 nvme 1.74660 osd.10 7 ssd2n 0.87329 osd.7 8 ssd2n 0.87329 osd.8 -5 5.23978 host pmx-s02 4 nvme 1.74660 osd.4 11 nvme 1.74660 osd.11 2 ssd2n 0.87329 osd.2 6 ssd2n 0.87329 osd.6 -7 5.23978 host pmx-s03 5 nvme 1.74660 osd.5 9 nvme 1.74660 osd.9 0 ssd2n 0.87329 osd.0 1 ssd2n 0.87329 osd.1[/CODE] [/SPOILER] Did a lot of tests and recreated pools and OSD's in many ways but in a matter of days every time each OSD's gets severely fragmented and loses up to 80% of write performance (tested with many FIO tests , rados benches , osd benches , RBD benches). If i delete the osd's from node and let it sync from 2 nodes it will be perfect for a few days 0.1 - 0.2 bluestore fragmentation but then it is in 0.8+ state soon. We are using only block devices for VM's with ext4 FS and no SWAP on them. [SPOILER="CEPH bluestore fragmentation"] [CODE]osd.3 "fragmentation_rating": 0.090421032864104897 osd.10 "fragmentation_rating": 0.093359029842755931 osd.7 "fragmentation_rating": 0.083908842581664561 osd.8 "fragmentation_rating": 0.067356428512611116 after 5 days osd.3 "fragmentation_rating": 0.2567613553223777 osd.10 "fragmentation_rating": 0.25025098722978778 osd.7 "fragmentation_rating": 0.77481281469969676 osd.8 "fragmentation_rating": 0.82260745733487917 after few weeks 0,882571391878622 0,891192311159292 ..etc [/CODE] [/SPOILER] [SPOILER="CEPH OSD bench degradation"] [CODE]after recreating OSD's and syncing data to them osd.0: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.41652934400000002, "bytes_per_sec": 2577829964.3638072, "iops": 614.60255726905041 } osd.1: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.42986965700000002, "bytes_per_sec": 2497831160.0160232, "iops": 595.52935600662784 } osd.2: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.424486221, "bytes_per_sec": 2529509253.4935308, "iops": 603.08200204218167 } osd.3: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.31504493500000003, "bytes_per_sec": 3408218018.1693759, "iops": 812.58249716028593 } osd.4: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.26949361700000002, "bytes_per_sec": 3984294084.412396, "iops": 949.92973432836436 } osd.5: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.278853238, "bytes_per_sec": 3850562509.8748178, "iops": 918.04564234610029 } osd.6: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.41076984700000002, "bytes_per_sec": 2613974301.7700129, "iops": 623.22003883600541 } osd.7: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.42715592699999999, "bytes_per_sec": 2513699930.4705892, "iops": 599.31276571049432 } osd.8: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.42246709999999998, "bytes_per_sec": 2541598680.7020001, "iops": 605.96434609937671 } osd.9: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.27906448499999997, "bytes_per_sec": 3847647700.4947443, "iops": 917.35069763535125 } osd.10: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.29398438999999998, "bytes_per_sec": 3652376998.6562896, "iops": 870.79453436286201 } osd.11: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.29044762800000001, "bytes_per_sec": 3696851757.3846393, "iops": 881.39814314475996 }[/CODE] [CODE]5 days later when 2TB pool fragmented osd.0: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 1.2355760659999999, "bytes_per_sec": 869021223.01226258, "iops": 207.19080519968571 } osd.1: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 1.2537920739999999, "bytes_per_sec": 856395447.27254355, "iops": 204.18058568776692 } osd.2: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 1.109058316, "bytes_per_sec": 968156325.51462686, "iops": 230.82645547738716 } osd.3: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.303978943, "bytes_per_sec": 3532290142.8734818, "iops": 842.16359683835071 } osd.4: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.29256520600000002, "bytes_per_sec": 3670094057.5961719, "iops": 875.01861038116738 } osd.5: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.34798205999999998, "bytes_per_sec": 3085624080.7356563, "iops": 735.67010897056014 } osd.6: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 1.037829675, "bytes_per_sec": 1034603124.0627226, "iops": 246.6686067730719 } osd.7: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 1.1761135300000001, "bytes_per_sec": 912957632.58501065, "iops": 217.66606154084459 } osd.8: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 1.154277314, "bytes_per_sec": 930228646.9436754, "iops": 221.78379224388013 } osd.9: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.27671432299999998, "bytes_per_sec": 3880326151.3860998, "iops": 925.14184746410842 } osd.10: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.301649371, "bytes_per_sec": 3559569245.7121019, "iops": 848.66744177629994 } osd.11: { "bytes_written": 1073741824, "blocksize": 4194304, "elapsed_sec": 0.269951261, "bytes_per_sec": 3977539575.1902046, "iops": 948.3193338370811 }[/CODE] [CODE]Diff between them 4,6c4,6 < "elapsed_sec": 0.41652934400000002, < "bytes_per_sec": 2577829964.3638072, < "iops": 614.60255726905041 --- > "elapsed_sec": 1.2355760659999999, > "bytes_per_sec": 869021223.01226258, > "iops": 207.19080519968571 11,13c11,13 < "elapsed_sec": 0.42986965700000002, < "bytes_per_sec": 2497831160.0160232, < "iops": 595.52935600662784 --- > "elapsed_sec": 1.2537920739999999, > "bytes_per_sec": 856395447.27254355, > "iops": 204.18058568776692 18,20c18,20 < "elapsed_sec": 0.424486221, < "bytes_per_sec": 2529509253.4935308, < "iops": 603.08200204218167 --- > "elapsed_sec": 1.109058316, > "bytes_per_sec": 968156325.51462686, > "iops": 230.82645547738716 25,27c25,27 < "elapsed_sec": 0.31504493500000003, < "bytes_per_sec": 3408218018.1693759, < "iops": 812.58249716028593 --- > "elapsed_sec": 0.303978943, > "bytes_per_sec": 3532290142.8734818, > "iops": 842.16359683835071 32,34c32,34 < "elapsed_sec": 0.26949361700000002, < "bytes_per_sec": 3984294084.412396, < "iops": 949.92973432836436 --- > "elapsed_sec": 0.29256520600000002, > "bytes_per_sec": 3670094057.5961719, > "iops": 875.01861038116738 39,41c39,41 < "elapsed_sec": 0.278853238, < "bytes_per_sec": 3850562509.8748178, < "iops": 918.04564234610029 --- > "elapsed_sec": 0.34798205999999998, > "bytes_per_sec": 3085624080.7356563, > "iops": 735.67010897056014 46,48c46,48 < "elapsed_sec": 0.41076984700000002, < "bytes_per_sec": 2613974301.7700129, < "iops": 623.22003883600541 --- > "elapsed_sec": 1.037829675, > "bytes_per_sec": 1034603124.0627226, > "iops": 246.6686067730719 53,55c53,55 < "elapsed_sec": 0.42715592699999999, < "bytes_per_sec": 2513699930.4705892, < "iops": 599.31276571049432 --- > "elapsed_sec": 1.1761135300000001, > "bytes_per_sec": 912957632.58501065, > "iops": 217.66606154084459 60,62c60,62 < "elapsed_sec": 0.42246709999999998, < "bytes_per_sec": 2541598680.7020001, < "iops": 605.96434609937671 --- > "elapsed_sec": 1.154277314, > "bytes_per_sec": 930228646.9436754, > "iops": 221.78379224388013 67,69c67,69 < "elapsed_sec": 0.27906448499999997, < "bytes_per_sec": 3847647700.4947443, < "iops": 917.35069763535125 --- > "elapsed_sec": 0.27671432299999998, > "bytes_per_sec": 3880326151.3860998, > "iops": 925.14184746410842 74,76c74,76 < "elapsed_sec": 0.29398438999999998, < "bytes_per_sec": 3652376998.6562896, < "iops": 870.79453436286201 --- > "elapsed_sec": 0.301649371, > "bytes_per_sec": 3559569245.7121019, > "iops": 848.66744177629994 81,83c81,83 < "elapsed_sec": 0.29044762800000001, < "bytes_per_sec": 3696851757.3846393, < "iops": 881.39814314475996 --- > "elapsed_sec": 0.269951261, > "bytes_per_sec": 3977539575.1902046, > "iops": 948.3193338370811[/CODE] [/SPOILER] IOPS in osd bench after some time go to as low as 108 with 455MB/s I noticed posts on the internet asking how to prevent or fix fragmentation but no replies to them and RedHat CEPH documentation says "to call Redhat to assist with fragmentation." Anyone knows what causes fragmentation and how to solve it without deleting OSD's on each node 1by1 and syncing in between. (Operation is 90 minutes for all 3 nodes with 5TB of data but this is a testing cluster so for production it is not acceptable). I tried changing these values : ceph config set osd osd_memory_target 17179869184 ceph config set osd osd_memory_expected_fragmentation 0.800000 ceph config set osd osd_memory_base 2147483648 ceph config set osd osd_memory_cache_min 805306368 ceph config set osd bluestore_cache_size 17179869184 ceph config set osd bluestore_cache_size_ssd 17179869184 Cluster is really not in use. [SPOILER="RADOS df"] [CODE]POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR cephfs_data 0 B 0 0 0 0 0 0 0 0 B 0 0 B 0 B 0 B cephfs_metadata 15 MiB 22 0 66 0 0 0 176 716 KiB 206 195 KiB 0 B 0 B containers 12 KiB 3 0 9 0 0 0 14890830 834 GiB 11371993 641 GiB 0 B 0 B device_health_metrics 1.2 MiB 6 0 18 0 0 0 1389 3.6 MiB 1713 1.4 MiB 0 B 0 B machines 2.4 TiB 221068 0 663204 0 0 0 35032709 3.3 TiB 433971410 7.3 TiB 0 B 0 B two_tb_pool 1.8 TiB 186662 0 559986 0 0 0 12742384 864 GiB 217071088 5.0 TiB 0 B 0 B total_objects 407761 total_used 4.2 TiB total_avail 11 TiB total_space 16 TiB[/CODE] [/SPOILER] [SPOILER="ceph -s"] [CODE] cluster: id: REMOVED-for-pricvacy health: HEALTH_OK services: mon: 3 daemons, quorum pmx-s01,pmx-s02,pmx-s03 (age 2w) mgr: pmx-s03(active, since 4d), standbys: pmx-s01, pmx-s02 mds: cephfs:1 {0=pmx-s03=up:active} 2 up:standby osd: 12 osds: 12 up (since 4h), 12 in (since 3d) data: pools: 6 pools, 593 pgs objects: 407.76k objects, 1.5 TiB usage: 4.2 TiB used, 11 TiB / 16 TiB avail pgs: 593 active+clean io: client: 55 KiB/s rd, 11 MiB/s wr, 2 op/s rd, 257 op/s wr[/CODE] [/SPOILER] _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx