Glad you’re sorted out. I had a feeling it was a function of not being able to satisfy pool / rule constraints. > On Nov 18, 2024, at 1:58 AM, Roland Giesler <roland@xxxxxxxxxxxxxx> wrote: > > On 2024/11/17 18:12, Anthony D'Atri wrote: >> I see 5 OSDs with 0 CRUSH weight, is that intentional? > > Yes, I set the weight to 0 to ensure all the pg's are removed from them them since I'm removing them (worn out ssd's) > > I think I found the problem. I had created a CRUSH rule called old_ssd (and corresponding pool) into which I had earlier attempted to move the ssd's by changing their respective device class to a custom class I created. Also I had removed the devices from that new class, the rule and pool remained. Eventually I saw that there was actually ghost data assigned to that pool. Since I was sure there the were no legitimate pg's in the pool, I deleted it and voila! all the "stuck" pg's vanished. I have now removed those ssd's from their pool again and so far so good. > >> >> Notably: >> >>> All the problem pg's are on osd.39. >> osd.39 has 0 CRUSH weight, so CRUSH shouldn’t be placing any PGs there. Yet there appear to be PGs mapped to the 4x 0 weight OSDs that are up. I had hoped that the health detail would show the PG IDs, and thus the pool(s) they are in. 7 pools is a lot for such a small cluster. In case nobody has called it out yet, with size=2 you run a risk of data unavailability and eventual loss. >> >> You replied to Eugen asserting that rule 0 is the rule for the `ssd` class. Is it possible that And later you showed a query for PG 28.42 — you must have a LOT of pools, or have created and removed a lot in the past. So at least some and perhaps all of the problem PGs are in pool #28. >> >> Please send `ceph osd dump | grep pool` ; I don’t see that in the thread so far. Similarly `ceph osd tree` which I don’t think we’ve seen yet. Your cluster is … more complicated than I had expected. >> >> I see that a number of your OSDs have the class ‘old-ssd’, I’ll guess those are the smaller ones? >> >> Mind you it’s Sunday morning for me (yawwwn) but I see 4x OSDs with the device class ’ssd’ that appear to be directly under the CRUSH root, vs being under a host bucket. That could be part of the problem as well. >> >> >> >> >> >> 25 ssd 0.28319 1.00000 290 GiB 152 GiB 151 GiB 2.8 MiB 1020 MiB 138 GiB 52.52 1.06 39 up >> 41 ssd 3.49309 1.00000 3.5 TiB 1.6 TiB 1.6 TiB 22 MiB 5.4 GiB 1.9 TiB 44.58 0.90 411 up >> 38 ssd 0 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 41 up >> 0 ssd 0 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down >> 36 ssd 0 1.00000 290 GiB 1.3 GiB 23 MiB 2.8 MiB 1.2 GiB 289 GiB 0.44 0.01 13 up >> 37 ssd 0 1.00000 290 GiB 1.1 GiB 23 MiB 2.5 MiB 1.1 GiB 289 GiB 0.38 0.01 18 up >> 39 ssd 0 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 86 up >> >> The SSD OSDs are really small, but you know that - just be aware that overhead is such that you won’t be able to fill them as much as you might think. >> This is one of the wrinkles with having OSDs of considerably different sizes used for a common pool. You didn’t ask about performance, but the 3.84TB SSDs are getting 12x the op workload of the 300GB SSDs. Which may not be all bad, since a 300GB SSD may well be old, client-class, and/or have low performance/endurance. >> >> >>>> Disabling mclock as described here https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/ might help >>> I cannot see any option that allows me to disable mclock... >>> >> The option to enable override is what I meant, also cf later replies to this thread. >> >> >> "choose_total_tries": 50, >> >> Upping this to say 100 may help. I’ve seen placement snags on very small clusters. >> >>>> Also, you have a small cluster with a bunch of small OSDs. Please send `ceph health detail` >>> # ceph health detail >>> HEALTH_WARN 1 pool(s) have no replicas configured >>> [WRN] POOL_NO_REDUNDANCY: 1 pool(s) have no replicas configured >>> pool 'backups' has no replicas configured >>> >>> The backups pool is on spinners and 1 copy is sufficient since we're also replicating this offsite in real time. Otherwise too many copies take too much space. >>> >>>> Please also send `ceph osd df` and `ceph osd crush dump` >>> NodeC:~# ceph osd df >>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS >>> 2 hdd 1.86029 1.00000 1.9 TiB 1.1 TiB 1.1 TiB 16 MiB 2.7 GiB 746 GiB 60.97 1.23 68 up >>> 3 hdd 1.86029 1.00000 1.9 TiB 949 GiB 902 GiB 12 MiB 2.6 GiB 961 GiB 49.68 1.00 57 up >>> 4 hdd 1.86029 1.00000 1.9 TiB 848 GiB 800 GiB 10 MiB 2.3 GiB 1.0 TiB 44.37 0.89 51 up >>> 5 hdd 1.86589 1.00000 1.9 TiB 1.0 TiB 982 GiB 15 MiB 2.4 GiB 881 GiB 53.87 1.08 66 up >>> 0 ssd 0 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 0 down >>> 1 ssd 0.28319 1.00000 290 GiB 123 GiB 122 GiB 2.3 MiB 1.2 GiB 167 GiB 42.38 0.85 46 up >>> 36 ssd 0 1.00000 290 GiB 1.3 GiB 23 MiB 2.8 MiB 1.2 GiB 289 GiB 0.44 0.01 13 up >>> 37 ssd 0 1.00000 290 GiB 1.1 GiB 23 MiB 2.5 MiB 1.1 GiB 289 GiB 0.38 0.01 18 up >>> 38 ssd 0 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 41 up >>> 39 ssd 0 0 0 B 0 B 0 B 0 B 0 B 0 B 0 0 86 up >>> 40 ssd 3.30690 1.00000 3.3 TiB 1.7 TiB 1.7 TiB 23 MiB 4.8 GiB 1.6 TiB 50.49 1.02 424 up >>> 10 hdd 1.86029 1.00000 1.9 TiB 1.0 TiB 990 GiB 13 MiB 2.8 GiB 873 GiB 54.19 1.09 64 up >>> 11 hdd 1.86029 1.00000 1.9 TiB 1.3 TiB 1.2 TiB 15 MiB 3.2 GiB 625 GiB 67.20 1.35 72 up >>> 26 hdd 1.86029 1.00000 1.9 TiB 843 GiB 801 GiB 12 MiB 2.3 GiB 1.0 TiB 44.25 0.89 50 up >>> 27 hdd 1.86029 1.00000 1.9 TiB 1.1 TiB 1.0 TiB 16 MiB 2.7 GiB 793 GiB 58.38 1.18 70 up >>> 6 ssd 0.28319 1.00000 290 GiB 115 GiB 115 GiB 2.3 MiB 743 MiB 175 GiB 39.82 0.80 28 up >>> 7 ssd 0.28319 1.00000 290 GiB 94 GiB 93 GiB 2.0 MiB 921 MiB 196 GiB 32.51 0.65 23 up >>> 8 ssd 0.28319 1.00000 290 GiB 103 GiB 102 GiB 2.1 MiB 963 MiB 187 GiB 35.46 0.71 27 up >>> 9 ssd 0.28319 1.00000 290 GiB 131 GiB 130 GiB 2.5 MiB 1.1 GiB 159 GiB 45.28 0.91 33 up >>> 24 ssd 0.28319 1.00000 290 GiB 102 GiB 101 GiB 2.1 MiB 719 MiB 188 GiB 35.10 0.71 26 up >>> 25 ssd 0.28319 1.00000 290 GiB 152 GiB 151 GiB 2.8 MiB 1020 MiB 138 GiB 52.52 1.06 39 up >>> 41 ssd 3.49309 1.00000 3.5 TiB 1.6 TiB 1.6 TiB 22 MiB 5.4 GiB 1.9 TiB 44.58 0.90 411 up >>> 14 hdd 1.59999 1.00000 1.9 TiB 1.0 TiB 1.0 TiB 13 MiB 2.4 GiB 837 GiB 56.04 1.13 63 up >>> 15 hdd 1.86029 1.00000 1.9 TiB 1017 GiB 975 GiB 13 MiB 2.1 GiB 888 GiB 53.36 1.07 60 up >>> 16 hdd 1.86029 1.00000 1.9 TiB 990 GiB 948 GiB 13 MiB 2.2 GiB 915 GiB 51.97 1.05 59 up >>> 17 hdd 1.86029 1.00000 1.9 TiB 1.0 TiB 992 GiB 14 MiB 2.5 GiB 871 GiB 54.28 1.09 64 up >>> 12 ssd 0.28319 1.00000 290 GiB 122 GiB 121 GiB 2.3 MiB 1.1 GiB 168 GiB 41.96 0.84 31 up >>> 13 ssd 0.28319 1.00000 290 GiB 145 GiB 144 GiB 2.6 MiB 1.2 GiB 145 GiB 49.95 1.01 37 up >>> 28 ssd 0.28319 1.00000 290 GiB 142 GiB 141 GiB 2.6 MiB 1.1 GiB 148 GiB 48.89 0.98 35 up >>> 29 ssd 0.28319 1.00000 290 GiB 109 GiB 108 GiB 2.2 MiB 1.3 GiB 181 GiB 37.58 0.76 29 up >>> 30 ssd 0.28319 1.00000 290 GiB 146 GiB 145 GiB 2.7 MiB 1.3 GiB 144 GiB 50.37 1.01 37 up >>> 31 ssd 0.28319 1.00000 290 GiB 131 GiB 130 GiB 2.5 MiB 1.3 GiB 159 GiB 45.24 0.91 34 up >>> 43 ssd 3.30690 1.00000 3.3 TiB 1.4 TiB 1.4 TiB 19 MiB 4.0 GiB 1.9 TiB 42.50 0.86 353 up >>> 20 hdd 1.86029 1.00000 1.9 TiB 1.2 TiB 1.2 TiB 12 MiB 2.8 GiB 659 GiB 65.38 1.32 67 up >>> 21 hdd 1.86029 1.00000 1.9 TiB 1.1 TiB 1.0 TiB 15 MiB 2.8 GiB 815 GiB 57.20 1.15 68 up >>> 22 hdd 1.86029 1.00000 1.9 TiB 878 GiB 836 GiB 13 MiB 2.4 GiB 1.0 TiB 46.10 0.93 54 up >>> 23 hdd 1.86029 1.00000 1.9 TiB 1018 GiB 977 GiB 14 MiB 2.8 GiB 886 GiB 53.46 1.08 59 up >>> 18 ssd 0.28319 1.00000 290 GiB 115 GiB 114 GiB 2.3 MiB 1.3 GiB 175 GiB 39.74 0.80 29 up >>> 19 ssd 0.28319 1.00000 290 GiB 115 GiB 114 GiB 2.3 MiB 961 MiB 175 GiB 39.59 0.80 28 up >>> 32 ssd 0.28319 1.00000 290 GiB 103 GiB 102 GiB 2.1 MiB 1.0 GiB 187 GiB 35.65 0.72 26 up >>> 33 ssd 0.28319 1.00000 290 GiB 98 GiB 97 GiB 2.1 MiB 1.4 GiB 192 GiB 33.92 0.68 26 up >>> 34 ssd 0.28319 1.00000 290 GiB 141 GiB 140 GiB 2.6 MiB 987 MiB 149 GiB 48.49 0.98 35 up >>> 35 ssd 0.28319 1.00000 290 GiB 116 GiB 114 GiB 2.3 MiB 1.3 GiB 174 GiB 39.92 0.80 31 up >>> 42 ssd 3.49309 1.00000 3.5 TiB 1.4 TiB 1.4 TiB 20 MiB 4.5 GiB 2.1 TiB 40.45 0.81 372 up >>> TOTAL 49 TiB 25 TiB 24 TiB 351 MiB 83 GiB 25 TiB 49.67 >>> MIN/MAX VAR: 0.01/1.35 STDDEV: 13.87 >>> >>> NodeC:~# ceph osd crush dump >>> { >>> "devices": [ >>> { >>> "id": 0, >>> "name": "osd.0", >>> "class": "ssd" >>> }, >>> { >>> "id": 1, >>> "name": "osd.1", >>> "class": "ssd" >>> }, >>> { >>> "id": 2, >>> "name": "osd.2", >>> "class": "hdd" >>> }, >>> { >>> "id": 3, >>> "name": "osd.3", >>> "class": "hdd" >>> }, >>> { >>> "id": 4, >>> "name": "osd.4", >>> "class": "hdd" >>> }, >>> { >>> "id": 5, >>> "name": "osd.5", >>> "class": "hdd" >>> }, >>> { >>> "id": 6, >>> "name": "osd.6", >>> "class": "ssd" >>> }, >>> { >>> "id": 7, >>> "name": "osd.7", >>> "class": "ssd" >>> }, >>> { >>> "id": 8, >>> "name": "osd.8", >>> "class": "ssd" >>> }, >>> { >>> "id": 9, >>> "name": "osd.9", >>> "class": "ssd" >>> }, >>> { >>> "id": 10, >>> "name": "osd.10", >>> "class": "hdd" >>> }, >>> { >>> "id": 11, >>> "name": "osd.11", >>> "class": "hdd" >>> }, >>> { >>> "id": 12, >>> "name": "osd.12", >>> "class": "ssd" >>> }, >>> { >>> "id": 13, >>> "name": "osd.13", >>> "class": "ssd" >>> }, >>> { >>> "id": 14, >>> "name": "osd.14", >>> "class": "hdd" >>> }, >>> { >>> "id": 15, >>> "name": "osd.15", >>> "class": "hdd" >>> }, >>> { >>> "id": 16, >>> "name": "osd.16", >>> "class": "hdd" >>> }, >>> { >>> "id": 17, >>> "name": "osd.17", >>> "class": "hdd" >>> }, >>> { >>> "id": 18, >>> "name": "osd.18", >>> "class": "ssd" >>> }, >>> { >>> "id": 19, >>> "name": "osd.19", >>> "class": "ssd" >>> }, >>> { >>> "id": 20, >>> "name": "osd.20", >>> "class": "hdd" >>> }, >>> { >>> "id": 21, >>> "name": "osd.21", >>> "class": "hdd" >>> }, >>> { >>> "id": 22, >>> "name": "osd.22", >>> "class": "hdd" >>> }, >>> { >>> "id": 23, >>> "name": "osd.23", >>> "class": "hdd" >>> }, >>> { >>> "id": 24, >>> "name": "osd.24", >>> "class": "ssd" >>> }, >>> { >>> "id": 25, >>> "name": "osd.25", >>> "class": "ssd" >>> }, >>> { >>> "id": 26, >>> "name": "osd.26", >>> "class": "hdd" >>> }, >>> { >>> "id": 27, >>> "name": "osd.27", >>> "class": "hdd" >>> }, >>> { >>> "id": 28, >>> "name": "osd.28", >>> "class": "ssd" >>> }, >>> { >>> "id": 29, >>> "name": "osd.29", >>> "class": "ssd" >>> }, >>> { >>> "id": 30, >>> "name": "osd.30", >>> "class": "ssd" >>> }, >>> { >>> "id": 31, >>> "name": "osd.31", >>> "class": "ssd" >>> }, >>> { >>> "id": 32, >>> "name": "osd.32", >>> "class": "ssd" >>> }, >>> { >>> "id": 33, >>> "name": "osd.33", >>> "class": "ssd" >>> }, >>> { >>> "id": 34, >>> "name": "osd.34", >>> "class": "ssd" >>> }, >>> { >>> "id": 35, >>> "name": "osd.35", >>> "class": "ssd" >>> }, >>> { >>> "id": 36, >>> "name": "osd.36", >>> "class": "ssd" >>> }, >>> { >>> "id": 37, >>> "name": "osd.37", >>> "class": "ssd" >>> }, >>> { >>> "id": 38, >>> "name": "osd.38", >>> "class": "ssd" >>> }, >>> { >>> "id": 39, >>> "name": "osd.39", >>> "class": "ssd" >>> }, >>> { >>> "id": 40, >>> "name": "osd.40", >>> "class": "ssd" >>> }, >>> { >>> "id": 41, >>> "name": "osd.41", >>> "class": "ssd" >>> }, >>> { >>> "id": 42, >>> "name": "osd.42", >>> "class": "ssd" >>> }, >>> { >>> "id": 43, >>> "name": "osd.43", >>> "class": "ssd" >>> } >>> ], >>> "types": [ >>> { >>> "type_id": 0, >>> "name": "osd" >>> }, >>> { >>> "type_id": 1, >>> "name": "host" >>> }, >>> { >>> "type_id": 2, >>> "name": "chassis" >>> }, >>> { >>> "type_id": 3, >>> "name": "rack" >>> }, >>> { >>> "type_id": 4, >>> "name": "row" >>> }, >>> { >>> "type_id": 5, >>> "name": "pdu" >>> }, >>> { >>> "type_id": 6, >>> "name": "pod" >>> }, >>> { >>> "type_id": 7, >>> "name": "room" >>> }, >>> { >>> "type_id": 8, >>> "name": "datacenter" >>> }, >>> { >>> "type_id": 9, >>> "name": "zone" >>> }, >>> { >>> "type_id": 10, >>> "name": "region" >>> }, >>> { >>> "type_id": 11, >>> "name": "root" >>> } >>> ], >>> "buckets": [ >>> { >>> "id": -1, >>> "name": "default", >>> "type_id": 11, >>> "type_name": "root", >>> "weight": 3177873, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": -3, >>> "weight": 723311, >>> "pos": 0 >>> }, >>> { >>> "id": -7, >>> "weight": 827941, >>> "pos": 1 >>> }, >>> { >>> "id": -10, >>> "weight": 798680, >>> "pos": 2 >>> }, >>> { >>> "id": -13, >>> "weight": 827941, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -2, >>> "name": "default~ssd", >>> "type_id": 11, >>> "type_name": "root", >>> "weight": 1243909, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": -4, >>> "weight": 235280, >>> "pos": 0 >>> }, >>> { >>> "id": -8, >>> "weight": 340277, >>> "pos": 1 >>> }, >>> { >>> "id": -11, >>> "weight": 328075, >>> "pos": 2 >>> }, >>> { >>> "id": -14, >>> "weight": 340277, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -3, >>> "name": "FT1-NodeA", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 723311, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 1, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 36, >>> "weight": 0, >>> "pos": 1 >>> }, >>> { >>> "id": 37, >>> "weight": 0, >>> "pos": 2 >>> }, >>> { >>> "id": 0, >>> "weight": 0, >>> "pos": 3 >>> }, >>> { >>> "id": 38, >>> "weight": 0, >>> "pos": 4 >>> }, >>> { >>> "id": 39, >>> "weight": 0, >>> "pos": 5 >>> }, >>> { >>> "id": 2, >>> "weight": 121916, >>> "pos": 6 >>> }, >>> { >>> "id": 3, >>> "weight": 121916, >>> "pos": 7 >>> }, >>> { >>> "id": 4, >>> "weight": 121916, >>> "pos": 8 >>> }, >>> { >>> "id": 40, >>> "weight": 216721, >>> "pos": 9 >>> }, >>> { >>> "id": 5, >>> "weight": 122283, >>> "pos": 10 >>> } >>> ] >>> }, >>> { >>> "id": -4, >>> "name": "FT1-NodeA~ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 235280, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 1, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 36, >>> "weight": 0, >>> "pos": 1 >>> }, >>> { >>> "id": 37, >>> "weight": 0, >>> "pos": 2 >>> }, >>> { >>> "id": 0, >>> "weight": 0, >>> "pos": 3 >>> }, >>> { >>> "id": 38, >>> "weight": 0, >>> "pos": 4 >>> }, >>> { >>> "id": 39, >>> "weight": 0, >>> "pos": 5 >>> }, >>> { >>> "id": 40, >>> "weight": 216721, >>> "pos": 6 >>> } >>> ] >>> }, >>> { >>> "id": -5, >>> "name": "FT1-NodeA~hdd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 488031, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 2, >>> "weight": 121916, >>> "pos": 0 >>> }, >>> { >>> "id": 3, >>> "weight": 121916, >>> "pos": 1 >>> }, >>> { >>> "id": 4, >>> "weight": 121916, >>> "pos": 2 >>> }, >>> { >>> "id": 5, >>> "weight": 122283, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -6, >>> "name": "default~hdd", >>> "type_id": 11, >>> "type_name": "root", >>> "weight": 1933964, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": -5, >>> "weight": 488031, >>> "pos": 0 >>> }, >>> { >>> "id": -9, >>> "weight": 487664, >>> "pos": 1 >>> }, >>> { >>> "id": -12, >>> "weight": 470605, >>> "pos": 2 >>> }, >>> { >>> "id": -15, >>> "weight": 487664, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -7, >>> "name": "FT1-NodeB", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 827941, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 6, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 24, >>> "weight": 18559, >>> "pos": 1 >>> }, >>> { >>> "id": 25, >>> "weight": 18559, >>> "pos": 2 >>> }, >>> { >>> "id": 11, >>> "weight": 121916, >>> "pos": 3 >>> }, >>> { >>> "id": 10, >>> "weight": 121916, >>> "pos": 4 >>> }, >>> { >>> "id": 7, >>> "weight": 18559, >>> "pos": 5 >>> }, >>> { >>> "id": 8, >>> "weight": 18559, >>> "pos": 6 >>> }, >>> { >>> "id": 9, >>> "weight": 18559, >>> "pos": 7 >>> }, >>> { >>> "id": 26, >>> "weight": 121916, >>> "pos": 8 >>> }, >>> { >>> "id": 27, >>> "weight": 121916, >>> "pos": 9 >>> }, >>> { >>> "id": 41, >>> "weight": 228923, >>> "pos": 10 >>> } >>> ] >>> }, >>> { >>> "id": -8, >>> "name": "FT1-NodeB~ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 340277, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 6, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 24, >>> "weight": 18559, >>> "pos": 1 >>> }, >>> { >>> "id": 25, >>> "weight": 18559, >>> "pos": 2 >>> }, >>> { >>> "id": 7, >>> "weight": 18559, >>> "pos": 3 >>> }, >>> { >>> "id": 8, >>> "weight": 18559, >>> "pos": 4 >>> }, >>> { >>> "id": 9, >>> "weight": 18559, >>> "pos": 5 >>> }, >>> { >>> "id": 41, >>> "weight": 228923, >>> "pos": 6 >>> } >>> ] >>> }, >>> { >>> "id": -9, >>> "name": "FT1-NodeB~hdd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 487664, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 11, >>> "weight": 121916, >>> "pos": 0 >>> }, >>> { >>> "id": 10, >>> "weight": 121916, >>> "pos": 1 >>> }, >>> { >>> "id": 26, >>> "weight": 121916, >>> "pos": 2 >>> }, >>> { >>> "id": 27, >>> "weight": 121916, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -10, >>> "name": "FT1-NodeC", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 798680, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 13, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 28, >>> "weight": 18559, >>> "pos": 1 >>> }, >>> { >>> "id": 29, >>> "weight": 18559, >>> "pos": 2 >>> }, >>> { >>> "id": 14, >>> "weight": 104857, >>> "pos": 3 >>> }, >>> { >>> "id": 12, >>> "weight": 18559, >>> "pos": 4 >>> }, >>> { >>> "id": 30, >>> "weight": 18559, >>> "pos": 5 >>> }, >>> { >>> "id": 31, >>> "weight": 18559, >>> "pos": 6 >>> }, >>> { >>> "id": 15, >>> "weight": 121916, >>> "pos": 7 >>> }, >>> { >>> "id": 16, >>> "weight": 121916, >>> "pos": 8 >>> }, >>> { >>> "id": 17, >>> "weight": 121916, >>> "pos": 9 >>> }, >>> { >>> "id": 43, >>> "weight": 216721, >>> "pos": 10 >>> } >>> ] >>> }, >>> { >>> "id": -11, >>> "name": "FT1-NodeC~ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 328075, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 13, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 28, >>> "weight": 18559, >>> "pos": 1 >>> }, >>> { >>> "id": 29, >>> "weight": 18559, >>> "pos": 2 >>> }, >>> { >>> "id": 12, >>> "weight": 18559, >>> "pos": 3 >>> }, >>> { >>> "id": 30, >>> "weight": 18559, >>> "pos": 4 >>> }, >>> { >>> "id": 31, >>> "weight": 18559, >>> "pos": 5 >>> }, >>> { >>> "id": 43, >>> "weight": 216721, >>> "pos": 6 >>> } >>> ] >>> }, >>> { >>> "id": -12, >>> "name": "FT1-NodeC~hdd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 470605, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 14, >>> "weight": 104857, >>> "pos": 0 >>> }, >>> { >>> "id": 15, >>> "weight": 121916, >>> "pos": 1 >>> }, >>> { >>> "id": 16, >>> "weight": 121916, >>> "pos": 2 >>> }, >>> { >>> "id": 17, >>> "weight": 121916, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -13, >>> "name": "FT1-NodeD", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 827941, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 18, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 32, >>> "weight": 18559, >>> "pos": 1 >>> }, >>> { >>> "id": 33, >>> "weight": 18559, >>> "pos": 2 >>> }, >>> { >>> "id": 22, >>> "weight": 121916, >>> "pos": 3 >>> }, >>> { >>> "id": 19, >>> "weight": 18559, >>> "pos": 4 >>> }, >>> { >>> "id": 34, >>> "weight": 18559, >>> "pos": 5 >>> }, >>> { >>> "id": 35, >>> "weight": 18559, >>> "pos": 6 >>> }, >>> { >>> "id": 23, >>> "weight": 121916, >>> "pos": 7 >>> }, >>> { >>> "id": 21, >>> "weight": 121916, >>> "pos": 8 >>> }, >>> { >>> "id": 20, >>> "weight": 121916, >>> "pos": 9 >>> }, >>> { >>> "id": 42, >>> "weight": 228923, >>> "pos": 10 >>> } >>> ] >>> }, >>> { >>> "id": -14, >>> "name": "FT1-NodeD~ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 340277, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 18, >>> "weight": 18559, >>> "pos": 0 >>> }, >>> { >>> "id": 32, >>> "weight": 18559, >>> "pos": 1 >>> }, >>> { >>> "id": 33, >>> "weight": 18559, >>> "pos": 2 >>> }, >>> { >>> "id": 19, >>> "weight": 18559, >>> "pos": 3 >>> }, >>> { >>> "id": 34, >>> "weight": 18559, >>> "pos": 4 >>> }, >>> { >>> "id": 35, >>> "weight": 18559, >>> "pos": 5 >>> }, >>> { >>> "id": 42, >>> "weight": 228923, >>> "pos": 6 >>> } >>> ] >>> }, >>> { >>> "id": -15, >>> "name": "FT1-NodeD~hdd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 487664, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": 22, >>> "weight": 121916, >>> "pos": 0 >>> }, >>> { >>> "id": 23, >>> "weight": 121916, >>> "pos": 1 >>> }, >>> { >>> "id": 21, >>> "weight": 121916, >>> "pos": 2 >>> }, >>> { >>> "id": 20, >>> "weight": 121916, >>> "pos": 3 >>> } >>> ] >>> }, >>> { >>> "id": -16, >>> "name": "FT1-NodeA~old-ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 0, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [] >>> }, >>> { >>> "id": -17, >>> "name": "FT1-NodeB~old-ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 0, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [] >>> }, >>> { >>> "id": -18, >>> "name": "FT1-NodeC~old-ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 0, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [] >>> }, >>> { >>> "id": -19, >>> "name": "FT1-NodeD~old-ssd", >>> "type_id": 1, >>> "type_name": "host", >>> "weight": 0, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [] >>> }, >>> { >>> "id": -20, >>> "name": "default~old-ssd", >>> "type_id": 11, >>> "type_name": "root", >>> "weight": 0, >>> "alg": "straw2", >>> "hash": "rjenkins1", >>> "items": [ >>> { >>> "id": -16, >>> "weight": 0, >>> "pos": 0 >>> }, >>> { >>> "id": -17, >>> "weight": 0, >>> "pos": 1 >>> }, >>> { >>> "id": -18, >>> "weight": 0, >>> "pos": 2 >>> }, >>> { >>> "id": -19, >>> "weight": 0, >>> "pos": 3 >>> } >>> ] >>> } >>> ], >>> "rules": [ >>> { >>> "rule_id": 0, >>> "rule_name": "ssd_rule", >>> "type": 1, >>> "steps": [ >>> { >>> "op": "take", >>> "item": -2, >>> "item_name": "default~ssd" >>> }, >>> { >>> "op": "chooseleaf_firstn", >>> "num": 0, >>> "type": "host" >>> }, >>> { >>> "op": "emit" >>> } >>> ] >>> }, >>> { >>> "rule_id": 1, >>> "rule_name": "hdd_rule", >>> "type": 1, >>> "steps": [ >>> { >>> "op": "take", >>> "item": -6, >>> "item_name": "default~hdd" >>> }, >>> { >>> "op": "chooseleaf_firstn", >>> "num": 0, >>> "type": "host" >>> }, >>> { >>> "op": "emit" >>> } >>> ] >>> }, >>> { >>> "rule_id": 2, >>> "rule_name": "old_nvme", >>> "type": 1, >>> "steps": [ >>> { >>> "op": "take", >>> "item": -20, >>> "item_name": "default~old-ssd" >>> }, >>> { >>> "op": "chooseleaf_firstn", >>> "num": 0, >>> "type": "host" >>> }, >>> { >>> "op": "emit" >>> } >>> ] >>> } >>> ], >>> "tunables": { >>> "choose_local_tries": 0, >>> "choose_local_fallback_tries": 0, >>> "choose_total_tries": 50, >>> "chooseleaf_descend_once": 1, >>> "chooseleaf_vary_r": 1, >>> "chooseleaf_stable": 1, >>> "straw_calc_version": 1, >>> "allowed_bucket_algs": 54, >>> "profile": "jewel", >>> "optimal_tunables": 1, >>> "legacy_tunables": 0, >>> "minimum_required_version": "jewel", >>> "require_feature_tunables": 1, >>> "require_feature_tunables2": 1, >>> "has_v2_rules": 0, >>> "require_feature_tunables3": 1, >>> "has_v3_rules": 0, >>> "has_v4_buckets": 1, >>> "require_feature_tunables5": 1, >>> "has_v5_rules": 0 >>> }, >>> "choose_args": {} >>> } >>> >>> >>> >>>> It may help to manually edit the CRUSH map as detailed here: >>>> >>>> https://docs.ceph.com/en/reef/rados/operations/crush-map-edits/ >>>> >>>> >>>> At the end of the decompiled CRUSH map is a “tunables” section. Look for the “choose_total_tries” >>>> line, which should have the default value of 50. Change that to 100. >>>> >>>> >>>> >>>>> On Nov 16, 2024, at 11:13 AM, Roland Giesler <roland@xxxxxxxxxxxxxx> <mailto:roland@xxxxxxxxxxxxxx> wrote: >>>>> >>>>> On 2024/11/15 16:37, Anthony D'Atri wrote: >>>>>> Only 2 OSDs in the acting set. This isn’t a size=2 pool is it? Did you restart osd.39? >>>>> Restarted both repeatedly. Has no effect. >>>>> >>>>> >>>>> >>>>>>> On Nov 15, 2024, at 9:32 AM, Roland Giesler <roland@xxxxxxxxxxxxxx> <mailto:roland@xxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> "acting": [ >>>>>>> 39, >>>>>>> 1 >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx