Ah, the `debug_osd=10` was the missing piece of information for me. It looks like repeering actually triggered the necessary (chaotic ;-)) deletion. With increased log level I'm now seeing the following in the logs: 2021-07-28 12:07:56.661 7f688f301700 10 osd.0 pg_epoch: 284426 pg[3.114s1( v 279813'6291438 (279807'6288384,279813'6291438] lb MIN (bitwise) local-lis/les=280030/280031 n=117894 ec=366/91 lis/c 280038/214516 les/c/f 280039/214517/78343 280030/284422/85949) [113,148,152]p113(0) r=-1 lpr=284422 DELETING pi=[214516,284422)/2 crt=279813'6291438 lcod 0'0 unknown NOTIFY mbc={}] do_peering_event: epoch_sent: 284426 epoch_requested: 284426 DeleteSome 2021-07-28 12:07:56.661 7f688f301700 10 osd.0 pg_epoch: 284426 pg[3.114s1( v 279813'6291438 (279807'6288384,279813'6291438] lb MIN (bitwise) local-lis/les=280030/280031 n=117894 ec=366/91 lis/c 280038/214516 les/c/f 280039/214517/78343 280030/284422/85949) [113,148,152]p113(0) r=-1 lpr=284422 DELETING pi=[214516,284422)/2 crt=279813'6291438 lcod 0'0 unknown NOTIFY mbc={}] _delete_some 2021-07-28 12:08:00.370 7f688e2ff700 10 osd.0 pg_epoch: 284426 pg[3.e9s0( v 281567'6292700 (279807'6289659,281567'6292700] lb MIN (bitwise) local-lis/les=281537/281538 n=118646 ec=366/91 lis/c 281608/280406 les/c/f 281609/280410/78343 284424/284424/281654) [104,2147483647,67]p104(0) r=-1 lpr=284424 DELETING pi=[280406,284424)/3 crt=281567'6292700 lcod 0'0 unknown NOTIFY mbc={}] do_peering_event: epoch_sent: 284426 epoch_requested: 284426 DeleteSome 2021-07-28 12:08:00.370 7f688e2ff700 10 osd.0 pg_epoch: 284426 pg[3.e9s0( v 281567'6292700 (279807'6289659,281567'6292700] lb MIN (bitwise) local-lis/les=281537/281538 n=118646 ec=366/91 lis/c 281608/280406 les/c/f 281609/280410/78343 284424/284424/281654) [104,2147483647,67]p104(0) r=-1 lpr=284424 DELETING pi=[280406,284424)/3 crt=281567'6292700 lcod 0'0 unknown NOTIFY mbc={}] _delete_some So, I'll keep my fingers crossed and try to be more patient to see whether the pg counts go towards the expected ones over time. Thanks! On Wed, Jul 28, 2021 at 11:52 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > Yes I expect `ceph pg repeer` would work. > Instead of doing all PGs at once, which is sort of chaotic, just pick > one PG which is on osd.0 but shouldn't be there. > > To find that, you need to restart the osd.0 with debug_osd=10 and look > for lines like: > > 10 osd.74 47719 load_pgs loaded pg[2.6d( > > Then check the up/active set in the pg map and find one which is not > supposed to be there. > > Then just `ceph pg repeer` that PG, and see if it starts deleting from > the OSD. Deleting a PG can take awhile. If you stil have debug_osd=10 > you'll see the PG deleting ongoing.. messages like "delete_some" or > similar with the expected PG id. > > If repeer doesn't work, then try restarting the primary OSD of the > relevant PG. > > -- dan > > On Wed, Jul 28, 2021 at 11:44 AM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> > wrote: > > > > Hi, > > > > would it not be simpler to find the "bad" pgs and call "ceph pg repeer" > on them to force them to peer? Or is this a different kind of peering than > the one you are describing? > > > > My approach would be to get a list of ALL pgs and then call "ceph pg > repeer" on them. The first command line call gets the JSON file, the the > python snippet print the pgs names, and then I loop over this with "for x > in [list]; do ceph pg repeear $x; done" (not shown). > > > > ## first ## ceph pg dump -f json > ceph-pg-dump.json > > > > #!/usr/bin/env python > > import json > > > > with open('ceph-pg-dump.json', 'rt') as inputf: > > data = json.loads(inputf.read()) > > > > pgs = [] > > for r_pg_stat in data['pg_map']['pg_stats']: > > if r_pg_stat["pgid"].startswith("3."): > > pgs.append(r_pg_stat["pgid"]) > > > > print(" ".join(sorted(set(pgs)))) > > > > I just did this but it did not have the desired effect. The cluster > appears to correctly repeat all pgs, some osds go "down" on the way but in > the end I have a system in HEALTH_OK state but with the unchanged number of > pgs per osd. > > > > My problem is that >50% of all osds seem to be affected somehow, so I'd > rather not zap all of them but I'd like to have all of them fixed > eventually. > > > > How can I find out which pgs are actually on osd.0? I guess I can use a > similar Python script as the one from above to find out what the central > book keeping things should be on osd.0 (change the condition in the script > to `if 0 in (r_pg_stat["up"] + r_pg_stat["acting"])`). > > > > I believe the command for removing a pg from an osd would be > > > > # ceph-objectstore-tool --op export-remove --data-path > /var/lib/ceph/osd/ceph-0 --pgid PGID > > > > So I'd know how to proceed if I knew the pg id. Is there a check that I > should perform after removing a pg from an osd with ceph-objectstore-tool > to make sure that I did not get a wrong one? Or would ceph notice? > > > > Thanks, > > Manuel > > > > > > On Wed, Jul 28, 2021 at 10:41 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> > wrote: > >> > >> Wait, after re-reading my own ticket I realized you can more easily > >> remove the leftover PGs by re-peering the *other* osds. > >> > >> "I found a way to remove those leftover PGs (without using > >> ceph-objectstore-tool): If the PG re-peers, then osd.74 notices he's > >> not in the up/acting set then starts deleting the PG. So at the moment > >> I'm restarting those former peers to trim this OSD." > >> > >> -- dan > >> > >> > >> On Wed, Jul 28, 2021 at 10:37 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> > wrote: > >> > > >> > Cool, looks like the second problem is the real issue here :) > >> > > >> > IIRC, you can remove the leftover PGs with ceph-objectstore-tool. I > >> > don't recall the exact syntax, but you'd need to find out which PGs > >> > are not mapped there by the current crush rule and remove the others. > >> > Or, you can zap and re-create the OSD. > >> > > >> > -- dan > >> > > >> > > >> > On Wed, Jul 28, 2021 at 10:34 AM Manuel Holtgrewe < > zyklenfrei@xxxxxxxxx> wrote: > >> > > > >> > > How "wide" is "wide". I have 4 nodes and 140 HDD OSDs. Here is the > info as from the Ceph system: > >> > > > >> > > # ceph osd erasure-code-profile get hdd_ec > >> > > crush-device-class=hdd > >> > > crush-failure-domain=host > >> > > crush-root=default > >> > > jerasure-per-chunk-alignment=false > >> > > k=2 > >> > > m=1 > >> > > plugin=jerasure > >> > > technique=reed_sol_van > >> > > w=8 > >> > > > >> > > Here is what your script gives: > >> > > > >> > > # python tools/ceph-pool-pg-distribution 3 > >> > > Searching for PGs in pools: ['3'] > >> > > Summary: 2048 PGs on 140 osds > >> > > > >> > > Num OSDs with X PGs: > >> > > 43: 16 > >> > > 44: 124 > >> > > > >> > > ... and finally your last proposal, so it looks like I have some > left-over pgs, see below. I'm also observing PG values than 43/44 on other > osds in the system. > >> > > > >> > > # ceph daemon osd.0 status > >> > > { > >> > > "cluster_fsid": "55633ec3-6c0c-4a02-990c-0f87e0f7a01f", > >> > > "osd_fsid": "85e266f1-8d8c-4c2a-b03c-0aef9bc4e532", > >> > > "whoami": 0, > >> > > "state": "active", > >> > > "oldest_map": 99775, > >> > > "newest_map": 281713, > >> > > "num_pgs": 77 > >> > > } > >> > > > >> > > I found this ticket (https://tracker.ceph.com/issues/38931 I > believe you actually opened it ;-)) and tried to restart the osd.0 and now > the OSD is scrubbing some of its pgs... However, I'm uncertain that this is > actually trimming the left-over pgs. > >> > > > >> > > Thanks for all your help up to this point already! > >> > > > >> > > Best wishes, > >> > > Manuel > >> > > > >> > > On Wed, Jul 28, 2021 at 9:55 AM Dan van der Ster < > dan@xxxxxxxxxxxxxx> wrote: > >> > >> > >> > >> How wide is hdd_ec? With a wide EC rule and relatively few OSDs and > >> > >> relatively few PGs per OSD for the pool, it can be impossible for > the > >> > >> balancer to make things perfect. > >> > >> It would help to look at the PG distribution for only the hdd_ec > pool > >> > >> -- this script can help > >> > >> > https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-pool-pg-distribution > >> > >> > >> > >> Another possibility is that osd.0 has some leftover data from PGs > that > >> > >> should have been deleted. From the box, check: `ceph daemon osd.0 > >> > >> status` and compare the number of PGs it holds vs the value in your > >> > >> osd df output (48). > >> > >> > >> > >> -- dan > >> > >> > >> > >> On Wed, Jul 28, 2021 at 9:24 AM Manuel Holtgrewe < > zyklenfrei@xxxxxxxxx> wrote: > >> > >> > > >> > >> > Hi, > >> > >> > > >> > >> > thanks for your quick response. I already did this earlier this > week: > >> > >> > > >> > >> > # ceph config dump | grep upmap_max_deviation > >> > >> > mgr advanced mgr/balancer/upmap_max_deviation > 1 > >> > >> > > >> > >> > Cheers, > >> > >> > Manuel > >> > >> > > >> > >> > On Wed, Jul 28, 2021 at 9:15 AM Dan van der Ster < > dan@xxxxxxxxxxxxxx> wrote: > >> > >> >> > >> > >> >> Hi, > >> > >> >> > >> > >> >> Start by setting: > >> > >> >> > >> > >> >> ceph config set mgr mgr/balancer/upmap_max_deviation 1 > >> > >> >> > >> > >> >> This configures the balancer to squeeze the OSDs to within 1 PG > of eachother. > >> > >> >> > >> > >> >> I'm starting to think this should be the default. > >> > >> >> > >> > >> >> Cheers, dan > >> > >> >> > >> > >> >> > >> > >> >> On Wed, Jul 28, 2021 at 9:08 AM Manuel Holtgrewe < > zyklenfrei@xxxxxxxxx> wrote: > >> > >> >> > > >> > >> >> > Dear all, > >> > >> >> > > >> > >> >> > I'm running Ceph 14.2.11. I have 140 HDDs in my cluster of 4 > nodes, 35 HDDs > >> > >> >> > per node. I am observing fill ratios of 66% to 70% of OSDs > and then one > >> > >> >> > with 82% (see attached ceph-osd-df.txt for output of "ceph > osd df"). > >> > >> >> > > >> > >> >> > Previously, I had problems with single OSDs filling up to 85% > and then > >> > >> >> > everything coming to a grinding halt. Ideally, I would like > to have all OSD > >> > >> >> > fill grade to be close to the mean of 67%... At the very > least I need to > >> > >> >> > get the 82% OSD back into the range. > >> > >> >> > > >> > >> >> > I have upmap balancing enabled and the balancer says: > >> > >> >> > > >> > >> >> > # ceph balancer status > >> > >> >> > { > >> > >> >> > "last_optimize_duration": "0:00:00.053686", > >> > >> >> > "plans": [], > >> > >> >> > "mode": "upmap", > >> > >> >> > "active": true, > >> > >> >> > "optimize_result": "Unable to find further optimization, > or pool(s) > >> > >> >> > pg_num is decreasing, or distribution is already perfect", > >> > >> >> > "last_optimize_started": "Wed Jul 28 09:03:02 2021" > >> > >> >> > } > >> > >> >> > > >> > >> >> > Creating an offline balancing plan looks like this: > >> > >> >> > > >> > >> >> > # ceph osd getmap -o om > >> > >> >> > got osdmap epoch 281708 > >> > >> >> > # osdmaptool om --upmap out.txt --upmap-pool hdd_ec > --upmap-deviation 1 > >> > >> >> > --upmap-active > >> > >> >> > osdmaptool: osdmap file 'om' > >> > >> >> > writing upmap command output to: out.txt > >> > >> >> > checking for upmap cleanups > >> > >> >> > upmap, max-count 10, max deviation 1 > >> > >> >> > limiting to pools hdd_ec ([3]) > >> > >> >> > pools hdd_ec > >> > >> >> > prepared 0/10 changes > >> > >> >> > Time elapsed 0.0275739 secs > >> > >> >> > Unable to find further optimization, or distribution is > already perfect > >> > >> >> > osd.0 pgs 43 > >> > >> >> > [...] > >> > >> >> > # wc -l out.txt > >> > >> >> > 0 out.txt > >> > >> >> > > >> > >> >> > Does anyone have a suggestion how to proceed getting the 82% > OSD closer to > >> > >> >> > the mean fill ratio (and maybe the other OSDs as well?) > >> > >> >> > > >> > >> >> > Thanks, > >> > >> >> > Manuel > >> > >> >> > _______________________________________________ > >> > >> >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx