Don't purge anything! On Fri, Apr 1, 2022 at 9:38 AM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > > Ciao Dan, > thanks for your time! > > So you are suggesting that my problems with PG 85.25 may somehow resolve > if I manage to bring up the three OSDs currently "down" (possibly due to > PG 85.12, and other PGs)? > > Looking for the string 'start interval does not contain the required > bound' I found similar errors in the three OSDs: > osd.158: 85.12s0 > osd.145: 85.33s0 > osd.121: 85.11s0 > > Here is the output of "pg 85.12 query": > https://pastebin.ubuntu.com/p/ww3JdwDXVd/ > and its status (also showing the other 85.XX, for reference): > > 85.11 39501 0 0 0 165479411712 0 > 0 3000 stale+active+clean 3d 606021'532631 > 617659:1827554 > [124,157,68,72,102]p124 > [124,157,68,72,102]p124 2022-03-28 07:21:00.566032 2022-03-28 > 07:21:00.566032 > 85.12 39704 39704 158816 0 166350008320 0 > 0 3028 active+undersized+degraded+remapped 3d 606021'573200 > 620336:1839924 > [2147483647,2147483647,2147483647,2147483647,2147483647]p-1 > [67,91,82,2147483647,112]p67 2022-03-15 03:25:28.478280 > 2022-03-12 19:10:45.866650 > 85.25 39402 0 0 0 165108592640 0 > 0 3098 stale+down+remapped 3d 606021'521273 > 618930:1734492 > [2147483647,2147483647,2147483647,2147483647,2147483647]p-1 > [2147483647,2147483647,96,2147483647,2147483647]p96 2022-03-15 > 04:08:42.561720 2022-03-09 17:05:34.205121 > 85.33 39319 0 0 0 164740796416 0 > 0 3000 stale+active+clean 3d 606021'513259 > 617659:2125167 > [174,112,85,102,124]p174 > [174,112,85,102,124]p174 2022-03-28 07:21:12.097873 2022-03-28 > 07:21:12.097873 > > So 85.11 and 85.33 do not look bad, after all: why are the relevant OSDs > complaining? Is there a way to force them (OSDs) to forget about the > chunks they possess, as apparently those have already safely migrated > elsewhere? > > Indeed 85.12 is not really healthy... > As for chunks of 85.12 and 85.25, the 3 down OSDs have: > osd.121 > 85.12s3 > 85.25s3 > osd.158 > 85.12s0 > osd.145 > none > I guess I can safely purge osd.145 and re-create it, then. > > > As for the history of the pool, this is an EC pool with metadata in a > SSD-backed replicated pool. At some point I realized I had made a > mistake in the allocation rule for the "data" part, so I changed the > relevant rule to: > > ~]$ ceph --cluster cephpa1 osd lspools | grep 85 > 85 csd-dataonly-ec-pool > ~]$ ceph --cluster cephpa1 osd pool get csd-dataonly-ec-pool crush_rule > crush_rule: csd-data-pool > > rule csd-data-pool { > id 5 > type erasure > min_size 3 > max_size 5 > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class big > step choose indep 0 type host <--- this was "osd", before > step emit > } > > At the time I changed the rule, there was no 'down' PG, all PGs in the > cluster were 'active' plus possibly some other state (remapped, > degraded, whatever) as I had added some new disk servers few days before. > The rule change, of course, caused some data movement and after a while > I found those three OSDs down. > > Thanks! > > Fulvio > > > On 3/30/22 16:48, Dan van der Ster wrote: > > Hi Fulvio, > > > > I'm not sure why that PG doesn't register. > > But let's look into your log. The relevant lines are: > > > > -635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster) > > log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does > > not contain the required bound [605868,616454) start > > > > -628> 2022-03-30 14:49:57.810 7ff904970700 -1 osd.158 pg_epoch: > > 616454 pg[85.12s0( empty local-lis/les=0/0 n=0 ec=616435/616435 lis/c > > 605866/605866 les/c/f 605867/605868/0 616453/616454/616454) > > [158,168,64,102,156]/[67,91,82,121,112]p67(0) r=-1 lpr=616454 > > pi=[616435,616454)/0 crt=0'0 remapped NOTIFY mbc={}] 85.12s0 > > past_intervals [616435,616454) start interval does not contain the > > required bound [605868,616454) start > > > > -355> 2022-03-30 14:49:57.816 7ff904970700 -1 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc: > > In function 'void PG::check_past_interval_bounds() const' thread > > 7ff904970700 time 2022-03-30 14:49:57.811165 > > > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc: > > 956: ceph_abort_msg("past_interval start interval mismatch") > > > > > > What is the output of `ceph pg 85.12 query` ? > > > > What's the history of that PG? was it moved around recently prior to this crash? > > Are the other down osds also hosting broken parts of PG 85.12 ? > > > > Cheers, Dan > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx