What's with the OSDs having loopback addresses? E.g. v2: 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667 Does `ceph osd dump` show those same loopback addresses for each OSD? This sounds familiar... I'm trying to find the recent ticket. .. dan On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla@xxxxxxxxx> wrote: > hi Dan: > > So, unsetting nodown results in... almost all of the OSDs being marked > down. (231 down out of 328). > Checking the actual OSD services, most of them were actually up and active > on the nodes, even when the mons had marked them down. > (On a few nodes, the down services corresponded to OSDs that had been > flapping - but increasing osd_max_markdown locally to keep them up despite > the previous flapping, and restarting the services... didn't help.) > > In fact, starting up the few OSD services which had actually stopped, > resulted in a different set of OSDs being marked down, and some others > coming up. > We currently have a sort of "rolling OSD outness" passing through the > cluster - there's always ~230 OSDs marked down now, but which ones those > are changes (we've had everything from 1 HOST down to 4 HOSTS down over the > past 14 minutes as things fluctuate. > > A log from one of the "down" OSDs [which is actually running, and on the > same host as OSDs which are marked up] shows this worrying snippet > > 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:45.298 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:46.340 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:47.376 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:48.395 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:49.407 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:50.400 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400 > 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000 > 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800 > 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000 > 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1 > protocol > 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2: > 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00 > 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 > tx=0)._handle_peer_banner peer [v2: > 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1 > protocol > 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:51.377 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:52.370 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:53.377 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:54.385 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:55.385 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:56.362 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 is_healthy false -- > only 0/10 up peers (less than 33%) > 2021-03-22 17:01:57.324 7f6c9c883700 1 osd.127 253515 not healthy; > waiting to boot > > > > Any suggestions? > > Sam > > P.S. an example ceph status as it is now [with everything now on 14.2.18, > since we had to restart osds anyway]: > > cluster: > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 > health: HEALTH_WARN > pauserd,pausewr,noout,nobackfill,norebalance flag(s) set > 230 osds down > 4 hosts (80 osds) down > Reduced data availability: 2048 pgs inactive > 8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has > slow ops > > services: > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h) > mgr: cephs01(active, since 77m) > osd: 329 osds: 98 up (since 4s), 328 in (since 4d) > flags pauserd,pausewr,noout,nobackfill,norebalance > > data: > pools: 3 pools, 2048 pgs > objects: 0 objects, 0 B > usage: 0 B used, 0 B / 0 B avail > pgs: 100.000% pgs unknown > 2048 unknown > > > > On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > >> Hi, >> >> I would unset nodown (hiding osd failures) and norecover (blcoking PGs >> from recovering degraded objects), then start starting osds. >> As soon as you have some osd logs reporting some failures, then share >> those... >> >> - Dan >> >> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla@xxxxxxxxx> wrote: >> > >> > So, we started the mons and mgr up again, and here's the relevant logs, >> including also ceph versions. We've also turned off all of the firewalls on >> all of the nodes so we know that there can't be network issues [and, >> indeed, all of our management of the OSDs happens via logins from the >> service nodes or to each other] >> > >> > > ceph status >> > >> > >> > cluster: >> > id: a1148af2-6eaf-4486-a27e-a05a78c2b378 >> > health: HEALTH_WARN >> > >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> > 1 nearfull osd(s) >> > 3 pool(s) nearfull >> > Reduced data availability: 2048 pgs inactive >> > mons cephs01,cephs02,cephs03 are using a lot of disk space >> > >> > services: >> > mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s) >> > mgr: cephs01(active, since 76s) >> > osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped >> pgs >> > flags >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >> > >> > data: >> > pools: 3 pools, 2048 pgs >> > objects: 0 objects, 0 B >> > usage: 0 B used, 0 B / 0 B avail >> > pgs: 100.000% pgs unknown >> > 2048 unknown >> > >> > >> > > ceph health detail >> > >> > HEALTH_WARN >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set; >> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs >> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space >> > OSDMAP_FLAGS >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> > OSD_NEARFULL 1 nearfull osd(s) >> > osd.63 is near full >> > POOL_NEARFULL 3 pool(s) nearfull >> > pool 'dteam' is nearfull >> > pool 'atlas' is nearfull >> > pool 'atlas-localgroup' is nearfull >> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive >> > pg 13.1ef is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 13.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1ec is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 14.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1ed is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f0 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f1 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f2 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f3 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f4 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f5 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f6 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f7 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f8 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1f9 is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fa is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fb is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fc is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fd is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1fe is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > pg 15.1ff is stuck inactive for 89.322981, current state unknown, >> last acting [] >> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space >> > mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB) >> > mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB) >> > mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB) >> > >> > >> > > ceph versions >> > >> > { >> > "mon": { >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 >> > }, >> > "mgr": { >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1 >> > }, >> > "osd": { >> > "ceph version 14.2.10 >> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1, >> > "ceph version 14.2.15 >> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188, >> > "ceph version 14.2.16 >> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18, >> > "ceph version 14.2.18 >> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122 >> > }, >> > >> > >> > >>>>>> >> > >> > As a note, the log where the mgr explodes (which precipitated all of >> this) definitely shows the problem occurring on the 12th [when 14.2.17 >> dropped], but things didn't "break" until we tried upgrading OSDs to >> 14.2.18... >> > >> > >> > Sam >> > >> > >> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla@xxxxxxxxx> wrote: >> >> >> >> Hi Dan: >> >> >> >> Thanks for the reply - at present, our mons and mgrs are off [because >> of the unsustainable nature of the filesystem usage]. We'll try putting >> them on again for long enough to get "ceph status" out of them, but because >> the mgr was unable to actually talk to anything, and reply at that point. >> >> >> >> (And thanks for the link to the bug tracker - I guess this mismatch of >> expectations is why the devs are so keen to move to containerised >> deployments where there is no co-location of different types of server, as >> it means they don't need to worry as much about the assumptions about when >> it's okay to restart a service on package update. Disappointing that it >> seems stale after 2 years...) >> >> >> >> Sam >> >> >> >> >> >> >> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan@xxxxxxxxxxxxxx> >> wrote: >> >>> >> >>> Hi Sam, >> >>> >> >>> The daemons restart (for *some* releases) because of this: >> >>> https://tracker.ceph.com/issues/21672 >> >>> In short, if the selinux module changes, and if you have selinux >> >>> enabled, then midway through yum update, there will be a systemctl >> >>> restart ceph.target issued. >> >>> >> >>> For the rest -- I think you should focus on getting the PGs all >> >>> active+clean as soon as possible, because the degraded and remapped >> >>> states are what leads to mon / osdmap growth. >> >>> This kind of scenario is why we wrote this tool: >> >>> >> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py >> >>> It will use pg-upmap-items to force the PGs to the OSDs where they are >> >>> currently residing. >> >>> >> >>> But there is some clarification needed before you go ahead with that. >> >>> Could you share `ceph status`, `ceph health detail`? >> >>> >> >>> Cheers, Dan >> >>> >> >>> >> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla@xxxxxxxxx> >> wrote: >> >>> > >> >>> > Hi everyone: >> >>> > >> >>> > I posted to the list on Friday morning (UK time), but apparently my >> email >> >>> > is still in moderation (I have an email from the list bot telling >> me that >> >>> > it's held for moderation but no updates). >> >>> > >> >>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm >> posting >> >>> > again. >> >>> > >> >>> > To save retyping the whole thing, I will direct you to a copy of >> the email >> >>> > I wrote on Friday: >> >>> > >> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt >> >>> > >> >>> > (Since that was sent, we did successfully add big SSDs to the MON >> hosts so >> >>> > they don't fill up their disks with store.db s). >> >>> > >> >>> > I would appreciate any advice - assuming this also doesn't get >> stuck in >> >>> > moderation queues. >> >>> > >> >>> > -- >> >>> > Sam Skipsey (he/him, they/them) >> >>> > _______________________________________________ >> >>> > ceph-users mailing list -- ceph-users@xxxxxxx >> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> >> >> >> >> >> -- >> >> Sam Skipsey (he/him, they/them) >> >> >> >> >> > >> > >> > -- >> > Sam Skipsey (he/him, they/them) >> > >> > >> > > > -- > Sam Skipsey (he/him, they/them) > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx