Re: Advice needed: stuck cluster halfway upgraded, comms issues and MON space usage

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 22 Mar 2021 18:20:07 +0100

What's with the OSDs having loopback addresses? E.g. v2:
127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667

Does `ceph osd dump` show those same loopback addresses for each OSD?

This sounds familiar... I'm trying to find the recent ticket.

.. dan

On Mon, Mar 22, 2021, 6:07 PM Sam Skipsey <aoanla@xxxxxxxxx> wrote:

> hi Dan:
>
> So, unsetting nodown results in... almost all of the OSDs being marked
> down. (231 down out of 328).
> Checking the actual OSD services, most of them were actually up and active
> on the nodes, even when the mons had marked them down.
> (On a few nodes, the down services corresponded to OSDs that had been
> flapping - but increasing osd_max_markdown locally to keep them up despite
> the previous flapping, and restarting the services... didn't help.)
>
> In fact, starting up the few OSD services which had actually stopped,
> resulted in a different set of OSDs being marked down, and some others
> coming up.
> We currently have a sort of "rolling OSD outness" passing through the
> cluster - there's always ~230 OSDs marked down now, but which ones those
> are changes (we've had everything from 1 HOST down to 4 HOSTS down over the
> past 14 minutes as things fluctuate.
>
> A log from one of the "down" OSDs [which is actually running, and on the
> same host as OSDs which are marked up] shows this worrying snippet
>
> 2021-03-22 17:01:45.298 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:45.298 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:46.340 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:46.340 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:47.376 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:47.376 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:48.395 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:48.395 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:49.407 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:49.407 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:50.400 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:50.400 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:50.922 7f6c9f088700 -1 --2- 10.1.50.21:0/23673 >> [v2:
> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] conn(0x56010903e400
> 0x56011a71fc00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
> tx=0)._handle_peer_banner peer [v2:
> 127.0.0.1:6881/17664667,v1:127.0.0.1:6882/17664667] is using msgr V1
> protocol
> 2021-03-22 17:01:50.922 7f6c9f889700 -1 --2- 10.1.50.21:0/23673 >> [v2:
> 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] conn(0x5600df434000
> 0x56011718e000 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
> tx=0)._handle_peer_banner peer [v2:
> 127.0.0.1:6821/13015214,v1:127.0.0.1:6831/13015214] is using msgr V1
> protocol
> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:
> 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] conn(0x5600f85ed800
> 0x560109df2a00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
> tx=0)._handle_peer_banner peer [v2:
> 127.0.0.1:6826/11091658,v1:127.0.0.1:6828/11091658] is using msgr V1
> protocol
> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:
> 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] conn(0x5600f22ea000
> 0x560117182300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
> tx=0)._handle_peer_banner peer [v2:
> 127.0.0.1:6859/2683393,v1:127.0.0.1:6862/2683393] is using msgr V1
> protocol
> 2021-03-22 17:01:50.922 7f6ca008a700 -1 --2- 10.1.50.21:0/23673 >> [v2:
> 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] conn(0x5600df435c00
> 0x560139370300 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0
> tx=0)._handle_peer_banner peer [v2:
> 127.0.0.1:6901/15090566,v1:127.0.0.1:6907/15090566] is using msgr V1
> protocol
> 2021-03-22 17:01:51.377 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:51.377 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:52.370 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:52.370 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:53.377 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:53.377 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:54.385 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:54.385 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:55.385 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:55.385 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:56.362 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:56.362 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
> 2021-03-22 17:01:57.324 7f6c9c883700  1 osd.127 253515 is_healthy false --
> only 0/10 up peers (less than 33%)
> 2021-03-22 17:01:57.324 7f6c9c883700  1 osd.127 253515 not healthy;
> waiting to boot
>
>
>
> Any suggestions?
>
> Sam
>
> P.S. an example ceph status as it is now [with everything now on 14.2.18,
> since we had to restart osds anyway]:
>
>  cluster:
>     id:     a1148af2-6eaf-4486-a27e-a05a78c2b378
>     health: HEALTH_WARN
>             pauserd,pausewr,noout,nobackfill,norebalance flag(s) set
>             230 osds down
>             4 hosts (80 osds) down
>             Reduced data availability: 2048 pgs inactive
>             8 slow ops, oldest one blocked for 901 sec, mon.cephs01 has
> slow ops
>
>   services:
>     mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 2h)
>     mgr: cephs01(active, since 77m)
>     osd: 329 osds: 98 up (since 4s), 328 in (since 4d)
>          flags pauserd,pausewr,noout,nobackfill,norebalance
>
>   data:
>     pools:   3 pools, 2048 pgs
>     objects: 0 objects, 0 B
>     usage:   0 B used, 0 B / 0 B avail
>     pgs:     100.000% pgs unknown
>              2048 unknown
>
>
>
> On Mon, 22 Mar 2021 at 14:57, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
>
>> Hi,
>>
>> I would unset nodown (hiding osd failures) and norecover (blcoking PGs
>> from recovering degraded objects), then start starting osds.
>> As soon as you have some osd logs reporting some failures, then share
>> those...
>>
>> - Dan
>>
>> On Mon, Mar 22, 2021 at 3:49 PM Sam Skipsey <aoanla@xxxxxxxxx> wrote:
>> >
>> > So, we started the mons and mgr up again, and here's the relevant logs,
>> including also ceph versions. We've also turned off all of the firewalls on
>> all of the nodes so we know that there can't be network issues [and,
>> indeed, all of our management of the OSDs happens via logins from the
>> service nodes or to each other]
>> >
>> > > ceph status
>> >
>> >
>> >   cluster:
>> >     id:     a1148af2-6eaf-4486-a27e-a05a78c2b378
>> >     health: HEALTH_WARN
>> >
>>  pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
>> >             1 nearfull osd(s)
>> >             3 pool(s) nearfull
>> >             Reduced data availability: 2048 pgs inactive
>> >             mons cephs01,cephs02,cephs03 are using a lot of disk space
>> >
>> >   services:
>> >     mon: 3 daemons, quorum cephs01,cephs02,cephs03 (age 61s)
>> >     mgr: cephs01(active, since 76s)
>> >     osd: 329 osds: 329 up (since 63s), 328 in (since 4d); 466 remapped
>> pgs
>> >          flags
>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
>> >
>> >   data:
>> >     pools:   3 pools, 2048 pgs
>> >     objects: 0 objects, 0 B
>> >     usage:   0 B used, 0 B / 0 B avail
>> >     pgs:     100.000% pgs unknown
>> >              2048 unknown
>> >
>> >
>> > > ceph health detail
>> >
>> > HEALTH_WARN
>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set;
>> 1 nearfull osd(s); 3 pool(s) nearfull; Reduced data availability: 2048 pgs
>> inactive; mons cephs01,cephs02,cephs03 are using a lot of disk space
>> > OSDMAP_FLAGS
>> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
>> > OSD_NEARFULL 1 nearfull osd(s)
>> >     osd.63 is near full
>> > POOL_NEARFULL 3 pool(s) nearfull
>> >     pool 'dteam' is nearfull
>> >     pool 'atlas' is nearfull
>> >     pool 'atlas-localgroup' is nearfull
>> > PG_AVAILABILITY Reduced data availability: 2048 pgs inactive
>> >     pg 13.1ef is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f0 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f1 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f2 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f3 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f4 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f5 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f6 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f7 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f8 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1f9 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1fa is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1fb is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1fc is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1fd is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1fe is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 13.1ff is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1ec is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f0 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f1 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f2 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f3 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f4 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f5 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f6 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f7 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f8 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1f9 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1fa is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1fb is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1fc is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1fd is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1fe is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 14.1ff is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1ed is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f0 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f1 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f2 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f3 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f4 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f5 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f6 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f7 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f8 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1f9 is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1fa is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1fb is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1fc is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1fd is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1fe is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> >     pg 15.1ff is stuck inactive for 89.322981, current state unknown,
>> last acting []
>> > MON_DISK_BIG mons cephs01,cephs02,cephs03 are using a lot of disk space
>> >     mon.cephs01 is 96 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.cephs02 is 96 GiB >= mon_data_size_warn (15 GiB)
>> >     mon.cephs03 is 96 GiB >= mon_data_size_warn (15 GiB)
>> >
>> >
>> > > ceph versions
>> >
>> > {
>> >     "mon": {
>> >         "ceph version 14.2.18
>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3
>> >     },
>> >     "mgr": {
>> >         "ceph version 14.2.18
>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 1
>> >     },
>> >     "osd": {
>> >         "ceph version 14.2.10
>> (b340acf629a010a74d90da5782a2c5fe0b54ac20) nautilus (stable)": 1,
>> >         "ceph version 14.2.15
>> (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)": 188,
>> >         "ceph version 14.2.16
>> (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable)": 18,
>> >         "ceph version 14.2.18
>> (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 122
>> >     },
>> >
>> >
>> > >>>>>>
>> >
>> > As a note, the log where the mgr explodes (which precipitated all of
>> this) definitely shows the problem occurring on the 12th [when 14.2.17
>> dropped], but things didn't "break" until we tried upgrading OSDs to
>> 14.2.18...
>> >
>> >
>> > Sam
>> >
>> >
>> > On Mon, 22 Mar 2021 at 12:20, Sam Skipsey <aoanla@xxxxxxxxx> wrote:
>> >>
>> >> Hi Dan:
>> >>
>> >> Thanks for the reply - at present, our mons and mgrs are off [because
>> of the unsustainable nature of the filesystem usage]. We'll try putting
>> them on again for long enough to get "ceph status" out of them, but because
>> the mgr was unable to actually talk to anything, and reply at that point.
>> >>
>> >> (And thanks for the link to the bug tracker - I guess this mismatch of
>> expectations is why the devs are so keen to move to containerised
>> deployments where there is no co-location of different types of server, as
>> it means they don't need to worry as much about the assumptions about when
>> it's okay to restart a service on package update. Disappointing that it
>> seems stale after 2 years...)
>> >>
>> >> Sam
>> >>
>> >>
>> >>
>> >> On Mon, 22 Mar 2021 at 12:11, Dan van der Ster <dan@xxxxxxxxxxxxxx>
>> wrote:
>> >>>
>> >>> Hi Sam,
>> >>>
>> >>> The daemons restart (for *some* releases) because of this:
>> >>> https://tracker.ceph.com/issues/21672
>> >>> In short, if the selinux module changes, and if you have selinux
>> >>> enabled, then midway through yum update, there will be a systemctl
>> >>> restart ceph.target issued.
>> >>>
>> >>> For the rest -- I think you should focus on getting the PGs all
>> >>> active+clean as soon as possible, because the degraded and remapped
>> >>> states are what leads to mon / osdmap growth.
>> >>> This kind of scenario is why we wrote this tool:
>> >>>
>> https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
>> >>> It will use pg-upmap-items to force the PGs to the OSDs where they are
>> >>> currently residing.
>> >>>
>> >>> But there is some clarification needed before you go ahead with that.
>> >>> Could you share `ceph status`, `ceph health detail`?
>> >>>
>> >>> Cheers, Dan
>> >>>
>> >>>
>> >>> On Mon, Mar 22, 2021 at 12:05 PM Sam Skipsey <aoanla@xxxxxxxxx>
>> wrote:
>> >>> >
>> >>> > Hi everyone:
>> >>> >
>> >>> > I posted to the list on Friday morning (UK time), but apparently my
>> email
>> >>> > is still in moderation (I have an email from the list bot telling
>> me that
>> >>> > it's held for moderation but no updates).
>> >>> >
>> >>> > Since this is a bit urgent - we have ~3PB of storage offline - I'm
>> posting
>> >>> > again.
>> >>> >
>> >>> > To save retyping the whole thing, I will direct you to a copy of
>> the email
>> >>> > I wrote on Friday:
>> >>> >
>> >>> > http://aoanla.pythonanywhere.com/Logs/EmailToCephUsers.txt
>> >>> >
>> >>> > (Since that was sent, we did successfully add big SSDs to the MON
>> hosts so
>> >>> > they don't fill up their disks with store.db s).
>> >>> >
>> >>> > I would appreciate any advice - assuming this also doesn't get
>> stuck in
>> >>> > moderation queues.
>> >>> >
>> >>> > --
>> >>> > Sam Skipsey (he/him, they/them)
>> >>> > _______________________________________________
>> >>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> >>
>> >>
>> >>
>> >> --
>> >> Sam Skipsey (he/him, they/them)
>> >>
>> >>
>> >
>> >
>> > --
>> > Sam Skipsey (he/him, they/them)
>> >
>> >
>>
>
>
> --
> Sam Skipsey (he/him, they/them)
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx