Re: Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Thu, 27 Sep 2018 12:11:00 +0200

On 26/09/2018 12:41, Eugen Block wrote:
Hi,

I'm not sure how the recovery "still works" with the flag "norecover".
Anyway, I think you should unset the flags norecover, nobackfill. Even 
if not all OSDs come back up you should allow the cluster to backfill 
PGs. Not sure, but unsetting norebalance could also be useful, but 
that can be done step by step. First watch if the cluster gets any 
better without it.

The best way to see if recovery is doing its stuff, is look at the 
recovering pgs in
    ceph pg dump

And look at the objects and see that some of the counters are actually 
going down.
If they don't then the PG is not recovering/backfilling.

Haven't found a better way to determine this (yet).

--WjW

And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.

The suggested config settings look reasonable to mee. You should also 
try to raise the timeouts for the MONs and increase their db cache as 
suggested earlier today.

after this point, if an osd is down, it's fine...it'll only prevent 
access to that specific data (bad for clients, fine for recovery)

I agree with that, the cluster state has to become stable first, then 
you can take a look into those OSDs that won't get up.

Regards,
Eugen

Zitat von by morphin <morphinwithyou@xxxxxxxxx>:

Hello Eugen.  Thank you for your answer. I was loosing my hope to get
an answer here.

I faced so many times with losing 2/3 mons but I never faced any
problem like this on luminous.
The recovery still works and its have been 30hours.  The last state of
my cluster is: https://paste.ubuntu.com/p/rDNHCcNG7P/
We are discussing should we unset the nodown, norecover flags or not 
on IRC.

I tried unset the nodown flag yesterday and I have 15 osd do not start
anymore with same error --> : https://paste.ubuntu.com/p/94xpzxTSnr/
I dont know what is the reason of this but I saw some commits for the
dump problem. Is this bug or something else?

And can you check the plan "peetaur2" offered from IRC:
https://bpaste.net/show/20581774ff08
Also Be_El strongly offers to unset nodown parameter.
What do you think?
Eugen Block <eblock@xxxxxx>, 26 Eyl 2018 Çar, 12:54 tarihinde şunu 
yazdı:

Hi,

could this be related to this other Mimic upgrade thread [1]? Your
failing MONs sound a bit like the problem described there, eventually
the user reported recovery success. You could try the described steps:

  - disable cephx auth with 'auth_cluster_required = none'
  - set the mon_osd_cache_size = 200000 (default 10)
  - Setting 'osd_heartbeat_interval = 30'
  - setting 'mon_lease = 75'
  - increase the rocksdb_cache_size and leveldb_cache_size on the mons
to be big enough to cache the entire db

I just copied the mentioned steps, so please read the thread before
applying anything.

Regards,
Eugen

[1]
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030018.html 

Zitat von by morphin <morphinwithyou@xxxxxxxxx>:

> After I tried too many things with so many helps on IRC. My pool
> health is still in ERROR and I think I can't recover from this.
> https://paste.ubuntu.com/p/HbsFnfkYDT/
> At the end 2 of 3 mons crashed and started at same time and the pool
> is offlined. Recovery takes more than 12hours and it is way too slow.
> Somehow recovery seems to be not working.
>
> If I can reach my data I will re-create the pool easily.
> If I run ceph-object-tool script to regenerate mon store.db can I
> acccess the RBD pool again?
> by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 20:03
> tarihinde şunu yazdı:
>>
>> Hi,
>>
>> Cluster is still down :(
>>
>> Up to not we have managed to compensate the OSDs. 118s of 160 OSD 
are
>> stable and cluster is still in the progress of settling. Thanks for
>> the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
>> flapping OSDs stable.
>>
>> What we learned up now is that this is the cause of unsudden 
death of
>> 2 monitor servers of 3. And when they come back if they do not start
>> one by one (each after joining cluster) this can happen. Cluster can
>> be unhealty and it can take countless hour to come back.
>>
>> Right now here is our status:
>> ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
>> health detail: https://paste.ubuntu.com/p/w4gccnqZjR/
>>
>> Since OSDs disks are NL-SAS it can take up to 24 hours for an online
>> cluster. What is most it has been said that we could be extremely 
luck
>> if all the data is rescued.
>>
>> Most unhappily our strategy is just to sit and wait :(. As soon 
as the
>> peering and activating count drops to 300-500 pgs we will restart 
the
>> stopped OSDs one by one. For each OSD and we will wait the 
cluster to
>> settle down. The amount of data stored is OSD is 33TB. Our most
>> concern is to export our rbd pool data outside to a backup space. 
Then
>> we will start again with clean one.
>>
>> I hope to justify our analysis with an expert. Any help or advise
>> would be greatly appreciated.
>> by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 15:08
>> tarihinde şunu yazdı:
>> >
>> > After reducing the recovery parameter values did not change much.
>> > There are a lot of OSD still marked down.
>> >
>> > I don't know what I need to do after this point.
>> >
>> > [osd]
>> > osd recovery op priority = 63
>> > osd client op priority = 1
>> > osd recovery max active = 1
>> > osd max scrubs = 1
>> >
>> >
>> > ceph -s
>> >   cluster:
>> >     id:     89569e73-eb89-41a4-9fc9-d2a5ec5f4106
>> >     health: HEALTH_ERR
>> >             42 osds down
>> >             1 host (6 osds) down
>> >             61/8948582 objects unfound (0.001%)
>> >             Reduced data availability: 3837 pgs inactive, 1822 pgs
>> > down, 1900 pgs peering, 6 pgs stale
>> >             Possible data damage: 18 pgs recovery_unfound
>> >             Degraded data redundancy: 457246/17897164 objects 
degraded
>> > (2.555%), 213 pgs degraded, 209 pgs undersized
>> >             2554 slow requests are blocked > 32 sec
>> >             3273 slow ops, oldest one blocked for 1453 sec, 
daemons
>> >
>> 
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
>> > have slow ops.
>> >
>> >   services:
>> >     mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
>> >     mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, 
SRV-SEKUARK3,
>> > SRV-SEKUARK4
>> >     osd: 168 osds: 118 up, 160 in
>> >
>> >   data:
>> >     pools:   1 pools, 4096 pgs
>> >     objects: 8.95 M objects, 17 TiB
>> >     usage:   33 TiB used, 553 TiB / 586 TiB avail
>> >     pgs:     93.677% pgs not active
>> >              457246/17897164 objects degraded (2.555%)
>> >              61/8948582 objects unfound (0.001%)
>> >              1676 down
>> >              1372 peering
>> >              528  stale+peering
>> >              164  active+undersized+degraded
>> >              145  stale+down
>> >              73   activating
>> >              40   active+clean
>> >              29   stale+activating
>> >              17 active+recovery_unfound+undersized+degraded
>> >              16   stale+active+clean
>> >              16 stale+active+undersized+degraded
>> >              9    activating+undersized+degraded
>> >              3    active+recovery_wait+degraded
>> >              2    activating+undersized
>> >              2    activating+degraded
>> >              1    creating+down
>> >              1 stale+active+recovery_unfound+undersized+degraded
>> >              1 stale+active+clean+scrubbing+deep
>> >              1 stale+active+recovery_wait+degraded
>> >
>> > ceph -w: https://paste.ubuntu.com/p/WZ2YqzS86S/
>> > ceph health detail: https://paste.ubuntu.com/p/8w7Jpms8fj/
>> > by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 14:32
>> > tarihinde şunu yazdı:
>> > >
>> > > The config didnt work. Because increasing the number faced with
>> more OSD Drops.
>> > >
>> > > bhfs -s
>> > >   cluster:
>> > >     id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
>> > >     health: HEALTH_ERR
>> > >             norebalance,norecover flag(s) set
>> > >             1 osds down
>> > >             17/8839434 objects unfound (0.000%)
>> > >             Reduced data availability: 3578 pgs inactive, 861 
pgs
>> > > down, 1928 pgs peering, 11 pgs stale
>> > >             Degraded data redundancy: 44853/17678868 objects 
degraded
>> > > (0.254%), 221 pgs degraded, 20 pgs undersized
>> > >             610 slow requests are blocked > 32 sec
>> > >             3996 stuck requests are blocked > 4096 sec
>> > >             6076 slow ops, oldest one blocked for 4129 sec, 
daemons
>> > >
>> 
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
>> > > have slow ops.
>> > >
>> > >   services:
>> > >     mon: 3 daemons, quorum 
SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
>> > >     mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, 
SRV-SEKUARK3
>> > >     osd: 168 osds: 128 up, 129 in; 2 remapped pgs
>> > >          flags norebalance,norecover
>> > >
>> > >   data:
>> > >     pools:   1 pools, 4096 pgs
>> > >     objects: 8.84 M objects, 17 TiB
>> > >     usage:   26 TiB used, 450 TiB / 477 TiB avail
>> > >     pgs:     0.024% pgs unknown
>> > >              89.160% pgs not active
>> > >              44853/17678868 objects degraded (0.254%)
>> > >              17/8839434 objects unfound (0.000%)
>> > >              1612 peering
>> > >              720  down
>> > >              583  activating
>> > >              319  stale+peering
>> > >              255  active+clean
>> > >              157  stale+activating
>> > >              108  stale+down
>> > >              95   activating+degraded
>> > >              84   stale+active+clean
>> > >              50 active+recovery_wait+degraded
>> > >              29   creating+down
>> > >              23   stale+activating+degraded
>> > >              18 stale+active+recovery_wait+degraded
>> > >              14 active+undersized+degraded
>> > >              12 active+recovering+degraded
>> > >              4    stale+creating+down
>> > >              3 stale+active+recovering+degraded
>> > >              3 stale+active+undersized+degraded
>> > >              2    stale
>> > >              1 active+recovery_wait+undersized+degraded
>> > >              1 active+clean+scrubbing+deep
>> > >              1    unknown
>> > >              1 active+undersized+degraded+remapped+backfilling
>> > >              1 active+recovering+undersized+degraded
>> > >
>> > > I guess OSD down and drop issue increases the recovery time. 
So I
>> > > decided to try with decreasing recovery parameters for less 
load on
>> > > cluster.
>> > > I have Nvme and SAS disks. Servers are powerfull enough.
>> Network is 4x10Gb.
>> > > I dont think my cluster is a bad shape. Because I have 
datacenter
>> > > redundancy (14 servers + 14 servers). The crashed 7 servers 
are on
>> > > only datacenter A. And it took only a few minutes to back 
online. Also
>> > > 2 of them is monitors and cluster I/O should be suspended so 
there
>> > > should be less data difference.
>> > >
>> > > On the other hand I dont understand the burden of recovery. I 
have
>> > > faced many recoverys but none of the stopped my cluster 
working. This
>> > > recovery burden is so high that it didnt stop for hours. I 
wish I
>> > > could just decrease the recovery speed and continue to serve 
my VMs.
>> > > Is the change of recovery load some what different than mimic?
>> > > Luminous was pretty fine indeed.
>> > > by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 13:57
>> > > tarihinde şunu yazdı:
>> > > >
>> > > > Thank you for answer
>> > > >
>> > > > What do you think the conf for speed the recover?
>> > > >
>> > > > [osd]
>> > > > osd recovery op priority = 63
>> > > > osd client op priority = 1
>> > > > osd recovery max active = 16
>> > > > osd max scrubs = 16
>> > > > <admin@xxxxxxxxxxxxxxx> adresine sahip kullanıcı 25 Eyl 
2018 Sal,
>> > > > 13:37 tarihinde şunu yazdı:
>> > > > >
>> > > > > Just let it recover.
>> > > > >
>> > > > >   data:
>> > > > >     pools:   1 pools, 4096 pgs
>> > > > >     objects: 8.95 M objects, 17 TiB
>> > > > >     usage:   34 TiB used, 577 TiB / 611 TiB avail
>> > > > >     pgs:     94.873% pgs not active
>> > > > >              48475/17901254 objects degraded (0.271%)
>> > > > >              1/8950627 objects unfound (0.000%)
>> > > > >              2631 peering
>> > > > >              637  activating
>> > > > >              562  down
>> > > > >              159  active+clean
>> > > > >              44 activating+degraded
>> > > > >              30 active+recovery_wait+degraded
>> > > > >              12 activating+undersized+degraded
>> > > > >              10 active+recovering+degraded
>> > > > >              10 active+undersized+degraded
>> > > > >              1 active+clean+scrubbing+deep
>> > > > >
>> > > > > You've got deep scrubbed PGs which put considerable IO 
load on OSDs.
>> > > > >
>> > > > >
>> > > > > September 25, 2018 1:23 PM, "by morphin"
>> <morphinwithyou@xxxxxxxxx> wrote:
>> > > > >
>> > > > >
>> > > > > > What should I do now?
>> > > > > >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com