Re: Fwd: [Ceph-community] After Mimic upgrade OSD's stuck at booting.

KEVIN MICHAEL HRPCEK <kevin.hrpcek@xxxxxxxxxxxxx> · Wed, 26 Sep 2018 13:12:48 +0000

Hey, don't lose hope. I just went through 2 3-5 day outages after a mimic upgrade with no data loss. I'd recommend looking through the thread about it to see how close it is to your issue. From my point of view there seems to be some similarities.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029649.html.

At a similar point of desperation with my cluster I would shut all ceph processes down and bring them up in order. Doing this had my cluster almost healthy a few times until it fell over again due to mon issues. So solving any mon issues is the first priority.
 It seems like you may also benefit from setting mon_osd_cache_size to a very large number if you have enough memory on your mon servers.

I'll hop on the irc today.

Kevin

On 09/25/2018 05:53 PM, by morphin wrote:

After I tried too many things with so many helps on IRC. My pool
health is still in ERROR and I think I can't recover from this.
https://paste.ubuntu.com/p/HbsFnfkYDT/
At the end 2 of 3 mons crashed and started at same time and the pool
is offlined. Recovery takes more than 12hours and it is way too slow.
Somehow recovery seems to be not working.

If I can reach my data I will re-create the pool easily.
If I run ceph-object-tool script to regenerate mon store.db can I
acccess the RBD pool again?
by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 20:03
tarihinde şunu yazdı:

Hi,

Cluster is still down :(

Up to not we have managed to compensate the OSDs. 118s of 160 OSD are
stable and cluster is still in the progress of settling. Thanks for
the guy Be-El in the ceph IRC channel. Be-El helped a lot to make
flapping OSDs stable.

What we learned up now is that this is the cause of unsudden death of
2 monitor servers of 3. And when they come back if they do not start
one by one (each after joining cluster) this can happen. Cluster can
be unhealty and it can take countless hour to come back.

Right now here is our status:
ceph -s : https://paste.ubuntu.com/p/6DbgqnGS7t/
health detail: https://paste.ubuntu.com/p/w4gccnqZjR/

Since OSDs disks are NL-SAS it can take up to 24 hours for an online
cluster. What is most it has been said that we could be extremely luck
if all the data is rescued.

Most unhappily our strategy is just to sit and wait :(. As soon as the
peering and activating count drops to 300-500 pgs we will restart the
stopped OSDs one by one. For each OSD and we will wait the cluster to
settle down. The amount of data stored is OSD is 33TB. Our most
concern is to export our rbd pool data outside to a backup space. Then
we will start again with clean one.

I hope to justify our analysis with an expert. Any help or advise
would be greatly appreciated.
by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 15:08
tarihinde şunu yazdı:

After reducing the recovery parameter values did not change much.
There are a lot of OSD still marked down.

I don't know what I need to do after this point.

[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 1
osd max scrubs = 1

ceph -s
  cluster:
    id:     89569e73-eb89-41a4-9fc9-d2a5ec5f4106
    health: HEALTH_ERR
            42 osds down
            1 host (6 osds) down
            61/8948582 objects unfound (0.001%)
            Reduced data availability: 3837 pgs inactive, 1822 pgs
down, 1900 pgs peering, 6 pgs stale
            Possible data damage: 18 pgs recovery_unfound
            Degraded data redundancy: 457246/17897164 objects degraded
(2.555%), 213 pgs degraded, 209 pgs undersized
            2554 slow requests are blocked > 32 sec
            3273 slow ops, oldest one blocked for 1453 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.

  services:
    mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
    mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3,
SRV-SEKUARK4
    osd: 168 osds: 118 up, 160 in

  data:
    pools:   1 pools, 4096 pgs
    objects: 8.95 M objects, 17 TiB
    usage:   33 TiB used, 553 TiB / 586 TiB avail
    pgs:     93.677% pgs not active
             457246/17897164 objects degraded (2.555%)
             61/8948582 objects unfound (0.001%)
             1676 down
             1372 peering
             528  stale+peering
             164  active+undersized+degraded
             145  stale+down
             73   activating
             40   active+clean
             29   stale+activating
             17   active+recovery_unfound+undersized+degraded
             16   stale+active+clean
             16   stale+active+undersized+degraded
             9    activating+undersized+degraded
             3    active+recovery_wait+degraded
             2    activating+undersized
             2    activating+degraded
             1    creating+down
             1    stale+active+recovery_unfound+undersized+degraded
             1    stale+active+clean+scrubbing+deep
             1    stale+active+recovery_wait+degraded

ceph -w: https://paste.ubuntu.com/p/WZ2YqzS86S/
ceph health detail: https://paste.ubuntu.com/p/8w7Jpms8fj/
by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 14:32
tarihinde şunu yazdı:

The config didnt work. Because increasing the number faced with more OSD Drops.

bhfs -s
  cluster:
    id:     89569e73-eb89-41a4-9fc9-d2a5ec5f4106
    health: HEALTH_ERR
            norebalance,norecover flag(s) set
            1 osds down
            17/8839434 objects unfound (0.000%)
            Reduced data availability: 3578 pgs inactive, 861 pgs
down, 1928 pgs peering, 11 pgs stale
            Degraded data redundancy: 44853/17678868 objects degraded
(0.254%), 221 pgs degraded, 20 pgs undersized
            610 slow requests are blocked > 32 sec
            3996 stuck requests are blocked > 4096 sec
            6076 slow ops, oldest one blocked for 4129 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.

  services:
    mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
    mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3
    osd: 168 osds: 128 up, 129 in; 2 remapped pgs
         flags norebalance,norecover

  data:
    pools:   1 pools, 4096 pgs
    objects: 8.84 M objects, 17 TiB
    usage:   26 TiB used, 450 TiB / 477 TiB avail
    pgs:     0.024% pgs unknown
             89.160% pgs not active
             44853/17678868 objects degraded (0.254%)
             17/8839434 objects unfound (0.000%)
             1612 peering
             720  down
             583  activating
             319  stale+peering
             255  active+clean
             157  stale+activating
             108  stale+down
             95   activating+degraded
             84   stale+active+clean
             50   active+recovery_wait+degraded
             29   creating+down
             23   stale+activating+degraded
             18   stale+active+recovery_wait+degraded
             14   active+undersized+degraded
             12   active+recovering+degraded
             4    stale+creating+down
             3    stale+active+recovering+degraded
             3    stale+active+undersized+degraded
             2    stale
             1    active+recovery_wait+undersized+degraded
             1    active+clean+scrubbing+deep
             1    unknown
             1    active+undersized+degraded+remapped+backfilling
             1    active+recovering+undersized+degraded

I guess OSD down and drop issue increases the recovery time. So I
decided to try with decreasing recovery parameters for less load on
cluster.
I have Nvme and SAS disks. Servers are powerfull enough. Network is 4x10Gb.
I dont think my cluster is a bad shape. Because I have datacenter
redundancy (14 servers + 14 servers). The crashed 7 servers are on
only datacenter A. And it took only a few minutes to back online. Also
2 of them is monitors and cluster I/O should be suspended so there
should be less data difference.

On the other hand I dont understand the burden of recovery. I have
faced many recoverys but none of the stopped my cluster working. This
recovery burden is so high that it didnt stop for hours. I wish I
could just decrease the recovery speed and continue to serve my VMs.
Is the change of recovery load some what different than mimic?
Luminous was pretty fine indeed.
by morphin <morphinwithyou@xxxxxxxxx>, 25 Eyl 2018 Sal, 13:57
tarihinde şunu yazdı:

Thank you for answer

What do you think the conf for speed the recover?

[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 16
osd max scrubs = 16
<admin@xxxxxxxxxxxxxxx> adresine sahip kullanıcı 25 Eyl 2018 Sal,
13:37 tarihinde şunu yazdı:

Just let it recover.

  data:
    pools:   1 pools, 4096 pgs
    objects: 8.95 M objects, 17 TiB
    usage:   34 TiB used, 577 TiB / 611 TiB avail
    pgs:     94.873% pgs not active
             48475/17901254 objects degraded (0.271%)
             1/8950627 objects unfound (0.000%)
             2631 peering
             637  activating
             562  down
             159  active+clean
             44   activating+degraded
             30   active+recovery_wait+degraded
             12   activating+undersized+degraded
             10   active+recovering+degraded
             10   active+undersized+degraded
             1    active+clean+scrubbing+deep

You've got deep scrubbed PGs which put considerable IO load on OSDs.

September 25, 2018 1:23 PM, "by morphin" <morphinwithyou@xxxxxxxxx> wrote:

What should I do now?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com