Re: 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

Eugen Block <eblock@xxxxxx> · Fri, 19 Oct 2018 10:03:33 +0000

No, you do not need to set nobackfill and norecover if you only shut
down one server. The guide you are referencing is about shutting down
everything.

It will not recover degraded PGs if you shut down one server with noout.

You are right, I must have confused something in my memory with the  
recovery I/O and the flags. Thanks for the clarification.

Eugen

Zitat von Paul Emmerich <paul.emmerich@xxxxxxxx>:

No, you do not need to set nobackfill and norecover if you only shut
down one server. The guide you are referencing is about shutting down
everything.

It will not recover degraded PGs if you shut down one server with noout.

Paul
Am Fr., 19. Okt. 2018 um 11:37 Uhr schrieb Eugen Block <eblock@xxxxxx>:

Hi Denny,

the recommendation for ceph maintenance is to set three flags if you
need to shutdown a node (or the entire cluster):

ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover

Although the 'noout' flag seems to be enough for many maintenance
tasks it doesn't prevent the cluster from rebalancing because some PGs
are degraded and the cluster tries to recover.
Of course there are also scenarios where something could go wrong
during the maintenance and a recovery would be required, but there's
always some kind of tradeoff, I guess.

Until now we had no problems with maintenance when these flags were
set, and also no stuck I/O for the clients.

> We never saw this in the past, with the same procedure ...

If this has always worked for you and you suddenly encounter these
problems for the first time there could be something else going on, of
course, but I just wanted to point out the importance of those osd
flags.

Regards,
Eugen

[1] https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/

Zitat von Denny Fuchs <linuxmail@xxxxxxxx>:

> Hi,
>
> today we had an issue with our 6 node Ceph cluster.
>
> We had to shutdown one node (Ceph-03), to replace a disk (because,
> we did now know the slot). We set the noout flag and did a graceful
> shutdown. All was O.K. After the disk was replaced, the node comes
> up and our VMs had a big I/O latency.
> We never saw this in the past, with the same procedure ...
>
> * From our logs on Ceph-01:
>
> 2018-10-18 15:53:45.455743 mon.qh-a07-ceph-osd-03 mon.2
> 10.3.0.3:6789/0 1 : cluster [INF] mon.qh-a07-ceph-  osd-03 calling
> monitor election
> ...
> 2018-10-18 15:53:45.503818 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663050 : cluster [INF] mon.qh-a07-ceph-osd-01 is
> new leader, mons
>  
qh-a07-ceph-osd-01,qh-a07-ceph-osd-02,qh-a07-ceph-osd-03,qh-a07-ceph-osd-04,qh-a07-ceph-osd-05,qh
>
> * First OSD comes up:
>
> 2018-10-18 15:53:55.207742 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663063 : cluster [WRN] Health check update: 10 osds
> down (OSD_DOWN)
> 2018-10-18 15:53:55.207768 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663064 : cluster [INF] Health check cleared:
> OSD_HOST_DOWN (was: 1 host (11 osds) down)
> 2018-10-18 15:53:55.240079 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663065 : cluster [INF] osd.43 10.3.0.3:6812/7554 boot
>
> * All OSDs where up:
>
> 2018-10-18 15:54:25.331692 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663134 : cluster [INF] Health check cleared:
> OSD_DOWN (was: 1 osds down)
> 2018-10-18 15:54:25.360151 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663135 : cluster [INF] osd.12 10.3.0.3:6820/8537 boot
>
> * This OSDs here are a mix of HDD and SDD and different nodes
>
> 2018-10-18 15:54:27.073266 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663138 : cluster [WRN] Health check update:
> Degraded data redundancy: 84012/4293867 objects degraded (1.957%),
> 1316 pgs degraded, 487 pgs undersized (PG_DEGRADED)
> 2018-10-18 15:54:32.073644 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663142 : cluster [WRN] Health check update:
> Degraded data redundancy: 4611/4293867 objects degraded (0.107%),
> 1219 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:36.841189 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663144 : cluster [WRN] Health check failed: 1 slow
> requests are blocked > 32 sec. Implicated osds 16 (REQUEST_SLOW)
> 2018-10-18 15:54:37.074098 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663145 : cluster [WRN] Health check update:
> Degraded data redundancy: 4541/4293867 objects degraded (0.106%),
> 1216 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:42.074510 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663149 : cluster [WRN] Health check update:
> Degraded data redundancy: 4364/4293867 objects degraded (0.102%),
> 1176 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:42.074561 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663150 : cluster [WRN] Health check update: 5 slow
> requests are blocked > 32 sec. Implicated osds 15,25,30,34
> (REQUEST_SLOW)
> 2018-10-18 15:54:47.074886 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663152 : cluster [WRN] Health check update:
> Degraded data redundancy: 4193/4293867 objects degraded (0.098%),
> 1140 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:47.074934 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663153 : cluster [WRN] Health check update: 5 slow
> requests are blocked > 32 sec. Implicated osds 9,15,23,30
> (REQUEST_SLOW)
> 2018-10-18 15:54:52.075274 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663156 : cluster [WRN] Health check update:
> Degraded data redundancy: 4087/4293867 objects degraded (0.095%),
> 1120 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:52.075313 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663157 : cluster [WRN] Health check update: 14 slow
> requests are blocked > 32 sec. Implicated osds 2,13,14,15,16,23
> (REQUEST_SLOW)
> 2018-10-18 15:54:57.075635 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663158 : cluster [WRN] Health check update:
> Degraded data redundancy: 3932/4293867 objects degraded (0.092%),
> 1074 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:54:57.075683 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663159 : cluster [WRN] Health check update: 6 slow
> requests are blocked > 32 sec. Implicated osds 14,15 (REQUEST_SLOW)
> 2018-10-18 15:55:02.076071 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663163 : cluster [WRN] Health check update:
> Degraded data redundancy: 3805/4293867 objects degraded (0.089%),
> 1036 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:55:02.076138 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663164 : cluster [WRN] Health check update: 4 slow
> requests are blocked > 32 sec. Implicated osds 1,15,19 (REQUEST_SLOW)
> 2018-10-18 15:55:07.076562 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663168 : cluster [WRN] Health check update:
> Degraded data redundancy: 3656/4293867 objects degraded (0.085%),
> 988 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:55:07.076633 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663169 : cluster [WRN] Health check update: 4 slow
> requests are blocked > 32 sec. Implicated osds 15,22 (REQUEST_SLOW)
> 2018-10-18 15:55:12.077091 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663174 : cluster [WRN] Health check update:
> Degraded data redundancy: 3509/4293867 objects degraded (0.082%),
> 934 pgs degraded (PG_DEGRADED)
> 2018-10-18 15:55:12.077130 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663175 : cluster [WRN] Health check update: 1 slow
> requests are blocked > 32 sec. Implicated osds 15 (REQUEST_SLOW)
> 2018-10-18 15:55:13.317758 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663178 : cluster [INF] Health check cleared:
> REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec. Implicated
> osds 15)
>
> ....
>
> 2018-10-18 15:55:17.077685 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663180 : cluster [WRN] Health check update:
> Degraded data redundancy: 3351/4293867 objects degraded (0.078%),
> 886 pgs degraded, 1 pg undersized (PG_DEGRADED)
> 2018-10-18 15:55:22.078077 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663184 : cluster [WRN] Health check update:
> Degraded data redundancy: 3202/4293867 objects degraded (0.075%),
> 831 pgs degraded, 3 pgs undersized (PG_DEGRADED)
>
> * And then, all went back to work:
>
> 2018-10-18 16:00:22.876849 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663322 : cluster [WRN] Health check failed: Reduced
> data availability: 4 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
> 2018-10-18 16:00:24.898816 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663326 : cluster [WRN] Health check failed:
> 102307/4289049 objects misplaced (2.385%) (OBJECT_MISPLACED)
> 2018-10-18 16:00:28.995821 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663327 : cluster [INF] Health check cleared:
> PG_AVAILABILITY (was: Reduced data availability: 3 pgs peering)
> 2018-10-18 16:00:31.026616 mon.qh-a07-ceph-osd-01 mon.0
> 10.3.0.1:6789/0 1663330 : cluster [WRN] Health check update:
> 156337/4294191 objects misplaced (3.641%) (OBJECT_MISPLACED)
>
>
> A few mins later, we added the replaced disks again (on ceph-osd-06
> and on ceph-osd-03), without issues. To the I/O stuck started as the
> node where in the cluster again ( 15:53:45 ...) we have no clue ...
> and never happens in the past.
>
> We have per OSD node:
> * 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
> * 64GB DDR4 ECC
> * HDD: WD Red 1TB 2.5 with Intel NVME SSDPEDMD400G4 as journal device
> * SDD: Samsung 950 EVO 512GB
> * 2 x Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 (one for
> SSD, one for HDD)
> * Mellanox 100GB Ceph network (CX-4)
> * 6 Mon (on Ceph nodes)
>
>
> Ceph.conf
>
> [global]
>        osd journal size = 5120
>        osd pool default min size = 2
>        osd pool default size = 3
>
> [osd]
>     keyring = /var/lib/ceph/osd/ceph-$id/keyring
>     osd max backfills = 20
>     osd recovery max active = 8
>
> no special crush rule, just the HDD / SDD one.
>
>
> * ceph osd pool get ssd-pool pg_num
> pg_num: 2048
>
> * root@qh-a07-ceph-osd-01:[~]: ceph osd pool get sata-pool pg_num
> pg_num: 1024
>
> * root@qh-a07-ceph-osd-01:[~]: ceph osd pool get db-pool pg_num
> pg_num: 1024
>
> * ceph -s
>
> data:
>     pools:   3 pools, 4096 pgs
>     objects: 1.43M objects, 5.39TiB
>     usage:   15.7TiB used, 33.4TiB / 49.1TiB avail
>     pgs:     4096 active+clean
>
>
> Any suggestions to avoid this stuck next time ?
>
> cu denny

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com