Re: 12.2.8: 1 node comes up (noout set), from a 6 nodes cluster -> I/O stuck (rbd usage)

Eugen Block <eblock@xxxxxx> · Fri, 19 Oct 2018 09:36:29 +0000

Hi Denny,

the recommendation for ceph maintenance is to set three flags if you  
need to shutdown a node (or the entire cluster):

ceph osd set noout
ceph osd set nobackfill
ceph osd set norecover

Although the 'noout' flag seems to be enough for many maintenance  
tasks it doesn't prevent the cluster from rebalancing because some PGs  
are degraded and the cluster tries to recover.
Of course there are also scenarios where something could go wrong  
during the maintenance and a recovery would be required, but there's  
always some kind of tradeoff, I guess.

Until now we had no problems with maintenance when these flags were  
set, and also no stuck I/O for the clients.

We never saw this in the past, with the same procedure ...

If this has always worked for you and you suddenly encounter these  
problems for the first time there could be something else going on, of  
course, but I just wanted to point out the importance of those osd  
flags.

Regards,
Eugen

[1] https://ceph.com/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/

Zitat von Denny Fuchs <linuxmail@xxxxxxxx>:

Hi,

today we had an issue with our 6 node Ceph cluster.

We had to shutdown one node (Ceph-03), to replace a disk (because,   
we did now know the slot). We set the noout flag and did a graceful  
shutdown. All was O.K. After the disk was replaced, the node comes  
up and our VMs had a big I/O latency.
We never saw this in the past, with the same procedure ...

* From our logs on Ceph-01:

2018-10-18 15:53:45.455743 mon.qh-a07-ceph-osd-03 mon.2  
10.3.0.3:6789/0 1 : cluster [INF] mon.qh-a07-ceph-  osd-03 calling  
monitor election
...
2018-10-18 15:53:45.503818 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663050 : cluster [INF] mon.qh-a07-ceph-osd-01 is  
new leader, mons  
qh-a07-ceph-osd-01,qh-a07-ceph-osd-02,qh-a07-ceph-osd-03,qh-a07-ceph-osd-04,qh-a07-ceph-osd-05,qh

* First OSD comes up:

2018-10-18 15:53:55.207742 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663063 : cluster [WRN] Health check update: 10 osds  
down (OSD_DOWN)
2018-10-18 15:53:55.207768 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663064 : cluster [INF] Health check cleared:  
OSD_HOST_DOWN (was: 1 host (11 osds) down)
2018-10-18 15:53:55.240079 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663065 : cluster [INF] osd.43 10.3.0.3:6812/7554 boot

* All OSDs where up:

2018-10-18 15:54:25.331692 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663134 : cluster [INF] Health check cleared:  
OSD_DOWN (was: 1 osds down)
2018-10-18 15:54:25.360151 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663135 : cluster [INF] osd.12 10.3.0.3:6820/8537 boot

* This OSDs here are a mix of HDD and SDD and different nodes

2018-10-18 15:54:27.073266 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663138 : cluster [WRN] Health check update:  
Degraded data redundancy: 84012/4293867 objects degraded (1.957%),  
1316 pgs degraded, 487 pgs undersized (PG_DEGRADED)
2018-10-18 15:54:32.073644 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663142 : cluster [WRN] Health check update:  
Degraded data redundancy: 4611/4293867 objects degraded (0.107%),  
1219 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:36.841189 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663144 : cluster [WRN] Health check failed: 1 slow  
requests are blocked > 32 sec. Implicated osds 16 (REQUEST_SLOW)
2018-10-18 15:54:37.074098 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663145 : cluster [WRN] Health check update:  
Degraded data redundancy: 4541/4293867 objects degraded (0.106%),  
1216 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:42.074510 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663149 : cluster [WRN] Health check update:  
Degraded data redundancy: 4364/4293867 objects degraded (0.102%),  
1176 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:42.074561 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663150 : cluster [WRN] Health check update: 5 slow  
requests are blocked > 32 sec. Implicated osds 15,25,30,34  
(REQUEST_SLOW)
2018-10-18 15:54:47.074886 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663152 : cluster [WRN] Health check update:  
Degraded data redundancy: 4193/4293867 objects degraded (0.098%),  
1140 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:47.074934 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663153 : cluster [WRN] Health check update: 5 slow  
requests are blocked > 32 sec. Implicated osds 9,15,23,30  
(REQUEST_SLOW)
2018-10-18 15:54:52.075274 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663156 : cluster [WRN] Health check update:  
Degraded data redundancy: 4087/4293867 objects degraded (0.095%),  
1120 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:52.075313 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663157 : cluster [WRN] Health check update: 14 slow  
requests are blocked > 32 sec. Implicated osds 2,13,14,15,16,23  
(REQUEST_SLOW)
2018-10-18 15:54:57.075635 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663158 : cluster [WRN] Health check update:  
Degraded data redundancy: 3932/4293867 objects degraded (0.092%),  
1074 pgs degraded (PG_DEGRADED)
2018-10-18 15:54:57.075683 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663159 : cluster [WRN] Health check update: 6 slow  
requests are blocked > 32 sec. Implicated osds 14,15 (REQUEST_SLOW)
2018-10-18 15:55:02.076071 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663163 : cluster [WRN] Health check update:  
Degraded data redundancy: 3805/4293867 objects degraded (0.089%),  
1036 pgs degraded (PG_DEGRADED)
2018-10-18 15:55:02.076138 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663164 : cluster [WRN] Health check update: 4 slow  
requests are blocked > 32 sec. Implicated osds 1,15,19 (REQUEST_SLOW)
2018-10-18 15:55:07.076562 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663168 : cluster [WRN] Health check update:  
Degraded data redundancy: 3656/4293867 objects degraded (0.085%),  
988 pgs degraded (PG_DEGRADED)
2018-10-18 15:55:07.076633 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663169 : cluster [WRN] Health check update: 4 slow  
requests are blocked > 32 sec. Implicated osds 15,22 (REQUEST_SLOW)
2018-10-18 15:55:12.077091 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663174 : cluster [WRN] Health check update:  
Degraded data redundancy: 3509/4293867 objects degraded (0.082%),  
934 pgs degraded (PG_DEGRADED)
2018-10-18 15:55:12.077130 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663175 : cluster [WRN] Health check update: 1 slow  
requests are blocked > 32 sec. Implicated osds 15 (REQUEST_SLOW)
2018-10-18 15:55:13.317758 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663178 : cluster [INF] Health check cleared:  
REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec. Implicated  
osds 15)

....

2018-10-18 15:55:17.077685 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663180 : cluster [WRN] Health check update:  
Degraded data redundancy: 3351/4293867 objects degraded (0.078%),  
886 pgs degraded, 1 pg undersized (PG_DEGRADED)
2018-10-18 15:55:22.078077 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663184 : cluster [WRN] Health check update:  
Degraded data redundancy: 3202/4293867 objects degraded (0.075%),  
831 pgs degraded, 3 pgs undersized (PG_DEGRADED)

* And then, all went back to work:

2018-10-18 16:00:22.876849 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663322 : cluster [WRN] Health check failed: Reduced  
data availability: 4 pgs inactive, 6 pgs peering (PG_AVAILABILITY)
2018-10-18 16:00:24.898816 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663326 : cluster [WRN] Health check failed:  
102307/4289049 objects misplaced (2.385%) (OBJECT_MISPLACED)
2018-10-18 16:00:28.995821 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663327 : cluster [INF] Health check cleared:  
PG_AVAILABILITY (was: Reduced data availability: 3 pgs peering)
2018-10-18 16:00:31.026616 mon.qh-a07-ceph-osd-01 mon.0  
10.3.0.1:6789/0 1663330 : cluster [WRN] Health check update:  
156337/4294191 objects misplaced (3.641%) (OBJECT_MISPLACED)

A few mins later, we added the replaced disks again (on ceph-osd-06  
and on ceph-osd-03), without issues. To the I/O stuck started as the  
node where in the cluster again ( 15:53:45 ...) we have no clue ...  
and never happens in the past.

We have per OSD node:
* 2 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
* 64GB DDR4 ECC
* HDD: WD Red 1TB 2.5 with Intel NVME SSDPEDMD400G4 as journal device
* SDD: Samsung 950 EVO 512GB
* 2 x Symbios Logic SAS3224 PCI-Express Fusion-MPT SAS-3 (one for  
SSD, one for HDD)
* Mellanox 100GB Ceph network (CX-4)
* 6 Mon (on Ceph nodes)

Ceph.conf

[global]
	 osd journal size = 5120
	 osd pool default min size = 2
	 osd pool default size = 3

[osd]
    keyring = /var/lib/ceph/osd/ceph-$id/keyring
    osd max backfills = 20
    osd recovery max active = 8

no special crush rule, just the HDD / SDD one.

* ceph osd pool get ssd-pool pg_num
pg_num: 2048

* root@qh-a07-ceph-osd-01:[~]: ceph osd pool get sata-pool pg_num
pg_num: 1024

* root@qh-a07-ceph-osd-01:[~]: ceph osd pool get db-pool pg_num
pg_num: 1024

* ceph -s

data:
    pools:   3 pools, 4096 pgs
    objects: 1.43M objects, 5.39TiB
    usage:   15.7TiB used, 33.4TiB / 49.1TiB avail
    pgs:     4096 active+clean

Any suggestions to avoid this stuck next time ?

cu denny

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com