Luminous PG stuck peering after added nodes with noin

Aleksei Gutikov <aleksey.gutikov@xxxxxxxxxx> · Mon, 10 Jun 2019 16:27:48 +0300

Hi all!

Previous week we ran into terrible situation after added 4 new nodes 
into one of our clusters.
Trying to reduce pg moves we set noin flag.
Then deployed 4 new node so added 30% of OSDs with reweight=0.
After that a huge number of PGs stalled in peering or activating state - 
about 20%.
Please, see ceph -s output below.
Number of peering and activating PGs was decreasing very slowly - like 
5-10 per minute.

Unfortunately we did not collected any useful logs.
And there were no error in OSD logs with default loglevel.

Also we noticed strange CPU utilization of affected OSDs.
Not all OSD were affected - about 1/3 of OSDs on every host.
Affected OSDs were utilizing 3.5-4 CPU cores.
And every of 3 messenger threads were utilizing whole 100% CPU.

We were able to fix that state only by restarting all OSDs in hdd pool.
After restart OSDs finished peering as expected - in several seconds.

Everything looks like PG overdose, but:
- We have about 150 PG/OSD, while mon_max_pg_per_osd=400 and 
osd_max_pg_per_osd_hard_ratio=4.0
- We write "withhold creation of pg" with loglevel 0 and there were no 
such messages in OSD logs
- Unexplained msgr CPU usage

A couple days after we started to add new bunch of nodes.
In this case - one-by-one, and also with noin flag.
And after first node added to cluster - we got the same 20% stucked 
inactive PGs.
Again no unusual messages in logs.
We were already aware how to fix that - so restarted OSDs.
After peering was completed and backfilling stabilized we continued 
adding new OSD nodes to cluster.
And while backfilling was in progress issue with inactive PGs was not 
reproduced after adding next 3 nodes.

We have a number of small patches and do not want to file a bug before 
become sure that the root of this issue is not one of out patches.

So if anybody already know about this kind of issues - please let me know.

What log would you suggest to enable to see details about PG lifecycle 
in OSD?

Ceph version 12.2.8.

State we had 2-nd time, after added single OSD node:

   cluster:
     id:     ad99506a-05a5-11e8-975e-74d4351a7990
     health: HEALTH_ERR
             noin flag(s) set
             38625525/570124661 objects misplaced (6.775%)
             Reduced data availability: 4637 pgs inactive, 2483 pgs peering
             Degraded data redundancy: 2875/570124661 objects degraded 
(0.001%), 2 pgs degraded, 1 pg undersized
             26 slow requests are blocked > 5 sec. Implicated osds 312,792
             4199 stuck requests are blocked > 30 sec. Implicated osds 
3,4,7,10,12,13,14,21,27,28,29,31,33,35,39,47,48,51,54,55,57,58,59,63,64,67,69,70,71,72,73,74,75,83,85,86,87,92,94,96,100,102,104,107,113,117,118,119,121,125,126,129,130,131,133,136,138,140,141,145,146,148,153,154,155,156,158,160,162,163,164,165,166,168,176,179,182,183,185,187,188,189,192,194,198,199,200,201,203,205,207,208,209,210,213,215,216,220,221,223,224,226,228,230,232,234,235,238,239,240,242,244,245,246,250,252,253,255,256,257,259,261,263,264,267,271,272,273,275,279,282,284,286,288,289,291,292,293,299,300,307,311,318,319,321,323,324,327,329,330,332,333,334,339,341,342,343,345,346,348,352,354,355,356,360,361,363,365,366,367,369,370,372,378,382,384,393,396,398,401,402,404,405,409,411,412,415,416,418,421,428,429,432,434,435,436,438,441,444,446,447,448,449,451,452,453,456,457,458,460,461,462,464,465,466,467,468,469,471,472,474,478,479,480,481,482,483,485,486,487,489,492,494,498,499,503,504,505,506,507,508,509,510,512,513,515,516,517,520,521,522,523,524,527,528,530,531,533,535,536,5
38,539,541,542,546,549,550,554,555,559,561,562,563,564,565,566,568,571,573,574,578,581,582,583,588,589,590,592,593,594,595,596,597,598,599,602,604,605,606,607,608,609,610,611,612,613,614,617,618,619,620,621,622,624,627,628,630,632,633,634,636,637,638,639,640,642,643,644,645,646,647,648,650,651,652,656,659,660,661,662,663,666,668,669,671,672,673,674,675,676,678,681,682,683,686,687,691,692,694,695,696,697,699,701,704,705,706,707,708,709,712,714,716,717,718,719,720,722,724,727,729,732,733,736,737,738,739,740,741,742,743,745,746,750,751,752,754,755,756,758,759,760,761,762,763,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,793,794,795,796

   services:
     mon: 3 daemons, quorum 
BC-SR1-4R9-CEPH-MON1,BC-SR1-4R3-CEPH-MON1,BC-SR1-4R6-CEPH-MON1
     mgr: BC-SR1-4R9-CEPH-MON1(active), standbys: BC-SR1-4R3-CEPH-MON1, 
BC-SR1-4R6-CEPH-MON1
     osd: 828 osds: 828 up, 798 in; 5355 remapped pgs
          flags noin
     rgw: 187 daemons active

   data:
     pools:   14 pools, 21888 pgs
     objects: 53.44M objects, 741TiB
     usage:   1.04PiB used, 5.55PiB / 6.59PiB avail
     pgs:     21.203% pgs not active
              2875/570124661 objects degraded (0.001%)
              38625525/570124661 objects misplaced (6.775%)
              15382 active+clean
              1847  remapped+peering
              1642  activating+remapped
              1244  active+remapped+backfill_wait
              640   peering
              620   active+remapped+backfilling
              511   activating
              1     active+undersized+degraded+remapped+backfilling
              1     activating+degraded

   io:
     client:   715MiB/s rd, 817MiB/s wr, 5.14kop/s rd, 5.74kop/s wr
     recovery: 10.1GiB/s, 688objects/s

--

Best regards,
Aleksei Gutikov
Software Engineer | synesis.ru | Minsk. BY
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com