Re: PGs stuck activating after adding new OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Oops, sorry about not including the version. Everything is running 12.2.4 on Ubuntu 16.04.

Below is the output from ceph osd df. The OSDs are pretty full, hence adding a new OSD node. I did have to bump up the nearfull ratio to .90 and reweight a few OSDs to bring them a little closer to the average.

ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS
  0   ssd 1.74649  1.00000 1788G 15688M 1773G  0.86 0.01  88
  1   ssd 1.74649  1.00000 1788G 16489M 1772G  0.90 0.01  96
  2   ssd 1.74649  1.00000 1788G 17224M 1771G  0.94 0.01  86
  3   ssd 1.74649  1.00000 1788G 16745M 1772G  0.91 0.01 100
  4   ssd 1.74649  1.00000 1788G 17016M 1771G  0.93 0.01 109
  5   ssd 1.74649  1.00000 1788G 15964M 1772G  0.87 0.01 101
  6   ssd 1.74649  1.00000 1788G 15612M 1773G  0.85 0.01  95
  7   ssd 1.74649  1.00000 1788G 16109M 1772G  0.88 0.01  93
  8   hdd 9.09560  1.00000 9313G  7511G 1802G 80.65 1.21 169
  9   hdd 9.09560  1.00000 9313G  7155G 2158G 76.83 1.16 161
 10   hdd 9.09560  1.00000 9313G  7953G 1360G 85.39 1.28 179
 11   hdd 9.09560  0.95000 9313G  7821G 1492G 83.98 1.26 176
 12   hdd 9.09560  1.00000 9313G  7193G 2120G 77.24 1.16 162
 13   hdd 9.09560  1.00000 9313G  8131G 1182G 87.30 1.31 183
 14   hdd 9.09560  1.00000 9313G  7643G 1670G 82.07 1.23 172
 15   hdd 9.09560  1.00000 9313G  7019G 2294G 75.36 1.13 158
 16   hdd 9.09560  1.00000 9313G  7419G 1894G 79.66 1.20 167
 17   hdd 9.09560  1.00000 9313G  7333G 1980G 78.74 1.18 165
 18   hdd 9.09560  1.00000 9313G  7107G 2206G 76.31 1.15 160
 19   hdd 9.09560  1.00000 9313G  7288G 2025G 78.25 1.18 164
 20   hdd 9.09560  1.00000 9313G  8133G 1180G 87.32 1.31 183
 21   hdd 9.09560  1.00000 9313G  7374G 1939G 79.17 1.19 166
 22   hdd 9.09560  1.00000 9313G  7550G 1763G 81.07 1.22 170
 23   hdd 9.09560  1.00000 9313G  7552G 1761G 81.08 1.22 170
 24   hdd 9.09560  1.00000 9313G  7955G 1358G 85.42 1.28 179
 25   hdd 9.09560  1.00000 9313G  7909G 1404G 84.92 1.28 178
 26   hdd 9.09560  1.00000 9313G  7685G 1628G 82.51 1.24 173
 27   hdd 9.09560  1.00000 9313G  7284G 2029G 78.21 1.18 164
 28   hdd 9.09560  1.00000 9313G  7243G 2070G 77.77 1.17 163
 29   hdd 9.09560  1.00000 9313G  7509G 1804G 80.63 1.21 169
 30   hdd 9.09560  1.00000 9313G  7065G 2248G 75.86 1.14 159
 31   hdd 9.09560  1.00000 9313G  7155G 2158G 76.83 1.16 161
 32   hdd 9.09560  1.00000 9313G  6932G 2381G 74.43 1.12 156
 33   hdd 9.09560  1.00000 9313G  6756G 2557G 72.54 1.09 152
 34   hdd 9.09560  1.00000 9313G  7687G 1626G 82.54 1.24 173
 35   hdd 9.09560  1.00000 9313G  6665G 2648G 71.57 1.08 150
 36   hdd 9.09560  1.00000 9313G  7954G 1359G 85.41 1.28 179
 37   hdd 9.09560  1.00000 9313G  7113G 2199G 76.38 1.15 160
 38   hdd 9.09560  1.00000 9313G  7286G 2027G 78.23 1.18 164
 39   hdd 9.09560  1.00000 9313G  7198G 2115G 77.28 1.16 162
 40   hdd 9.09560  1.00000 9313G  7953G 1360G 85.39 1.28 179
 41   hdd 9.09560  1.00000 9313G  6756G 2557G 72.54 1.09 152
 42   hdd 9.09560  1.00000 9313G  7241G 2072G 77.75 1.17 163
 43   hdd 9.09560  1.00000 9313G  7063G 2250G 75.84 1.14 159
 44   hdd 9.09560  1.00000 9313G  7951G 1362G 85.38 1.28 179
 45   hdd 9.09560  1.00000 9313G  6708G 2605G 72.03 1.08 151
 46   hdd 9.09560  1.00000 9313G  7598G 1715G 81.58 1.23 171
 47   hdd 9.09560  1.00000 9313G  7065G 2248G 75.86 1.14 159
 48   hdd 9.09560  1.00000 9313G  7868G 1445G 84.48 1.27 177
 49   hdd 9.09560  1.00000 9313G  7331G 1982G 78.72 1.18 165
 50   hdd 9.09560  1.00000 9313G  7377G 1936G 79.21 1.19 166
 51   hdd 9.09560  1.00000 9313G  7065G 2248G 75.86 1.14 159
 52   hdd 9.09560  1.00000 9313G  8041G 1272G 86.34 1.30 181
 53   hdd 9.09560  1.00000 9313G  7152G 2161G 76.79 1.15 161
 54   hdd 9.09560  1.00000 9313G  7505G 1808G 80.58 1.21 169
 55   hdd 9.09560  1.00000 9313G  7556G 1757G 81.13 1.22 170
 56   hdd 9.09560  1.00000 9313G  6841G 2472G 73.46 1.10 154
 57   hdd 9.09560  1.00000 9313G  7598G 1715G 81.58 1.23 171
 58   hdd 9.09560  1.00000 9313G  7245G 2068G 77.79 1.17 163
 59   hdd 9.09560  1.00000 9313G  7152G 2161G 76.79 1.15 161
 60   hdd 9.09560  1.00000 9313G  7864G 1449G 84.44 1.27 177
 61   hdd 9.09560  1.00000 9313G  6890G 2423G 73.98 1.11 155
 62   hdd 9.09560  1.00000 9313G  6884G 2429G 73.92 1.11 155
 63   hdd 9.09560  1.00000 9313G  7776G 1537G 83.49 1.26 175
 64   hdd 9.09560  1.00000 9313G  7597G 1716G 81.57 1.23 171
 65   hdd 9.09560  1.00000 9313G  6706G 2607G 72.00 1.08 151
 66   hdd 9.09560  0.95000 9313G  7820G 1493G 83.97 1.26 176
 67   hdd 9.09560  0.95000 9313G  8043G 1270G 86.36 1.30 181
 68   hdd 9.09560  1.00000 9313G  7643G 1670G 82.07 1.23 172
 69   hdd 9.09560  1.00000 9313G  6620G 2693G 71.08 1.07 149
 70   hdd 9.09560  1.00000 9313G  7775G 1538G 83.48 1.26 175
 71   hdd 9.09560  1.00000 9313G  7731G 1581G 83.02 1.25 174
 72   hdd 9.09560  1.00000 9313G  7598G 1715G 81.58 1.23 171
 73   hdd 9.09560  1.00000 9313G  6575G 2738G 70.60 1.06 148
 74   hdd 9.09560  1.00000 9313G  7155G 2158G 76.83 1.16 161
 75   hdd 9.09560  1.00000 9313G  6220G 3093G 66.79 1.00 140
 76   hdd 9.09560  1.00000 9313G  6796G 2517G 72.97 1.10 153
 77   hdd 9.09560  1.00000 9313G  7725G 1587G 82.95 1.25 174
 78   hdd 9.09560  1.00000 9313G  7241G 2072G 77.75 1.17 163
 79   hdd 9.09560  1.00000 9313G  7597G 1716G 81.57 1.23 171
 80   hdd 9.09560  1.00000 9313G  7467G 1846G 80.18 1.21 168
 81   hdd 9.09560  1.00000 9313G  7909G 1404G 84.92 1.28 178
 82   hdd 9.09560  1.00000 9313G  7240G 2073G 77.74 1.17 163
 83   hdd 9.09560  1.00000 9313G  7241G 2072G 77.75 1.17 163
 84   hdd 9.09560  1.00000 9313G  7687G 1626G 82.54 1.24 173
 85   hdd 9.09560  1.00000 9313G  7244G 2069G 77.78 1.17 163
 86   hdd 9.09560  1.00000 9313G  7466G 1847G 80.16 1.21 168
 87   hdd 9.09560  1.00000 9313G  7953G 1360G 85.39 1.28 179
 88   hdd 9.09569  1.00000 9313G   144G 9169G  1.56 0.02   3
 89   hdd 9.09569  1.00000 9313G   241G 9072G  2.59 0.04   5
 90   hdd       0  1.00000 9313G  6975M 9307G  0.07 0.00   0
 91   hdd       0  1.00000 9313G  1854M 9312G  0.02    0   0
 92   hdd       0  1.00000 9313G  1837M 9312G  0.02    0   0
 93   hdd       0  1.00000 9313G  2001M 9312G  0.02    0   0
 94   hdd       0  1.00000 9313G  1829M 9312G  0.02    0   0
 95   hdd       0  1.00000 9313G  1807M 9312G  0.02    0   0
 96   hdd       0  1.00000 9313G  1850M 9312G  0.02    0   0
 97   hdd       0  1.00000 9313G  1311M 9312G  0.01    0   0
 98   hdd       0  1.00000 9313G  1287M 9312G  0.01    0   0
 99   hdd       0  1.00000 9313G  1279M 9312G  0.01    0   0
100   hdd       0  1.00000 9313G  1285M 9312G  0.01    0   0
101   hdd       0  1.00000 9313G  1271M 9312G  0.01    0   0


On Tue, Mar 27, 2018 at 2:29 PM, Peter Linder <peter.linder@xxxxxxxxxxxxxx> wrote:

I've had similar issues, but I think your problem might be something else. Could you send the output of "ceph osd df"?

Other people will probably be interested in what version you are using as well.


Den 2018-03-27 kl. 20:07, skrev Jon Light:
Hi all,

I'm adding a new OSD node with 36 OSDs to my cluster and have run into some problems. Here are some of the details of the cluster:

1 OSD node with 80 OSDs
1 EC pool with k=10, m=3
pg_num 1024
osd failure domain

I added a second OSD node and started creating OSDs with ceph-deploy, one by one. The first 2 added fine, but each subsequent new OSD resulted in more and more PGs stuck activating. I've added a total of 14 new OSDs, but had to set 12 of those with a weight of 0 to get the cluster healthy and usable until I get it fixed.

I have read some things about similar behavior due to PG overdose protection, but I don't think that's the case here because the failure domain is set to osd. Instead, I think my CRUSH rule need some attention:

rule main-storage {
        id 1
        type erasure
        min_size 3
        max_size 13
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step choose indep 0 type osd
        step emit
}

I don't believe I have modified anything from the automatically generated rule except for the addition of the hdd class.

I have been reading the documentation on CRUSH rules, but am having trouble figuring out if the rule is setup properly. After a few more nodes are added I do want to change the failure domain to host, but osd is sufficient for now.

Can anyone help out to see if the rule is causing the problems or if I should be looking at something else?


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux