slow requests and degraded cluster, but not really ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all,

We have an issue with our ceph cluster where 'ceph -s' shows that several requests are blocked, however querying further with 'ceph health detail' indicates that the PGs affected are either active+clean or do not currently exist. OSD 32 appears to be working fine, and the cluster is performing as expected with no clients seemingly affected.

Note - we had just upgraded to Luminous - and despite having "mon max pg per osd = 400" set in ceph.conf, we still have the message "too many PGs per OSD (278 > max 200)"

In order to improve the situation above, I removed several pools that were not used anymore. I assume the PGs that ceph cannot find now are related to this pool deletion.

Does anyone have any ideas on how to get out of this state?

Details below - and full 'ceph health detail' attached to this email.

Kind regards,

Ben Morrice

[root@ceph03 ~]# ceph -s
  cluster:
    id:     6c21c4ba-9c4d-46ef-93a3-441b8055cdc6
    health: HEALTH_WARN
            Degraded data redundancy: 443765/14311983 objects degraded (3.101%), 162 pgs degraded, 241 pgs undersized
            75 slow requests are blocked > 32 sec. Implicated osds 32
            too many PGs per OSD (278 > max 200)

  services:
    mon: 5 daemons, quorum bbpocn01,bbpocn02,bbpocn03,bbpocn04,bbpocn07
    mgr: bbpocn03(active, starting)
    osd: 36 osds: 36 up, 36 in
    rgw: 1 daemon active

  data:
    pools:   24 pools, 3440 pgs
    objects: 4.77M objects, 7.69TiB
    usage:   23.1TiB used, 104TiB / 127TiB avail
    pgs:     443765/14311983 objects degraded (3.101%)
             3107 active+clean
             170  active+undersized
             109  active+undersized+degraded
             43   active+recovery_wait+degraded
             10   active+recovering+degraded
             1    active+recovery_wait

[root@ceph03 ~]# for i in `ceph health detail |grep stuck | awk '{print $2}'`; do echo -n "$i: " ; ceph pg $i query -f plain | cut -d: -f2 | cut -d\" -f2; done
150.270: active+clean
150.2a0: active+clean
150.2b6: active+clean
150.2c2: active+clean
150.2cc: active+clean
150.2d5: active+clean
150.2d6: active+clean
150.2e1: active+clean
150.2ef: active+clean
150.2f5: active+clean
150.2f7: active+clean
150.2fc: active+clean
150.315: active+clean
150.318: active+clean
150.31a: active+clean
150.320: active+clean
150.326: active+clean
150.36e: active+clean
150.380: active+clean
150.389: active+clean
150.3a4: active+clean
150.3ad: active+clean
150.3b4: active+clean
150.3bb: active+clean
150.3ce: active+clean
150.3d0: active+clean
150.3d8: active+clean
150.3e0: active+clean
150.3f6: active+clean
165.24c: Error ENOENT: problem getting command descriptions from pg.165.24c
165.28f: Error ENOENT: problem getting command descriptions from pg.165.28f
165.2b3: Error ENOENT: problem getting command descriptions from pg.165.2b3
165.2b4: Error ENOENT: problem getting command descriptions from pg.165.2b4
165.2d6: Error ENOENT: problem getting command descriptions from pg.165.2d6
165.2f4: Error ENOENT: problem getting command descriptions from pg.165.2f4
165.2fd: Error ENOENT: problem getting command descriptions from pg.165.2fd
165.30f: Error ENOENT: problem getting command descriptions from pg.165.30f
165.322: Error ENOENT: problem getting command descriptions from pg.165.322
165.325: Error ENOENT: problem getting command descriptions from pg.165.325
165.334: Error ENOENT: problem getting command descriptions from pg.165.334
165.36e: Error ENOENT: problem getting command descriptions from pg.165.36e
165.37c: Error ENOENT: problem getting command descriptions from pg.165.37c
165.382: Error ENOENT: problem getting command descriptions from pg.165.382
165.387: Error ENOENT: problem getting command descriptions from pg.165.387
165.3af: Error ENOENT: problem getting command descriptions from pg.165.3af
165.3da: Error ENOENT: problem getting command descriptions from pg.165.3da
165.3e0: Error ENOENT: problem getting command descriptions from pg.165.3e0
165.3e2: Error ENOENT: problem getting command descriptions from pg.165.3e2
165.3e9: Error ENOENT: problem getting command descriptions from pg.165.3e9
165.3fb: Error ENOENT: problem getting command descriptions from pg.165.3fb

[root@ceph03 ~]# ceph pg 165.24c query
Error ENOENT: problem getting command descriptions from pg.165.24c
[root@ceph03 ~]# ceph pg 165.24c delete
Error ENOENT: problem getting command descriptions from pg.165.24c

--
Kind regards,

Ben Morrice

______________________________________________________________________
Ben Morrice | e: ben.morrice@xxxxxxx | t: +41-21-693-9670
EPFL / BBP
Biotech Campus
Chemin des Mines 9
1202 Geneva
Switzerland

HEALTH_WARN Degraded data redundancy: 443765/14311983 objects degraded (3.101%), 162 pgs degraded, 241 pgs undersized; 75 slow requests are blocked > 32 sec. Implicated osds 32; too many PGs per OSD (278 > max 200)
pg 150.270 is stuck undersized for 1871.987162, current state active+undersized, last acting [17,30]
pg 150.2a0 is stuck undersized for 1871.988539, current state active+undersized, last acting [16,24]
pg 150.2b6 is stuck undersized for 1871.984670, current state active+undersized, last acting [26,28]
pg 150.2c2 is stuck undersized for 1871.985571, current state active+undersized, last acting [10,30]
pg 150.2cc is stuck undersized for 1871.991733, current state active+undersized, last acting [35,23]
pg 150.2d5 is stuck undersized for 1871.992692, current state active+undersized, last acting [15,24]
pg 150.2d6 is stuck undersized for 1871.985410, current state active+undersized, last acting [23,34]
pg 150.2e1 is stuck undersized for 1871.990823, current state active+undersized, last acting [35,13]
pg 150.2ef is stuck undersized for 1871.990259, current state active+undersized, last acting [25,33]
pg 150.2f5 is stuck undersized for 1871.988578, current state active+undersized, last acting [35,11]
pg 150.2f7 is stuck undersized for 1871.989826, current state active+undersized, last acting [19,12]
pg 150.2fc is stuck undersized for 1871.987132, current state active+undersized, last acting [13,25]
pg 150.315 is stuck undersized for 1871.988419, current state active+undersized, last acting [24,12]
pg 150.318 is stuck undersized for 1871.985784, current state active+undersized, last acting [28,23]
pg 150.31a is stuck undersized for 1871.988659, current state active+undersized, last acting [23,30]
pg 150.320 is stuck undersized for 1871.986622, current state active+undersized, last acting [29,24]
pg 150.326 is stuck undersized for 1871.989506, current state active+undersized, last acting [29,10]
pg 150.36e is stuck undersized for 1871.991475, current state active+undersized, last acting [12,20]
pg 150.380 is stuck undersized for 1871.990961, current state active+undersized+degraded, last acting [23,13]
pg 150.389 is stuck undersized for 1871.984920, current state active+undersized+degraded, last acting [26,12]
pg 150.3a4 is stuck undersized for 1871.992132, current state active+undersized, last acting [22,34]
pg 150.3ad is stuck undersized for 1871.991914, current state active+undersized, last acting [15,33]
pg 150.3b4 is stuck undersized for 1871.986881, current state active+undersized, last acting [28,19]
pg 150.3bb is stuck undersized for 1871.987502, current state active+undersized, last acting [19,12]
pg 150.3ce is stuck undersized for 1871.989547, current state active+undersized, last acting [24,9]
pg 150.3d0 is stuck undersized for 1871.988650, current state active+undersized, last acting [15,18]
pg 150.3d8 is stuck undersized for 1871.985067, current state active+undersized, last acting [20,16]
pg 150.3e0 is stuck undersized for 1871.986621, current state active+undersized, last acting [23,10]
pg 150.3f6 is stuck undersized for 1871.986451, current state active+undersized, last acting [13,18]
pg 165.24c is stuck undersized for 1871.989838, current state active+undersized, last acting [31,13]
pg 165.28f is stuck undersized for 1871.987943, current state active+undersized, last acting [28,9]
pg 165.2b3 is stuck undersized for 1871.986314, current state active+undersized, last acting [32,11]
pg 165.2b4 is stuck undersized for 1871.990227, current state active+undersized, last acting [30,12]
pg 165.2d6 is stuck undersized for 1871.987215, current state active+undersized, last acting [32,25]
pg 165.2f4 is stuck undersized for 1871.992309, current state active+undersized, last acting [27,20]
pg 165.2fd is stuck undersized for 1871.992173, current state active+undersized, last acting [18,15]
pg 165.30f is stuck undersized for 1871.988641, current state active+undersized, last acting [24,10]
pg 165.322 is stuck undersized for 1871.992408, current state active+undersized, last acting [20,33]
pg 165.325 is stuck undersized for 1871.991148, current state active+undersized, last acting [24,28]
pg 165.334 is stuck undersized for 1871.989945, current state active+undersized, last acting [34,25]
pg 165.33e is active+undersized+degraded, acting [21,10]
pg 165.36e is stuck undersized for 1871.991843, current state active+undersized, last acting [24,12]
pg 165.37c is stuck undersized for 1871.989140, current state active+undersized, last acting [31,23]
pg 165.382 is stuck undersized for 1871.991045, current state active+undersized, last acting [10,20]
pg 165.387 is stuck undersized for 1871.987867, current state active+undersized, last acting [30,12]
pg 165.3af is stuck undersized for 1871.987671, current state active+undersized, last acting [22,34]
pg 165.3da is stuck undersized for 1871.992028, current state active+undersized, last acting [20,9]
pg 165.3e0 is stuck undersized for 1871.990471, current state active+undersized, last acting [24,13]
pg 165.3e2 is stuck undersized for 1871.990954, current state active+undersized, last acting [28,9]
pg 165.3e9 is stuck undersized for 1871.991001, current state active+undersized, last acting [24,13]
pg 165.3fb is stuck undersized for 1871.992232, current state active+undersized, last acting [25,9]
22 ops are blocked > 2097.15 sec
24 ops are blocked > 1048.58 sec
5 ops are blocked > 524.288 sec
1 ops are blocked > 262.144 sec
22 ops are blocked > 131.072 sec
1 ops are blocked > 32.768 sec
osd.32 has blocked requests > 2097.15 sec
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux