Incomplete pgs and no data movement ( cluster appears readonly )

"Brent Kennedy" <bkennedy@xxxxxxxxxx> · Wed, 10 Jan 2018 02:51:34 -0500

As per a previous thread, my pgs are set too high.  I tried adjusting the “mon max pg per osd” up higher and higher, which did clear the error(restarted monitors and managers each time), but it seems that data simply wont move around the cluster.  If I stop the primary OSD of an incomplete pg, the cluster just shows those affected pages as active+undersized+degraded:

services:
    mon: 3 daemons, quorum mon1,mon2,mon3
    mgr: mon3(active), standbys: mon1, mon2
    osd: 43 osds: 43 up, 43 in

data:
    pools:   11 pools, 36896 pgs
    objects: 8148k objects, 10486 GB
    usage:   21532 GB used, 135 TB / 156 TB avail
    pgs:     0.043% pgs unknown
             0.011% pgs not active
             362942/16689272 objects degraded (2.175%)
             34483 active+clean
             2393  active+undersized+degraded
             16    unknown
             3     incomplete
1       down

The 16 unknown are from me trying to setup a new pool, which was successful, but when I tried to copy an existing pool to it, the command just sat there.  I did this in the hopes of copying the existing oversized pg pools to new pools and then deleting the old pools.  I really didn’t want to move the data, but the issue needs to be dealt with.

If I start the OSD back up, the cluster goes back to:
services:
    mon: 3 daemons, quorum mon1,mon2,mon3
    mgr: mon3(active), standbys: mon1, mon2
    osd: 43 osds: 43 up, 43 in

  data:
    pools:   11 pools, 36896 pgs
    objects: 8148k objects, 10486 GB
    usage:   21533 GB used, 135 TB / 156 TB avail
    pgs:     0.041% pgs unknown
             0.014% pgs not active
             36876 active+clean
             16    unknown
             4     incomplete

The cluster was upgraded from Hammer .94 without issues to Jewel and then Luminous 12.2.2 last week using the latest ceph-deploy.

I guess the issue at the moment is that data is not moving either for recovery or new data being added( basically the new data just times out ).  

I also adjusted “osd max pg per osd hard ratio ” to 5, but that didn’t seem to trigger any data moved.  I did restart the OSDs each time I changed it.  The data just wont finish moving.  “ceph –w” shows this:
2018-01-10 07:49:27.715163 osd.20 [WRN] slow request 960.675164 seconds old, received at 2018-01-10 07:33:27.039907: osd_op(client.3542508.0:4097 14.0 14.50e8d0b0 (undecoded) ondisk+write+known_if_redirected e125984) currently queued_for_pg

Ceph health detail this:
HEALTH_ERR Reduced data availability: 20 pgs inactive, 4 pgs incomplete; Degraded data redundancy: 20 pgs unclean; 2 slow requests are blocked > 32 sec; 66 stuck requests are blocked > 4096 sec
PG_AVAILABILITY Reduced data availability: 20 pgs inactive, 4 pgs incomplete
    pg 11.720 is incomplete, acting [21,10]
    pg 11.9ab is incomplete, acting [14,2]
    pg 11.9fb is incomplete, acting [32,43]
    pg 11.c13 is incomplete, acting [42,26]
    pg 14.0 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.1 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.2 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.3 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.4 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.5 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.6 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.7 is stuck inactive for 1046.844458, current state creating+activating, last acting [21,40,5]
    pg 14.8 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.9 is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.a is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.b is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.c is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.d is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.e is stuck inactive for 1046.844458, current state unknown, last acting []
    pg 14.f is stuck inactive for 1046.844458, current state unknown, last acting []
PG_DEGRADED Degraded data redundancy: 20 pgs unclean
    pg 11.720 is stuck unclean since forever, current state incomplete, last acting [21,10]
    pg 11.9ab is stuck unclean since forever, current state incomplete, last acting [14,2]
    pg 11.9fb is stuck unclean since forever, current state incomplete, last acting [32,43]
    pg 11.c13 is stuck unclean since forever, current state incomplete, last acting [42,26]
    pg 14.0 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.1 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.2 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.3 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.4 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.5 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.6 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.7 is stuck unclean for 1046.844458, current state creating+activating, last acting [21,40,5]
    pg 14.8 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.9 is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.a is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.b is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.c is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.d is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.e is stuck unclean for 1046.844458, current state unknown, last acting []
    pg 14.f is stuck unclean for 1046.844458, current state unknown, last acting []
REQUEST_SLOW 2 slow requests are blocked > 32 sec
    2 ops are blocked > 1048.58 sec
    osds 15,20 have blocked requests > 1048.58 sec
REQUEST_STUCK 66 stuck requests are blocked > 4096 sec
    66 ops are blocked > 4194.3 sec
    osds 14,32,42 have stuck requests > 4194.3 sec

Any help would be appreciated, right now you can read the data, but that’s about it, the cluster is not writable.

-Brent
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com