Incomplete pgs and no data movement ( cluster appears readonly )

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As per a previous thread, my pgs are set too high.  I tried adjusting the “mon max pg per osd” up higher and higher, which did clear the error(restarted monitors and managers each time), but it seems that data simply wont move around the cluster.  If I stop the primary OSD of an incomplete pg, the cluster just shows those affected pages as active+undersized+degraded:

 

services:

    mon: 3 daemons, quorum mon1,mon2,mon3

    mgr: mon3(active), standbys: mon1, mon2

    osd: 43 osds: 43 up, 43 in

 

data:

    pools:   11 pools, 36896 pgs

    objects: 8148k objects, 10486 GB

    usage:   21532 GB used, 135 TB / 156 TB avail

    pgs:     0.043% pgs unknown

             0.011% pgs not active

             362942/16689272 objects degraded (2.175%)

             34483 active+clean

             2393  active+undersized+degraded

             16    unknown

             3     incomplete

1       down

 

The 16 unknown are from me trying to setup a new pool, which was successful, but when I tried to copy an existing pool to it, the command just sat there.  I did this in the hopes of copying the existing oversized pg pools to new pools and then deleting the old pools.  I really didn’t want to move the data, but the issue needs to be dealt with.

 

If I start the OSD back up, the cluster goes back to:

services:

    mon: 3 daemons, quorum mon1,mon2,mon3

    mgr: mon3(active), standbys: mon1, mon2

    osd: 43 osds: 43 up, 43 in

 

  data:

    pools:   11 pools, 36896 pgs

    objects: 8148k objects, 10486 GB

    usage:   21533 GB used, 135 TB / 156 TB avail

    pgs:     0.041% pgs unknown

             0.014% pgs not active

             36876 active+clean

             16    unknown

             4     incomplete

 

The cluster was upgraded from Hammer .94 without issues to Jewel and then Luminous 12.2.2 last week using the latest ceph-deploy.

 

I guess the issue at the moment is that data is not moving either for recovery or new data being added( basically the new data just times out ). 

 

I also adjusted “osd max pg per osd hard ratio ” to 5, but that didn’t seem to trigger any data moved.  I did restart the OSDs each time I changed it.  The data just wont finish moving.  “ceph –w” shows this:

2018-01-10 07:49:27.715163 osd.20 [WRN] slow request 960.675164 seconds old, received at 2018-01-10 07:33:27.039907: osd_op(client.3542508.0:4097 14.0 14.50e8d0b0 (undecoded) ondisk+write+known_if_redirected e125984) currently queued_for_pg

 

Ceph health detail this:

HEALTH_ERR Reduced data availability: 20 pgs inactive, 4 pgs incomplete; Degraded data redundancy: 20 pgs unclean; 2 slow requests are blocked > 32 sec; 66 stuck requests are blocked > 4096 sec

PG_AVAILABILITY Reduced data availability: 20 pgs inactive, 4 pgs incomplete

    pg 11.720 is incomplete, acting [21,10]

    pg 11.9ab is incomplete, acting [14,2]

    pg 11.9fb is incomplete, acting [32,43]

    pg 11.c13 is incomplete, acting [42,26]

    pg 14.0 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.1 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.2 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.3 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.4 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.5 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.6 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.7 is stuck inactive for 1046.844458, current state creating+activating, last acting [21,40,5]

    pg 14.8 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.9 is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.a is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.b is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.c is stuck inactive for 1046.844458, current state unknown, last acting []

   pg 14.d is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.e is stuck inactive for 1046.844458, current state unknown, last acting []

    pg 14.f is stuck inactive for 1046.844458, current state unknown, last acting []

PG_DEGRADED Degraded data redundancy: 20 pgs unclean

    pg 11.720 is stuck unclean since forever, current state incomplete, last acting [21,10]

    pg 11.9ab is stuck unclean since forever, current state incomplete, last acting [14,2]

    pg 11.9fb is stuck unclean since forever, current state incomplete, last acting [32,43]

    pg 11.c13 is stuck unclean since forever, current state incomplete, last acting [42,26]

    pg 14.0 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.1 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.2 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.3 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.4 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.5 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.6 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.7 is stuck unclean for 1046.844458, current state creating+activating, last acting [21,40,5]

    pg 14.8 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.9 is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.a is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.b is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.c is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.d is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.e is stuck unclean for 1046.844458, current state unknown, last acting []

    pg 14.f is stuck unclean for 1046.844458, current state unknown, last acting []

REQUEST_SLOW 2 slow requests are blocked > 32 sec

    2 ops are blocked > 1048.58 sec

    osds 15,20 have blocked requests > 1048.58 sec

REQUEST_STUCK 66 stuck requests are blocked > 4096 sec

    66 ops are blocked > 4194.3 sec

    osds 14,32,42 have stuck requests > 4194.3 sec

 

Any help would be appreciated, right now you can read the data, but that’s about it, the cluster is not writable.

 

-Brent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux