Re: Incomplete pgs and no data movement ( cluster appears readonly )

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I adjusted “osd max pg per osd hard ratio ” to 50.0 and left “mon max pg per osd” at 5000 just to see if things would allow data movement.  This worked, the new pool I created finished its creation and spread out.  I was able to then copy the data from the existing pool into the new pool and delete the old one. 

 

Used this process for copying the default pools:

ceph osd pool create .users.email.new 16

rados cppool .users.email .users.email.new

ceph osd pool delete .users.email .users.email --yes-i-really-really-mean-it

ceph osd pool rename .users.email.new .users.email

ceph osd pool application enable .users.email rgw

 

 

So at this point, I have recreated all the .rgw and .user pools except .rgw.buckets with a pg_num of 16, which significantly reduced the pgs, unfortunately, the incompletes are still there:

 

  cluster:

   health: HEALTH_WARN

            Reduced data availability: 4 pgs inactive, 4 pgs incomplete

            Degraded data redundancy: 4 pgs unclean

 

  services:

    mon: 3 daemons, quorum mon1,mon2,mon3

    mgr: mon3(active), standbys: mon1, mon2

    osd: 43 osds: 43 up, 43 in

 

  data:

    pools:   10 pools, 4240 pgs

    objects: 8148k objects, 10486 GB

    usage:   21536 GB used, 135 TB / 156 TB avail

    pgs:     0.094% pgs not active

             4236 active+clean

             4    incomplete

 

The health page is showing blue instead of read on the donut chart, at one point it jumped to green but its back to blue currently.  There are no more ops blocked/delayed either.

 

Thanks for assistance, it seems the cluster will play nice now.  Any thoughts on the stuck pgs?  I ran a query on 11.720 and it shows:

"blocked_by": [

                13,

                27,

                28

 

OSD 13 was acting strange so I wiped it and removed it from the cluster.  This was during the rebuild so I wasn’t aware of it blocking.  Now I am trying to figure out how a removed OSD is blocking.  I went through the process to remove it:

ceph osd crush remove

ceph auth del

ceph osd rm

 

I guess since the cluster was a hot mess at that point, its possible it was borked and therefore the pg is borked.  I am trying to avoid deleting the data as there is data in the OSDs that are online.

 

-Brent

 

 

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Brent Kennedy
Sent: Wednesday, January 10, 2018 12:20 PM
To: 'Janne Johansson' <icepic.dz@xxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Incomplete pgs and no data movement ( cluster appears readonly )

 

I change “mon max pg per osd” to 5000 because when I changed it to zero, which was supposed to disable it, it caused an issue where I couldn’t create any pools.  It would say 0 was larger than the minimum.  I imagine that’s a bug, if I wanted it disabled, then it shouldn’t use the calculation.  I then set “osd max pg per osd hard ratio ” to 5 after changing “mon max pg per osd” to 5000, figuring 5*5000 would cover it.  Perhaps not.  I will adjust it to 30 and restart the OSDs.

 

-Brent

 

 

 

From: Janne Johansson [mailto:icepic.dz@xxxxxxxxx]
Sent: Wednesday, January 10, 2018 3:00 AM
To: Brent Kennedy <bkennedy@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Incomplete pgs and no data movement ( cluster appears readonly )

 

 

 

2018-01-10 8:51 GMT+01:00 Brent Kennedy <bkennedy@xxxxxxxxxx>:

As per a previous thread, my pgs are set too high.  I tried adjusting the “mon max pg per osd” up higher and higher, which did clear the error(restarted monitors and managers each time), but it seems that data simply wont move around the cluster.  If I stop the primary OSD of an incomplete pg, the cluster just shows those affected pages as active+undersized+degraded:

 

I also adjusted “osd max pg per osd hard ratio ” to 5, but that didn’t seem to trigger any data moved.  I did restart the OSDs each time I changed it.  The data just wont finish moving.  “ceph –w” shows this:

2018-01-10 07:49:27.715163 osd.20 [WRN] slow request 960.675164 seconds old, received at 2018-01-10 07:33:27.039907: osd_op(client.3542508.0:4097 14.0 14.50e8d0b0 (undecoded) ondisk+write+known_if_redirected e125984) currently queued_for_pg

 

 

Did you bump the ratio so that the PGs per OSD max * hard ratio actually became more than the amount of PGs you had?

Last time you mailed the ratio was 25xx and the max was 200 which meant the ratio would have needed to be far more than 5.0.

 

 

--

May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux