Re: pg's degraded

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Maybe delete the pool and start over?

 

 

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of JIten Shah
Sent: Thursday, November 20, 2014 5:46 PM
To: Craig Lewis
Cc: ceph-users
Subject: Re: pg's degraded

 

Hi Craig,

 

Recreating the missing PG’s fixed it.  Thanks for your help.

 

But when I tried to mount the Filesystem, it gave me the “mount error 5”. I tried to restart the MDS server but it won’t work. It tells me that it’s laggy/unresponsive.

 

BTW, all these machines are VM’s.

 

[jshah@Lab-cephmon001 ~]$ ceph health detail

HEALTH_WARN mds cluster is degraded; mds Lab-cephmon001 is laggy

mds cluster is degraded

mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 rank 0 is replaying journal

mds.Lab-cephmon001 at 17.147.16.111:6800/3745284 is laggy/unresponsive

 

 

—Jiten

 

On Nov 20, 2014, at 4:20 PM, JIten Shah <jshah2005@xxxxxx> wrote:



Ok. Thanks.

 

—Jiten

 

On Nov 20, 2014, at 2:14 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:



If there's no data to lose, tell Ceph to re-create all the missing PGs.

 

ceph pg force_create_pg 2.33

 

Repeat for each of the missing PGs.  If that doesn't do anything, you might need to tell Ceph that you lost the OSDs.  For each OSD you moved, run ceph osd lost <OSDID>, then try the force_create_pg command again.

 

If that doesn't work, you can keep fighting with it, but it'll be faster to rebuild the cluster.

 

 

 

On Thu, Nov 20, 2014 at 1:45 PM, JIten Shah <jshah2005@xxxxxx> wrote:

Thanks for your help.

 

I was using puppet to install the OSD’s where it chooses a path over a device name. Hence it created the OSD in the path within the root volume since the path specified was incorrect.

 

And all 3 of the OSD’s were rebuilt at the same time because it was unused and we had not put any data in there.

 

Any way to recover from this or should i rebuild the cluster altogether.

 

—Jiten

 

On Nov 20, 2014, at 1:40 PM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:



So you have your crushmap set to choose osd instead of choose host?

 

Did you wait for the cluster to recover between each OSD rebuild?  If you rebuilt all 3 OSDs at the same time (or without waiting for a complete recovery between them), that would cause this problem.

 

 

 

On Thu, Nov 20, 2014 at 11:40 AM, JIten Shah <jshah2005@xxxxxx> wrote:

Yes, it was a healthy cluster and I had to rebuild because the OSD’s got accidentally created on the root disk. Out of 4 OSD’s I had to rebuild 3 of them.

 

 

[jshah@Lab-cephmon001 ~]$ ceph osd tree

# id weight type name up/down reweight

-1 0.5 root default

-2 0.09999 host Lab-cephosd005

4 0.09999 osd.4 up 1

-3 0.09999 host Lab-cephosd001

0 0.09999 osd.0 up 1

-4 0.09999 host Lab-cephosd002

1 0.09999 osd.1 up 1

-5 0.09999 host Lab-cephosd003

2 0.09999 osd.2 up 1

-6 0.09999 host Lab-cephosd004

3 0.09999 osd.3 up 1

 

 

[jshah@Lab-cephmon001 ~]$ ceph pg 2.33 query

Error ENOENT: i don't have paid 2.33

 

—Jiten

 

 

On Nov 20, 2014, at 11:18 AM, Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> wrote:



Just to be clear, this is from a cluster that was healthy, had a disk replaced, and hasn't returned to healthy?  It's not a new cluster that has never been healthy, right?

 

Assuming it's an existing cluster, how many OSDs did you replace?  It almost looks like you replaced multiple OSDs at the same time, and lost data because of it.

 

Can you give us the output of `ceph osd tree`, and `ceph pg 2.33 query`?

 

 

On Wed, Nov 19, 2014 at 2:14 PM, JIten Shah <jshah2005@xxxxxx> wrote:

After rebuilding a few OSD’s, I see that the pg’s are stuck in degraded mode. Sone are in the unclean and others are in the stale state. Somehow the MDS is also degraded. How do I recover the OSD’s and the MDS back to healthy ? Read through the documentation and on the web but no luck so far.

 

pg 2.33 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3]

pg 0.30 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3]

pg 1.31 is stuck unclean since forever, current state stale+active+degraded, last acting [2]

pg 2.32 is stuck unclean for 597129.903922, current state stale+active+degraded, last acting [2]

pg 0.2f is stuck unclean for 597129.903951, current state stale+active+degraded, last acting [2]

pg 1.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3]

pg 2.2d is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [2]

pg 0.2e is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3]

pg 1.2f is stuck unclean for 597129.904015, current state stale+active+degraded, last acting [2]

pg 2.2c is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [3]

pg 0.2d is stuck stale for 422844.566858, current state stale+active+degraded, last acting [2]

pg 1.2c is stuck stale for 422598.539483, current state stale+active+degraded+remapped, last acting [3]

pg 2.2f is stuck stale for 422598.539488, current state stale+active+degraded+remapped, last acting [3]

pg 0.2c is stuck stale for 422598.539487, current state stale+active+degraded+remapped, last acting [3]

pg 1.2d is stuck stale for 422598.539492, current state stale+active+degraded+remapped, last acting [3]

pg 2.2e is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3]

pg 0.2b is stuck stale for 422598.539491, current state stale+active+degraded+remapped, last acting [3]

pg 1.2a is stuck stale for 422598.539496, current state stale+active+degraded+remapped, last acting [3]

pg 2.29 is stuck stale for 422598.539504, current state stale+active+degraded+remapped, last acting [3]

.

.

.

6 ops are blocked > 2097.15 sec

3 ops are blocked > 2097.15 sec on osd.0

2 ops are blocked > 2097.15 sec on osd.2

1 ops are blocked > 2097.15 sec on osd.4

3 osds have slow requests

recovery 40/60 objects degraded (66.667%)

mds cluster is degraded

mds.Lab-cephmon001 at X.X.16.111:6800/3424727 rank 0 is replaying journal

 

—Jiten

 


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

 

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux