Re: Odp.: Odp.: CEPH 1 pgs incomplete

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Wed, 22 Apr 2015 13:15:45 -0700

ceph pg query says all the OSDs are being probed.  If those 6 OSDs are staying up, it probably just needs some time.  The OSDs need to stay up longer than 15 mniutes.  If any of them are getting marked down at all, that'll cause problems.  I'd like to see the past intervals in the recovery state get smaller.  All of those entries indicate potential history that needs to be reconciled.  If that array is getting smaller, then recovery is proceeding.
You could try pushing it a bit with a ceph pg scrub 0.37.  If that finishes with out any improvement, try ceph pg deep-scrub 0.37 .  Sometimes it helps move things faster, and sometimes it doesn't.

On Wed, Apr 22, 2015 at 11:54 AM, MEGATEL / Rafał Gawron <rafal.gawron@xxxxxxxxxxxxxx> wrote:

All osd are works fine now

ceph osd tree

ID  WEIGHT     TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 1080.71985 root default
 -2  120.07999     host s1
  0   60.03999         osd.0       up  1.00000          1.00000
  1   60.03999         osd.1       up  1.00000          1.00000
 -3  120.07999     host s2
  2   60.03999         osd.2       up  1.00000          1.00000
  3   60.03999         osd.3       up  1.00000          1.00000
 -4  120.07999     host s3
  4   60.03999         osd.4       up  1.00000          1.00000
  5   60.03999         osd.5       up  1.00000          1.00000
 -5  120.07999     host s4
  6   60.03999         osd.6       up  1.00000          1.00000
  7   60.03999         osd.7       up  1.00000          1.00000
 -6  120.07999     host s5
  9   60.03999         osd.9       up  1.00000          1.00000
  8   60.03999         osd.8       up  1.00000          1.00000
 -7  120.07999     host s6
 10   60.03999         osd.10      up  1.00000          1.00000
 11   60.03999         osd.11      up  1.00000          1.00000

 -8  120.07999     host s7
 12   60.03999         osd.12      up  1.00000          1.00000
 13   60.03999         osd.13      up  1.00000          1.00000

 -9  120.07999     host s8
 14   60.03999         osd.14      up  1.00000          1.00000

 15   60.03999         osd.15      up  1.00000          1.00000
-10  120.07999     host s9
 17   60.03999         osd.17      up  1.00000          1.00000
 16   60.03999         osd.16      up  1.00000          1.00000

Early I had power failure and my cluster was down.

After up is recovering but now I have :

1 pgs incomplete

1 pgs stuck inactive
1 pgs stuck unclean

Cluster don't can revovery this pg. 

I try out some osd and add to my cluster but recovery after this things don't rebuild my cluster. 

Od: Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx>

Wysłane: 22 kwietnia 2015 20:40

Do: MEGATEL / Rafał Gawron

Temat: Re: Odp.:  CEPH 1 pgs incomplete

So you have flapping OSDs.  None of the 6 OSDs involved in that PG are staying up long enough to complete the recovery.

What's happened is that because of how quickly the OSDs are coming up and failing, no single OSD has a complete copy of the data.  There should be a complete copy of the data, but different osds have different chunks of it.

Figure out why those 6 OSDs are failing, and Ceph should recover.  Do you see anything interesting in those OSD logs?  If not, you might need to increase the logging levels.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com