Re: Troubleshooting Incomplete PGs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Incomplete means "Ceph detects that a placement group is missing a necessary period of history from its log. If you see this state, report a bug, and try to start any failed OSDs that may contain the needed information".

In the PG query, it lists some OSDs that it's trying to probe:
          "probing_osds": [
                "10",
                "13",
                "15",
                "25"],
          "down_osds_we_would_probe": [],

Is one of those the OSD you replaced?  If so, you might try ceph pg {pg-id} mark_unfound_lost revert|delete.  That command will lose data; it tells Ceph to give up looking for data that it can't find, so you might want to wait a bit.


There's also the possibility that your crushmap is still not correct.  In the history, I can see that you had bad CRUSH MAPs in the past.  Stuff like
  "recovery_state": [
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2014-10-21 12:18:48.482666",
          "past_intervals": [
                { "first": 4663,
                  "last": 4685,
                  "maybe_went_rw": 1,
                  "up": [],
                  "acting": [
                        10,
                        25,
                        10,
                        -1]},

shows that CRUSH placed some data on osd.10 twice, which is a sign of a bad crushmap.  You might run through the crushtool testing at http://ceph.com/docs/master/man/8/crushtool/, just to make sure everything is kosher.



On Tue, Oct 21, 2014 at 7:04 PM, Chris Kitzmiller <ckitzmiller@xxxxxxxxxxxxx> wrote:
I've gotten myself into the position of having ~100 incomplete PGs. All of my OSDs are up+in (and I've restarted them all one by one).

I was in the process of rebalancing after altering my CRUSH map when I lost an OSD backing disk. I replaced that OSD and it seemed to be backfilling well. During this time I noticed that I had 2 underperforming disks which were holding up the backfilling process. I set them out to try and get everything recovered but *I think* this is what caused some of my PGs to go incomplete. Since then I set those two underperformers back in and they're still backfilling now.

Any help would be appreciated in troubleshooting these PGs. I'm not sure why they're incomplete or what to do about it. A query of one of my incomplete PGs can be found here: http://pastebin.com/raw.php?i=AJ3RMjz6

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux