Re: Orphan PG

Marek Dohojda <mdohojda@xxxxxxxxxxxxxxxxxxx> · Sun, 7 Jun 2015 08:41:29 -0600

I think this is the issue.  look at ceph health detail you will see that 0.21 and others are orphan:HEALTH_WARN 65 pgs stale; 22 pgs stuck inactive; 65 pgs stuck stale; 22 pgs stuck unclean; too many PGs per OSD (456 > max 300)
pg 0.21 is stuck inactive since forever, current state creating, last acting []
pg 0.7 is stuck inactive since forever, current state creating, last acting []
pg 5.2 is stuck inactive since forever, current state creating, last acting []
pg 1.7 is stuck inactive since forever, current state creating, last acting []
pg 0.34 is stuck inactive since forever, current state creating, last acting []
pg 0.33 is stuck inactive since forever, current state creating, last acting []
pg 5.1 is stuck inactive since forever, current state creating, last acting []
pg 0.1b is stuck inactive since forever, current state creating, last acting []
pg 0.32 is stuck inactive since forever, current state creating, last acting []
pg 1.2 is stuck inactive since forever, current state creating, last acting []
pg 0.31 is stuck inactive since forever, current state creating, last acting []
pg 2.0 is stuck inactive since forever, current state creating, last acting []
pg 5.7 is stuck inactive since forever, current state creating, last acting []
pg 1.0 is stuck inactive since forever, current state creating, last acting []
pg 2.2 is stuck inactive since forever, current state creating, last acting []
pg 0.16 is stuck inactive since forever, current state creating, last acting []
pg 0.15 is stuck inactive since forever, current state creating, last acting []
pg 0.2b is stuck inactive since forever, current state creating, last acting []
pg 0.3f is stuck inactive since forever, current state creating, last acting []
pg 0.27 is stuck inactive since forever, current state creating, last acting []
pg 0.3c is stuck inactive since forever, current state creating, last acting []
pg 0.3a is stuck inactive since forever, current state creating, last acting []
pg 0.21 is stuck unclean since forever, current state creating, last acting []
pg 0.7 is stuck unclean since forever, current state creating, last acting []
pg 5.2 is stuck unclean since forever, current state creating, last acting []
pg 1.7 is stuck unclean since forever, current state creating, last acting []
pg 0.34 is stuck unclean since forever, current state creating, last acting []
pg 0.33 is stuck unclean since forever, current state creating, last acting []
pg 5.1 is stuck unclean since forever, current state creating, last acting []
pg 0.1b is stuck unclean since forever, current state creating, last acting []
pg 0.32 is stuck unclean since forever, current state creating, last acting []
pg 1.2 is stuck unclean since forever, current state creating, last acting []
pg 0.31 is stuck unclean since forever, current state creating, last acting []
pg 2.0 is stuck unclean since forever, current state creating, last acting []
pg 5.7 is stuck unclean since forever, current state creating, last acting []
pg 1.0 is stuck unclean since forever, current state creating, last acting []
pg 2.2 is stuck unclean since forever, current state creating, last acting []
pg 0.16 is stuck unclean since forever, current state creating, last acting []
pg 0.15 is stuck unclean since forever, current state creating, last acting []
pg 0.2b is stuck unclean since forever, current state creating, last acting []
pg 0.3f is stuck unclean since forever, current state creating, last acting []
pg 0.27 is stuck unclean since forever, current state creating, last acting []
pg 0.3c is stuck unclean since forever, current state creating, last acting []
pg 0.3a is stuck unclean since forever, current state creating, last acting []

On Sun, Jun 7, 2015 at 8:39 AM, Alex Muntada <alexm@xxxxxxxxx> wrote:
That happened also to us, but after moving the OSDs with blocked requests out of the cluster it eventually regained health OK.
Running ceph health details should list those OSDs. Do you have any?
El dia 07/06/2015 16:16, "Marek Dohojda" <mdohojda@xxxxxxxxxxxxxxxxxxx> va escriure:
Thank you.  Unfortunately this won't work because 0.21 is already being creating:~# ceph pg force_create_pg 0.21
pg 0.21 already creating

I think, and I am guessing here since I don't know internals that well, that 0.21 started to be created but since its OSD disappear it never finished and it keeps trying. 

On Sun, Jun 7, 2015 at 12:18 AM, Alex Muntada <alexm@xxxxxxxxx> wrote:
Marek Dohojda:

One of the Stuck Inactive is 0.21 and here is the output of ceph pg map

#ceph pg map 0.21

osdmap e579 pg 0.21 (0.21) -> up [] acting []

#ceph pg dump_stuck stale

ok

pg_stat state   up      up_primary      acting  acting_primary

0.22    stale+active+clean      [5,1,6] 5       [5,1,6] 5

0.1f    stale+active+clean      [2,0,4] 2       [2,0,4] 2

<reducted for ease of reading>

# ceph osd stat

     osdmap e579: 14 osds: 14 up, 14 in

If I do

#ceph pg 0.21 query

The command freezes and never returns any output.

I suspect that the problem is that these PGs were created but the OSD that they were initially created under disappeared.  So I believe that I should just remove these PGs, but honestly I don’t see how.

Does anybody have any ideas as to what to do next?

ceph pg force_create_pg 0.21

We've been playing last week with this same scenario: we stopped on purpose the 3 OSD with the replicas of one PG to find out how it affected to the cluster and we ended up with a stale PG and 400 requests blocked for a long time. After trying several commands to get the cluster back the one that made the difference was force_create_pg and later moving the OSD with blocked requests out of the cluster.

Hope that helps,
Alex

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com