Re: How to fix a Ceph PG in unkown state with no OSDs?

Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx> · Thu, 14 Jun 2018 20:33:04 +0200

I'm not running the balancer, but I did reweight-by-utilization
a few times recently.

"ceph osd tree" and "ceph -s" say:

    https://gist.github.com/oschulz/36d92af84851ec42e09ce1f3cacbc110

On 14.06.2018 20:23, Gregory Farnum wrote:
Well, if this pg maps to no osds, something has certainly gone wrong 
with your crush map. What’s the crush rule it’s using, and what’s the 
output of “ceph osd tree”?
Are you running the manager’s balancer module or something that might be 
putting explicit mappings into the osd map and broken it?

I’m not certain off-hand about the pg reporting, but I believe if it’s 
reporting the state as unknown that means *no* running osd which 
contains any copy of that pg. That’s not something which ceph could do 
on its own without failures of osds. What’s the output of “ceph -s”?
On Thu, Jun 14, 2018 at 2:15 PM Oliver Schulz 
<oliver.schulz@xxxxxxxxxxxxxx <mailto:oliver.schulz@xxxxxxxxxxxxxx>> wrote:

    Dear Greg,

    no, it's a very old cluster (continuous operation since 2013,
    with multiple extensions). It's a production cluster and
    there's about 300TB of valuable data on it.

    We recently updated to luminous and added more OSDs (a month
    ago or so), but everything seemed Ok since then. We didn't have
    any disk failures, but we had trouble with the MDS daemons
    in the last days, so there were a few reboots.

    Is it somehow possible to find this "lost" PG again? Since
    it's in the metadata pool, large parts of our CephFS directory
    tree are currently unavailable. I turned the MDS daemons off
    for now ...

    Cheers

    Oliver

    On 14.06.2018 19:59, Gregory Farnum wrote:
     > Is this a new cluster? Or did the crush map change somehow
    recently? One
     > way this might happen is if CRUSH just failed entirely to map a pg,
     > although I think if the pg exists anywhere it should still be
    getting
     > reported as inactive.
     > On Thu, Jun 14, 2018 at 8:40 AM Oliver Schulz
     > <oliver.schulz@xxxxxxxxxxxxxx
    <mailto:oliver.schulz@xxxxxxxxxxxxxx>
    <mailto:oliver.schulz@xxxxxxxxxxxxxx
    <mailto:oliver.schulz@xxxxxxxxxxxxxx>>> wrote:
     >
     >     Dear all,
     >
     >     I have a serious problem with our Ceph cluster: One of our
    PGs somehow
     >     ended up in this state (reported by "ceph health detail":
     >
     >           pg 1.XXX is stuck inactive for ..., current state unknown,
     >     last acting []
     >
     >     Also, "ceph pg map 1.xxx" reports:
     >
     >           osdmap e525812 pg 1.721 (1.721) -> up [] acting []
     >
     >     I can't use "ceph pg 1.XXX query", it just hangs with no output.
     >
     >     All OSDs are up and in, I have MON quorum, all other PGs seem
    to be
     >     fine.
     >
     >     How can diagnose/fix this? Unfortunately, the PG in question
    is part
     >     of the CephFS metadata pool ...
     >
     >     Any help would be very, very much appreciated!
     >
     >
     >     Cheers,
     >
     >     Oliver
     >     _______________________________________________
     >     ceph-users mailing list
     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     >

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com