Re: pgs incomplete

☣Adam <adam@xxxxxxxxx> · Sun, 30 Jun 2019 11:02:09 -0500

The guide on migrating from filestore to bluestore was perfect.  I was
able to get that OSD back up and running quickly.  Thanks.

As for my PGs, I tried force-create-pg and it said it was working on it
for a while, and I saw some deep scrubs happening, but when they were
done it didn't help the incomplete problem.  However, the
ceph-objectstore-tool seems to be working.  For the people of the future
(which might well be me if I mess things up again), here's the command I
ran (from the node which hosts the OSD):

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --pgid 2.0
--op mark-complete --no-mon-config

Thanks for your help Alfredo & Paul. :-)

--Adam

On 6/27/19 11:05 AM, Alfredo Deza wrote:
> 
> 
> On Thu, Jun 27, 2019 at 10:36 AM ☣Adam <adam@xxxxxxxxx
> <mailto:adam@xxxxxxxxx>> wrote:
> 
>     Well that caused some excitement (either that or the small power
>     disruption did)!  One of my OSDs is now down because it keeps crashing
>     due to a failed assert (stacktraces attached, also I'm apparently
>     running mimic, not luminous).
> 
>     In the past a failed assert on an OSD has meant removing the disk,
>     wiping it, re-adding it as a new one, and then have ceph rebuild it from
>     other copies of the data.
> 
>     I did this all manually in the past, but I'm trying to get more familiar
>     with ceph's commands.  Will the following commands do the same?
> 
>     ceph-volume lvm zap --destroy --osd-id 11
>     # Presumably that has to be run from the node with OSD 11, not just
>     # any ceph node?
>     # Source: http://docs.ceph.com/docs/mimic/ceph-volume/lvm/zap
> 
> 
> That looks correct, and yes, you would need to run on the node with OSD 11.
> 
> 
> 
>     Do I need to remove the OSD (ceph osd out 11; wait for stabilization;
>     ceph osd purge 11) before I do this and run and "ceph-deploy osd create"
>     afterwards?
> 
> 
> I think that what you need es essentially the same as the guide for
> migrating from filestore to bluestore:
> 
> http://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/
> 
> 
>     Thanks,
>     Adam
> 
> 
>     On 6/26/19 6:35 AM, Paul Emmerich wrote:
>     > Have you tried: ceph osd force-create-pg <pgid>?
>     >
>     > If that doesn't work: use objectstore-tool on the OSD (while it's not
>     > running) and use it to force mark the PG as complete. (Don't know the
>     > exact command off the top of my head)
>     >
>     > Caution: these are obviously really dangerous commands
>     >
>     >
>     >
>     > Paul
>     >
>     >
>     >
>     > --
>     > Paul Emmerich
>     >
>     > Looking for help with your Ceph cluster? Contact us at
>     https://croit.io
>     >
>     > croit GmbH
>     > Freseniusstr. 31h
>     > 81247 München
>     > www.croit.io <http://www.croit.io> <http://www.croit.io>
>     > Tel: +49 89 1896585 90
>     >
>     >
>     > On Wed, Jun 26, 2019 at 1:56 AM ☣Adam <adam@xxxxxxxxx
>     <mailto:adam@xxxxxxxxx>
>     > <mailto:adam@xxxxxxxxx <mailto:adam@xxxxxxxxx>>> wrote:
>     >
>     >     How can I tell ceph to give up on "incomplete" PGs?
>     >
>     >     I have 12 pgs which are "inactive, incomplete" that won't
>     recover.  I
>     >     think this is because in the past I have carelessly pulled
>     disks too
>     >     quickly without letting the system recover.  I suspect the
>     disks that
>     >     have the data for these are long gone.
>     >
>     >     Whatever the reason, I want to fix it so I have a clean cluser
>     even if
>     >     that means losing data.
>     >
>     >     I went through the "troubleshooting pgs" guide[1] which is
>     excellent,
>     >     but didn't get me to a fix.
>     >
>     >     The output of `ceph pg 2.0 query` includes this:
>     >         "recovery_state": [
>     >             {
>     >                 "name": "Started/Primary/Peering/Incomplete",
>     >                 "enter_time": "2019-06-25 18:35:20.306634",
>     >                 "comment": "not enough complete instances of this PG"
>     >             },
>     >
>     >     I've already restated all OSDs in various orders, and I
>     changed min_size
>     >     to 1 to see if that would allow them to get fixed, but no such
>     luck.
>     >     These pools are not erasure coded and I'm using the Luminous
>     release.
>     >
>     >     How can I tell ceph to give up on these PGs?  There's nothing
>     identified
>     >     as unfound, so mark_unfound_lost doesn't help.  I feel like
>     `ceph osd
>     >     lost` might be it, but at this point the OSD numbers have been
>     reused
>     >     for new disks, so I'd really like to limit the damage to the
>     12 PGs
>     >     which are incomplete if possible.
>     >
>     >     Thanks,
>     >     Adam
>     >
>     >     [1]
>     >   
>      http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com