Re: Crashing OSDs (suicide timeout, following a single pool)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Nice catch.  That was a copy-paste error.  Sorry

it should have read:

 3. Flush the journal and export the primary version of the PG.  This took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16 --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export --file /root/32.10c.b.export

  4. Import the PG into a New / Temporary OSD that is also offline,
   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100 --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op import --file /root/32.10c.b.export


On Thu, Jun 2, 2016 at 5:10 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
<brandon.morris.pmp@xxxxxxxxx> wrote:

> The only way that I was able to get back to Health_OK was to export/import.  ***** Please note, any time you use the ceph_objectstore_tool you risk data loss if not done carefully.   Never remove a PG until you have a known good export *****
>
> Here are the steps I used:
>
> 1. set NOOUT, NO BACKFILL
> 2. Stop the OSD's that have the erroring PG
> 3. Flush the journal and export the primary version of the PG.  This took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16 --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export --file /root/32.10c.b.export
>
> 4. Import the PG into a New / Temporary OSD that is also offline,
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100 --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export --file /root/32.10c.b.export

This should be an import op and presumably to a different data path
and journal path more like the following?

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
--journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
import --file /root/32.10c.b.export

Just trying to clarify for anyone that comes across this thread in the future.

Cheers,
Brad

>
> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case it looks like)
> 6. Start cluster OSD's
> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3 OSD's it is supposed to be on.
>
> This is similar to the recovery process described in this post from 04/09/2015: http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool   Hopefully it works in your case too and you can the cluster back to a state that you can make the CephFS directories smaller.
>
> - Brandon

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux