Re: Lost OSD from PCIe error, recovered, HOW to restore OSD process

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, May 16, 2019 at 3:55 PM Mark Lehrer <lehrer@xxxxxxxxx> wrote:
> Steps 3-6 are to get the drive lvm volume back

How much longer will we have to deal with LVM?  If we can migrate non-LVM drives from earlier versions, how about we give ceph-volume the ability to create non-LVM OSDs directly?

We aren't requiring LVM exclusively, there is for example a ZFS plugin already, so I would say that if you want to have something like partitions, you can as a plugin (that would need to be developed). We are concentrating in LVM because we think that is the way to go.




On Thu, May 16, 2019 at 1:20 PM Tarek Zegar <tzegar@xxxxxxxxxx> wrote:

FYI for anyone interested, below is how to recover from a someone removing a NVME drive (the first two steps show how mine were removed and brought back)
Steps 3-6 are to get the drive lvm volume back AND get the OSD daemon running for the drive

1. echo 1 > /sys/block/nvme0n1/device/device/remove
2. echo 1 > /sys/bus/pci/rescan
3. vgcfgrestore ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841
4. ceph auth add osd.122 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-122/keyring
5. ceph-volume lvm activate --all
6. You should see the drive somewhere in the ceph tree, move it to the right host

Tarek



Inactive hide details for "Tarek Zegar" ---05/15/2019 10:32:27 AM---TLDR; I activated the drive successfully but the daemon won"Tarek Zegar" ---05/15/2019 10:32:27 AM---TLDR; I activated the drive successfully but the daemon won't start, looks like it's complaining abo

From: "Tarek Zegar" <tzegar@xxxxxxxxxx>
To: Alfredo Deza <adeza@xxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Date: 05/15/2019 10:32 AM
Subject: [EXTERNAL] Re: Lost OSD from PCIe error, recovered, to restore OSD process
Sent by: "ceph-users" <ceph-users-bounces@xxxxxxxxxxxxxx>





TLDR; I activated the drive successfully but the daemon won't start, looks like it's complaining about mon config, idk why (there is a valid ceph.conf on the host). Thoughts? I feel like it's close. Thank you

I executed the command:
ceph-volume lvm activate --all



It found the drive and activated it:

--> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a
....
--> ceph-volume lvm activate successful for osd ID: 122




However, systemd would not start the OSD process 122:
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15 14:16:13.862 7ffff1970700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15 14:16:13.862 7ffff116f700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]:
failed to fetch mon config (--no-mon-config to skip)
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: Main process exited, code=exited, status=1/FAILURE
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: Service hold-off time over, scheduling restart.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: Scheduled restart job, restart counter is at 3.
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support:
http://www.ubuntu.com/support
--
-- Automatic restarting of the unit ceph-osd@122.service has been scheduled, as the result for
-- the configured Restart= setting for the unit.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object storage daemon osd.122.
-- Subject: Unit ceph-osd@122.service has finished shutting down
-- Defined-By: systemd
-- Support:
http://www.ubuntu.com/support
--
-- Unit ceph-osd@122.service has finished shutting down.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: Start request repeated too quickly.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]:
Failed to start Ceph object storage daemon osd.122



Inactive hide details for Alfredo Deza ---05/15/2019 08:27:13 AM---On Tue, May 14, 2019 at 7:24 PM Bob R <bobr@xxxxxxxxxxxxxx>Alfredo Deza ---05/15/2019 08:27:13 AM---On Tue, May 14, 2019 at 7:24 PM Bob R <bobr@xxxxxxxxxxxxxx> wrote: >

From:
Alfredo Deza <adeza@xxxxxxxxxx>
To:
Bob R <bobr@xxxxxxxxxxxxxx>
Cc:
Tarek Zegar <tzegar@xxxxxxxxxx>, ceph-users <ceph-users@xxxxxxxxxxxxxx>
Date:
05/15/2019 08:27 AM
Subject:
[EXTERNAL] Re: Lost OSD from PCIe error, recovered, to restore OSD process




On Tue, May 14, 2019 at 7:24 PM Bob R <bobr@xxxxxxxxxxxxxx> wrote:
>
> Does 'ceph-volume lvm list' show it? If so you can try to activate it with 'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'

Good suggestion. If `ceph-volume lvm list` can see it, it can probably
activate it again. You can activate it with the OSD ID + OSD FSID, or
do:

ceph-volume lvm activate --all

You didn't say if the OSD wasn't coming up after trying to start it
(the systemd unit should still be there for ID 122), or if you tried
rebooting and that OSD didn't come up.

The systemd unit is tied to both the ID and FSID of the OSD, so it
shouldn't matter if the underlying device changed since ceph-volume
ensures it is the right one every time it activates.
>
> Bob
>
> On Tue, May 14, 2019 at 7:35 AM Tarek Zegar <tzegar@xxxxxxxxxx> wrote:
>>
>> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo 1 > /sys/block/nvme0n1/device/device/remove
>> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> However, it reenumerated as a different drive number (guess we didn't have udev rules)
>> They restored the LVM volume (vgcfgrestore ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>>
>> lsblk
>> nvme0n2 259:9 0 1.8T 0 diskc
>> ceph--8c81b2a3--6c8e--4cae--a3c0--e2d91f82d841-osd--data--74b01ec2--124d--427d--9812--e437f90261d4 253:1 0 1.8T 0 lvm
>>
>> We are stuck here. How do we attach an OSD daemon to the drive? It was OSD.122 previously
>>
>> Thanks
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux