Re: OSD failed to load OSD map for epoch

Eugen Block <eblock@xxxxxx> · Tue, 27 Jul 2021 13:35:13 +0000

Hi,

did you read this thread [1] reporting a similar issue? It refers to a  
solution described in [2] but the OP in [1] recreated all OSDs, so  
it's not clear what the root cause was.
Can you start the OSD with more verbose (debug) output and share that?  
Does your cluster really have only two OSDs? Are you running it with  
size 2 pools?

[1]  
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/EUFDKK3HEA5DPTUVJ5LBNQSWAKZH5ZM7/
[2]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036592.html

Zitat von Johan Hattne <johan@xxxxxxxxx>:

Dear all;

We have 3-node cluster that has two OSDs on separate nodes, each  
with wal on NVMe.  It's been running fine for quite some time,  
albeit under very light load.  This week, we moved from  
package-based Octopus to container-based ditto (15.2.13, all on  
Debian stable).  Within a few hours of that change, both OSDs  
crashed and dmesg filled up with stuff like:

  DMAR: DRHD: handling fault status reg 2
  DMAR: [DMA Read] Request device [06:00.0] PASID ffffffff fault  
addr ffbc0000 [fault reason 06] PTE Read access is not set

where 06:00.0 is the NVMe with the wal.  This happened at the same  
time on *both* OSD nodes, but I'll worry about why this happened  
later.  I would first like to see if I can get the cluster back up.

From cephadm shell, I activate OSD 1 and try to start it (I did  
create a minimal /etc/ceph/ceph.conf with global "fsid" and "mon  
host" for that purpose):

  # ceph-volume lvm activate 1 cce125b2-2597-4be9-bd17-23eb059d2778  
--no-systemd
  # ceph-osd -d --cluster ceph --id 1

This gives "osd.1 0 OSD::init() : unable to read osd superblock",  
and the subsequent output indicates that this due to checksum  
errors.  So ignore checksum mismatches and try again:

  # CEPH_ARGS="--bluestore-ignore-data-csum" ceph-osd -d --cluster  
ceph --id 1

which results in "osd.1 0 failed to load OSD map for epoch 4372, got  
0 bytes".  The monitors are at 4378, as per:

  # ceph osd stat
  2 osds: 0 up (since 47h), 1 in (since 47h); epoch: e4378

Is there any way to get past this?  For instance, could I coax the  
OSDs into epoch 4378?  This is the first time I deal a ceph  
disaster, so there may be all kinds of obvious things I'm missing.

// Best wishes; Johan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx