Re: Power failure recovery woes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Some additional information/questions:

Here is the output of "ceph osd tree"

Some of the "down" OSD's are actually running, but are "down". For example osd.1:

root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 /usr/bin/ceph-osd --cluster=ceph -i 0 -f

Is there any way to get the cluster to recognize them as being up? osd-1 has the "FAILED assert(last_e.version.version < e.version.version)" errors.

Thanks,
             Jeff


# id    weight  type name       up/down reweight
-1      10.22   root default
-2      2.72            host ceph1
0       0.91                    osd.0   up      1
1       0.91                    osd.1   down    0
2       0.9                     osd.2   down    0
-3      1.82            host ceph2
3       0.91                    osd.3   down    0
4       0.91                    osd.4   down    0
-4      2.04            host ceph3
5       0.68                    osd.5   up      1
6       0.68                    osd.6   up      1
7       0.68                    osd.7   up      1
8       0.68                    osd.8   down    0
-5      1.82            host ceph4
9       0.91                    osd.9   up      1
10      0.91                    osd.10  down    0
-6      1.82            host ceph5
11      0.91                    osd.11  up      1
12      0.91                    osd.12  up      1

On 2/17/2015 8:28 AM, Jeff wrote:


-------- Original Message --------
Subject: Re:  Power failure recovery woes
Date: 2015-02-17 04:23
From: Udo Lembke <ulembke@xxxxxxxxxxxx>
To: Jeff <jeff@xxxxxxxxxxxxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx

Hi Jeff,
is the osd /var/lib/ceph/osd/ceph-2 mounted?

If not, does it helps, if you mounted the osd and start with
service ceph start osd.2
??

Udo

Am 17.02.2015 09:54, schrieb Jeff:
Hi,

We had a nasty power failure yesterday and even with UPS's our small (5
node, 12 OSD) cluster is having problems recovering.

We are running ceph 0.87

3 of our OSD's are down consistently (others stop and are restartable,
but our cluster is so slow that almost everything we do times out).

We are seeing errors like this on the OSD's that never run:

    ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
Operation not permitted

We are seeing errors like these of the OSD's that run some of the time:

    osd/PGLog.cc: 844: FAILED assert(last_e.version.version <
e.version.version)
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

Does anyone have any suggestions on how to recover our cluster?

Thanks!
          Jeff


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux