OSD Issue

jacobgodin@xxxxxxxxx (Jacob Godin) · Mon, 11 Aug 2014 20:02:49 -0300

Update #3: Our OSDs all crashed at the same time. Logs are all showing
this: http://pastebin.com/ns0McteE

On Mon, Aug 11, 2014 at 6:40 PM, Jacob Godin <jacobgodin at gmail.com> wrote:

> We were able to get the cluster back online. The issue stemmed from the
> MON having a lower epoch than the OSDs.
>
> We used ceph osd thrash to bring the MON's epoch up to be >= that of the
> OSDs, restarted the osd procs, and they began cooperating again.
>
> After they completed syncing, we're now running into an issue with some
> rbd images. They show up in 'rbd ls', but when trying to do anything with
> them, we see the following:
> rbd: error opening image 2014-08-11 21:39:37.146976 7feac4127780 -1
> librbd::ImageCtx: error finding header: (6) No such device or address
> IMAGE-XXX: (6) No such device or address
>
>
> On Mon, Aug 11, 2014 at 3:43 PM, Jacob Godin <jacobgodin at gmail.com> wrote:
>
>> Hi there,
>>
>> Currently having an issue with a Cuttlefish cluster w/ 3 OSDs and 1 MON.
>> When trying to restart an OSD, the cluster became unresponsive to 'rbd
>> export'. Here are some sample OSD logs:
>>
>> OSD we restarted -http://pastebin.com/UUuDdS1V
>> Another OSD - http://pastebin.com/f12r4W2s
>>
>> In an attempt to get things back online, we tried restarting the entire
>> cluster. We're now seeing these errors over all three OSDs:
>> 2014-08-11 18:35:24.118737 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
>> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42246 s=1 pgs=0 cs=0
>> l=0).connect claims to be 10.100.250.1:6808/1344 not
>> 10.100.250.1:6808/12955 - wrong node!
>> 2014-08-11 18:35:29.925865 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
>> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60606 s=1 pgs=0 cs=0
>> l=0).connect claims to be 10.100.250.1:6802/5205 not
>> 10.100.250.1:6802/12408 - wrong node!
>> 2014-08-11 18:35:39.119564 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
>> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42253 s=1 pgs=0 cs=0
>> l=0).connect claims to be 10.100.250.1:6808/1344 not
>> 10.100.250.1:6808/12955 - wrong node!
>> 2014-08-11 18:35:44.926511 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
>> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60613 s=1 pgs=0 cs=0
>> l=0).connect claims to be 10.100.250.1:6802/5205 not
>> 10.100.250.1:6802/12408 - wrong node!
>> 2014-08-11 18:35:54.120391 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
>> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42259 s=1 pgs=0 cs=0
>> l=0).connect claims to be 10.100.250.1:6808/1344 not
>> 10.100.250.1:6808/12955 - wrong node!
>> 2014-08-11 18:35:59.927252 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
>> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60619 s=1 pgs=0 cs=0
>> l=0).connect claims to be 10.100.250.1:6802/5205 not
>> 10.100.250.1:6802/12408 - wrong node!
>>
>> ceph health:
>>    health HEALTH_WARN 6 pgs backfill; 6 pgs backfill_toofull; 3 pgs
>> backfilling; 38 pgs degraded; 859 pgs stale; 859 pgs stuck stale; 47 pgs
>> stuck unclean; recovery 60081/1241780 degraded (4.838%); 1 near full osd(s)
>>    monmap e18: 1 mons at {04=10.100.100.1:6789/0}, election epoch 1,
>> quorum 0 04
>>    osdmap e16752: 4 osds: 2 up, 2 in
>>     pgmap v7355946: 2515 pgs: 1647 active+clean, 6
>> active+remapped+wait_backfill+backfill_toofull, 821 stale+active+clean, 3
>> active+remapped+backfilling, 38 stale+active+degraded+remapped; 3421 GB
>> data, 4855 GB used, 630 GB / 5485 GB avail; 60081/1241780 degraded (4.838%)
>>    mdsmap e1: 0/0/1 up
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140811/22441492/attachment.htm>