OSD Issue

jacobgodin@xxxxxxxxx (Jacob Godin) · Mon, 11 Aug 2014 18:40:02 -0300

We were able to get the cluster back online. The issue stemmed from the MON
having a lower epoch than the OSDs.

We used ceph osd thrash to bring the MON's epoch up to be >= that of the
OSDs, restarted the osd procs, and they began cooperating again.

After they completed syncing, we're now running into an issue with some rbd
images. They show up in 'rbd ls', but when trying to do anything with them,
we see the following:
rbd: error opening image 2014-08-11 21:39:37.146976 7feac4127780 -1
librbd::ImageCtx: error finding header: (6) No such device or address
IMAGE-XXX: (6) No such device or address

On Mon, Aug 11, 2014 at 3:43 PM, Jacob Godin <jacobgodin at gmail.com> wrote:

> Hi there,
>
> Currently having an issue with a Cuttlefish cluster w/ 3 OSDs and 1 MON.
> When trying to restart an OSD, the cluster became unresponsive to 'rbd
> export'. Here are some sample OSD logs:
>
> OSD we restarted -http://pastebin.com/UUuDdS1V
> Another OSD - http://pastebin.com/f12r4W2s
>
> In an attempt to get things back online, we tried restarting the entire
> cluster. We're now seeing these errors over all three OSDs:
> 2014-08-11 18:35:24.118737 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42246 s=1 pgs=0 cs=0
> l=0).connect claims to be 10.100.250.1:6808/1344 not
> 10.100.250.1:6808/12955 - wrong node!
> 2014-08-11 18:35:29.925865 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60606 s=1 pgs=0 cs=0
> l=0).connect claims to be 10.100.250.1:6802/5205 not
> 10.100.250.1:6802/12408 - wrong node!
> 2014-08-11 18:35:39.119564 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42253 s=1 pgs=0 cs=0
> l=0).connect claims to be 10.100.250.1:6808/1344 not
> 10.100.250.1:6808/12955 - wrong node!
> 2014-08-11 18:35:44.926511 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60613 s=1 pgs=0 cs=0
> l=0).connect claims to be 10.100.250.1:6802/5205 not
> 10.100.250.1:6802/12408 - wrong node!
> 2014-08-11 18:35:54.120391 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42259 s=1 pgs=0 cs=0
> l=0).connect claims to be 10.100.250.1:6808/1344 not
> 10.100.250.1:6808/12955 - wrong node!
> 2014-08-11 18:35:59.927252 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60619 s=1 pgs=0 cs=0
> l=0).connect claims to be 10.100.250.1:6802/5205 not
> 10.100.250.1:6802/12408 - wrong node!
>
> ceph health:
>    health HEALTH_WARN 6 pgs backfill; 6 pgs backfill_toofull; 3 pgs
> backfilling; 38 pgs degraded; 859 pgs stale; 859 pgs stuck stale; 47 pgs
> stuck unclean; recovery 60081/1241780 degraded (4.838%); 1 near full osd(s)
>    monmap e18: 1 mons at {04=10.100.100.1:6789/0}, election epoch 1,
> quorum 0 04
>    osdmap e16752: 4 osds: 2 up, 2 in
>     pgmap v7355946: 2515 pgs: 1647 active+clean, 6
> active+remapped+wait_backfill+backfill_toofull, 821 stale+active+clean, 3
> active+remapped+backfilling, 38 stale+active+degraded+remapped; 3421 GB
> data, 4855 GB used, 630 GB / 5485 GB avail; 60081/1241780 degraded (4.838%)
>    mdsmap e1: 0/0/1 up
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140811/9cf63c6a/attachment.htm>