OSD Issue

jacobgodin@xxxxxxxxx (Jacob Godin) · Mon, 11 Aug 2014 15:43:09 -0300

Hi there,

Currently having an issue with a Cuttlefish cluster w/ 3 OSDs and 1 MON.
When trying to restart an OSD, the cluster became unresponsive to 'rbd
export'. Here are some sample OSD logs:

OSD we restarted -http://pastebin.com/UUuDdS1V
Another OSD - http://pastebin.com/f12r4W2s

In an attempt to get things back online, we tried restarting the entire
cluster. We're now seeing these errors over all three OSDs:
2014-08-11 18:35:24.118737 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42246 s=1 pgs=0 cs=0
l=0).connect claims to be 10.100.250.1:6808/1344 not 10.100.250.1:6808/12955
- wrong node!
2014-08-11 18:35:29.925865 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60606 s=1 pgs=0 cs=0
l=0).connect claims to be 10.100.250.1:6802/5205 not 10.100.250.1:6802/12408
- wrong node!
2014-08-11 18:35:39.119564 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42253 s=1 pgs=0 cs=0
l=0).connect claims to be 10.100.250.1:6808/1344 not 10.100.250.1:6808/12955
- wrong node!
2014-08-11 18:35:44.926511 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60613 s=1 pgs=0 cs=0
l=0).connect claims to be 10.100.250.1:6802/5205 not 10.100.250.1:6802/12408
- wrong node!
2014-08-11 18:35:54.120391 7f9dbe3ed700  0 -- 10.100.250.1:6806/31838 >>
10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42259 s=1 pgs=0 cs=0
l=0).connect claims to be 10.100.250.1:6808/1344 not 10.100.250.1:6808/12955
- wrong node!
2014-08-11 18:35:59.927252 7f9dc23fc700  0 -- 10.100.250.1:6806/31838 >>
10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60619 s=1 pgs=0 cs=0
l=0).connect claims to be 10.100.250.1:6802/5205 not 10.100.250.1:6802/12408
- wrong node!

ceph health:
   health HEALTH_WARN 6 pgs backfill; 6 pgs backfill_toofull; 3 pgs
backfilling; 38 pgs degraded; 859 pgs stale; 859 pgs stuck stale; 47 pgs
stuck unclean; recovery 60081/1241780 degraded (4.838%); 1 near full osd(s)
   monmap e18: 1 mons at {04=10.100.100.1:6789/0}, election epoch 1, quorum
0 04
   osdmap e16752: 4 osds: 2 up, 2 in
    pgmap v7355946: 2515 pgs: 1647 active+clean, 6
active+remapped+wait_backfill+backfill_toofull, 821 stale+active+clean, 3
active+remapped+backfilling, 38 stale+active+degraded+remapped; 3421 GB
data, 4855 GB used, 630 GB / 5485 GB avail; 60081/1241780 degraded (4.838%)
   mdsmap e1: 0/0/1 up
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140811/0c3b4a0e/attachment.htm>