Update #3: Our OSDs all crashed at the same time. Logs are all showing this: http://pastebin.com/ns0McteE On Mon, Aug 11, 2014 at 6:40 PM, Jacob Godin <jacobgodin at gmail.com> wrote: > We were able to get the cluster back online. The issue stemmed from the > MON having a lower epoch than the OSDs. > > We used ceph osd thrash to bring the MON's epoch up to be >= that of the > OSDs, restarted the osd procs, and they began cooperating again. > > After they completed syncing, we're now running into an issue with some > rbd images. They show up in 'rbd ls', but when trying to do anything with > them, we see the following: > rbd: error opening image 2014-08-11 21:39:37.146976 7feac4127780 -1 > librbd::ImageCtx: error finding header: (6) No such device or address > IMAGE-XXX: (6) No such device or address > > > On Mon, Aug 11, 2014 at 3:43 PM, Jacob Godin <jacobgodin at gmail.com> wrote: > >> Hi there, >> >> Currently having an issue with a Cuttlefish cluster w/ 3 OSDs and 1 MON. >> When trying to restart an OSD, the cluster became unresponsive to 'rbd >> export'. Here are some sample OSD logs: >> >> OSD we restarted -http://pastebin.com/UUuDdS1V >> Another OSD - http://pastebin.com/f12r4W2s >> >> In an attempt to get things back online, we tried restarting the entire >> cluster. We're now seeing these errors over all three OSDs: >> 2014-08-11 18:35:24.118737 7f9dbe3ed700 0 -- 10.100.250.1:6806/31838 >> >> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42246 s=1 pgs=0 cs=0 >> l=0).connect claims to be 10.100.250.1:6808/1344 not >> 10.100.250.1:6808/12955 - wrong node! >> 2014-08-11 18:35:29.925865 7f9dc23fc700 0 -- 10.100.250.1:6806/31838 >> >> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60606 s=1 pgs=0 cs=0 >> l=0).connect claims to be 10.100.250.1:6802/5205 not >> 10.100.250.1:6802/12408 - wrong node! >> 2014-08-11 18:35:39.119564 7f9dbe3ed700 0 -- 10.100.250.1:6806/31838 >> >> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42253 s=1 pgs=0 cs=0 >> l=0).connect claims to be 10.100.250.1:6808/1344 not >> 10.100.250.1:6808/12955 - wrong node! >> 2014-08-11 18:35:44.926511 7f9dc23fc700 0 -- 10.100.250.1:6806/31838 >> >> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60613 s=1 pgs=0 cs=0 >> l=0).connect claims to be 10.100.250.1:6802/5205 not >> 10.100.250.1:6802/12408 - wrong node! >> 2014-08-11 18:35:54.120391 7f9dbe3ed700 0 -- 10.100.250.1:6806/31838 >> >> 10.100.250.1:6808/12955 pipe(0x1d07a00 sd=139 :42259 s=1 pgs=0 cs=0 >> l=0).connect claims to be 10.100.250.1:6808/1344 not >> 10.100.250.1:6808/12955 - wrong node! >> 2014-08-11 18:35:59.927252 7f9dc23fc700 0 -- 10.100.250.1:6806/31838 >> >> 10.100.250.1:6802/12408 pipe(0x1d07500 sd=140 :60619 s=1 pgs=0 cs=0 >> l=0).connect claims to be 10.100.250.1:6802/5205 not >> 10.100.250.1:6802/12408 - wrong node! >> >> ceph health: >> health HEALTH_WARN 6 pgs backfill; 6 pgs backfill_toofull; 3 pgs >> backfilling; 38 pgs degraded; 859 pgs stale; 859 pgs stuck stale; 47 pgs >> stuck unclean; recovery 60081/1241780 degraded (4.838%); 1 near full osd(s) >> monmap e18: 1 mons at {04=10.100.100.1:6789/0}, election epoch 1, >> quorum 0 04 >> osdmap e16752: 4 osds: 2 up, 2 in >> pgmap v7355946: 2515 pgs: 1647 active+clean, 6 >> active+remapped+wait_backfill+backfill_toofull, 821 stale+active+clean, 3 >> active+remapped+backfilling, 38 stale+active+degraded+remapped; 3421 GB >> data, 4855 GB used, 630 GB / 5485 GB avail; 60081/1241780 degraded (4.838%) >> mdsmap e1: 0/0/1 up >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140811/22441492/attachment.htm>