Re: Upgrade to Infernalis: OSDs crash all the time

Kees Meijs <kees@xxxxxxxx> · Mon, 20 Aug 2018 21:46:28 +0200

    Hi again,

    I'm starting to feel really unlucky here...

    At the moment, the situation is "sort of okay":

                    1387 active+clean

                        11 active+clean+inconsistent

                         7 active+recovery_wait+degraded

                         1
      active+recovery_wait+undersized+degraded+remapped

                         1
      active+undersized+degraded+remapped+wait_backfill

                         1
      active+undersized+degraded+remapped+inconsistent+backfilling

    To ensure nothing is in the way, I disabled both scrubbing and deep
    scrubbing for the time being.

    However, random OSDs (still on Hammer) constantly crash giving the
    error as mentioned earlier (osd/ReplicatedPG.cc: 10115: FAILED
    assert(r >= 0)).

    It felt like they started crashing when hitting the PG currently
    backfilling, so I set the nobackfill flag.

    For now, the crashing seems to have stopped. However, the cluster
    seems slow at the moment when trying to access the given PG via
    KVM/QEMU (RBD).

    Recap:

      All monitors run Infernalis.
      One OSD node runs Infernalis.
      All other OSD nodes run Hammer.
      One OSD on Infernalis is set to "out" and is stopped. This OSD
        seemed to contain one inconsistent PG.

      Backfilling started.
      After hours and hours of backfilling, OSDs started to crash.

    Other than restarting the "out" and stopped OSD for the time
      being (haven't tried that yet) I'm quite lost.
    Hopefully someone has some pointers for me.
    Regards,

      Kees

    On 20-08-18 13:23, Kees Meijs wrote:

      The given PG is back online, phew...

Meanwhile, some OSDs still on Hammer seem to crash with errors alike:

        2018-08-20 13:06:33.819569 7f8962b2f700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::scan_range(int, int,
PG::BackfillInterval*, ThreadPool::TPHandle&)' thread 7f8962b2f700
time 2018-08-20 13:06:33.709922
osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)

      Restarting the OSDs seems to work.

K.

On 20-08-18 13:14, Kees Meijs wrote:

        Bad news: I've got a PG stuck in down+peering now.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com