Re: Upgrade to Infernalis: OSDs crash all the time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi again,

I'm starting to feel really unlucky here...

At the moment, the situation is "sort of okay":

                1387 active+clean
                  11 active+clean+inconsistent
                   7 active+recovery_wait+degraded
                   1 active+recovery_wait+undersized+degraded+remapped
                   1 active+undersized+degraded+remapped+wait_backfill
                   1 active+undersized+degraded+remapped+inconsistent+backfilling

To ensure nothing is in the way, I disabled both scrubbing and deep scrubbing for the time being.

However, random OSDs (still on Hammer) constantly crash giving the error as mentioned earlier (osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)).

It felt like they started crashing when hitting the PG currently backfilling, so I set the nobackfill flag.

For now, the crashing seems to have stopped. However, the cluster seems slow at the moment when trying to access the given PG via KVM/QEMU (RBD).

Recap:
  • All monitors run Infernalis.
  • One OSD node runs Infernalis.
  • All other OSD nodes run Hammer.
  • One OSD on Infernalis is set to "out" and is stopped. This OSD seemed to contain one inconsistent PG.
  • Backfilling started.
  • After hours and hours of backfilling, OSDs started to crash.

Other than restarting the "out" and stopped OSD for the time being (haven't tried that yet) I'm quite lost.

Hopefully someone has some pointers for me.

Regards,
Kees

On 20-08-18 13:23, Kees Meijs wrote:
The given PG is back online, phew...

Meanwhile, some OSDs still on Hammer seem to crash with errors alike:

2018-08-20 13:06:33.819569 7f8962b2f700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::scan_range(int, int,
PG::BackfillInterval*, ThreadPool::TPHandle&)' thread 7f8962b2f700
time 2018-08-20 13:06:33.709922
osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)
Restarting the OSDs seems to work.

K.

On 20-08-18 13:14, Kees Meijs wrote:
Bad news: I've got a PG stuck in down+peering now.

    

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux