Hi again, I'm starting to feel really unlucky here... At the moment, the situation is "sort of okay": 1387 active+clean To ensure nothing is in the way, I disabled both scrubbing and deep scrubbing for the time being. However, random OSDs (still on Hammer) constantly crash giving the error as mentioned earlier (osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)). It felt like they started crashing when hitting the PG currently backfilling, so I set the nobackfill flag. For now, the crashing seems to have stopped. However, the cluster seems slow at the moment when trying to access the given PG via KVM/QEMU (RBD). Recap:
Other than restarting the "out" and stopped OSD for the time being (haven't tried that yet) I'm quite lost. Hopefully someone has some pointers for me. Regards, On 20-08-18 13:23, Kees Meijs wrote:
The given PG is back online, phew... Meanwhile, some OSDs still on Hammer seem to crash with errors alike:2018-08-20 13:06:33.819569 7f8962b2f700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::scan_range(int, int, PG::BackfillInterval*, ThreadPool::TPHandle&)' thread 7f8962b2f700 time 2018-08-20 13:06:33.709922 osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)Restarting the OSDs seems to work. K. On 20-08-18 13:14, Kees Meijs wrote:Bad news: I've got a PG stuck in down+peering now. |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com