Hi, we are experiencing massive problems with our Ceph setup. After starting a "repair pg" because of scrub errors OSDs started to crash, which we could not stop so far. We are running Ceph 12.2.4. Crashed OSDs are both bluestore and filestore. Our cluster currently looks like this: # ceph -s cluster: id: c59e56df-2043-4c92-9492-25f05f268d9f health: HEALTH_ERR 1 osds down 73005/17149710 objects misplaced (0.426%) 5 scrub errors Reduced data availability: 2 pgs inactive, 2 pgs down Possible data damage: 1 pg inconsistent Degraded data redundancy: 611518/17149710 objects degraded (3.566%), 86 pgs degraded, 86 pgs undersized services: mon: 3 daemons, quorum head1,head2,head3 mgr: head3(active), standbys: head2, head1 osd: 34 osds: 24 up, 25 in; 18 remapped pgs data: pools: 1 pools, 768 pgs objects: 5582k objects, 19500 GB usage: 62030 GB used, 31426 GB / 93456 GB avail pgs: 0.260% pgs not active 611518/17149710 objects degraded (3.566%) 73005/17149710 objects misplaced (0.426%) 670 active+clean 75 active+undersized+degraded 8 active+undersized+degraded+remapped+backfill_wait 8 active+clean+remapped 2 down 2 active+undersized+degraded+remapped+backfilling 2 active+clean+scrubbing+deep 1 active+undersized+degraded+inconsistent io: client: 10911 B/s rd, 118 kB/s wr, 0 op/s rd, 54 op/s wr recovery: 31575 kB/s, 8 objects/s # ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 124.07297 root default -2 29.08960 host ceph1 0 hdd 3.63620 osd.0 up 1.00000 1.00000 1 hdd 3.63620 osd.1 down 0 1.00000 2 hdd 3.63620 osd.2 up 1.00000 1.00000 3 hdd 3.63620 osd.3 up 1.00000 1.00000 4 hdd 3.63620 osd.4 down 0 1.00000 5 hdd 3.63620 osd.5 down 0 1.00000 6 hdd 3.63620 osd.6 up 1.00000 1.00000 7 hdd 3.63620 osd.7 up 1.00000 1.00000 -3 7.27240 host ceph2 14 hdd 3.63620 osd.14 up 1.00000 1.00000 15 hdd 3.63620 osd.15 up 1.00000 1.00000 -4 29.11258 host ceph3 16 hdd 3.63620 osd.16 up 1.00000 1.00000 18 hdd 3.63620 osd.18 down 0 1.00000 19 hdd 3.63620 osd.19 down 0 1.00000 20 hdd 3.65749 osd.20 up 1.00000 1.00000 21 hdd 3.63620 osd.21 up 1.00000 1.00000 22 hdd 3.63620 osd.22 up 1.00000 1.00000 23 hdd 3.63620 osd.23 up 1.00000 1.00000 24 hdd 3.63789 osd.24 down 0 1.00000 -9 29.29919 host ceph4 17 hdd 3.66240 osd.17 up 1.00000 1.00000 25 hdd 3.66240 osd.25 up 1.00000 1.00000 26 hdd 3.66240 osd.26 down 0 1.00000 27 hdd 3.66240 osd.27 up 1.00000 1.00000 28 hdd 3.66240 osd.28 down 0 1.00000 29 hdd 3.66240 osd.29 up 1.00000 1.00000 30 hdd 3.66240 osd.30 up 1.00000 1.00000 31 hdd 3.66240 osd.31 down 0 1.00000 -11 29.29919 host ceph5 32 hdd 3.66240 osd.32 up 1.00000 1.00000 33 hdd 3.66240 osd.33 up 1.00000 1.00000 34 hdd 3.66240 osd.34 up 1.00000 1.00000 35 hdd 3.66240 osd.35 up 1.00000 1.00000 36 hdd 3.66240 osd.36 down 1.00000 1.00000 37 hdd 3.66240 osd.37 up 1.00000 1.00000 38 hdd 3.66240 osd.38 up 1.00000 1.00000 39 hdd 3.66240 osd.39 up 1.00000 1.00000 The last OSDs that crashed are #28 and #36. Please find the corresponding log files here: http://af.janno.io/ceph/ceph-osd.28.log.1.gz http://af.janno.io/ceph/ceph-osd.36.log.1.gz The backtraces look almost the same for all crashed OSDs. Any help, hint or advice would really be appreciated. Please let me know if you need any further information. Best Regards Jan -- Artfiles New Media GmbH | Zirkusweg 1 | 20359 Hamburg Tel: 040 - 32 02 72 90 | Fax: 040 - 32 02 72 95 E-Mail: support@xxxxxxxxxxx | Web: http://www.artfiles.de Geschäftsführer: Harald Oltmanns | Tim Evers Eingetragen im Handelsregister Hamburg - HRB 81478 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com