Dying OSDs

Jan Marquardt <jm@xxxxxxxxxxx> · Tue, 10 Apr 2018 09:17:55 +0200

Hi,

we are experiencing massive problems with our Ceph setup. After starting
a "repair pg" because of scrub errors OSDs started to crash, which we
could not stop so far. We are running Ceph 12.2.4. Crashed OSDs are both
bluestore and filestore.

Our cluster currently looks like this:

# ceph -s
  cluster:
    id:     c59e56df-2043-4c92-9492-25f05f268d9f
    health: HEALTH_ERR
            1 osds down
            73005/17149710 objects misplaced (0.426%)
            5 scrub errors
            Reduced data availability: 2 pgs inactive, 2 pgs down
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 611518/17149710 objects degraded
(3.566%), 86 pgs degraded, 86 pgs undersized

  services:
    mon: 3 daemons, quorum head1,head2,head3
    mgr: head3(active), standbys: head2, head1
    osd: 34 osds: 24 up, 25 in; 18 remapped pgs

  data:
    pools:   1 pools, 768 pgs
    objects: 5582k objects, 19500 GB
    usage:   62030 GB used, 31426 GB / 93456 GB avail
    pgs:     0.260% pgs not active
             611518/17149710 objects degraded (3.566%)
             73005/17149710 objects misplaced (0.426%)
             670 active+clean
             75  active+undersized+degraded
             8   active+undersized+degraded+remapped+backfill_wait
             8   active+clean+remapped
             2   down
             2   active+undersized+degraded+remapped+backfilling
             2   active+clean+scrubbing+deep
             1   active+undersized+degraded+inconsistent

  io:
    client:   10911 B/s rd, 118 kB/s wr, 0 op/s rd, 54 op/s wr
    recovery: 31575 kB/s, 8 objects/s

# ceph osd tree
ID  CLASS WEIGHT    TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       124.07297 root default
 -2        29.08960     host ceph1
  0   hdd   3.63620         osd.0      up  1.00000 1.00000
  1   hdd   3.63620         osd.1    down        0 1.00000
  2   hdd   3.63620         osd.2      up  1.00000 1.00000
  3   hdd   3.63620         osd.3      up  1.00000 1.00000
  4   hdd   3.63620         osd.4    down        0 1.00000
  5   hdd   3.63620         osd.5    down        0 1.00000
  6   hdd   3.63620         osd.6      up  1.00000 1.00000
  7   hdd   3.63620         osd.7      up  1.00000 1.00000
 -3         7.27240     host ceph2
 14   hdd   3.63620         osd.14     up  1.00000 1.00000
 15   hdd   3.63620         osd.15     up  1.00000 1.00000
 -4        29.11258     host ceph3
 16   hdd   3.63620         osd.16     up  1.00000 1.00000
 18   hdd   3.63620         osd.18   down        0 1.00000
 19   hdd   3.63620         osd.19   down        0 1.00000
 20   hdd   3.65749         osd.20     up  1.00000 1.00000
 21   hdd   3.63620         osd.21     up  1.00000 1.00000
 22   hdd   3.63620         osd.22     up  1.00000 1.00000
 23   hdd   3.63620         osd.23     up  1.00000 1.00000
 24   hdd   3.63789         osd.24   down        0 1.00000
 -9        29.29919     host ceph4
 17   hdd   3.66240         osd.17     up  1.00000 1.00000
 25   hdd   3.66240         osd.25     up  1.00000 1.00000
 26   hdd   3.66240         osd.26   down        0 1.00000
 27   hdd   3.66240         osd.27     up  1.00000 1.00000
 28   hdd   3.66240         osd.28   down        0 1.00000
 29   hdd   3.66240         osd.29     up  1.00000 1.00000
 30   hdd   3.66240         osd.30     up  1.00000 1.00000
 31   hdd   3.66240         osd.31   down        0 1.00000
-11        29.29919     host ceph5
 32   hdd   3.66240         osd.32     up  1.00000 1.00000
 33   hdd   3.66240         osd.33     up  1.00000 1.00000
 34   hdd   3.66240         osd.34     up  1.00000 1.00000
 35   hdd   3.66240         osd.35     up  1.00000 1.00000
 36   hdd   3.66240         osd.36   down  1.00000 1.00000
 37   hdd   3.66240         osd.37     up  1.00000 1.00000
 38   hdd   3.66240         osd.38     up  1.00000 1.00000
 39   hdd   3.66240         osd.39     up  1.00000 1.00000

The last OSDs that crashed are #28 and #36. Please find the
corresponding log files here:

http://af.janno.io/ceph/ceph-osd.28.log.1.gz
http://af.janno.io/ceph/ceph-osd.36.log.1.gz

The backtraces look almost the same for all crashed OSDs.

Any help, hint or advice would really be appreciated. Please let me know
if you need any further information.

Best Regards

Jan

-- 
Artfiles New Media GmbH | Zirkusweg 1 | 20359 Hamburg
Tel: 040 - 32 02 72 90 | Fax: 040 - 32 02 72 95
E-Mail: support@xxxxxxxxxxx | Web: http://www.artfiles.de
Geschäftsführer: Harald Oltmanns | Tim Evers
Eingetragen im Handelsregister Hamburg - HRB 81478
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com