Re: Dying OSDs

Brady Deetz <bdeetz@xxxxxxxxx> · Tue, 10 Apr 2018 13:29:51 +0000

What distribution and kernel are you running?
I recently found my cluster running the 3.10 centos kernel when I thought it was running the elrepo kernel. After forcing it to boot correctly, my flapping osd issue went away. 

On Tue, Apr 10, 2018, 2:18 AM Jan Marquardt <jm@xxxxxxxxxxx> wrote:
Hi,

we are experiencing massive problems with our Ceph setup. After starting

a "repair pg" because of scrub errors OSDs started to crash, which we

could not stop so far. We are running Ceph 12.2.4. Crashed OSDs are both

bluestore and filestore.

Our cluster currently looks like this:

# ceph -s

  cluster:

    id:     c59e56df-2043-4c92-9492-25f05f268d9f

    health: HEALTH_ERR

            1 osds down

            73005/17149710 objects misplaced (0.426%)

            5 scrub errors

            Reduced data availability: 2 pgs inactive, 2 pgs down

            Possible data damage: 1 pg inconsistent

            Degraded data redundancy: 611518/17149710 objects degraded

(3.566%), 86 pgs degraded, 86 pgs undersized

  services:

    mon: 3 daemons, quorum head1,head2,head3

    mgr: head3(active), standbys: head2, head1

    osd: 34 osds: 24 up, 25 in; 18 remapped pgs

  data:

    pools:   1 pools, 768 pgs

    objects: 5582k objects, 19500 GB

    usage:   62030 GB used, 31426 GB / 93456 GB avail

    pgs:     0.260% pgs not active

             611518/17149710 objects degraded (3.566%)

             73005/17149710 objects misplaced (0.426%)

             670 active+clean

             75  active+undersized+degraded

             8   active+undersized+degraded+remapped+backfill_wait

             8   active+clean+remapped

             2   down

             2   active+undersized+degraded+remapped+backfilling

             2   active+clean+scrubbing+deep

             1   active+undersized+degraded+inconsistent

  io:

    client:   10911 B/s rd, 118 kB/s wr, 0 op/s rd, 54 op/s wr

    recovery: 31575 kB/s, 8 objects/s

# ceph osd tree

ID  CLASS WEIGHT    TYPE NAME      STATUS REWEIGHT PRI-AFF

 -1       124.07297 root default

 -2        29.08960     host ceph1

  0   hdd   3.63620         osd.0      up  1.00000 1.00000

  1   hdd   3.63620         osd.1    down        0 1.00000

  2   hdd   3.63620         osd.2      up  1.00000 1.00000

  3   hdd   3.63620         osd.3      up  1.00000 1.00000

  4   hdd   3.63620         osd.4    down        0 1.00000

  5   hdd   3.63620         osd.5    down        0 1.00000

  6   hdd   3.63620         osd.6      up  1.00000 1.00000

  7   hdd   3.63620         osd.7      up  1.00000 1.00000

 -3         7.27240     host ceph2

 14   hdd   3.63620         osd.14     up  1.00000 1.00000

 15   hdd   3.63620         osd.15     up  1.00000 1.00000

 -4        29.11258     host ceph3

 16   hdd   3.63620         osd.16     up  1.00000 1.00000

 18   hdd   3.63620         osd.18   down        0 1.00000

 19   hdd   3.63620         osd.19   down        0 1.00000

 20   hdd   3.65749         osd.20     up  1.00000 1.00000

 21   hdd   3.63620         osd.21     up  1.00000 1.00000

 22   hdd   3.63620         osd.22     up  1.00000 1.00000

 23   hdd   3.63620         osd.23     up  1.00000 1.00000

 24   hdd   3.63789         osd.24   down        0 1.00000

 -9        29.29919     host ceph4

 17   hdd   3.66240         osd.17     up  1.00000 1.00000

 25   hdd   3.66240         osd.25     up  1.00000 1.00000

 26   hdd   3.66240         osd.26   down        0 1.00000

 27   hdd   3.66240         osd.27     up  1.00000 1.00000

 28   hdd   3.66240         osd.28   down        0 1.00000

 29   hdd   3.66240         osd.29     up  1.00000 1.00000

 30   hdd   3.66240         osd.30     up  1.00000 1.00000

 31   hdd   3.66240         osd.31   down        0 1.00000

-11        29.29919     host ceph5

 32   hdd   3.66240         osd.32     up  1.00000 1.00000

 33   hdd   3.66240         osd.33     up  1.00000 1.00000

 34   hdd   3.66240         osd.34     up  1.00000 1.00000

 35   hdd   3.66240         osd.35     up  1.00000 1.00000

 36   hdd   3.66240         osd.36   down  1.00000 1.00000

 37   hdd   3.66240         osd.37     up  1.00000 1.00000

 38   hdd   3.66240         osd.38     up  1.00000 1.00000

 39   hdd   3.66240         osd.39     up  1.00000 1.00000

The last OSDs that crashed are #28 and #36. Please find the

corresponding log files here:

http://af.janno.io/ceph/ceph-osd.28.log.1.gz

http://af.janno.io/ceph/ceph-osd.36.log.1.gz

The backtraces look almost the same for all crashed OSDs.

Any help, hint or advice would really be appreciated. Please let me know

if you need any further information.

Best Regards

Jan

--

Artfiles New Media GmbH | Zirkusweg 1 | 20359 Hamburg

Tel: 040 - 32 02 72 90 | Fax: 040 - 32 02 72 95

E-Mail: support@xxxxxxxxxxx | Web: http://www.artfiles.de

Geschäftsführer: Harald Oltmanns | Tim Evers

Eingetragen im Handelsregister Hamburg - HRB 81478

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com