Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello *,

we have an issue with a Luminous cluster (all 12.2.5, except one on 12.2.7) for RBD (OpenStack) and CephFS. This is the osd tree:

host1:~ # ceph osd tree
ID  CLASS WEIGHT   TYPE NAME         STATUS REWEIGHT PRI-AFF
 -1       22.57602 root default
 -4        1.81998     host host5
 14   hdd  0.90999         osd.14        up  0.84999 0.50000
 15   hdd  0.90999         osd.15        up  0.84999 0.50000
 -2        6.27341     host host1
  1   hdd  0.92429         osd.1         up  1.00000 1.00000
  4   hdd  0.92429         osd.4         up  1.00000 1.00000
  6   hdd  0.92429         osd.6         up  1.00000 1.00000
 13   hdd  0.92429         osd.13        up  1.00000 1.00000
 16   hdd  0.92429         osd.16        up  1.00000 1.00000
 18   hdd  0.92429         osd.18        up  1.00000 1.00000
 10   ssd  0.72769         osd.10        up  1.00000 1.00000
 -3        6.27341     host host2
  2   hdd  0.92429         osd.2         up  1.00000 1.00000
  5   hdd  0.92429         osd.5         up  1.00000 1.00000
  7   hdd  0.92429         osd.7         up  1.00000 1.00000
 12   hdd  0.92429         osd.12        up  1.00000 1.00000
 17   hdd  0.92429         osd.17        up  1.00000 1.00000
 19   hdd  0.92429         osd.19        up  1.00000 1.00000
  9   ssd  0.72769         osd.9         up  1.00000 1.00000
 -5        4.57043     host host3
  0   hdd  0.92429         osd.0         up  1.00000 1.00000
  3   hdd  0.92429         osd.3         up  1.00000 1.00000
  8   hdd  0.92429         osd.8         up  1.00000 1.00000
 11   hdd  0.92429         osd.11        up  1.00000 1.00000
 20   ssd  0.87329         osd.20        up  1.00000       0
-16        3.63879     host host4
 21   hdd  0.90970         osd.21        up  1.00000       0
 22   hdd  0.90970         osd.22        up  1.00000       0
 23   hdd  0.90970         osd.23        up  1.00000       0
 24   hdd  0.90970         osd.24        up  1.00000       0


A couple of weeks ago a new host was added to the cluster (host4), containing four bluestore OSDs (HDD) with block.db on LVM (SSD). All went well and the cluster was in HEALTH_OK state for some time.

Then suddenly we experienced flapping OSDs, first on host3 (MON, MGR, OSD) for a single OSD (OSD.20 on SSD). Later host4 (OSD only) started flapping, too, this time all four OSDs (OSD.21 - OSD.24) were affected. Only two reboots brought the node back up.

We found segfaults from safe_timer and were pretty sure that the cluster was hit by [1], it all sounded very much like our experience. That's why we started to upgrade the new host to 12.2.7, we waited before upgrading the other nodes in case some other issues would come up. Two days later the same host was flapping again, but without a segfault or any other trace of the cause. We started to assume that the segfault could be a result of the segfault, not the cause.

Since it seems impossible to predict that flapping we don't have debug logs for those OSDs. But the usual logs don't reveal anything extra-ordinary. The cluster ist healthy again for 5 days now.

Then I found some clients (CephFS mounted for home directories and shared storage for compute nodes) reporting this multiple times:

---cut here---
[Mi Aug 22 10:31:33 2018] libceph: osd21 down
[Mi Aug 22 10:31:33 2018] libceph: osd22 down
[Mi Aug 22 10:31:33 2018] libceph: osd23 down
[Mi Aug 22 10:31:33 2018] libceph: osd24 down
[Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x10000 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd21 up
[Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x10000 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd22 up
[Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x10000 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd24 up
[Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x10000 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd23 up
---cut here---

This output repeats about 20 times per OSD (except for osd20, only one occurence). But there's no health warning, no trace of that in the logs, no flapping (yet?), as if nothing has happened. Since these are those OSDs that were affected by flapping there has to be a connection, but I can't seem to find it.

Why isn't there anything in the logs related to these dmesg events? Why would a client report OSDs down if they haven't been? We checked the disks for errors, we searched for network issues, no hint for anything going wrong.

Can anyone shed some light on this? Can these client messages somehow affect the OSD/MON communication in such way that MON starts reporting OSDs down, too? The OSDs then report themselves up and then the flapping begins?
How can I find the cause for these reports?

If there's any more information I can provide, please let me know.

Any insights are highly appreciated!

Regards,
Eugen

[1] http://tracker.ceph.com/issues/23352

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux