OSD reboot loop after running out of memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the server ran out of memory for unknown reasons (normally the machine uses about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs will just mention system seeing the daemon fail and then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail this way at separate times. We resorted to wiping the affected OSD and re-adding it to the cluster, but it seems as soon as all PGs have moved to the OSD, the next one fails.

This is also keeping us from re-deploying RGW, which was affected by the same out of memory incident, since cephadm runs a check and won’t deploy the service unless the cluster is in HEALTH_OK status.

Any help would be greatly appreciated.

Thanks,
Stefan

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux