Kingston DC500M IO problems

Frank Schilder <frans@xxxxxx> · Fri, 25 Mar 2022 14:00:44 +0000

Dear all,

we are using a bunch of Kingston DC500M drives in our cluster for an all-flash 6+2 EC pool used as a data pool for RBD images. For quite a while not I observe that these drives seem to stall for extended periods of time, sometimes to an extend that they are marked down. Here are the boot events for one day as an example (sorted by OSD, not date stamp):

2022-03-24 05:22:23.713736 mon.ceph-01 mon.0 192.168.32.65:6789/0 200841 : cluster [INF] osd.584 192.168.32.89:6814/4765 boot
2022-03-24 07:11:10.032115 mon.ceph-01 mon.0 192.168.32.65:6789/0 202319 : cluster [INF] osd.584 192.168.32.89:6814/4765 boot
2022-03-24 07:11:08.618319 mon.ceph-01 mon.0 192.168.32.65:6789/0 202315 : cluster [INF] osd.585 192.168.32.89:6810/4767 boot
2022-03-24 12:24:02.790395 mon.ceph-01 mon.0 192.168.32.65:6789/0 206344 : cluster [INF] osd.585 192.168.32.89:6810/4767 boot
2022-03-24 06:55:10.513353 mon.ceph-01 mon.0 192.168.32.65:6789/0 202062 : cluster [INF] osd.594 192.168.32.91:6802/272337 boot
2022-03-24 06:55:10.513303 mon.ceph-01 mon.0 192.168.32.65:6789/0 202061 : cluster [INF] osd.595 192.168.32.91:6804/272338 boot
2022-03-24 20:34:31.991914 mon.ceph-01 mon.0 192.168.32.65:6789/0 218334 : cluster [INF] osd.595 192.168.32.91:6804/272338 boot
2022-03-24 02:15:11.231804 mon.ceph-01 mon.0 192.168.32.65:6789/0 197965 : cluster [INF] osd.596 192.168.32.83:6829/4755 boot
2022-03-24 04:58:24.831549 mon.ceph-01 mon.0 192.168.32.65:6789/0 200555 : cluster [INF] osd.596 192.168.32.83:6829/4755 boot
2022-03-24 03:02:16.971836 mon.ceph-01 mon.0 192.168.32.65:6789/0 199130 : cluster [INF] osd.603 192.168.32.84:6814/4738 boot
2022-03-24 13:56:15.723368 mon.ceph-01 mon.0 192.168.32.65:6789/0 207508 : cluster [INF] osd.604 192.168.32.82:6806/4639 boot
2022-03-24 07:24:42.557331 mon.ceph-01 mon.0 192.168.32.65:6789/0 202530 : cluster [INF] osd.606 192.168.32.84:6831/4605 boot
2022-03-24 01:26:23.313526 mon.ceph-01 mon.0 192.168.32.65:6789/0 197079 : cluster [INF] osd.609 192.168.32.84:6817/4603 boot
2022-03-24 07:24:42.557288 mon.ceph-01 mon.0 192.168.32.65:6789/0 202529 : cluster [INF] osd.609 192.168.32.84:6817/4603 boot
2022-03-24 05:48:09.449210 mon.ceph-01 mon.0 192.168.32.65:6789/0 201169 : cluster [INF] osd.614 192.168.32.85:6826/4777 boot

We have 2 types of drives in this pool, Micron 5200Pro 1.92 TB (1 OSD per disk) and the Kingston DC500M 3.84 TB (2 OSDs per disk). The above boot events are exclusively on Kingston drives. After adding these drives, we didn't have any problems for a year or so. This started recently, maybe 3-4 months ago. My guess is that its because these drives are halfway filled now and probably through several disk writes and that the controller has sometimes problems flushing writes or allocating blocks for writes.

Is anyone else using these drives?
Did anyone else make a similar experience and has a way to solve that?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx