HDD spindown problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

we are fighting a HDD spin-down problem on our production ceph cluster since two weeks now. The problem is not ceph related but I guess this topic is interesting to the list and to be honest I hope to find a solution here.

We do use 6 OSD Nodes like:
OS: Suse 12 SP3
Ceph: SES 5.5 (12.2.8)
Server: Supermicro 6048R-E1CR36L
Controller: LSI 3008 (LSI3008-IT)
Disk: 12x Seagate ST8000NM0055-1RM112 8TB (SN05 Firmware (some still SN02 and SN04) NVMe: 1x Intel DC P3700 800GB (used for 80GB RocksDB and 2GB WAL for each OSD (only 7 Disks are online right now - up to 9 Disks will have there RocksDB/WAL on one NVMe SSD)


Problem:
This Ceph cluster is used for objectstorage (RadosGW) only and is mostly used for backups to S3 (RadosGW). There is not that much activity - mostly at night time. We do not want any HDD to spin down but they do. We tried to disable the spindown timers by using sdparm and also with the Seagate tool SeaChest but "something" does re-enable them:


Disable standby on all HDD:
for i in sd{c..n}; do /root/SeaChestUtilities/Linux/Lin64/SeaChest_PowerControl_191_1183_64 -d /dev/$i --onlySeagate --changePower --disableMode --powerMode standby ; done


Monitor standby timer status:

while true; do for i in sd{c..n}; do echo "$(date) $i $(/root/SeaChestUtilities/Linux/Lin64/SeaChest_PowerControl_191_1183_64 -d /dev/$i --onlySeagate --showEPCSettings -v0 | grep Stand)"; done; sleep 1 ; done

This will show:
Mon Dec 3 10:42:54 CET 2018 sdc Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:54 CET 2018 sdd Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:54 CET 2018 sde Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:54 CET 2018 sdf Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:54 CET 2018 sdg Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:54 CET 2018 sdh Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:54 CET 2018 sdi Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:55 CET 2018 sdj Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:55 CET 2018 sdk Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:55 CET 2018 sdl Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:55 CET 2018 sdm Standby Z 0 9000 65535 120 Y Y Mon Dec 3 10:42:55 CET 2018 sdn Standby Z 0 9000 65535 120 Y Y


So everything is fine right now. Standby timer is 0 and disabled (no * shown) while the default value is 9000 and the saved timer is FFFF (we saved this value so the disks have a huge time after reboots). But after a unknown amount of time (in this case ~7 minutes) things start to get weird:

Mon Dec 3 10:47:52 CET 2018 sdc Standby Z *3500 9000 65535 120 Y Y
[...]
65535        120           Y Y
Mon Dec 3 10:48:07 CET 2018 sdc Standby Z *3500 9000 65535 120 Y Y Mon Dec 3 10:48:09 CET 2018 sdc Standby Z *3500 9000 65535 120 Y Y Mon Dec 3 10:48:12 CET 2018 sdc Standby Z *4500 9000 65535 120 Y Y Mon Dec 3 10:48:14 CET 2018 sdc Standby Z *4500 9000 65535 120 Y Y Mon Dec 3 10:48:16 CET 2018 sdc Standby Z *4500 9000 65535 120 Y Y Mon Dec 3 10:48:19 CET 2018 sdc Standby Z *4500 9000 65535 120 Y Y Mon Dec 3 10:48:21 CET 2018 sdc Standby Z *4500 9000 65535 120 Y Y Mon Dec 3 10:48:23 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:26 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:28 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:30 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:32 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:35 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:37 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:40 CET 2018 sdc Standby Z *5500 9000 65535 120 Y Y Mon Dec 3 10:48:42 CET 2018 sdc Standby Z *6500 9000 65535 120 Y Y Mon Dec 3 10:48:44 CET 2018 sdc Standby Z *6500 9000 65535 120 Y Y Mon Dec 3 10:48:47 CET 2018 sdc Standby Z *6500 9000 65535 120 Y Y Mon Dec 3 10:48:49 CET 2018 sdc Standby Z *6500 9000 65535 120 Y Y Mon Dec 3 10:48:52 CET 2018 sdc Standby Z *7500 9000 65535 120 Y Y Mon Dec 3 10:48:52 CET 2018 sde Standby Z *65535 9000 65535 120 Y Y Mon Dec 3 10:48:54 CET 2018 sdc Standby Z *7500 9000 65535 120 Y Y Mon Dec 3 10:48:55 CET 2018 sde Standby Z *65535 9000 65535 120 Y Y Mon Dec 3 10:48:57 CET 2018 sdc Standby Z *7500 9000 65535 120 Y Y Mon Dec 3 10:48:57 CET 2018 sde Standby Z *65535 9000 65535 120 Y Y Mon Dec 3 10:48:59 CET 2018 sdc Standby Z *7500 9000 65535 120 Y Y Mon Dec 3 10:49:00 CET 2018 sde Standby Z *65535 9000 65535 120 Y Y Mon Dec 3 10:49:02 CET 2018 sdc Standby Z *8500 9000 65535 120 Y Y Mon Dec 3 10:49:02 CET 2018 sde Standby Z *11500 9000 65535 120 Y Y Mon Dec 3 10:49:04 CET 2018 sdc Standby Z *8500 9000 65535 120 Y Y Mon Dec 3 10:49:05 CET 2018 sde Standby Z *11500 9000 65535 120 Y Y Mon Dec 3 10:49:07 CET 2018 sdc Standby Z *8500 9000 65535 120 Y Y Mon Dec 3 10:49:07 CET 2018 sde Standby Z *11500 9000 65535 120 Y Y


So "something" starts to re-enable those standby timers with strange numbers. After those timers going up and down and disabled/enabled a certain (unknown) amount of time they get "stable" at a value of 3000 and stay enabled (*):

Mon Dec 3 10:50:43 CET 2018 sde Standby Z *3000 9000 65535 120 Y Y Mon Dec 3 10:50:45 CET 2018 sde Standby Z *3000 9000 65535 120 Y Y


3000 = 3000 / 100ms = 5 minutes. This is exactly what we measured when we started to analyse the issue. Disks powered off (spin-down) after 5 Minutes.

We tried to add:
options mpt3sas allow_drive_spindown=0

did not help anything...

The only workaround right now is to have a cronjob in place running all 3 minutes to disable standby for all disks.


Anyone this a proper solution?

All the best,
Flo

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux