Hello, I have a pool of +300 OSDs that are identical model (Seagate model: ST1800MM0129 size: 1.64 TiB). Only 1 OSD crashes regularely, however I cannot identify a root cause. Based on the output of smartctl the disk is ok. # smartctl -a -d megaraid,1 /dev/sda [47/1833] smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.18-2-pve] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: LENOVO-X Product: ST1800MM0129 Revision: L2B6 Compliance: SPC-4 User Capacity: 1,800,360,124,416 bytes [1.80 TB] Logical block size: 512 bytes Physical block size: 4096 bytes LU is fully provisioned Rotation Rate: 10500 rpm Form Factor: 2.5 inches Logical Unit id: 0x5000c500bb7822cf Serial number: WBN0QHX80000E852944J Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Mon May 18 09:19:41 2020 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: HARDWARE IMPENDING FAILURE GENERAL HARD DRIVE FAILURE [asc=5d, ascq=10] [22/1833] Grown defects during certification <not available> Total blocks reassigned during format <not available> Total new blocks reassigned = 68 Power on minutes since format <not available> Current Drive Temperature: 33 C Drive Trip Temperature: 65 C Manufactured in week 31 of year 2018 Specified cycle count over device lifetime: 10000 Accumulated start-stop cycles: 21 Specified load-unload count over device lifetime: 300000 Accumulated load-unload cycles: 709 Elements in grown defect list: 18 Error counter log: Errors Corrected by Total Correction Gigabytes Total ECC rereads/ errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read: 3278853896 1 0 3278853897 32 83933.567 19 write: 0 0 0 0 0 24093.894 0 verify: 3080361880 0 0 3080361880 0 12630.494 0 Non-medium error count: 244 SMART Self-test log Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] Description number (hours) # 1 Background short Completed - 3761 - [- - -] # 2 Background short Completed - 3737 - [- - -] # 3 Background short Completed - 3713 - [- - -] # 4 Background short Completed - 3689 - [- - -] # 5 Background short Completed - 3665 - [- - -] # 6 Background short Completed - 3641 - [- - -] # 7 Background short Completed - 3617 - [- - -] # 8 Background short Completed - 3593 - [- - -] # 9 Background long Completed - 3569 - [- - -] #10 Background short Completed - 3545 - [- - -] #11 Background short Completed - 3521 - [- - -] #12 Background short Completed - 3497 - [- - -] #13 Background short Completed - 3473 - [- - -] #14 Background short Completed - 3449 - [- - -] #15 Background short Completed - 3425 - [- - -] #16 Background short Completed - 3401 - [- - -] #17 Background short Completed - 3377 - [- - -] #18 Background short Completed - 3353 - [- - -] #19 Background short Completed - 3329 - [- - -] #20 Background short Completed - 3305 - [- - -] Long (extended) Self-test duration: 9459 seconds [157.7 minutes] I have attached the log of the affected OSD. THX Thomas Ich habe 1 zu dieser E-Mail gehörende Datei hochgeladen: ceph-osd.92.log.1.gz <https://we.tl/t-7DzNCDP3iZ>(578 KB)WeTransferhttps://we.tl/t-7DzNCDP3iZ Mozilla Thunderbird <https://www.thunderbird.net> macht es einfach, große Dateien über E-Mails zu teilen. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx