Hi and thanks for reply The OSDs are all healthy, in fact after a ceph pg repair <PG> the ceph health is back to OK and in the OSD log I see <PG> repair ok, 0 fixed The SMART data of the 3 OSDs seems fine OSD.5 # ceph-disk list | grep osd.5 /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2 # smartctl -a /dev/sdd smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST1000DM003-1SB10C Serial Number: Z9A1MA1V LU WWN Device Id: 5 000c50 090c7028b Firmware Version: CC43 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Mar 5 16:17:22 2018 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 109) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x1085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 082 063 006 Pre-fail Always - 193297722 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 60 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1451132477 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13283 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 61 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 086 086 000 Old_age Always - 14 190 Airflow_Temperature_Cel 0x0022 071 055 040 Old_age Always - 29 (Min/Max 23/32) 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 607 194 Temperature_Celsius 0x0022 029 014 000 Old_age Always - 29 (0 14 0 0 0) 195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 193297722 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 13211h+23m+08.363s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 53042120064 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 170788993187 OSD.4 # ceph-disk list | grep osd.4 /dev/sdc1 ceph data, active, cluster ceph, osd.4, block /dev/sdc2 # smartctl -a /dev/sdc smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST1000DM003-1SB10C Serial Number: Z9A1M1BW LU WWN Device Id: 5 000c50 090c78d27 Firmware Version: CC43 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Mar 5 16:20:46 2018 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 109) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x1085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 082 063 006 Pre-fail Always - 194906537 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 64 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1485899434 9 Power_On_Hours 0x0032 085 085 000 Old_age Always - 13390 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 65 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 095 095 000 Old_age Always - 5 190 Airflow_Temperature_Cel 0x0022 074 051 040 Old_age Always - 26 (Min/Max 19/29) 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 616 194 Temperature_Celsius 0x0022 026 014 000 Old_age Always - 26 (0 14 0 0 0) 195 Hardware_ECC_Recovered 0x001a 004 001 000 Old_age Always - 194906537 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 13315h+20m+30.974s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 52137467719 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 177227508503 OSD.8 # ceph-disk list | grep osd.8 /dev/sda1 ceph data, active, cluster ceph, osd.8, block /dev/sda2 # smartctl -a /dev/sda smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.14 (AF) Device Model: ST1000DM003-1SB10C Serial Number: Z9A2BEF2 LU WWN Device Id: 5 000c50 0910f5427 Firmware Version: CC43 User Capacity: 1,000,203,804,160 bytes [1.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Mar 5 16:22:47 2018 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 110) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x1085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 083 063 006 Pre-fail Always - 224621855 3 Spin_Up_Time 0x0003 097 097 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 275 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 081 060 045 Pre-fail Always - 149383284 9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 6210 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 265 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 098 098 000 Old_age Always - 2 190 Airflow_Temperature_Cel 0x0022 069 058 040 Old_age Always - 31 (Min/Max 21/35) 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 516 194 Temperature_Celsius 0x0022 031 017 000 Old_age Always - 31 (0 17 0 0 0) 195 Hardware_ECC_Recovered 0x001a 005 001 000 Old_age Always - 224621855 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 6154h+03m+35.126s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 24333847321 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 50261005553 However it's not only these 3 OSD to have PG with errors, these are onlyl the most recent, in the last 3 months I had often OSD_SCRUB_ERRORS in various OSDs, always solved by ceph pg repair <PG>, I don't think it's an hardware issue.
Il 05/03/2018 13:40, Vladimir Prokofev
ha scritto:
--
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com