Re: Random health OSD_SCRUB_ERRORS on various OSDs, after pg repair back to HEALTH_OK

Brad Hubbard <bhubbard@xxxxxxxxxx> · Tue, 6 Mar 2018 19:11:11 +1000

debug_osd that is... :)

On Tue, Mar 6, 2018 at 7:10 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

On Tue, Mar 6, 2018 at 5:26 PM, Marco Baldini - H.S. Amiata <mbaldini@xxxxxxxxxxx> wrote:

    Hi
    I monitor dmesg in each of the 3 nodes, no hardware issue
      reported. And the problem happens with various different OSDs in
      different nodes, for me it is clear it's not an hardware problem.

If you have osd_debug set to 25 or greater when you run the deep scrub you should get more information about the nature of the read error in the  ReplicatedBackend::be_deep_scrub() function (assuming this is a replicated pool).

This may create large logs so watch they don't exhaust storage.

    Thanks for reply

    Il 05/03/2018 21:45, Vladimir Prokofev
      ha scritto:

      > always
          solved by ceph pg repair <PG>
        That
            doesn't necessarily means that there's no hardware issue. In
            my case repair also worked fine and returned cluster to OK
            state every time, but in time faulty disk fail another scrub
            operation, and this repeated multiple times before we
            replaced that disk.
        One
            last thing to look into is dmesg at your OSD nodes. If
            there's a hardware read error it will be logged in dmesg.

        2018-03-05 18:26 GMT+03:00 Marco
          Baldini - H.S. Amiata <mbaldini@xxxxxxxxxxx>:

              Hi and thanks for reply
              The OSDs are all healthy, in fact after a ceph pg
                repair <PG> the ceph health is back to OK and in
                the OSD log I see  <PG> repair ok, 0 fixed
              The SMART data of the 3 OSDs seems fine
              OSD.5

# ceph-disk list | grep osd.5
 /dev/sdd1 ceph data, active, cluster ceph, osd.5, block /dev/sdd2

# smartctl -a /dev/sdd
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST1000DM003-1SB10C
Serial Number:    Z9A1MA1V
LU WWN Device Id: 5 000c50 090c7028b
Firmware Version: CC43
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar  5 16:17:22 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 109) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   063   006    Pre-fail  Always       -       193297722
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       60
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1451132477
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13283
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       61
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   086   086   000    Old_age   Always       -       14
190 Airflow_Temperature_Cel 0x0022   071   055   040    Old_age   Always       -       29 (Min/Max 23/32)
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       607
194 Temperature_Celsius     0x0022   029   014   000    Old_age   Always       -       29 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always       -       193297722
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       13211h+23m+08.363s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       53042120064
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       170788993187

OSD.4

# ceph-disk list | grep osd.4
 /dev/sdc1 ceph data, active, cluster ceph, osd.4, block /dev/sdc2

# smartctl -a /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST1000DM003-1SB10C
Serial Number:    Z9A1M1BW
LU WWN Device Id: 5 000c50 090c78d27
Firmware Version: CC43
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar  5 16:20:46 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 109) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   063   006    Pre-fail  Always       -       194906537
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       64
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   091   060   045    Pre-fail  Always       -       1485899434
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       13390
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       65
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   095   095   000    Old_age   Always       -       5
190 Airflow_Temperature_Cel 0x0022   074   051   040    Old_age   Always       -       26 (Min/Max 19/29)
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       616
194 Temperature_Celsius     0x0022   026   014   000    Old_age   Always       -       26 (0 14 0 0 0)
195 Hardware_ECC_Recovered  0x001a   004   001   000    Old_age   Always       -       194906537
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       13315h+20m+30.974s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       52137467719
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       177227508503

OSD.8

# ceph-disk list | grep osd.8
 /dev/sda1 ceph data, active, cluster ceph, osd.8, block /dev/sda2

# smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.13-6-pve] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST1000DM003-1SB10C
Serial Number:    Z9A2BEF2
LU WWN Device Id: 5 000c50 0910f5427
Firmware Version: CC43
User Capacity:    1,000,203,804,160 bytes [1.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Mar  5 16:22:47 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 110) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   063   006    Pre-fail  Always       -       224621855
  3 Spin_Up_Time            0x0003   097   097   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       275
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   081   060   045    Pre-fail  Always       -       149383284
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6210
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       265
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   098   098   000    Old_age   Always       -       2
190 Airflow_Temperature_Cel 0x0022   069   058   040    Old_age   Always       -       31 (Min/Max 21/35)
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       516
194 Temperature_Celsius     0x0022   031   017   000    Old_age   Always       -       31 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   005   001   000    Old_age   Always       -       224621855
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6154h+03m+35.126s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       24333847321
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       50261005553

              However it's not only these 3 OSD to have PG with
                errors, these are onlyl the most recent, in the last 3
                months I had often OSD_SCRUB_ERRORS in various OSDs,
                always solved by ceph pg repair <PG>, I don't
                think it's an hardware issue.

                  Il
                    05/03/2018 13:40, Vladimir Prokofev ha scritto:

                    > candidate had a read error
                      speaks for itself - while scrubbing it
                        coudn't read data.
                      I had similar issue, and it was just OSD
                        dying - errors and relocated sectors in SMART,
                        just replaced the disk. But in your case it
                        seems that errors are on different OSDs? Are
                        your OSDs all healthy?
                      You can use this command to see some details.
                       rados
                          list-inconsistent-obj <pg.id>
                          --format=json-pretty

                      pg.id is the PG
                          that's reporting as inconsistent. My guess is
                          that you'll see read errors in this output,
                          with OSD number that encountered error. After
                          that you have to check that OSD health - SMART
                          details, etc.
                      Not
                          always it's the disk itself that causing
                          problems - for example we had read errors
                          because of a faulty backplane interface in a
                          server; changing the chassis resolved this
                          issue.

                        2018-03-05 14:21
                          GMT+03:00 Marco Baldini - H.S. Amiata <mbaldini@xxxxxxxxxxx>:

                              Hi

                              After some days with debug_osd 5/5 I
                                found [ERR] in different days, different
                                PGs, different OSDs, different hosts.
                                This is what I get in the OSD logs:
                              OSD.5 (host 3)
2018-03-01 20:30:02.702269 7fdf4d515700  2 osd.5 pg_epoch: 16486 pg[9.1c( v 16486'51798 (16431'50251,16486'51798] local-lis/les=16474/16475 n=3629 ec=1477/1477 lis/c 16474/16474 les/c/f 16475/16477/0 16474/16474/16474) [5,6] r=0 lpr=16474 crt=16486'51798 lcod 16486'51797 mlcod 16486'51797 active+clean+scrubbing+deep] 9.1c shard 6: soid 9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a read error
2018-03-01 20:30:02.702278 7fdf4d515700 -1 log_channel(cluster) log [ERR] : 9.1c shard 6: soid 9:3b157c56:::rbd_data.1526386b8b4567.0000000000001761:head candidate had a read error

OSD.4 (host 3)
2018-02-28 00:03:33.458558 7f112cf76700 -1 log_channel(cluster) log [ERR] : 13.65 shard 2: soid 13:a719ecdf:::rbd_data.5f65056b8b4567.000000000000f8eb:head candidate had a read error
                              OSD.8 (host 2)
2018-02-27 23:55:15.100084 7f4dd0816700 -1 log_channel(cluster) log [ERR] : 14.31 shard 1: soid 14:8cc6cd37:::rbd_data.30b15b6b8b4567.00000000000081a1:head candidate had a read error
                              I don't know what this error is
                                meaning, and as always a ceph pg repair
                                fixes it. I don't think this is normal.
                              Ideas?
                              Thanks

                                  Il
                                    28/02/2018 14:48, Marco Baldini -
                                    H.S. Amiata ha scritto:

                                    Hi
                                    I read the bugtracker issue and
                                      it seems a lot like my problem,
                                      even if I can't check the reported
                                      checksum because I don't have it
                                      in my logs, perhaps it's because
                                      of debug osd = 0/0 in ceph.conf
                                    I just raised the OSD log level
                                    ceph tell osd.* injectargs --debug-osd 5/5
                                    I'll check OSD logs in the next
                                      days...
                                    Thanks 

                                    Il
                                      28/02/2018 11:59, Paul Emmerich ha
                                      scritto:

                                     Hi,

                                      might be http://tracker.ceph.com/issues/22464

                                      Can you check the OSD log
                                        file to see if the reported
                                        checksum is 0x6706be76?

                                      Paul

                                            Am 28.02.2018 um 11:43
                                              schrieb Marco Baldini -
                                              H.S. Amiata <mbaldini@xxxxxxxxxxx>:

                                                Hello
                                                I have a little ceph
                                                  cluster with 3 nodes,
                                                  each with 3x1TB HDD
                                                  and 1x240GB SSD. I
                                                  created this cluster
                                                  after Luminous
                                                  release, so all OSDs
                                                  are Bluestore. In my
                                                  crush map I have two
                                                  rules, one targeting
                                                  the SSDs and one
                                                  targeting the HDDs. I
                                                  have 4 pools, one
                                                  using the SSD rule and
                                                  the others using the
                                                  HDD rule, three pools
                                                  are size=3 min_size=2,
                                                  one is size=2
                                                  min_size=1 (this one
                                                  have content that it's
                                                  ok to lose)
                                                In the last 3 month
                                                  I'm having a strange
                                                  random problem. I
                                                  planned my osd scrubs
                                                  during the night (osd
                                                  scrub begin hour = 20,
                                                  osd scrub end hour =
                                                  7) when office is
                                                  closed so there is low
                                                  impact on the users.
                                                  Some mornings, when I
                                                  ceph the cluster
                                                  health, I find: 

                                                HEALTH_ERR X scrub errors; Possible data damage: Y pgs inconsistent
OSD_SCRUB_ERRORS X scrub errors
PG_DAMAGED Possible data damage: Y pg inconsistent
                                                X and Y sometimes are
                                                  1, sometimes 2.
                                                I issue a ceph health
                                                  detail, check the
                                                  damaged PGs, and run a
                                                  ceph pg repair for the
                                                  damaged PGs, I get
                                                instructing pg PG on osd.N to repair
                                                PG are different, OSD
                                                  that have to repair PG
                                                  is different, even the
                                                  node hosting the OSD
                                                  is different, I made a
                                                  list of all PGs and
                                                  OSDs. This morning is
                                                  the most recent case:
                                                > ceph health detail
HEALTH_ERR 2 scrub errors; Possible data damage: 2 pgs inconsistent
OSD_SCRUB_ERRORS 2 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
pg 13.65 is active+clean+inconsistent, acting [4,2,6]
pg 14.31 is active+clean+inconsistent, acting [8,3,1]

                                                > ceph pg repair 13.65
instructing pg 13.65 on osd.4 to repair

(node-2)> tail /var/log/ceph/ceph-osd.4.log
2018-02-28 08:38:47.593447 7f112cf76700  0 log_channel(cluster) log [DBG] : 13.65 repair starts
2018-02-28 08:39:37.573342 7f112cf76700  0 log_channel(cluster) log [DBG] : 13.65 repair ok, 0 fixed
                                                > ceph pg repair 14.31
instructing pg 14.31 on osd.8 to repair

(node-3)> tail /var/log/ceph/ceph-osd.8.log
2018-02-28 08:52:37.297490 7f4dd0816700  0 log_channel(cluster) log [DBG] : 14.31 repair starts
2018-02-28 08:53:00.704020 7f4dd0816700  0 log_channel(cluster) log [DBG] : 14.31 repair ok, 0 fixed

                                                I made a list of when
                                                  I got
                                                  OSD_SCRUB_ERRORS, what
                                                  PG and what OSD had to
                                                  repair PG. Date is
                                                  dd/mm/yyyy

                                                21/12/2017   --  pg 14.29 is active+clean+inconsistent, acting [6,2,4]

18/01/2018   --  pg 14.5a is active+clean+inconsistent, acting [6,4,1]

22/01/2018   --  pg 9.3a is active+clean+inconsistent, acting [2,7]

29/01/2018   --  pg 13.3e is active+clean+inconsistent, acting [4,6,1]
                 instructing pg 13.3e on osd.4 to repair

07/02/2018   --  pg 13.7e is active+clean+inconsistent, acting [8,2,5]
                 instructing pg 13.7e on osd.8 to repair

09/02/2018   --  pg 13.30 is active+clean+inconsistent, acting [7,3,2]
                 instructing pg 13.30 on osd.7 to repair

15/02/2018   --  pg 9.35 is active+clean+inconsistent, acting [1,8]
                 instructing pg 9.35 on osd.1 to repair

                 pg 13.3e is active+clean+inconsistent, acting [4,6,1]
                 instructing pg 13.3e on osd.4 to repair

17/02/2018   --  pg 9.2d is active+clean+inconsistent, acting [7,5]
                 instructing pg 9.2d on osd.7 to repair                 

22/02/2018   --  pg 9.24 is active+clean+inconsistent, acting [5,8]
                 instructing pg 9.24 on osd.5 to repair

28/02/2018   --  pg 13.65 is active+clean+inconsistent, acting [4,2,6]
                 instructing pg 13.65 on osd.4 to repair

                 pg 14.31 is active+clean+inconsistent, acting [8,3,1]
                 instructing pg 14.31 on osd.8 to repair

                                                If can be useful, my
                                                  ceph.conf is here:
                                                [global]
auth client required = none
auth cluster required = none
auth service required = none
fsid = 24d5d6bc-0943-4345-b44e-46c19099004b
cluster network = 10.10.10.0/24
public network = 10.10.10.0/24
keyring = /etc/pve/priv/$cluster.$name.keyring
mon allow pool delete = true
osd journal size = 5120
osd pool default min size = 2
osd pool default size = 3
bluestore_block_db_size = 64424509440

debug asok = 0/0
debug auth = 0/0
debug buffer = 0/0
debug client = 0/0
debug context = 0/0
debug crush = 0/0
debug filer = 0/0
debug filestore = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug journal = 0/0
debug journaler = 0/0
debug lockdep = 0/0
debug mds = 0/0
debug mds balancer = 0/0
debug mds locker = 0/0
debug mds log = 0/0
debug mds log expire = 0/0
debug mds migrator = 0/0
debug mon = 0/0
debug monc = 0/0
debug ms = 0/0
debug objclass = 0/0
debug objectcacher = 0/0
debug objecter = 0/0
debug optracker = 0/0
debug osd = 0/0
debug paxos = 0/0
debug perfcounter = 0/0
debug rados = 0/0
debug rbd = 0/0
debug rgw = 0/0
debug throttle = 0/0
debug timer = 0/0
debug tp = 0/0

[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd max backfills = 1
osd recovery max active = 1

osd scrub begin hour = 20
osd scrub end hour = 7
osd scrub during recovery = false
osd scrub load threshold = 0.3

[client]
rbd cache = true
rbd cache size = 268435456      # 256MB
rbd cache max dirty = 201326592    # 192MB
rbd cache max dirty age = 2
rbd cache target dirty = 33554432    # 32MB
rbd cache writethrough until flush = true

#[mgr]
#debug_mgr = 20

[mon.pve-hs-main]
host = pve-hs-main
mon addr = 10.10.10.251:6789

[mon.pve-hs-2]
host = pve-hs-2
mon addr = 10.10.10.252:6789

[mon.pve-hs-3]
host = pve-hs-3
mon addr = 10.10.10.253:6789

                                                My ceph versions:
                                                {
    "mon": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 3
    },
    "mgr": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 3
    },
    "osd": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 12
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.2 (215dd7151453fae88e6f968c975b6ce309d42dcf) luminous (stable)": 18
    }
}

                                                My ceph osd tree:
                                                ID CLASS WEIGHT  TYPE NAME            STATUS REWEIGHT PRI-AFF
-1       8.93686 root default
-6       2.94696     host pve-hs-2
 3   hdd 0.90959         osd.3            up  1.00000 1.00000
 4   hdd 0.90959         osd.4            up  1.00000 1.00000
 5   hdd 0.90959         osd.5            up  1.00000 1.00000
10   ssd 0.21819         osd.10           up  1.00000 1.00000
-3       2.86716     host pve-hs-3
 6   hdd 0.85599         osd.6            up  1.00000 1.00000
 7   hdd 0.85599         osd.7            up  1.00000 1.00000
 8   hdd 0.93700         osd.8            up  1.00000 1.00000
11   ssd 0.21819         osd.11           up  1.00000 1.00000
-7       3.12274     host pve-hs-main
 0   hdd 0.96819         osd.0            up  1.00000 1.00000
 1   hdd 0.96819         osd.1            up  1.00000 1.00000
 2   hdd 0.96819         osd.2            up  1.00000 1.00000
 9   ssd 0.21819         osd.9            up  1.00000 1.00000

                                                My pools:
                                                pool 9 'cephbackup' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 5665 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~3]
pool 13 'cephwin' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 16454 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~5]
pool 14 'cephnix' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 16482 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~227]
pool 17 'cephssd' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 8601 flags hashpspool stripe_width 0 application rbd
        removed_snaps [1~3]

                                                I can't understand
                                                  where the problem
                                                  comes from, I don't
                                                  think it's hardware,
                                                  if I had a failed
                                                  disk, then I should
                                                  have problems always
                                                  on the same OSD. Any
                                                  ideas
                                                Thanks

                                                --

                                                        Marco
                                                          Baldini

                                                        H.S.
                                                          Amiata Srl

                                                        Ufficio:  
                                                        0577-779396

                                                        Cellulare:  
                                                        335-8765169

                                                        WEB:  
                                                        www.hsamiata.it

                                                        EMAIL:  
                                                        mbaldini@xxxxxxxxxxx

_______________________________________________

                                              ceph-users mailing list

                                              ceph-users@xxxxxxxxxxxxxx

                                              http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                          -- 

                                            Mit freundlichen Grüßen
                                              / Best Regards
                                            Paul Emmerich

                                            croit GmbH
                                            Freseniusstr. 31h
                                            81247 München
                                            www.croit.io
                                            Tel: +49 89 1896585 90

                                            Geschäftsführer: Martin
                                              Verges
                                            Handelsregister:
                                              Amtsgericht München
                                            USt-IdNr: DE310638492

                                    --

                                            Marco
                                                Baldini

                                            H.S.
                                                Amiata Srl

                                            Ufficio:  
                                            0577-779396

                                            Cellulare:  
                                            335-8765169

                                            WEB:  
                                            www.hsamiata.it

                                            EMAIL:  
                                            mbaldini@xxxxxxxxxxx

                                    _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                                  --

                                          Marco
                                              Baldini

                                          H.S. Amiata
                                              Srl

                                          Ufficio:  
                                          0577-779396

                                          Cellulare:  
                                          335-8765169

                                          WEB:  
                                          www.hsamiata.it

                                          EMAIL:  
                                          mbaldini@xxxxxxxxxxx

                            _______________________________________________

                            ceph-users mailing list

                            ceph-users@xxxxxxxxxxxxxx

                            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                    _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

                  -- 

                          Marco Baldini

                          H.S. Amiata Srl

                          Ufficio:  
                          0577-779396

                          Cellulare:  
                          335-8765169

                          WEB:  
                          www.hsamiata.it

                          EMAIL:  
                          mbaldini@xxxxxxxxxxx

            _______________________________________________

            ceph-users mailing list

            ceph-users@xxxxxxxxxxxxxx

            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    -- 

            Marco Baldini

            H.S. Amiata Srl

            Ufficio:  
            0577-779396

            Cellulare:  
            335-8765169

            WEB:  
            www.hsamiata.it

            EMAIL:  
            mbaldini@xxxxxxxxxxx

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad

-- 
Cheers,
Brad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Marco Baldini
H.S. Amiata Srl
Ufficio:	0577-779396
Cellulare:	335-8765169
WEB:	www.hsamiata.it
EMAIL:	mbaldini@xxxxxxxxxxx