Hi, Thank you for providing me this level of detail. I ended up just failing the drive since it is still under support and we had in fact gotten emails about the health of this drive in the past. I will however use this in the future if we have an issue with a pg and it is the first time we have had an issue with the drive and/or it's not still under support. Thanks again. Shain > On Mar 19, 2017, at 11:19 AM, Mehmet <ceph@xxxxxxxxxx> wrote: > > Hi Shain, > > what i would do: > take the osd.32 out > > # systemctl stop ceph-osd@32 > # ceph osd out osd.32 > > this will cause rebalancing. > > to repair/reuse the drive you can do: > > # smartctl -t long /dev/sdX > This will start a long self-test on the drive and - i bet - abort this after a while with somethin like > > # smartctl -a /dev/sdX > [...] > SMART Self-test log > > Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ] > > Description number (hours) > > # 1 Background long Failed in segment --> - 4378 35494670 [0x3 0x11 0x0] > [...] > > > Now mark the segmant as "malfunction" - my system was Ubuntu > > # apt install sg3-utils/xenial > # sg_verify --lba=35494670 /dev/sdX1 > # sg_reassign --address=35494670 /dev/sdX > # sg_reassign --grown /dev/sdX > > the next long test should hopefully work fine: > # smartctl -t long /dev/sdX > > If not repeat the above with new found defekt lba. > > Ive done this three time successfully - but not with an error on a primary pg. > > After that you can start the osd with > > # systemctl start ceph-osd@32 > # ceph osd in osd.32 > > HTH > - Mehmet > > > Am 2017-03-17 20:08, schrieb Shain Miley: >> Brian, >> Thank you for the detailed information. I was able to compare the 3 >> hexdump files and it looks like the primary pg is the odd man out. >> I stopped the OSD and then I attempted to move the object: >> root@hqosd3:/var/lib/ceph/osd/ceph-32/current/3.2b8_head/DIR_8/DIR_B/DIR_2/DIR_A/DIR_0# >> mv rb.0.fe307e.238e1f29.00000076024c__head_4650A2B8__3 /root >> mv: error reading >> ‘rb.0.fe307e.238e1f29.00000076024c__head_4650A2B8__3’: >> Input/output error >> mv: failed to extend >> ‘/root/rb.0.fe307e.238e1f29.00000076024c__head_4650A2B8__3’: >> Input/output error >> However I got a nice Input/output error instead. >> I assume that this is not the case normally. >> Any ideas on how I should proceed at this point..should I fail out >> this OSD and replace the drive (I have had no indication other than >> the IO error that there is an issue with this disk), or is there >> something I can try first? >> Thanks again, >> Shain >>> On 03/17/2017 11:38 AM, Brian Andrus wrote: >>> We went through a period of time where we were experiencing these >>> daily... >>> cd to the PG directory on each OSD and do a find for >>> "238e1f29.00000076024c" (mentioned in your error message). This will >>> likely return a file that has a slash in the name, something like >>> rbdudata.238e1f29.00000076024c_head_blah_1f... >>> hexdump -C the object (tab completing the name helps) and pipe the >>> output to a different location. Once you obtain the hexdumps, do a >>> diff or cmp against them and find which one is not like the others. >>> If the primary is not the outlier, perform the PG repair without >>> worry. If the primary is the outlier, you will need to stop the OSD, >>> move the object out of place, start it back up and then it will be >>> okay to issue a PG repair. >>> Other less common inconsistent PGs we see are differing object sizes >>> (easy to detect with a simple list of file size) and differing >>> attributes ("attr -l", but the error logs are usually precise in >>> identifying the problematic PG copy). >>>> On Fri, Mar 17, 2017 at 8:16 AM, Shain Miley <smiley@xxxxxxx> wrote: >>>> Hello, >>>> Ceph status is showing: >>>> 1 pgs inconsistent >>>> 1 scrub errors >>>> 1 active+clean+inconsistent >>>> I located the error messages in the logfile after querying the pg >>>> in question: >>>> root@hqosd3:/var/log/ceph# zgrep -Hn 'ERR' ceph-osd.32.log.1.gz >>>> ceph-osd.32.log.1.gz:846:2017-03-17 02:25:20.281608 7f7744d7f700 >>>> -1 log_channel(cluster) log [ERR] : 3.2b8 shard 32: soid >>>> 3/4650a2b8/rb.0.fe307e.238e1f29.00000076024c/head candidate had a >>>> read error, data_digest 0x84c33490 != known data_digest 0x974a24a7 >>>> from auth shard >> 62 >>>> ceph-osd.32.log.1.gz:847:2017-03-17 02:30:40.264219 7f7744d7f700 >>>> -1 log_channel(cluster) log [ERR] : 3.2b8 deep-scrub 0 missing, 1 >>>> inconsistent >> objects >>>> ceph-osd.32.log.1.gz:848:2017-03-17 02:30:40.264307 7f7744d7f700 >>>> -1 log_channel(cluster) log [ERR] : 3.2b8 deep-scrub 1 errors >>>> Is this a case where it would be safe to use 'ceph pg repair'? The >>>> documentation indicates there are times where running this command >>>> is less safe than others...and I would like to be sure before I do >>>> so. >>>> Thanks, >>>> Shain >>>> -- >>>> NPR | Shain Miley | Manager of Infrastructure, Digital Media | >>>> smiley@xxxxxxx | 202.513.3649 [1] >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2] >>> -- >>> Brian Andrus | Cloud Systems Engineer | DreamHost >>> brian.andrus@xxxxxxxxxxxxx | www.dreamhost.com [3] >> -- >> NPR | Shain Miley | Manager of Infrastructure, Digital Media | >> smiley@xxxxxxx | 202.513.3649 >> Links: >> ------ >> [1] tel:%28202%29%20513-3649 >> [2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> [3] http://www.dreamhost.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com