Re: [Jewel] Crash Osd with void Hit_set_trim

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Le 24/10/2017 à 07:49, Brad Hubbard a écrit :


On Mon, Oct 23, 2017 at 4:51 PM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx> wrote:

Hello,

Le 23/10/2017 à 02:05, Brad Hubbard a écrit :
2017-10-22 17:32:56.031086 7f3acaff5700  1 osd.14 pg_epoch: 72024 pg[37.1c( v 71593'41657 (60849'38594,71593'41657] local-les=72023 n=13 ec=7037 les/c/f 72023/72023/66447 72022/72022/72022) [14,1,41] r=0 lpr=72022 crt=71593'41657 lcod 0'
0 mlcod 0'0 active+clean] hit_set_trim 37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31 01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head not found
2017-10-22 17:32:56.033936 7f3acaff5700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&, unsigned int)' thread 7f3acaff5700 time 2017-10-22 17:32:56.031105
osd/ReplicatedPG.cc: 11782: FAILED assert(obc)

It appears to be looking for (and failing to find) a hitset object with a timestamp from August? Does that sound right to you? Of course, it appears an object for that timestamp does not exist.

How is-it possible ? How to fix it. I am sure, if I run a lot of read, other objects like this will crash other osd.
(Cluster is OK now, I will probably destroy OSD 14 and recreate it).
How to find this object ?

You should be able to do a find on the OSDs filestore and grep the output for 'hit_set_37.1c_archive_2017-08-31'. I'd start with the OSDs responsible for pg 37.1c and then move on to the others if it's feasible.

So with grep, I found OSD.14 (already destroyed anb recreated) and OSD.1.

ceph-osd-01: /var/log/ceph/ceph-osd.1.log-20171019.gz:2017-10-18 05:37:52.793802 7f9754ec5700 -1 osd.1 pg_epoch: 71592 pg[37.1c( v 71591'41652 (60849'38594,71591'41652] local-les=71583 n=17 ec=7037 les/c/f 71583/71554/66447 71561/71578/71578) [43,26,13]/[1,41] r=0 lpr=71578 pi=71553-71577/5 luod=71590'41651 bft=13,26,43 crt=71588'41647 lcod 71589'41650 mlcod 0'0 active+undersized+degraded+remapped+wait_backfill] agent_load_hit_sets: could not load hitset 37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31 01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head

May I destroy OSD 1 and recreate it as well  to force move ? or just reweight OSD to force move ?

How to find other objects with same issues ? (just restart rsync and see ?).

Other question  :I use to run a night crontab with fstrim on rbd disk. Is-it is it because of the problem ?

Let us know the results.


--
Performance Conseil Informatique
Pascal Pucci
Consultant Infrastructure
pascal.pucci@xxxxxxxxxxxxxxx
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net
News :
Très heureux de réaliser des projets continuité stockage avec DataCore depuis 2008. PCI est partenaire Silver DataCore. Merci à DataCore ...lire...I
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux