Re: [Jewel] Crash Osd with void Hit_set_trim

"pascal.pucci@xxxxxxxxxxxxxxx" <pascal.pucci@xxxxxxxxxxxxxxx> · Tue, 24 Oct 2017 08:52:20 +0200



    Hello,

    
    Le 24/10/2017 à 07:49, Brad Hubbard a écrit :

    
          On Mon, Oct 23, 2017 at 4:51 PM, pascal.pucci@xxxxxxxxxxxxxxx <pascal.pucci@xxxxxxxxxxxxxxx>
            wrote:

            
                Hello,
                 Le 23/10/2017 à 02:05, Brad
                  Hubbard a écrit :

                  
                          2017-10-22 17:32:56.031086 7f3acaff5700 
                            1 osd.14 pg_epoch: 72024 pg[37.1c( v
                            71593'41657 (60849'38594,71593'41657]
                            local-les=72023 n=13 ec=7037 les/c/f
                            72023/72023/66447 72022/72022/72022)
                            [14,1,41] r=0 lpr=72022 crt=71593'41657 lcod
                            0'

                            0 mlcod 0'0 active+clean] hit_set_trim
                            37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31
                            01%3a03%3a24.697717Z_2017-08-31
                            01%3a52%3a34.767197Z:head not found

                            2017-10-22 17:32:56.033936 7f3acaff5700 -1
                            osd/ReplicatedPG.cc: In function 'void
                            ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
                            unsigned int)' thread 7f3acaff5700 time
                            2017-10-22 17:32:56.031105

                            osd/ReplicatedPG.cc: 11782: FAILED
                            assert(obc)

                            
                          It appears to be looking for (and failing to
                          find) a hitset object with a timestamp from
                          August? Does that sound right to you? Of
                          course, it appears an object for that
                          timestamp does not exist.

                        
                 How is-it possible ? How to fix it. I am sure,
                if I run a lot of read, other objects like this will
                crash other osd.

                (Cluster is OK now, I will probably destroy OSD 14 and
                recreate it).

                How to find this object ?

              
            You should be able to do a find on the OSDs filestore
              and grep the output for
              'hit_set_37.1c_archive_2017-08-31'. I'd start with the
              OSDs responsible for pg 37.1c and
                then move on to the others if it's feasible.
            

    So with grep, I found OSD.14 (already destroyed anb recreated) and
    OSD.1.

    
    ceph-osd-01: /var/log/ceph/ceph-osd.1.log-20171019.gz:2017-10-18
    05:37:52.793802 7f9754ec5700 -1 osd.1 pg_epoch: 71592 pg[37.1c( v
    71591'41652 (60849'38594,71591'41652] local-les=71583 n=17 ec=7037
    les/c/f 71583/71554/66447 71561/71578/71578) [43,26,13]/[1,41] r=0
    lpr=71578 pi=71553-71577/5 luod=71590'41651 bft=13,26,43
    crt=71588'41647 lcod 71589'41650 mlcod 0'0
    active+undersized+degraded+remapped+wait_backfill]
    agent_load_hit_sets: could not load hitset
    37:38000000:.ceph-internal::hit_set_37.1c_archive_2017-08-31
    01%3a03%3a24.697717Z_2017-08-31 01%3a52%3a34.767197Z:head

    
    May I destroy OSD 1 and recreate it as well  to force move ? or just
    reweight OSD to force move ?

    
    How to find other objects with same issues ? (just restart rsync and
    see ?).

    
    Other question  :I use to run a night crontab with fstrim on rbd
    disk. Is-it is it because of the problem ?

    
            Let us know the results.
            

    -- 

      
              Performance Conseil Informatique

                Pascal Pucci

                Consultant Infrastructure

                pascal.pucci@xxxxxxxxxxxxxxx

                Mobile : 06 51 47 84 98

                Bureau : 02 85 52 41 81

                http://www.performance-conseil-informatique.net
              
              News :
                  Parteneriat
                      DataCore -PCI est Silver Partner
                   Très heureux de
                    réaliser des projets continuité stockage avec
                    DataCore depuis 2008. PCI est partenaire Silver
                    DataCore. Merci à DataCore ...lire...I
                  
                
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com