Re: [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

"pascal.pucci@xxxxxxxxxxxxxxx" <pascal.pucci@xxxxxxxxxxxxxxx> · Fri, 10 Mar 2017 09:00:28 +0100



    Hello,
    This night, same effect, new freeze... BUT,
        I found this morning maybe why ! 

      
    A stupid boy added "vm.vfs_cache_pressure=1" for tuning and
      forget to remove after on first OSD node... bad boy :)

    
    There is always an explanation. It could not be otherwise.

    
    This was maybe fast good before upgrade, but not after... That
      explain a lot, why load is ever au little greater as other...
      etc... etc...

    
    We will see in two days.

    
    Sorry, sorry, sorry :|

    
    Le 09/03/2017 à 13:45,
      pascal.pucci@xxxxxxxxxxxxxxx a écrit :

    
      Le 09/03/2017 à 13:03, Vincent Godin a écrit :

      
            First of all, don't do a ceph upgrade while your
              cluster is in warning or error state. A process upgrade
              must be done from an clean cluster.

            
      of course.

      
      So, Yesterday, so I try this for my "unfound PG"

      
      ceph pg 50.2dd mark_unfound_lost revert => MON crash :(

      so :

      ceph pg 50.2dd mark_unfound_lost delete => OK.

      
      Cluster was Health OK => So I finaly migrate all to version
      Jewel 10.2.6. So this night, nothing, all worked fine (trimfs of
      rbd was disabled). 

      Maybe next. It's always after two days. (scrubing  is 22h to 6h).

      
          Don't stay with a replicate at 2. Majority of problems
            come from that point: just look the advices given by
            experience users of the list. You should set a replicate of
            3 and a min_size at 2. This will prevent you to fail some
            data because of a double fault which is frequent.

          
      I already had a faulty some pg found by scrubbing processus (disk
      IO error) and had to remove bad PG myself. As I understood, with 3
      replica, repair would be automatique.

      Ok, I will change to 3. :)

      
        For your specific problem, i have no idea of the
          root cause. If you have already checked your network (tuning
          parameters, enable jumbo, etc..), your software version on all
          the components, your hardware (raid card, system messages,
          ...), may be you should just re-install your first OSD server.
          I had a big problem after an upgrade from hammer to jewel and
          nobody seems to have encountered it doing the same operation.
          All servers were configured the same way but they had not the
          same history.We found that the problem came from the
          differents versions we installed on some OSD servers (giant
          -> hammer -> jewel). OSD servers which never knew the
          giant version had no problem at all. We had on the problematic
          servers (in jewel) some bugs which was corrected years ago in
          giant !!!. So we have to isolate those servers and reinstall
          them directly in jewel : it solved the problem.

        
      OK. I will think about it.

      
      But, all node are realy same = > check all node with rpm -Va
      => OK. Tuning all, etc... check network ok... It came just the
      day after upgrade :)

      
      Thanks for you advise. We will see this night. :)

      
      Pascal.

        
    -- 

      
              Performance Conseil Informatique

                Pascal Pucci

                Consultant Infrastructure

                pascal.pucci@xxxxxxxxxxxxxxx

                Mobile : 06 51 47 84 98

                Bureau : 02 85 52 41 81

                http://www.performance-conseil-informatique.net
              
              News :
                  On transforme !
                   Comme promis, en
                    2017 on transforme ! A vos côtés, nous transformons
                    votre infrastructure informatique tout en gardant
                    les fondamentaux PCI : Conti... 
                
              
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com