Re: [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

"pascal.pucci@xxxxxxxxxxxxxxx" <pascal.pucci@xxxxxxxxxxxxxxx> · Thu, 9 Mar 2017 13:46:14 +0100



    Le 09/03/2017 à 13:03, Vincent Godin a écrit :

    
          First of all, don't do a ceph upgrade while your cluster
            is in warning or error state. A process upgrade must be done
            from an clean cluster.

          
    of course.

    
    So, Yesterday, so I try this for my "unfound PG"

    
    ceph pg 50.2dd mark_unfound_lost revert => MON crash :(

    so :

    ceph pg 50.2dd mark_unfound_lost delete => OK.

    
    Cluster was Health OK => So I finaly migrate all to version Jewel
    10.2.6. So this night, nothing, all worked fine (trimfs of rbd was
    disabled). 

    Maybe next. It's always after two days. (scrubing  is 22h to 6h).

    
        Don't stay with a replicate at 2. Majority of problems come
          from that point: just look the advices given by experience
          users of the list. You should set a replicate of 3 and a
          min_size at 2. This will prevent you to fail some data because
          of a double fault which is frequent.

        
    I already had a faulty some pg found by scrubbing processus (disk IO
    error) and had to remove bad PG myself. As I understood, with 3
    replica, repair would be automatique.

    Ok, I will change to 3. :)

    
      For your specific problem, i have no idea of the
        root cause. If you have already checked your network (tuning
        parameters, enable jumbo, etc..), your software version on all
        the components, your hardware (raid card, system messages, ...),
        may be you should just re-install your first OSD server. I had a
        big problem after an upgrade from hammer to jewel and nobody
        seems to have encountered it doing the same operation. All
        servers were configured the same way but they had not the same
        history.We found that the problem came from the differents
        versions we installed on some OSD servers (giant -> hammer
        -> jewel). OSD servers which never knew the giant version had
        no problem at all. We had on the problematic servers (in jewel)
        some bugs which was corrected years ago in giant !!!. So we have
        to isolate those servers and reinstall them directly in jewel :
        it solved the problem.

      
    OK. I will think about it.

    
    But, all node are realy same = > check all node with rpm -Va
    => OK. Tuning all, etc... check network ok... It came just the
    day after upgrade :)

    
    Thanks for you advise. We will see this night. :)

    
    Pascal.

      
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com