[Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

No new information. Every two night server OSD 1 freeze with a load > 500.

It's every 2 days. Sometime during scrub, sometime during fstrim, 
sometime during nothing...

But this night, this OSD server came not a life after some minutes as 
before... 8 hours without this server and all its OSD (12/36).

This morning, I restart it  and now after some hours :

HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck unclean; 
recovery 304/46002595 objects degraded (0.001%); recovery 11288/46002595 
objects misplaced (0.025%); recovery 3/9779473 unfound (0.000%)
pg 50.2dd is stuck unclean for 23531.224308, current state 
active+recovering+degraded+remapped, last acting [7,28]
pg 50.2dd is active+recovering+degraded+remapped, acting [7,28], 3 unfound
recovery 304/46002595 objects degraded (0.001%)
recovery 11288/46002595 objects misplaced (0.025%)
recovery 3/9779473 unfound (0.000%)

Pool 50.2dd is a RBD filesystem, XFS with replicat 2x.

So what is the best solution ?

#ceph pg 50.2dd mark_unfound_lost delete

or

#ceph pg 50.2dd mark_unfound_lost revert

?

What can have more impact to RBD/XFS filesystem ? a xfs_repair required 
after ?

So, I will probably try ceph version 10.2.6 this evening because I 
really found nothing to fix...

Why this freeze ? why only this server OSD freeze and not others ? why 
every 2 days ? It's crazy.

I already checked all : disk, network, soft, all servers are equals.

(all issues started the day after upgrade to 10.2.5 from 10.2.3).

Thanks for your help.

Regards,

Le 02/03/2017 ? 15:34, pascal.pucci at pci-conseil.net a ?crit :
>
> Hello,
>
> So, I need maybe some advices : 1 week ago (last 19 feb), I upgraded 
> my stable Ceph Jewel from 10.2.3 to 10.2.5 (YES, It was maybe a bad idea).
>
> I never had problem with Ceph 10.2.3 since last upgrade, last 23 
> September.
>
> So since my upgrade (10.2.5), every 2 days, the first OSD server 
> totaly Freeze. Load go > 500 and come back after somes minutes? I lost 
> all OSD from this server (12/36) during issue.
>
>
[...]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20170308/fc911f2f/attachment.htm>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux