Re: Cluster unusable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


I got a recommendation From Stephan to restart the OSDs one by one.
So I did it. It helped a bit (some IOs completed), but at the end, the state was the same as before, and new IOs still hung.

Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game.
 
Actually this was done by simply restarting ceph on that node:
[root@qvitblhat12 ~]# date;service ceph status
Tue Dec 23 14:36:11 UTC 2014
=== osd.0 ===
osd.0: running {"version":"0.80.7"}
=== osd.4 ===
osd.4: running {"version":"0.80.7"}
[root@qvitblhat12 ~]# date;service ceph restart
Tue Dec 23 14:36:17 UTC 2014
=== osd.0 ===
=== osd.0 ===
Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done
=== osd.0 ===
create-or-move updating item name 'osd.0' weight 0.27 at location {host=qvitblhat12,root=default} to crush map
Starting Ceph osd.0 on qvitblhat12...
Running as unit run-4398.service.
=== osd.4 ===
=== osd.4 ===
Stopping Ceph osd.4 on qvitblhat12...kill 5375...done
=== osd.4 ===
create-or-move updating item name 'osd.4' weight 0.27 at location {host=qvitblhat12,root=default} to crush map
Starting Ceph osd.4 on qvitblhat12...
Running as unit run-4720.service.

[root@qvitblhat06 ~]# ceph osd tree
# id    weight    type name    up/down    reweight
-1    1.62    root default
-5    1.08        datacenter dc_XAT
-2    0.54            host qvitblhat10
1    0.27                osd.1    up    1    
5    0.27                osd.5    up    1    
-4    0.54            host qvitblhat12
0    0.27                osd.0    up    1    
4    0.27                osd.4    up    1    
-6    0.54        datacenter dc_QVI
-3    0.54            host qvitblhat11
2    0.27                osd.2    up    1    
3    0.27                osd.3    up    1    
[root@qvitblhat06 ~]#

This change made ceph to rebalance data, and then the miracle, as all PGs ended up as active+clean.

[root@qvitblhat06 ~]# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set
noscrub,nodeep-scrub flag(s) set

Well apart from being happy that the cluster is now healthy, I find it a little bit scary of having to shake it in one direction and another
and hope that it will eventually recover, while in the meantime my users IOs are stuck...

So is there a way to understand what happened ?

Francois

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux