Re: Cluster unusable

"francois.petit@xxxxxxxxxxxxxxxx" <francois.petit@xxxxxxxxxxxxxxxx> · Tue, 23 Dec 2014 17:14:13 +0100 (CET)

   Hi,

I got a recommendation From Stephan to restart the OSDs one by one.

So I did it. It helped a bit (some IOs completed), but at the end, the state was the same as before, and new IOs still hung.

Loïc, thanks for the advice on moving back the osd.0 and osd.4 into the game.

   Actually this was done by simply restarting ceph on that node:

[root@qvitblhat12 ~]# date;service ceph status

Tue Dec 23 14:36:11 UTC 2014

=== osd.0 === 

osd.0: running {"version":"0.80.7"}

=== osd.4 === 

osd.4: running {"version":"0.80.7"}

[root@qvitblhat12 ~]# date;service ceph restart

Tue Dec 23 14:36:17 UTC 2014

=== osd.0 === 

=== osd.0 === 

Stopping Ceph osd.0 on qvitblhat12...kill 4527...kill 4527...done

=== osd.0 === 

create-or-move updating item name 'osd.0' weight 0.27 at location {host=qvitblhat12,root=default} to crush map

Starting Ceph osd.0 on qvitblhat12...

Running as unit run-4398.service.

=== osd.4 === 

=== osd.4 === 

Stopping Ceph osd.4 on qvitblhat12...kill 5375...done

=== osd.4 === 

create-or-move updating item name 'osd.4' weight 0.27 at location {host=qvitblhat12,root=default} to crush map

Starting Ceph osd.4 on qvitblhat12...

Running as unit run-4720.service.

[root@qvitblhat06 ~]# ceph osd tree

# id    weight    type name    up/down    reweight

-1    1.62    root default

-5    1.08        datacenter dc_XAT

-2    0.54            host qvitblhat10

1    0.27                osd.1    up    1    

5    0.27                osd.5    up    1    

-4    0.54            host qvitblhat12

0    0.27                osd.0    up    1    

4    0.27                osd.4    up    1    

-6    0.54        datacenter dc_QVI

-3    0.54            host qvitblhat11

2    0.27                osd.2    up    1    

3    0.27                osd.3    up    1    

[root@qvitblhat06 ~]# 

This change made ceph to rebalance data, and then the miracle, as all PGs ended up as active+clean.

[root@qvitblhat06 ~]# ceph health detail

HEALTH_WARN noscrub,nodeep-scrub flag(s) set

noscrub,nodeep-scrub flag(s) set

Well apart from being happy that the cluster is now healthy, I find it a little bit scary of having to shake it in one direction and another

and hope that it will eventually recover, while in the meantime my users IOs are stuck...

So is there a way to understand what happened ?

Francois

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com