Re: cluster unavailable for 20 mins when downed server was reintroduced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 15 Aug 2017, Sean Purdy said:
> Luminous 12.1.1 rc1
> 
> Hi,
> 
> 
> I have a three node cluster with 6 OSD and 1 mon per node.
> 
> I had to turn off one node for rack reasons.  While the node was down, the cluster was still running and accepting files via radosgw.  However, when I turned the machine back on, radosgw uploads stopped working and things like "ceph status" starting timed out.  It took 20 minutes for "ceph status" to be OK.  

Well I've figured out why "ceph status" was hanging (and possibly radosgw).  It seems that ceph utility looks at ceph.conf to find a monitor to connect to (or at least that's what strace implied), but our ceph.conf only had one monitor out of three actually listed in the file.  And that was the node I turned off.  Updating mon_initial_members and mon_host with the other two monitors worked.

TBF, https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/1.3/html/administration_guide/managing_cluster_size does mention you should add your second and third monitors here.  But I hadn't read that, and elsewhere I read that on boot the monitors will discover other monitors, so I thought you didn't need to list them all.  e.g. http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address (which also says clients use ceph.conf to find monitors - I missed that part).

Anyway, I'll do a few more tests with a better ceph.conf


Sean Purdy
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux