Hello Megan, On 04/06/15 08:23 -0400, Megan . wrote: > On Wed, Jun 3, 2015 at 10:31 AM, Megan . <nagemnna@xxxxxxxxx> wrote: [...] > FYI - i talked to our network folks and it looks like they were doing some > testing last night with port failover which may or may not have caused this unlikely, unless you were "lucky" enough to contact a different actual machine under the network address than you intended or if modclusterd was fragile enough to break on these intermittent changes (not exactly sure what you mean with "port failover" TBH). Indicated error: >> Error: ClientSocket(String): connect() failed: No such file or directory means that modclusterd on particular node was not running (by itself, this is still OK) and it could not be started within 8 seconds, which is what modcluster (ricci's helper, but from clustermon package) tries to do if the socket /var/run/clumond.sock (indication of running modclusterd) cannot be reached (for whatever reason, including SELinux, but that should be OK as well). So if the problem recidivates, definitely check the troubling node if: - modclusterd service is running and/or is able to start (provide /var/run/clumond.sock socket) within 5 seconds or so under the typical workload (may be subtle in virtualized environment) - when modclusterd is started, /var/run/clumond.sock exists and has the expected properties (file-like socket, expected permissions) - SELinux (if enabled) audit contains any clumond.sock or modclusterd reference > issue. However, I was able to correct it by fencing the problem nodes. Provided that those "port failover" shakes were settled down by that time, perhaps modclusterd just started to be happy again and not failing anymore if it was the case previously. >> Anybody ever seen "Error: ClientSocket(String): connect() failed: No such >> file or directory" when doing a start all? Something seems to have >> broken with our closer. Our UAT setup works as expected. I looked at >> tcpdumps the best that i could (i'm not a network person though) and i >> didn't see anything obvious. I shutdown iptables on all nodes. FWIW, most if not all of the packet sniffing tools cannot hook into local file-like sockets. >> We are running Centos 6,6, ccs-0.16.2-75.el6_6.1.x86_64 Good, this excluded all known (and fixed!) bugs preventing modclusterd from operation (IPv4-only environment, huge cluster.conf). >> cman-3.0.12.1-68.el6.x86_64. We have a 12 node cluster in production that >> allows us to share gfs2 iscsi mounts. no other services are used. clvmd >> -R runs fine at this time. ccs -h node --sync --activate also runs fine. >> >> >> [root@admin1 ~]# ccs -h admin1-ops --startall >> Unable to start map1-ops, possibly due to lack of quorum, try --startall >> Error: ClientSocket(String): connect() failed: No such file or directory >> Started cache2-ops >> Unable to start data1-ops, possibly due to lack of quorum, try --startall >> Error: ClientSocket(String): connect() failed: No such file or directory >> Started map2-ops >> Unable to start archive1-ops, possibly due to lack of quorum, try >> --startall >> Error: ClientSocket(String): connect() failed: No such file or directory >> Started data3-ops >> Started mgmt1-ops >> Unable to start admin1-ops, possibly due to lack of quorum, try --startall >> Error: ClientSocket(String): connect() failed: No such file or directory >> Started data2-ops >> Started cache1-ops The out-of-context, hilarious hint (use --startall when you actually do) led me to file a bug: <https://bugzilla.redhat.com/1234515>. Thanks for indirectly showing this off! -- Jan
Attachment:
pgpnZ395BCGpb.pgp
Description: PGP signature
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster