Re: [Gluster-users] When will 3.6 be considered stable? (was: Replace brick 3.4.2 with 3.6.2?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the delay... Long day of flights... OK. Here goes my attempt to explain what was happening:

First, my setup. I am using a replica-2 setup with four nodes. These are:

gfsib01a
gfsib01b
gfsib02a
gfsib02b, where the 1a/1b and 2a/2b are replica pairs.

I am using a number of segregated networks.

gfsib01a 10.200.70.1
gfsib01b 10.200.71.1
gfsib02a 10.200.70.2
gfsib02b 10.200.71.2

where 10.200.x.x is my infiniband network. Gluster is also connect to my super-computer nodes on a 10.214.x.x network thought the gigabit interface.

Our DNS resolves gfsib01a to the 10.200.x.x network. When our initial system was setup and we were accessing gluster on a non-infiniband network space (i.e. on a machine with no infiniband card, and therefore, no access to the 10.200 network), we adjusted the DNS entries by placing the following in the /etc/hosts file on the machine:

/etc/hosts [only done on machines without access to 10.200 IB network]:
gfsib01a 10.214.70.1
gfsib01b 10.214.71.1
gfsib02a 10.214.70.2
gfsib02b 10.214.71.2

This setup was recommended by the Redhat guys who came out to demo gluster for us a year or two ago. This is how we were instructed to setup multiple network access with gluster. Basically, it tricked the traffic to resolve gfsib01a.corvidtec.com to something that could be seen on a given node that didn't have access to the 10.200 network.

10.200 traffic would be routed through ib0 on nodes where there was an IB card. 10.214 traffic would be routed through eth0 on nodes where there was no IB card, and hence, no access to the 10.200 network.

This worked for us until we upgraded to 3.6.3. At that point, we ran into issues where some of the nodes would mount /homegfs and some would fail with timeout issues. For those that did actually mount (430 of the nodes out of 1500 completed the mount, the rest timed out), /homegfs was accessible. However, when I tried to switch to a user whose home directory was on /homegfs, it would sit there for roughly 20-30 seconds before completing. Something in the ssh connection was taking a very long time. Once you were connected, it behaved normally and operated fine without any performance issues.

Now begins my best guess as to what happened with my fully admitted novice level understanding of how this works. Let the speculation begin... It looks like something changed in 3.6.3 with the name resolution/IP handling. My best guess is that FUSE needs to "see" all of the nodes to be able to write to them. When I mounted gfsib01a effectively using "10.214.70.1:/homegfs /homegfs", it found gfsib01a without any issues. However, it looks like 3.6.3 now returns the 10.200.x.x address space back to the FUSE mount for the other nodes in the volume (gfsib01b, gfsib02a, gfsib02b). At which point, the route doesn't work as the node doesn't have access to the 10.200 network space. I fixed this by adding a route to the nodes so that 10.200 traffic goes out the 10.214 ethernet port, and removing the DNS adjustments in /etc/hosts.

Again, I am guessing here, but do you know if the name resolution that is passed back changed in 3.6.3. Did it send back the machine name (gfsib01a, gfsib01b, gfsib02a, gfsib02b) prior to 3.6.3 and now it sends back IP addresses? Or, something along these lines?

Once I added the routes and eliminated the "spoofing" in the /etc/hosts file, everything worked fine.

On a more positive note, it does seem to be behaving well. The previous heal-fails have been cleaned up and it no longer continually shows failed heals. The only thing I have noticed is that I am getting a lot of these in the logs:

[2015-05-06 04:25:15.293175] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.293184] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.293192] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.293200] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375447] D [cli-cmd-volume.c:1825:cli_check_gsync_present] 0-cli: Returning 0 [2015-05-06 04:25:15.375511] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375522] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375538] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375552] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375562] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375572] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375581] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375588] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375597] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375604] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375614] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375622] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375629] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375635] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375642] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375652] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375659] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375667] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375674] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375681] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375688] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375695] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375702] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375708] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375716] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375723] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375730] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375737] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375743] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375751] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375760] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375767] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375775] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375782] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375790] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375803] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375811] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375819] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375826] D [registry.c:408:cli_cmd_register] 0-cli: Returning 0 [2015-05-06 04:25:15.375879] T [cli.c:264:cli_rpc_notify] 0-glusterfs: got RPC_CLNT_CONNECT [2015-05-06 04:25:15.375896] T [cli-quotad-client.c:94:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_CONNECT [2015-05-06 04:25:15.375911] I [socket.c:2353:socket_event_handler] 0-transport: disconnecting now [2015-05-06 04:25:15.375938] T [cli-quotad-client.c:100:cli_quotad_notify] 0-glusterfs: got RPC_CLNT_DISCONNECT [2015-05-06 04:25:15.376003] T [rpc-clnt.c:1381:rpc_clnt_record] 0-glusterfs: Auth Info: pid: 0, uid: 0, gid: 0, owner: [2015-05-06 04:25:15.376036] T [rpc-clnt.c:1238:rpc_clnt_record_build_header] 0-rpc-clnt: Request fraglen 152, payload: 88, rpc hdr: 64 [2015-05-06 04:25:15.376252] T [socket.c:2872:socket_connect] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x1e0)[0x30fb620550] (--> /usr/lib64/glusterfs/3.6.3/rpc-transport/socket.so(+0x72d3)[0x7f95db4232d3] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_submit+0x468)[0x30fbe0efe8] (--> gluster(cli_submit_request+0xdb)[0x40a8fb] (--> gluster(cli_cmd_submit+0x8e)[0x40b6fe] ))))) 0-glusterfs: connect () called on transport already connected [2015-05-06 04:25:15.376275] T [rpc-clnt.c:1573:rpc_clnt_submit] 0-rpc-clnt: submitted request (XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 27) to rpc-transport (glusterfs) [2015-05-06 04:25:15.376297] D [rpc-clnt-ping.c:231:rpc_clnt_start_ping] 0-glusterfs: ping timeout is 0, returning [2015-05-06 04:25:15.381486] T [rpc-clnt.c:660:rpc_clnt_reply_init] 0-glusterfs: received rpc message (RPC XID: 0x1 Program: Gluster CLI, ProgVers: 2, Proc: 27) from rpc-transport (glusterfs) [2015-05-06 04:25:15.381524] D [cli-rpc-ops.c:6649:gf_cli_status_cbk] 0-cli: Received response to status cmd [2015-05-06 04:25:15.381712] D [cli-cmd.c:384:cli_cmd_submit] 0-cli: Returning 0 [2015-05-06 04:25:15.381731] D [cli-rpc-ops.c:6912:gf_cli_status_volume] 0-cli: Returning: 0 [2015-05-06 04:25:15.381739] D [cli-cmd-volume.c:1930:cli_cmd_volume_status_cbk] 0-cli: frame->local is not NULL (0x7f95cc0009c0) [2015-05-06 04:25:15.381764] I [input.c:36:cli_batch] 0-: Exiting with: 0

David


------ Original Message ------
From: "Justin Clift" <justin@xxxxxxxxxxx>
To: "Kingsley Tart - Barritel" <kingsley.tart@xxxxxxxxxxxx>
Cc: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx>
Sent: 5/5/2015 10:11:50 AM
Subject: Re: [Gluster-users] When will 3.6 be considered stable? (was: Replace brick 3.4.2 with 3.6.2?)

On 5 May 2015, at 14:39, Kingsley Tart - Barritel <kingsley.tart@xxxxxxxxxxxx> wrote:
 On Thu, 2015-02-26 at 21:24 +0000, Justin Clift wrote:
When will 3.6 be considered stable? I'm waiting to deploy a cluster into a production environment. I've built a 3.6.2 cluster and have put a live
 copy of the data onto it, which took rsync a solid 2 weeks to do. I
 don't really want to go through that again if I can help it.

 We thought it was - including getting tested by a some places fairly
intensively before - until bugs started showing up when people deployed
 it to production.

 We're actively working on a 3.6.3 release, fixing the reported bugs,
 and should have a beta out in the near-ish future.  (3.6.3beta1 came
 out on 11th Feb, we're still working on a few more patches)

 Hi Justin,

apologies for emailing directly, but it seems better than emailing the whole list on this particular issue, especially as we've already spoken
 about it.

 I've seen that 3.6.3 has been out a little while now, but I've been
holding off in case any other issues came to light. How is it all going
 with 3.6.3? Ideally I'd like to upgrade our 3.6.2 setup if 3.6.3 is
 considered good, but I don't know whether it's still early days.

3.6.3 *should* be better than 3.6.2. David Robinson (CC'd) mentioned he
saw a changE in client connectivity behaviour though, which is worth
knowing about before hand. I don't know the details, though David mentioned
he'll send info about it through to the mailing list.

I'd wait until then, read though that when it arrives, and then proceed with
planning and upgrading things then.

Hope that helps. :)

+ Justin


 --
 Cheers,
 Kingsley.


--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift


_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel




[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux