Re: Change transport-type on volume from tcp to rdma, tcp

Geoffrey Letessier <geoffrey.letessier@xxxxxxx> · Wed, 22 Jul 2015 16:59:20 +0200

I can confirm your words… Everything looks like OK with TCP proto and more-or-less unstable with RDMA one. But TCP is slower than RDMA protocol… 
In attachments you can find my volume mount log, all brick logs and several information concerning my vol_shared volume.

Thanks in advance,
Geoffrey

PS: sorry for my next answer latencies but I will be in vacation (from this evening) very far from any internet access.
Attachment:
vol_shared.tgz

Description: Binary data

------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 22 juil. 2015 à 10:45, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> a écrit :

On 07/22/2015 01:36 PM, Geoffrey Letessier wrote:
Oops, i forgot to add all people in CC.
Yes, i guessed. 
With TCP protocol, all my volume seem OK and I dont note, for the moment, any hang. 

So if I understand correctly , everything is fine with tcp (no hang, no transport end point disconnected error),and both happens for rdma. please correct me if not so.

mount command:
 - with RDMA: mount -t glusterfs -o transport=rdma,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /mnt
 - with TCP:    mount -t glusterfs -o transport=tcp,direct-io-mode=disable,enable-ino32 ib-storage1:vol_home /mnt

volume status:
# gluster volume status all
Status of volume: vol_home
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ib-storage1:/export/brick_home/brick1
/data                                       49159     49165      Y       6547 
Brick ib-storage2:/export/brick_home/brick1
/data                                       49161     49173      Y       24348
Brick ib-storage3:/export/brick_home/brick1
/data                                       49152     49156      Y       5616 
Brick ib-storage4:/export/brick_home/brick1
/data                                       49152     49162      Y       5424 
Brick ib-storage1:/export/brick_home/brick2
/data                                       49160     49166      Y       6548 
Brick ib-storage2:/export/brick_home/brick2
/data                                       49162     49174      Y       24355
Brick ib-storage3:/export/brick_home/brick2
/data                                       49153     49157      Y       5635 
Brick ib-storage4:/export/brick_home/brick2
/data                                       49153     49163      Y       5443 
Self-heal Daemon on localhost               N/A       N/A        Y       6534 
Self-heal Daemon on ib-storage3             N/A       N/A        Y       7656 
Self-heal Daemon on ib-storage2             N/A       N/A        Y       24519
Self-heal Daemon on ib-storage4             N/A       N/A        Y       7288 

Task Status of Volume vol_home
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: vol_shared
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ib-storage1:/export/brick_shared/data 49152     49164      Y       6554 
Brick ib-storage2:/export/brick_shared/data 49152     49172      Y       24362
Self-heal Daemon on localhost               N/A       N/A        Y       6534 
Self-heal Daemon on ib-storage3             N/A       N/A        Y       7656 
Self-heal Daemon on ib-storage2             N/A       N/A        Y       24519
Self-heal Daemon on ib-storage4             N/A       N/A        Y       7288 

Task Status of Volume vol_shared
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: vol_workdir_amd
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ib-storage1:/export/brick_workdir/bri
ck1/data                                    49191     49192      Y       6555 
Brick ib-storage3:/export/brick_workdir/bri
ck1/data                                    49164     49165      Y       6368 
Brick ib-storage1:/export/brick_workdir/bri
ck2/data                                    49193     49194      Y       6576 
Brick ib-storage3:/export/brick_workdir/bri
ck2/data                                    49166     49167      Y       6387 

Task Status of Volume vol_workdir_amd
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: vol_workdir_intel
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ib-storage2:/export/brick_workdir/bri
ck1/data                                    49175     49176      Y       24371
Brick ib-storage2:/export/brick_workdir/bri
ck2/data                                    49177     49178      Y       24372
Brick ib-storage4:/export/brick_workdir/bri
ck1/data                                    49164     49165      Y       5571 
Brick ib-storage4:/export/brick_workdir/bri
ck2/data                                    49166     49167      Y       5590 

Task Status of Volume vol_workdir_intel
------------------------------------------------------------------------------
There are no active volume tasks

Concerning the brick logs, do you wanna have all bricks on every servers?
any errors from client log and bricks logs, and logs which has message id in between 102000 to 104000 from the same .

Rafi KC

Geoffrey

------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 22 juil. 2015 à 10:00, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> a écrit :

On 07/22/2015 12:55 PM, Geoffrey Letessier wrote:
Concerning the hang, I just saw this only once with TCP protocol but, actually, RDMA seems to be in cause.

If you are mounting a tcp,rdma volume using tcp protocol, all the communication will go through the tcp connection and rdma won't come in between client and server.

… And, after a moment (a few minutes after having restarted my back-transfert of around 40TB), my volume fall down (and all my rsync too):[root@atlas ~]# df -h /mnt
df: « /mnt »: Noeud final de transport n'est pas connecté
df: aucun système de fichiers traité
aka "transport endpoint is not connected »

Can you sent me the following details , if possible, ?
1) mount command used, 2) volume status 3) Client, brick logs 

Regards
Rafi KC

Geoffrey

------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 22 juil. 2015 à 09:17, Geoffrey Letessier <geoffrey.letessier@xxxxxxx> a écrit :

Hi Rafi,
It’s what I do. But I note particularly this kind of trouble when I mount my volumes manually.

In addition, when I changed my transport-type from tcp or rdma to tcp,rdma, I have had to restart my volume in order they can took effect. 

I wonder if these trouble are not due to RDMA protocol… because it looks like more stable with TCP one.

Another idea?
Thanks for replying and by advance,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 22 juil. 2015 à 07:33, Mohammed Rafi K C <rkavunga@xxxxxxxxxx> a écrit :

On 07/22/2015 04:51 AM, Geoffrey Letessier wrote:
Hi Niels,
Thanks for replying. 

In fact, after having checked the log, I've discovered GlusterFS tried to connect a brick with a TCP (or RDMA) port allocated to another volume… (bug?)
For example, here is a extract of my workdir.log file :
[2015-07-21 21:34:01.820188] E [socket.c:2332:socket_connect_finish] 0-vol_workdir_amd-client-0: connection to 10.0.4.1:49161 failed (Connexion refusée)
[2015-07-21 21:34:01.822563] E [socket.c:2332:socket_connect_finish] 0-vol_workdir_amd-client-2: connection to 10.0.4.1:49162 failed (Connexion refusée)

But the 2 ports (49161 and 49162) concerned only my vol_home volume, not the vol_workdir_amd one.

Now, after having restart all glusterd synchronously (pdsh -w cl-storage[1-4] service glusterd restart), all seems to be back into a normal situation (size, write permission, etc.)

But, a few minutes later, i note a strange thing I notice since i’ve upgraded my cluster storage from 3.5.3 to 3.7.2-3: when I try to mount some volume (particularly my vol_shared volume (replicated volume)) my system can hang… And, because I use it in my bashrc file for my environment modules, i need to restart my node. Idem if I try to do a DF on my mounted volume (if it doesn’t hang during the mount).

With TCP transport-type, the situation seems to be more stable..

In addition: If I restart a storage node, I can’t use Gluster CLI (it also hang).

Do you have an idea?

Are you using bash script to start/mount the volume ? If so, add a sleep after volume start and mount, to allow all the process to start properly. Because RDMA protocol will take some time to init the resources.

Regards
Rafi KC

One more time, thanks a lot for your help,
Geoffrey

------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 21 juil. 2015 à 23:49, Niels de Vos <ndevos@xxxxxxxxxx> a écrit :

On Tue, Jul 21, 2015 at 11:20:20PM +0200, Geoffrey Letessier wrote:
Hello Soumya, Hello everybody,

network.ping-timeout was set to 42 seconds. I set it to 0 but no
difference. The problem was, after having re-set le transport-type to
rdma,tcp some brick down after a few minutes.. Despite of restarting
volumes, after a few minutes, some [other/different] bricks down
again.

I'm not sure how if the ping-timeout is differently handled when RDMA is
used. Adding two of the guys that know RDMA well on CC.

Now, after re-creation of my volume, bricks keep alive but, oddly, i’m
not able to write on my volume. In addition, I defined a distributed
volume with 2 servers, 4 bricks of 250GB each and my final volume
seems to be only sized to 500GB… It’s amazing.. 

As seen further below, the 500GB volume is caused by two unreachable
bricks. When the bricks are not reachable, the size of the bricks can
not be detected by the client and therefore 2x 250 GB is missing.

It is unclear to me why writing to a pure distributed volume fails. When
a brick is not reachable, and the file should be created there, it
would normally get created on an other brick. When the brick that should
have the file gets online, and a new lookup for the file is done, a so
called "link file" is created, which points to the file on the other
brick. I guess the failure has to do with the connection issues, and I
would suggest to get that solved first.

HTH,
Niels

Here you can find some information:
# gluster volume status vol_workdir_amd
Status of volume: vol_workdir_amd
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick ib-storage1:/export/brick_workdir/bri
ck1/data                                    49185     49186      Y       23098
Brick ib-storage3:/export/brick_workdir/bri
ck1/data                                    49158     49159      Y       3886 
Brick ib-storage1:/export/brick_workdir/bri
ck2/data                                    49187     49188      Y       23117
Brick ib-storage3:/export/brick_workdir/bri
ck2/data                                    49160     49161      Y       3905 

# gluster volume info vol_workdir_amd

Volume Name: vol_workdir_amd
Type: Distribute
Volume ID: 087d26ea-c6df-4cbe-94af-ecd87b59aedb
Status: Started
Number of Bricks: 4
Transport-type: tcp,rdma
Bricks:
Brick1: ib-storage1:/export/brick_workdir/brick1/data
Brick2: ib-storage3:/export/brick_workdir/brick1/data
Brick3: ib-storage1:/export/brick_workdir/brick2/data
Brick4: ib-storage3:/export/brick_workdir/brick2/data
Options Reconfigured:
performance.readdir-ahead: on

# pdsh -w storage[1,3] df -h /export/brick_workdir/brick{1,2}
storage3: Filesystem            Size  Used Avail Use% Mounted on
storage3: /dev/mapper/st--block1-blk1--workdir
storage3:                       250G   34M  250G   1% /export/brick_workdir/brick1
storage3: /dev/mapper/st--block2-blk2--workdir
storage3:                       250G   34M  250G   1% /export/brick_workdir/brick2
storage1: Filesystem            Size  Used Avail Use% Mounted on
storage1: /dev/mapper/st--block1-blk1--workdir
storage1:                       250G   33M  250G   1% /export/brick_workdir/brick1
storage1: /dev/mapper/st--block2-blk2--workdir
storage1:                       250G   33M  250G   1% /export/brick_workdir/brick2

# df -h /workdir/
Filesystem            Size  Used Avail Use% Mounted on
localhost:vol_workdir_amd.rdma
                     500G   67M  500G   1% /workdir

# touch /workdir/test
touch: impossible de faire un touch « /workdir/test »: Aucun fichier ou dossier de ce type

# tail -30l /var/log/glusterfs/workdir.log 
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:33.927673] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:37.877231] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to 49173 (from 0)
[2015-07-21 21:10:37.880556] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to 49174 (from 0)
[2015-07-21 21:10:37.914661] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:37.923535] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:41.883925] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to 49173 (from 0)
[2015-07-21 21:10:41.887085] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to 49174 (from 0)
[2015-07-21 21:10:41.919394] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:41.932622] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:44.682636] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
[2015-07-21 21:10:44.682947] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
[2015-07-21 21:10:44.683240] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
[2015-07-21 21:10:44.683472] W [dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk info from vol_workdir_amd-client-0
[2015-07-21 21:10:44.683506] W [dht-diskusage.c:48:dht_du_info_cbk] 0-vol_workdir_amd-dht: failed to get disk info from vol_workdir_amd-client-2
[2015-07-21 21:10:44.683532] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
[2015-07-21 21:10:44.683551] W [fuse-bridge.c:1970:fuse_create_cbk] 0-glusterfs-fuse: 18: /test => -1 (Aucun fichier ou dossier de ce type)
[2015-07-21 21:10:44.683619] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
[2015-07-21 21:10:44.683846] W [dht-layout.c:189:dht_layout_search] 0-vol_workdir_amd-dht: no subvolume for hash (value) = 1072520554
[2015-07-21 21:10:45.886807] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-0: changing port to 49173 (from 0)
[2015-07-21 21:10:45.893059] I [rpc-clnt.c:1819:rpc_clnt_reconfig] 0-vol_workdir_amd-client-2: changing port to 49174 (from 0)
[2015-07-21 21:10:45.920434] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-0: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1021 peer:10.0.4.1:49173)
Host Unreachable, Check your connection with IPoIB
[2015-07-21 21:10:45.925292] W [rdma.c:1263:gf_rdma_cm_event_handler] 0-vol_workdir_amd-client-2: cma event RDMA_CM_EVENT_REJECTED, error 8 (me:10.0.4.1:1020 peer:10.0.4.1:49174)
Host Unreachable, Check your connection with IPoIB

I use GlusterFS in production since around 3 years without any block
problem but now the situation is awesome since more than 3 weeks…
Indeed, our production are down since roughly 3.5 weeks (with a lot
and different problems with GlusterFS v3.5.3 and now with 3.7.2-3) and
i need to restart it… 

Thanks in advance,
Geoffrey
------------------------------------------------------
Geoffrey Letessier
Responsable informatique & ingénieur système
UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
Institut de Biologie Physico-Chimique
13, rue Pierre et Marie Curie - 75005 Paris
Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx

Le 21 juil. 2015 à 19:36, Soumya Koduri <skoduri@xxxxxxxxxx> a écrit :

From the following errors,

[2015-07-21 14:36:30.495321] I [MSGID: 114020] [client.c:2118:notify] 0-vol_shared-client-0: parent translators are ready, attempting connect on transport
[2015-07-21 14:36:30.498989] W [socket.c:923:__socket_keepalive] 0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12, Protocole non disponible
[2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect] 0-vol_shared-client-0: Failed to set keep-alive: Protocole non disponible

looks like setting TCP_USER_TIMEOUT value to 0 on the socket failed with error (IIUC) "Protocol not available".
Could you check if 'network.ping-timeout' is set to zero for that volume using 'gluster volume info'? Anyways from the code looks like 'TCP_USER_TIMEOUT' can take value zero. Not sure why it has failed.

Niels, any thoughts?

Thanks,
Soumya

On 07/21/2015 08:15 PM, Geoffrey Letessier wrote:
[2015-07-21 14:36:30.495321] I [MSGID: 114020] [client.c:2118:notify]
0-vol_shared-client-0: parent translators are ready, attempting connect
on transport
[2015-07-21 14:36:30.498989] W [socket.c:923:__socket_keepalive]
0-socket: failed to set TCP_USER_TIMEOUT 0 on socket 12, Protocole non
disponible
[2015-07-21 14:36:30.499004] E [socket.c:3015:socket_connect]
0-vol_shared-client-0: Failed to set keep-alive: Protocole non disponible

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users