Issue in RDMA transport

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

I found some memory corruption in the RDMA transport layer.

Setup is CentOS 5.5, Mellanox OFED 1.5.2 / OpenFabrics OFED 1.5.2,
ConnectX-2 cards, GlusterFS 3.1.2 / Git Master Branch.

Application is ANSYS CFX wit transient cases, running with strange
corecounts like 6 or 12.

Symptoms are failure during the write out of the case. Errors are
recorded in the brick's and client's logs:

node24:/var/log/glusterfs/home.log
[2011-02-04 15:41:19.688110] W [fuse-bridge.c:1761:fuse_writev_cbk]
glusterfs-fuse: 29810266: WRITE => -1 (Bad address)

server2:/var/log/glusterfs/bricks/brick07.log
[2011-02-04 15:41:19.687733] E [posix.c:2504:posix_writev] home-posix:
write failed: offset 538534184, Bad address

I was able to reproduce the error using a single brick and a single
client. Running server and client on the same system didn't pop up the
error, the data must pass a wire to trigger the bug. Switching to TCP
over IPoIB was a successful workaround.

It looks like a pointer in the iovec structure used by the writev is
screwed up during the transport over RDMA. I can imagine that the
debugging would be rather hard, hopefully you'll be able to find the
root cause. Feel free to ask for additional logs or traces, I'll try to
provide them.

Beat

-- 
     \|/                           Beat Rubischon <beat at 0x1b.ch>
   ( 0-0 )                             http://www.0x1b.ch/~beat/
oOO--(_)--OOo---------------------------------------------------
Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/


[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux