Sorry to hound you about this but it turns out an afr volume failing
works fine over tcp, but hangs the client over ib-verbs.
Our ib-verbs driver is from the one included in OFED-1.2.5. Is this the
recommended ib library? The error is raised at the transport level as
you can see from the client log below. Let me know if you need any more
detailed information.
Thanks!
Mickey Mazarick wrote:
AFR is being handled on the client... I simplified the specs down to
look exactly like the online example and I'm still seeing the same
result.
This is an infiniband setup so that may be the problem. We want to run
this on a 6 brick 100+ client cluster over infiniband.
Whenever I kill the gluster daemon on RTPST201 it hangs and the client
log says:
/2007-11-30 07:55:14 E [unify.c:145:unify_buf_cbk] bricks: afrns
returned 107
2007-11-30 07:55:14 E [unify.c:145:unify_buf_cbk] bricks: afrns
returned 107
2007-11-30 07:55:34 E [ib-verbs.c:1100:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on `mthca0' returned error
wc.status = 12, wc.vendor_err = 129, post->buf = 0x2aaaad801000,
wc.byte_len = 0, post->reused = 210
2007-11-30 07:55:34 E [ib-verbs.c:1100:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on `mthca0' returned error
wc.status = 12, wc.vendor_err = 129, post->buf = 0x2aaaac2bf000,
wc.byte_len = 0, post->reused = 168
2007-11-30 07:55:34 E [ib-verbs.c:951:ib_verbs_recv_completion_proc]
transport/ib-verbs: ibv_get_cq_event failed, terminating recv thread
2007-11-30 07:55:34 E [ib-verbs.c:1100:ib_verbs_send_completion_proc]
transport/ib-verbs: send work request on `mthca0' returned error
wc.status = 12, wc.vendor_err = 129, post->buf = 0x2aaaabfb9000,
wc.byte_len = 0, post->reused = 230/
Storage Bricks are:
RTPST201,RTPST202
########################Storage Brick vol spec:
volume afrmirror
type storage/posix
option directory /mnt/gluster/afrmirror
end-volume
volume afrns
type storage/posix
option directory /mnt/gluster/afrns
end-volume
volume afr
type storage/posix
option directory /mnt/gluster/afr
end-volume
volume server
type protocol/server
option transport-type ib-verbs/server # For ib-verbs transport
option ib-verbs-work-request-send-size 131072
option ib-verbs-work-request-send-count 64
option ib-verbs-work-request-recv-size 131072
option ib-verbs-work-request-recv-count 64
##auth##
option auth.ip.afrmirror.allow *
option auth.ip.afrns.allow *
option auth.ip.afr.allow *
option auth.ip.main.allow *
option auth.ip.main-ns.allow *
end-volume
#####################Client spec is:
volume afrvol1
type protocol/client
option transport-type ib-verbs/client option remote-host RTPST201
option remote-subvolume afr
end-volume
volume afrmirror1
type protocol/client
option transport-type ib-verbs/client option remote-host RTPST201
option remote-subvolume afrmirror
end-volume
volume afrvol2
type protocol/client
option transport-type ib-verbs/client option remote-host RTPST202
option remote-subvolume afr
end-volume
volume afrmirror2
type protocol/client
option transport-type ib-verbs/client option remote-host RTPST202
option remote-subvolume afrmirror
end-volume
volume afr1
type cluster/afr
subvolumes afrvol1 afrmirror2
end-volume
volume afr2
type cluster/afr
subvolumes afrvol2 afrmirror1
end-volume
volume afrns1
type protocol/client
option transport-type ib-verbs/client
option remote-host RTPST201
option remote-subvolume afrns
end-volume
volume afrns2
type protocol/client
option transport-type ib-verbs/client
option remote-host RTPST202
option remote-subvolume afrns
end-volume
volume afrns
type cluster/afr
subvolumes afrns1 afrns2
end-volume
volume bricks
type cluster/unify
option namespace afrns
subvolumes afr1 afr2
option scheduler alu # use the ALU scheduler
option alu.order open-files-usage:disk-usage:read-usage:write-usage
end-volume
Krishna Srinivas wrote:
If you have the AFR on the server side, and if this server goes down
then
all the FDs associated with the files on this server will return
ENOTCONN
error. (If that is how your setup is? ) But if you had AFR on the client
side it would have worked seamlessly. However this situation will be
handled when we bring out the HA translator
Krishna
On Nov 30, 2007 3:01 AM, Mickey Mazarick <mic@xxxxxxxxxxxxxxxxxx> wrote:
Is this true for files that are currently open? For example I have a
virtual machine running that had a file open at all times. Errors are
bubbling back to the application layer instead of just waiting. After
that I have to unmount/remount the gluster vol. Is there a way of
preventing this?
(This is the latest tla btw)
Thanks!
Anand Avati wrote:
This is possible already, just that the files from the node which are
down will not be accessible for the time the server is down. When the
server is brought back up, the files are made accessible again.
avati
2007/11/30, Mickey Mazarick <mic@xxxxxxxxxxxxxxxxxx
<mailto:mic@xxxxxxxxxxxxxxxxxx>>:
Is there currently a way to force a client connection to retry
dist io
until a failed resource comes back online?
if a disk in a unified volume drops I have to remount on all the
clients. Is there a way around this?
I'm using afr/unify on 6 storage bricks and I want to be able to
change
a server config setting and restart the server bricks one at a
time
without losing the mount point on the clients. Is this currently
possible without doing ip failover?
--
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxx>
http://lists.nongnu.org/mailman/listinfo/gluster-devel
--
It always takes longer than you expect, even when you take into
account Hofstadter's Law.
-- Hofstadter's Law
--
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel
--