Re: GlusterFS Volume Failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The engineering team will need some details -

Gluster version?
OS details for the clients and servers.
Hardware
details for the clients and servers
Client volume file.
Why was 10.1.1.2 already down, how was it brought down?

Also, this type of question will probably get a better response on the gluster-users list, could you subscribe there and repost your email with the details I've asked for? You can subscribe to Gluster-users here - http://gluster.org/cgi-bin/mailman/listinfo/gluster-users.

Thanks,

Craig

--
Craig Carl
Sales Engineer; Gluster, Inc.


From: "Philippe Muller" <philippe.muller@xxxxxxxxx>
To: gluster-devel@xxxxxxxxxx
Sent: Tuesday, June 1, 2010 1:39:09 AM
Subject: GlusterFS Volume Failure

Hi,

Last night, we got some troubles with a GlusterFS mount. It's a replicate volume, and the 10.1.1.2 host was already down. The volume files weren't readable until I manually restarted the GlusterFS instance.
We'd like to understand what happened on this volume. Especially the "Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting." message. I can't figure out why the GlusterFS instance couldn't talk to itself.
Please help us.

This log is from 10.1.1.1 itself :

[2010-06-01 00:01:54] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:04:28] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:06:57] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:09:32] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:11:55] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:14:29] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STAT(0) frame sent = 2010-05-31 23:45:43. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731899: STAT() /masterspool => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:39. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731898: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STATFS(13) frame sent = 2010-05-31 23:45:39. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 7731897: ERR => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:37. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731896: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame OPEN(10) frame sent = 2010-05-31 23:45:34. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:858:fuse_fd_cbk] glusterfs-fuse: 7731894: OPEN() /cell/common/bootstrap => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:35. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731895: FSTAT() /masterspool/messages => -1 (File descriptor in bad state)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:34. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731893: FSTAT() /cell/common/bootstrap => -1 (File descriptor in bad state)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame PING(5) frame sent = 2010-05-31 23:45:35. frame-timeout = 1800
[2010-06-01 00:15:54] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame PING(5) frame sent = 2010-05-31 23:45:51. frame-timeout = 1800
[2010-06-01 00:16:05] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:56. frame-timeout = 1800
[2010-06-01 00:16:05] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731901: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:16:25] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STATFS(13) frame sent = 2010-05-31 23:46:19. frame-timeout = 1800
[2010-06-01 00:16:25] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 7731902: ERR => -1 (Transport endpoint is not connected)
[..]

Here is our configuration :

volume posix
    type storage/posix
    option directory /data/sge
end-volume

volume locks
    type features/locks
    subvolumes posix
end-volume

volume brick
    type performance/io-threads
    option thread-count 8
    subvolumes locks
end-volume

volume server
    type protocol/server
    option transport-type tcp
    option auth.addr.brick.allow 10.*.*.*
    subvolumes brick
end-volume

volume brick-qmaster
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.1
    option remote-subvolume brick
end-volume

volume brick-shadow
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.2
    option remote-subvolume brick
end-volume

volume sge-replicate
    type cluster/replicate
    subvolumes brick-qmaster brick-shadow
end-volume



Philippe Muller

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux