GlusterFS Volume Failure

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Last night, we got some troubles with a GlusterFS mount. It's a replicate volume, and the 10.1.1.2 host was already down. The volume files weren't readable until I manually restarted the GlusterFS instance.
We'd like to understand what happened on this volume. Especially the "Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting." message. I can't figure out why the GlusterFS instance couldn't talk to itself.
Please help us.

This log is from 10.1.1.1 itself :

[2010-06-01 00:01:54] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:04:28] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:06:57] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:09:32] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:11:55] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:14:29] E [client-protocol.c:415:client_ping_timer_expired] brick-qmaster: Server 10.1.1.1:6996 has not responded in the last 42 seconds, disconnecting.
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STAT(0) frame sent = 2010-05-31 23:45:43. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731899: STAT() /masterspool => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:39. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731898: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STATFS(13) frame sent = 2010-05-31 23:45:39. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 7731897: ERR => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:37. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731896: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame OPEN(10) frame sent = 2010-05-31 23:45:34. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:858:fuse_fd_cbk] glusterfs-fuse: 7731894: OPEN() /cell/common/bootstrap => -1 (Transport endpoint is not connected)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:35. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731895: FSTAT() /masterspool/messages => -1 (File descriptor in bad state)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame FSTAT(25) frame sent = 2010-05-31 23:45:34. frame-timeout = 1800
[2010-06-01 00:15:44] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731893: FSTAT() /cell/common/bootstrap => -1 (File descriptor in bad state)
[2010-06-01 00:15:44] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame PING(5) frame sent = 2010-05-31 23:45:35. frame-timeout = 1800
[2010-06-01 00:15:54] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame PING(5) frame sent = 2010-05-31 23:45:51. frame-timeout = 1800
[2010-06-01 00:16:05] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame LOOKUP(27) frame sent = 2010-05-31 23:45:56. frame-timeout = 1800
[2010-06-01 00:16:05] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 7731901: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-06-01 00:16:25] E [client-protocol.c:313:call_bail] brick-qmaster: bailing out frame STATFS(13) frame sent = 2010-05-31 23:46:19. frame-timeout = 1800
[2010-06-01 00:16:25] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 7731902: ERR => -1 (Transport endpoint is not connected)
[..]

Here is our configuration :

volume posix
    type storage/posix
    option directory /data/sge
end-volume

volume locks
    type features/locks
    subvolumes posix
end-volume

volume brick
    type performance/io-threads
    option thread-count 8
    subvolumes locks
end-volume

volume server
    type protocol/server
    option transport-type tcp
    option auth.addr.brick.allow 10.*.*.*
    subvolumes brick
end-volume

volume brick-qmaster
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.1
    option remote-subvolume brick
end-volume

volume brick-shadow
    type protocol/client
    option transport-type tcp
    option remote-host 10.1.1.2
    option remote-subvolume brick
end-volume

volume sge-replicate
    type cluster/replicate
    subvolumes brick-qmaster brick-shadow
end-volume



Philippe Muller

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux