Re: AFR setup with Virtual Servers crashes

"Anand Avati" <avati@xxxxxxxxxxxxx> · Thu, 10 May 2007 15:55:58 +0530

Urban,
which version of glusterfs are you using? if it is from TLA checkout
what is the patchset number?

you have a core dump generated from the segfault, can you please get a
backtrace from it? (gdb glusterfsd -c core.<pid> or gdb glusterfsd -c
core, type 'bt' command and paste the output) please.

is this easily reproducible? have you checked with the latest TLA checkout?

thanks,
avati

2007/5/10, Urban Loesch <ul@xxxxxxxx>:
Hi,

I'm new to this list.
First: sorry for my bad english.

I was searching for some easy and transparent Clusterfilesystem with
failover feature and I found on Wikipedia the GlusterFS project.
It's a nice project and tried it on my test environment. I thought when
it works good I use it in production too.

A very nice feature for me is the AFR setup. So I can replicate all the
data over 2 Servers in RAID-1 Mode.
But it seems that I make something wrong, because the "glusterfsd"
crashes on both nodes.
But let me explain form the beginning.

Here's my setup:
Hardware:
2 different servers for storage
1 server as client
On top of the server I use a virtual server setup (details
http://linux-vserver.org).

OS:
Debian Sarge with self compiled 2.6.19.2 (uname -r 2.6.19.2-vs2.2.0) and
latest stable virtual server patch.
glusterfs-1.3.0-pre3.tar.gz

What I'm trying to do:
- Create a AFR Mirror over the 2 Servers.
- Mount the Volume on Server 3 (Client).
- Install on the mounted volume the hole virtual Server with Apache,
MySql and so on.
So I have a full redundant Virtual Server mirrored over two bricks .

Here my current confuguration:
- Serverconfig on Server 1 (brick)

### Export volume "brick" with the contents of "/home/export" directory.
volume brick
  type storage/posix                   # POSIX FS translator
  option directory /gluster        # Export this directory
end-volume

### File Locking
volume locks
  type features/posix-locks
  subvolumes brick
end-volume

### Add network serving capability to above brick.
volume server
  type protocol/server
  option transport-type tcp/server     # For TCP/IP transport
option listen-port 6996               # Default is 6996
  subvolumes locks
  option auth.ip.locks.allow *         # access to "brick" volume
end-volume

- Serverconfig on Server 2 (brick-afr)
### Export volume "brick" with the contents of "/home/export" directory.
volume brick-afr
  type storage/posix                   # POSIX FS translator
  option directory /gluster-afr        # Export this directory
end-volume

### File Locking
volume locks-afr
  type features/posix-locks
  subvolumes brick-afr
end-volume

### Add network serving capability to above brick.
volume server
  type protocol/server
  option transport-type tcp/server     # For TCP/IP transport
option listen-port 6996               # Default is 6996
  subvolumes locks-afr
  option auth.ip.locks-afr.allow *         # access to "brick" volume
end-volume

- Clientconfiguration on Server 3 (
### Add client feature and attach to remote subvolume of server1
volume brick
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.0.1      # IP address of the remote brick
  option remote-port 6996              # default server port is 6996
  option remote-subvolume locks        # name of the remote volume
end-volume

### Add client feature and attach to remote subvolume of brick1
volume brick-afr
  type protocol/client
  option transport-type tcp/client     # for TCP/IP transport
  option remote-host 192.168.0.2      # IP address of the remote brick
  option remote-port 6996              # default server port is 6996
  option remote-subvolume locks-afr        # name of the remote volume
end-volume

### Add AFR feature to brick
volume afr
  type cluster/afr
  subvolumes brick brick-afr
  option replicate *:2                 # All files 2 copies (RAID-1)
end-volume

----------------------------------------------------------------------------------------------------------------------
I started the two Bricks in debug mode and it starts without problems.

- Server1
glusterfsd --no-daemon --log-file=/dev/stdout --log-level=DEBUG
....
[May 10 11:52:11] [DEBUG/proto-srv.c:2919/init()]
protocol/server:protocol/server xlator loaded
[May 10 11:52:11] [DEBUG/transport.c:83/transport_load()]
libglusterfs/transport:attempt to load type tcp/server
[May 10 11:52:11] [DEBUG/transport.c:88/transport_load()]
libglusterfs/transport:attempt to load file
/usr/lib/glusterfs/1.3.0-pre3/transport/tcp/server.so

- Server2
glusterfsd --no-daemon --log-file=/dev/stdout --log-level=DEBUG
....
[May 10 11:51:44] [DEBUG/proto-srv.c:2919/init()]
protocol/server:protocol/server xlator loaded
[May 10 11:51:44] [DEBUG/transport.c:83/transport_load()]
libglusterfs/transport:attempt to load type tcp/server
[May 10 11:51:44] [DEBUG/transport.c:88/transport_load()]
libglusterfs/transport:attempt to load file
/usr/lib/glusterfs/1.3.0-pre3/transport/tcp/server.so
------------------------------------------------------------------------------------------------------------------------------

So far so good.

After I mounted the volume on server 3 (client). It mounts without any
problems.
glusterfs --no-daemon --log-file=/dev/stdout --log-level=DEBUG
--spec-file=/etc/glusterfs/glusterfs-client.vol /var/lib/vservers/mastersql
...
[May 10 13:59:00] [DEBUG/client-protocol.c:2796/init()]
protocol/client:defaulting transport-timeout to 120
[May 10 13:59:00] [DEBUG/transport.c:83/transport_load()]
libglusterfs/transport:attempt to load type tcp/client
[May 10 13:59:00] [DEBUG/transport.c:88/transport_load()]
libglusterfs/transport:attempt to load file
/usr/lib/glusterfs/1.3.0-pre3/transport/tcp/client.so
[May 10 13:59:00] [DEBUG/tcp-client.c:174/tcp_connect()] transport: tcp:
:try_connect: socket fd = 8
[May 10 13:59:00] [DEBUG/tcp-client.c:196/tcp_connect()] transport: tcp:
:try_connect: finalized on port `1022'
[May 10 13:59:00] [DEBUG/tcp-client.c:255/tcp_connect()]
tcp/client:connect on 8 in progress (non-blocking)
[May 10 13:59:00] [DEBUG/tcp-client.c:293/tcp_connect()]
tcp/client:connection on 8 still in progress - try later

OK. Nice.
A short check on the client:
df -HT
Filesystem    Type     Size   Used  Avail Use% Mounted on
/dev/sda1     ext3      13G   2.6G   8.9G  23% /
tmpfs        tmpfs     1.1G      0   1.1G   0% /lib/init/rw
udev         tmpfs      11M    46k    11M   1% /dev
tmpfs        tmpfs     1.1G      0   1.1G   0% /dev/shm
glusterfs:24914
              fuse     9.9G   2.5G   6.9G  27% /var/lib/vservers/mastersql

Wow it works. Now I can add, remove or edit files and directories
without problems. The file are written to all two bricks without
problems. Performance is good too.

But then I tried to start my virtual Server (called mastersql).
The virtual server not starts and I get the a lot of following debug
output on the client:

[May 10 14:04:43] [DEBUG/tcp-client.c:174/tcp_connect()] transport: tcp:
:try_connect: socket fd = 4
[May 10 14:04:43] [DEBUG/tcp-client.c:196/tcp_connect()] transport: tcp:
:try_connect: finalized on port `1023'
[May 10 14:04:43] [DEBUG/tcp-client.c:255/tcp_connect()]
tcp/client:connect on 4 in progress (non-blocking)
[May 10 14:04:43] [DEBUG/tcp-client.c:293/tcp_connect()]
tcp/client:connection on 4 still in progress - try later
[May 10 14:04:43] [ERROR/client-protocol.c:204/client_protocol_xfer()]
protocol/client:transport_submit failed
[May 10 14:04:43]
[DEBUG/client-protocol.c:2604/client_protocol_cleanup()]
protocol/client:cleaning up state in transport object 0x8076cf0
[May 10 14:04:43] [DEBUG/tcp-client.c:174/tcp_connect()] transport: tcp:
:try_connect: socket fd = 7
[May 10 14:04:43] [DEBUG/tcp-client.c:196/tcp_connect()] transport: tcp:
:try_connect: finalized on port `1022'
[May 10 14:04:43] [DEBUG/tcp-client.c:255/tcp_connect()]
tcp/client:connect on 7 in progress (non-blocking)
[May 10 14:04:43] [DEBUG/tcp-client.c:293/tcp_connect()]
tcp/client:connection on 7 still in progress - try later
[May 10 14:04:43] [ERROR/client-protocol.c:204/client_protocol_xfer()]
protocol/client:transport_submit failed
[May 10 14:04:43]
[DEBUG/client-protocol.c:2604/client_protocol_cleanup()]
protocol/client:cleaning up state in transport object 0x80762d0

The two mirrorservers are crashing with the following debug code:

[May 10 11:54:26] [DEBUG/tcp-server.c:134/tcp_server_notify()]
tcp/server:Registering socket (5) for new transport object of 192.168.0.3
[May 10 11:55:22] [DEBUG/proto-srv.c:2418/mop_setvolume()]
server-protocol:mop_setvolume: received port = 1022
[May 10 11:55:22] [DEBUG/proto-srv.c:2434/mop_setvolume()]
server-protocol:mop_setvolume: IP addr = *, received ip addr = 192.168.0.3
[May 10 11:55:22] [DEBUG/proto-srv.c:2444/mop_setvolume()]
server-protocol:mop_setvolume: accepted client from 192.168.0.3

Trying to set: READ  Is grantable: READ   Inserting: READTrying to set:
UNLOCK  Is grantable: UNLOCK  Conflict with: READTrying to set: WRITE
Is grantable: WRITE   Inserting: WRITETrying to set: UNLOCK  Is
grantable: UNLOCK  Conflict with: WRITETrying to set: WRITE  Is
grantable: WRITE   Inserting: WRITETrying to set: UNLOCK  Is grantable:
UNLOCK  Conflict with: WRITETrying to set: WRITE  Is grantable: WRITE
Inserting: WRITE[May 10 12:00:09]
[CRITICAL/common-utils.c:215/gf_print_trace()] debug-backtrace:Got
signal (11), printing backtrace
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/libglusterfs.so.0(gf_print_trace+0x2e) [0xb7f53a7e]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:[0xb7f60420]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
[0xb75d1192]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
[0xb75cded7]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/libglusterfs.so.0(transport_notify+0x1d)
[0xb7f54ecd]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xe9)
[0xb7f55b79]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/usr/lib/libglusterfs.so.0(poll_iteration+0x1d) [0xb7f54f7d]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:glusterfsd [0x804924e]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xc8)
[0xb7e17ea8]
[May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
debug-backtrace:glusterfsd [0x8048c51]
Segmentation fault (core dumped)

It seems that there are come conflicts with "READ, WRITE, UNLOCK". But
I'm not an expert on filesystems an locking features.

As you can see the filesystem is just mounted but not connected to the
two bricks.
df -HT
Filesystem    Type     Size   Used  Avail Use% Mounted on
/dev/sda1     ext3      13G   2.6G   8.9G  23% /
tmpfs        tmpfs     1.1G      0   1.1G   0% /lib/init/rw
udev         tmpfs      11M    46k    11M   1% /dev
tmpfs        tmpfs     1.1G      0   1.1G   0% /dev/shm
df: `/var/lib/vservers/mastersql': Transport endpoint is not connected

I'm not sure if i make something wrong (configuration) or if it is a bug!
Can you experts please help me?

If you need any further information or something please let me know.

Thanks and regards
Urban

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel

--
Anand V. Avati