Re: AFR setup with Virtual Servers crashes

"Anand Avati" <avati@xxxxxxxxxxxxx> · Thu, 10 May 2007 16:52:34 +0530

Urban,
 this bug has alredy been fixed in the source repository.
thanks,
avati

2007/5/10, Urban Loesch <ul@xxxxxxxx>:
Hi Avati,

thanks for your fast answer.

I use the version glusterfs-1.3.0-pre3 downloaded form your server
(http://ftp.zresearch.com/pub/gluster/glusterfs/1.3-pre/).
I will try the latest version from TLA today afternoon and let you know
what happens.

Here's the backtrace from the core dump
# gdb glusterfsd -c core.15160
..
Core was generated by `glusterfsd --no-daemon --log-file=/dev/stdout
--log-level=DEBUG'.
Program terminated with signal 11, Segmentation fault.
#0  0xb75d8fd3 in posix_locks_flush () from
/usr/lib/glusterfs/1.3.0-pre3/xlator/features/posix-locks.so
(gdb) bt
#0  0xb75d8fd3 in posix_locks_flush () from
/usr/lib/glusterfs/1.3.0-pre3/xlator/features/posix-locks.so
#1  0xb75d1192 in fop_flush () from
/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
#2  0xb75cded7 in proto_srv_notify () from
/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
#3  0xb7f54ecd in transport_notify (this=0x804b1a0, event=1) at
transport.c:148
#4  0xb7f55b79 in sys_epoll_iteration (ctx=0xbfbc2ff0) at epoll.c:53
#5  0xb7f54f7d in poll_iteration (ctx=0xbfbc2ff0) at transport.c:251
#6  0x0804924e in main ()

Yes it is reproducible. It happens every time when I try to start my
virtual server.

Thanks
Urban

Anand Avati wrote:
> Urban,
> which version of glusterfs are you using? if it is from TLA checkout
> what is the patchset number?
>
> you have a core dump generated from the segfault, can you please get a
> backtrace from it? (gdb glusterfsd -c core.<pid> or gdb glusterfsd -c
> core, type 'bt' command and paste the output) please.
>
> is this easily reproducible? have you checked with the latest TLA
> checkout?
>
> thanks,
> avati
>
> 2007/5/10, Urban Loesch <ul@xxxxxxxx>:
>> Hi,
>>
>> I'm new to this list.
>> First: sorry for my bad english.
>>
>> I was searching for some easy and transparent Clusterfilesystem with
>> failover feature and I found on Wikipedia the GlusterFS project.
>> It's a nice project and tried it on my test environment. I thought when
>> it works good I use it in production too.
>>
>> A very nice feature for me is the AFR setup. So I can replicate all the
>> data over 2 Servers in RAID-1 Mode.
>> But it seems that I make something wrong, because the "glusterfsd"
>> crashes on both nodes.
>> But let me explain form the beginning.
>>
>> Here's my setup:
>> Hardware:
>> 2 different servers for storage
>> 1 server as client
>> On top of the server I use a virtual server setup (details
>> http://linux-vserver.org).
>>
>> OS:
>> Debian Sarge with self compiled 2.6.19.2 (uname -r 2.6.19.2-vs2.2.0) and
>> latest stable virtual server patch.
>> glusterfs-1.3.0-pre3.tar.gz
>>
>> What I'm trying to do:
>> - Create a AFR Mirror over the 2 Servers.
>> - Mount the Volume on Server 3 (Client).
>> - Install on the mounted volume the hole virtual Server with Apache,
>> MySql and so on.
>> So I have a full redundant Virtual Server mirrored over two bricks .
>>
>> Here my current confuguration:
>> - Serverconfig on Server 1 (brick)
>>
>> ### Export volume "brick" with the contents of "/home/export" directory.
>> volume brick
>>   type storage/posix                   # POSIX FS translator
>>   option directory /gluster        # Export this directory
>> end-volume
>>
>> ### File Locking
>> volume locks
>>   type features/posix-locks
>>   subvolumes brick
>> end-volume
>>
>> ### Add network serving capability to above brick.
>> volume server
>>   type protocol/server
>>   option transport-type tcp/server     # For TCP/IP transport
>> option listen-port 6996               # Default is 6996
>>   subvolumes locks
>>   option auth.ip.locks.allow *         # access to "brick" volume
>> end-volume
>>
>> - Serverconfig on Server 2 (brick-afr)
>> ### Export volume "brick" with the contents of "/home/export" directory.
>> volume brick-afr
>>   type storage/posix                   # POSIX FS translator
>>   option directory /gluster-afr        # Export this directory
>> end-volume
>>
>> ### File Locking
>> volume locks-afr
>>   type features/posix-locks
>>   subvolumes brick-afr
>> end-volume
>>
>> ### Add network serving capability to above brick.
>> volume server
>>   type protocol/server
>>   option transport-type tcp/server     # For TCP/IP transport
>> option listen-port 6996               # Default is 6996
>>   subvolumes locks-afr
>>   option auth.ip.locks-afr.allow *         # access to "brick" volume
>> end-volume
>>
>> - Clientconfiguration on Server 3 (
>> ### Add client feature and attach to remote subvolume of server1
>> volume brick
>>   type protocol/client
>>   option transport-type tcp/client     # for TCP/IP transport
>>   option remote-host 192.168.0.1      # IP address of the remote brick
>>   option remote-port 6996              # default server port is 6996
>>   option remote-subvolume locks        # name of the remote volume
>> end-volume
>>
>> ### Add client feature and attach to remote subvolume of brick1
>> volume brick-afr
>>   type protocol/client
>>   option transport-type tcp/client     # for TCP/IP transport
>>   option remote-host 192.168.0.2      # IP address of the remote brick
>>   option remote-port 6996              # default server port is 6996
>>   option remote-subvolume locks-afr        # name of the remote volume
>> end-volume
>>
>> ### Add AFR feature to brick
>> volume afr
>>   type cluster/afr
>>   subvolumes brick brick-afr
>>   option replicate *:2                 # All files 2 copies (RAID-1)
>> end-volume
>>
>> ----------------------------------------------------------------------------------------------------------------------
>>
>> I started the two Bricks in debug mode and it starts without problems.
>>
>> - Server1
>> glusterfsd --no-daemon --log-file=/dev/stdout --log-level=DEBUG
>> ....
>> [May 10 11:52:11] [DEBUG/proto-srv.c:2919/init()]
>> protocol/server:protocol/server xlator loaded
>> [May 10 11:52:11] [DEBUG/transport.c:83/transport_load()]
>> libglusterfs/transport:attempt to load type tcp/server
>> [May 10 11:52:11] [DEBUG/transport.c:88/transport_load()]
>> libglusterfs/transport:attempt to load file
>> /usr/lib/glusterfs/1.3.0-pre3/transport/tcp/server.so
>>
>> - Server2
>> glusterfsd --no-daemon --log-file=/dev/stdout --log-level=DEBUG
>> ....
>> [May 10 11:51:44] [DEBUG/proto-srv.c:2919/init()]
>> protocol/server:protocol/server xlator loaded
>> [May 10 11:51:44] [DEBUG/transport.c:83/transport_load()]
>> libglusterfs/transport:attempt to load type tcp/server
>> [May 10 11:51:44] [DEBUG/transport.c:88/transport_load()]
>> libglusterfs/transport:attempt to load file
>> /usr/lib/glusterfs/1.3.0-pre3/transport/tcp/server.so
>> ------------------------------------------------------------------------------------------------------------------------------
>>
>>
>> So far so good.
>>
>> After I mounted the volume on server 3 (client). It mounts without any
>> problems.
>> glusterfs --no-daemon --log-file=/dev/stdout --log-level=DEBUG
>> --spec-file=/etc/glusterfs/glusterfs-client.vol
>> /var/lib/vservers/mastersql
>> ...
>> [May 10 13:59:00] [DEBUG/client-protocol.c:2796/init()]
>> protocol/client:defaulting transport-timeout to 120
>> [May 10 13:59:00] [DEBUG/transport.c:83/transport_load()]
>> libglusterfs/transport:attempt to load type tcp/client
>> [May 10 13:59:00] [DEBUG/transport.c:88/transport_load()]
>> libglusterfs/transport:attempt to load file
>> /usr/lib/glusterfs/1.3.0-pre3/transport/tcp/client.so
>> [May 10 13:59:00] [DEBUG/tcp-client.c:174/tcp_connect()] transport: tcp:
>> :try_connect: socket fd = 8
>> [May 10 13:59:00] [DEBUG/tcp-client.c:196/tcp_connect()] transport: tcp:
>> :try_connect: finalized on port `1022'
>> [May 10 13:59:00] [DEBUG/tcp-client.c:255/tcp_connect()]
>> tcp/client:connect on 8 in progress (non-blocking)
>> [May 10 13:59:00] [DEBUG/tcp-client.c:293/tcp_connect()]
>> tcp/client:connection on 8 still in progress - try later
>>
>> OK. Nice.
>> A short check on the client:
>> df -HT
>> Filesystem    Type     Size   Used  Avail Use% Mounted on
>> /dev/sda1     ext3      13G   2.6G   8.9G  23% /
>> tmpfs        tmpfs     1.1G      0   1.1G   0% /lib/init/rw
>> udev         tmpfs      11M    46k    11M   1% /dev
>> tmpfs        tmpfs     1.1G      0   1.1G   0% /dev/shm
>> glusterfs:24914
>>               fuse     9.9G   2.5G   6.9G  27%
>> /var/lib/vservers/mastersql
>>
>> Wow it works. Now I can add, remove or edit files and directories
>> without problems. The file are written to all two bricks without
>> problems. Performance is good too.
>>
>> But then I tried to start my virtual Server (called mastersql).
>> The virtual server not starts and I get the a lot of following debug
>> output on the client:
>>
>> [May 10 14:04:43] [DEBUG/tcp-client.c:174/tcp_connect()] transport: tcp:
>> :try_connect: socket fd = 4
>> [May 10 14:04:43] [DEBUG/tcp-client.c:196/tcp_connect()] transport: tcp:
>> :try_connect: finalized on port `1023'
>> [May 10 14:04:43] [DEBUG/tcp-client.c:255/tcp_connect()]
>> tcp/client:connect on 4 in progress (non-blocking)
>> [May 10 14:04:43] [DEBUG/tcp-client.c:293/tcp_connect()]
>> tcp/client:connection on 4 still in progress - try later
>> [May 10 14:04:43] [ERROR/client-protocol.c:204/client_protocol_xfer()]
>> protocol/client:transport_submit failed
>> [May 10 14:04:43]
>> [DEBUG/client-protocol.c:2604/client_protocol_cleanup()]
>> protocol/client:cleaning up state in transport object 0x8076cf0
>> [May 10 14:04:43] [DEBUG/tcp-client.c:174/tcp_connect()] transport: tcp:
>> :try_connect: socket fd = 7
>> [May 10 14:04:43] [DEBUG/tcp-client.c:196/tcp_connect()] transport: tcp:
>> :try_connect: finalized on port `1022'
>> [May 10 14:04:43] [DEBUG/tcp-client.c:255/tcp_connect()]
>> tcp/client:connect on 7 in progress (non-blocking)
>> [May 10 14:04:43] [DEBUG/tcp-client.c:293/tcp_connect()]
>> tcp/client:connection on 7 still in progress - try later
>> [May 10 14:04:43] [ERROR/client-protocol.c:204/client_protocol_xfer()]
>> protocol/client:transport_submit failed
>> [May 10 14:04:43]
>> [DEBUG/client-protocol.c:2604/client_protocol_cleanup()]
>> protocol/client:cleaning up state in transport object 0x80762d0
>>
>> The two mirrorservers are crashing with the following debug code:
>>
>> [May 10 11:54:26] [DEBUG/tcp-server.c:134/tcp_server_notify()]
>> tcp/server:Registering socket (5) for new transport object of
>> 192.168.0.3
>> [May 10 11:55:22] [DEBUG/proto-srv.c:2418/mop_setvolume()]
>> server-protocol:mop_setvolume: received port = 1022
>> [May 10 11:55:22] [DEBUG/proto-srv.c:2434/mop_setvolume()]
>> server-protocol:mop_setvolume: IP addr = *, received ip addr =
>> 192.168.0.3
>> [May 10 11:55:22] [DEBUG/proto-srv.c:2444/mop_setvolume()]
>> server-protocol:mop_setvolume: accepted client from 192.168.0.3
>>
>> Trying to set: READ  Is grantable: READ   Inserting: READTrying to set:
>> UNLOCK  Is grantable: UNLOCK  Conflict with: READTrying to set: WRITE
>> Is grantable: WRITE   Inserting: WRITETrying to set: UNLOCK  Is
>> grantable: UNLOCK  Conflict with: WRITETrying to set: WRITE  Is
>> grantable: WRITE   Inserting: WRITETrying to set: UNLOCK  Is grantable:
>> UNLOCK  Conflict with: WRITETrying to set: WRITE  Is grantable: WRITE
>> Inserting: WRITE[May 10 12:00:09]
>> [CRITICAL/common-utils.c:215/gf_print_trace()] debug-backtrace:Got
>> signal (11), printing backtrace
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/usr/lib/libglusterfs.so.0(gf_print_trace+0x2e)
>> [0xb7f53a7e]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:[0xb7f60420]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
>> [0xb75d1192]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/usr/lib/glusterfs/1.3.0-pre3/xlator/protocol/server.so
>> [0xb75cded7]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/usr/lib/libglusterfs.so.0(transport_notify+0x1d)
>> [0xb7f54ecd]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xe9)
>> [0xb7f55b79]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/usr/lib/libglusterfs.so.0(poll_iteration+0x1d)
>> [0xb7f54f7d]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:glusterfsd [0x804924e]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:/lib/tls/i686/cmov/libc.so.6(__libc_start_main+0xc8)
>> [0xb7e17ea8]
>> [May 10 12:00:09] [CRITICAL/common-utils.c:217/gf_print_trace()]
>> debug-backtrace:glusterfsd [0x8048c51]
>> Segmentation fault (core dumped)
>>
>> It seems that there are come conflicts with "READ, WRITE, UNLOCK". But
>> I'm not an expert on filesystems an locking features.
>>
>> As you can see the filesystem is just mounted but not connected to the
>> two bricks.
>> df -HT
>> Filesystem    Type     Size   Used  Avail Use% Mounted on
>> /dev/sda1     ext3      13G   2.6G   8.9G  23% /
>> tmpfs        tmpfs     1.1G      0   1.1G   0% /lib/init/rw
>> udev         tmpfs      11M    46k    11M   1% /dev
>> tmpfs        tmpfs     1.1G      0   1.1G   0% /dev/shm
>> df: `/var/lib/vservers/mastersql': Transport endpoint is not connected
>>
>> I'm not sure if i make something wrong (configuration) or if it is a
>> bug!
>> Can you experts please help me?
>>
>> If you need any further information or something please let me know.
>>
>> Thanks and regards
>> Urban
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel@xxxxxxxxxx
>> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>>
>
>

--
Anand V. Avati