Re: about afr

nicolas prochazka <prochazka.nicolas@xxxxxxxxx> · Tue, 3 Feb 2009 12:48:56 +0100

ok, 
So now I know there's few bugs,

1 - when stop and i restart a server , I've the EBADFD bug
2 - When I stop server : 
       - with  --disable-direct-io-mode   : my big image file become corrupt  ( missing data ...)

      - without --disable-direct-io-mode  :   my process hangs and cpu load grows a lot (by process )   

any ideas ?

Regards,
Nicolas Prochazka

 On Tue, Feb 3, 2009 at 5:42 AM, Raghavendra G <raghavendra@xxxxxxxxxxxxx> wrote:

Hi Nicolas,

On Tue, Feb 3, 2009 at 12:01 AM, nicolas prochazka <prochazka.nicolas@xxxxxxxxx> wrote:

I inspect the log and i find something interesting : 
All is ok, 
i have stop 10.98.98.2 and i restart it :  

2009-02-02 15:00:32 D [client-protocol.c:6498:notify] brick_10.98.98.2: got GF_EVENT_CHILD_UP

2009-02-02 15:00:32 D [socket.c:924:socket_connect] brick_10.98.98.2: connect () called on transport already connected

2009-02-02 15:00:32 N [client-protocol.c:5786:client_setvolume_cbk] brick_10.98.98.2: connection and handshake succeeded
2009-02-02 15:00:40 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: 17399: STATFS
2009-02-02 15:00:40 D [fuse-bridge.c:368:fuse_entry_cbk] glusterfs-fuse: 17400: LOOKUP() / => 1 (1)

200t9-02-02 15:00:42 D [client-protocol.c:5854:client_protocol_reconnect] brick_10.98.98.2: breaking reconnect chain

All seems to be ok but now i have this log : 
( a lot of times ) 

2009-02-02 15:07:05 D [client-protocol.c:2799:client_fstat] brick_10.98.98.2: (2148533016): failed to get remote fd. returning EBADFD

then  stop 10.98.98.1  ( I tought that 10.98.98.2 is ok but EBADFD seems to be not ! )

This is a known issue in afr for files which remain open across the time frame when a server goes down and comes back. Ideally afr should've issued reopen for those files once the server comes back. But currently its not doing so.

2009-02-02 15:10:30 D [page.c:644:ioc_frame_return] io-cache: locked local(0x6309d0)

2009-02-02 15:10:30 D [client-protocol.c:2799:client_fstat] brick_10.98.98.2: (2148533016): failed to get remote fd. returning EBADFD

2009-02-02 15:10:30 D [page.c:646:ioc_frame_return] io-cache: unlocked local(0x6309d0)
2009-02-02 15:10:30 D [io-cache.c:798:ioc_need_prune] io-cache: locked table(0x614320)
2009-02-02 15:10:30 D [io-cache.c:802:ioc_need_prune] io-cache: unlocked table(0x614320)

2009-02-02 15:10:30 D [client-protocol.c:2799:client_fstat] brick_10.98.98.1: (2148533016): failed to get remote fd. returning EBADFD
2009-02-02 15:10:30 D [io-cache.c:425:ioc_cache_validate_cbk] io-cache: cache for inode(0x7fdce0002780) is invalid. flushing all pages

Now my client have problems with two servers ( fd )

so perhaps there is a problem, why 10.98.98.2 is online but client tells EBADFD.

Regard, 
Nicolas

On Mon, Feb 2, 2009 at 3:30 PM, nicolas prochazka <prochazka.nicolas@xxxxxxxxx> wrote:

hi again, 
last test and last log before stop for me : 
I do a change, i add option read-subvolume brick_10.98.98.2 in client conf 10.98.98.48

and option read-subvolume brick_10.98.98.1 in client conf 10.98.98.44

run 10.98.98.1 and 10.98.98.2 as server
run 10.98.98.44 and 10.98.98.48 as client

1 - stop 10.98.98.2
10.98.98.48 always run and go read to 10.98.98.1
10.98.98.44 always run , 10.98.98.1

2 - rerun 10.98.98.2 , waiting 5 minutes

3 - stop 10.98.98.1
process 10.98.98.44 / 48  are hanging

I think, client can not re read to 10.98.98.2  , is it normal ?  10.98.98.2 is become ready after crash.

Regards, 
Nico

On Mon, Feb 2, 2009 at 2:25 PM, nicolas prochazka <prochazka.nicolas@xxxxxxxxx> wrote:

hello 
I always trying to debugging my strange and block problem.
I run client with log but there's a lot and a lot (100 mo ) so i can not send you, just info : 

Server 10.98.98.1  and 10.98.98.2
client 10.98.98.44  10.98.98.48

Test : ( all tests is performe with big file ( > 10G ) sometimes the test hangs process, sometimes, big file become corrupte ( there's seem that's some data is lacking )

run all system.  :  ok 
stop : 10.98.98.2   : client seems ok

run 10.98.98.2 :  sometime it block
stop 10.98.98.1 : client 10.98.98.44 is blocking   : last log is : 

2009-02-02 13:53:59 D [io-cache.c:798:ioc_need_prune] io-cache: locked table(0x614320)
2009-02-02 13:53:59 D [io-cache.c:802:ioc_need_prune] io-cache: unlocked table(0x614320)

2009-02-02 13:53:59 D [client-protocol.c:1701:client_readv] brick_10.98.98.2: (2148533016): failed to get remote fd, returning EBADFD

and if i rerun 10.98.98.1 , client run again ( ls works ) and log : 

2009-02-02 14:03:18 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: 40423: STATFS

2009-02-02 14:03:18 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: 40424: STATFS
2009-02-02 14:03:33 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: 40425: STATFS

On client 10.98.98.48 , not block.

On Fri, Jan 30, 2009 at 10:14 AM, nicolas prochazka <prochazka.nicolas@xxxxxxxxx> wrote:

Hello, 
first thing, thanks a lot for all yours works.
second,
Your tests is ok for me but when i replace echo or tail by opening a file with certains type of program, 
as qemu for example, there's a lot of problem. Process hangs, I also try with --disable-direct-io-mode  then process do not hang but file seems to be corrupted.

It's very strange problem. 

Regards, 
Nicolas Prochazka.

2009/1/30 Raghavendra G <raghavendra@xxxxxxxxxxxxx>

nicolas,

I've two servers n1 and n2 which are being afred from client side. I am using the same configuration you finalized on for which you are facing the problem. n1 is the first child of afr.

on n1:

ifconfig eth0 down (eth0 is the interface I am using for communicating with server on n1)

on glusterfs mount:
1. ls (hangs for transport-timeout seconds but completes successfully after timeout)
2. I also had a file opened with tail -f /mnt/glusterfs/file before bringing down eth0 on n1.

3. echo "content" >> /mnt/glusterfs/file, appends to file and I was able to observe the content through tail -f.

on n1:
bring up eth0

on glusterfs mount:
1. ls (completes successfully without any problem).

2. echo "content-2" >> /mnt/glusterfs/file (also appends content-2 to file and shown in the output of tail -f)

From the above tests, it seems the bug is not reproducible in our setup. Is this the similar procedure you followed to reproduce the bug? I am using glusterfs--mainline--3.0--patch-883.

regards,

On Fri, Jan 30, 2009 at 12:05 AM, Anand Avati <avati@xxxxxxxxxxxxx> wrote:

Raghu/ Krishna,

  can you guys look into this? It seems like a serious flaw..

avati

On Thu, Jan 29, 2009 at 7:13 PM, nicolas prochazka

<prochazka.nicolas@xxxxxxxxx> wrote:

> hello again,

> to be more precise,

> now i can do 'ls /glustermountpoint ' after timeout in all cases, that's

> good

> but, for files which be opened before the crash of first server, that do not

> work, process seems to be block.

>

> Regards,

> Nicolas.

-- 
Raghavendra G

-- 
Raghavendra G