Re: error while reading from an open file

Brian Hirt <bhirt@xxxxxxxxxxxxx> · Fri, 4 Sep 2009 09:31:00 -0600

Raghavendra ,

Thanks for looking into this, its great that you have identified the  
bug.  Let's hope you can get a fix out soon since it's a pretty  
serious error that undermines data integrity and reliability of  
gluster.  From your message, I can't quite tell if you are agreeing  
that there is a serious problem, or just explaining what's happening  
in gluster as an explanation that everything is working as planned.

I understand what you are saying below, but from the viewpoint of a  
client application, there absolutely is data corruption.  If a program  
is writing to a file, handling all errors reported to it by the  
operating system, and later reads from the same  file and finds that  
the data it wrote to the file isn't in there that is fundamentally a  
data corruption issue with very serious implications for any program.

Restarting servers is typical behavior for setups of all sizes.   
Machines require routine maintenance.  It is guaranteed that servers  
will be brought offline for hardware changes and or operating system  
updates.

The documentation is misleading, as it implies split brain as one of  
the new features in 2.0.  I quote
	"Replicate correctness (includes cases of splitbrain, proper self- 
heal fixes etc) Replicate is one of GlusterFS's key features. It was  
questioned for its ability to function well in some split brain  
situations. There were some corner cases in AFR which needed to be re- 
worked. With 2.0.x releases, AFR has proper detection of split-brain  
situations and works smoothly in all cases."

This documentation is also misleading and incorrect.  The statement  
that there is no data corruption when reading a file is just sort of  
silly and appears to be an attempt to soften the actual truth that any  
opened, replicated file that is being written to during a restart will  
become corrupt.   Then this statement "All file systems are vulnerable  
to such loses" is just another attempt to make it sound like gluster  
behaves like all other filesystems, when in fact that's not even close  
to the truth.
	http://gluster.org/docs/index.php/GlusterFS_Technical_FAQ#What_happens_in_case_of_hardware_or_GlusterFS_crash.3F

I would suggest you change it to something more accurate and specific:
	"WARNING:  If gluster is used in environments where files are  
replicated and written to, you will experience data loss/corruption  
when you restart servers.  The longer you leave a file open, the  
larger your exposure to this issue will be."

If the documentation was a bit more honest about what gluster can and  
cannot do, instead of talking it up as the next best thing, it would  
go a long way towards helping the project out.  Right now, after all  
the problems I've experienced, seen other people report on this list,  
along with the lack of responses or hand waving that these issues  
aren't serious, I have very little trust in this project.  It's really  
a shame, because it seems like there is so much potential.

Kind Regards,

Brian

On Sep 3, 2009, at 10:28 PM, Raghavendra G wrote:

with write behind being removed from configuration, success is not  
returned for
writes which are failed. The above situation is caused because of  
replicate in
the setup. If replicate is removed from the test setup, this issue  
is not

observed. As an example consider replicate over 2 nodes and  
following sequence

of operations.

1. start writing on gluster mount, data is replicated to both the  
nodes.
2. stop node1. Application still receives success on writes, but  
data is

written only to node2.

3. restart node1. Application still receives success on writes, but  
data is
still not written to node1, since the file is no longer open on  
node1. Also
note that self-heal will not be done from node2 to node1 since  
replicate does

not support self-heal on open fds yet.

4. stop node2. Application receives either ENOTCONN or EBADFD, based  
on the
child from which replicate received the last reply for write.  
subvolume
corresponding to node1 returns EBADFD and that of node2 returns  
ENOTCONN.

5. Application sensing writes are failing, issues an open on the  
file. Note

that node2 is still down. If it were to be up, data on node1 would  
be synced
from node2.
6. Now writes happen only on node1. Note that file on node1 was  
missing some

writes happened on node2. Now, node2 will miss the future writes,  
leading to a

split-brain situation.
7. Bring up the node1. If open were to happen now, when both nodes  
are up,
replicate identifies this as a split brain situation and a manual  
intervention

is needed to identify the "more" correct version of file and remove  
the other.

Then replicate copies the more correct version of file to other node.

Hence the issue here is not writes being failed. Writes are  
happening but there

is a split-brain situation because of the way servers have been  
restarted.

On Wed, Sep 2, 2009 at 8:06 PM, Brian Hirt <bhirt@xxxxxxxxxxxxx>  
wrote:

On Sep 2, 2009, at 7:12 AM, Vijay Bellur wrote:

Brian Hirt wrote:

The first part of this problem (open files not surviving gluster  
restarts) seems like a pretty major design flaw that needs to be  
fixed.
Yes, we do know that this is a problem and we have our sights set on  
solving this.

That is good to know.  Do you know if is this planned on being back  
ported into 2.0 or is it going to be part of 2.1?  Is there a bug  
report id so we can follow the progress?

The second part (gluster not reporting the error to the writer when  
gluster chokes) is a critical problem that needs to be fixed.

This is a bug in the write-behind translator and bug 242 has been  
tracked to address this.

A discussion from the mailing list archives which could be of  
interest to you for the tail -f problem:

http://gluster.org/pipermail/gluster-users/20090113/001362.html

Is there any additional information I can provide in this bug  
report?   I have disabled the following section from my test clients  
and can confirm that some errors that were not being reported are  
now being sent back to the writer program.  It's certainly an  
improvement over no errors being reported.

volume writebehind
 type performance/write-behind
 option window-size 1MB
 subvolumes distribute
end-volume

I've also discovered, that this problem is not isolated to the  
writebehind module.  While some errors are being sent back to the  
writer, there is still data corruption in the gluster created file.   
Gluster is still reporting success to the writer when writes have  
failed.   I have a simple program that writes 1, 2, 3, 4 ... N to a  
file at the rate of 100 lines per second.  Whenever the writer gets  
an error returned from write() it waits a second, reopens the file  
and continues writing.   While this writer is writing, I restart the  
gluster nodes one by one.  Once this is done, I stop the writter and  
check it for corruption.

One interesting observation I have made is that when restarting the  
gluster servers, sometimes errorno EBADFD is returned and sometimes  
it's ENOTCONN.  When errno is ENOTCONN (107 in ubuntu 9.04) the file  
is not corrupted. When errno is EBADFD (77 in ubuntu 9.04) there is  
file corruption.  These statements are based on a limited number of  
test runs, but were always true for me.

Some sample output of some tests:

bhirt@ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 && ./ 
write-numbers /unify/m/test1.2009-09-02
problems writting to fd, reopening logfile (errno = 77) in one second
^C
bhirt@ubuntu:~/gluster-tests$ ./check-numbers /unify/m/ 
test1.2009-09-02
169 <> 480
bhirt@ubuntu:~/gluster-tests$ rm -f /unify/m/test1.2009-09-02 && ./ 
write-numbers /unify/m/test1.2009-09-02
problems writting to fd, reopening logfile (errno = 107) in one second
^C

bhirt@ubuntu:~/gluster-tests$ ./check-numbers /unify/m/ 
test1.2009-09-02
OK

The programs I use to test this are:

bhirt@ubuntu:~/gluster-tests$ cat write-numbers.c check-numbers
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>

#define BUFSIZE        65536

/* write 100 entries per second */
#define WRITE_DELAY    1000000 / 100

int open_testfile(char *testfile)
{
   int fd;

   fd = open(testfile, O_WRONLY | O_CREAT | O_APPEND, 0666);

   if (fd < 0) {
       perror("open");
       exit(2);
   }

   return(fd);
}

void usage(char *s)
{
   fprintf(stderr, "\nusage: %s testfile\n\n",s);
}

int main (int argc, char **argv)
{
   char buf[BUFSIZE];
   int  logfd;
   int  nread;
   int counter = 0;

   if (argc != 2) {
       usage(argv[0]);
       exit(1);
   }

   logfd = open_testfile(argv[1]);

   /* loop endlessly */
   for (;;) {

       snprintf(buf, sizeof(buf), "%d\n",counter);
       nread = strnlen(buf,sizeof(buf));

       /* write data */
       int nwrite = write(logfd, buf, nread);

       if (nwrite == nread) {
           counter++;
           usleep(WRITE_DELAY);
       } else {
           /* restarted gluster nodes give this error in 2.0.6 */
           if (errno == EBADFD || errno == ENOTCONN)
           {
             /* wait a second before re-opening the file */
             fprintf(stderr,"problems writting to fd, reopening  
logfile (errno = %d) in one second\n",errno);
             sleep(1);

             /* reopen log file, and set write again flag so the  
data tries to get written back */
             logfd = open_testfile(argv[1]);
           }
           else
           {
             perror("write");
             exit(2);
           }
       }
   }
}

#!/usr/bin/perl

use strict;
use warnings;

my $i=0;

while (<>) { die "$i <> $_" if $i++ != $_; }
print STDERR "OK\n";

The client log file during one of the tests I ran.

[2009-09-02 09:59:23] E [saved-frames.c:165:saved_frames_unwind]  
remote1: forced unwinding frame type(1) op(FINODELK)
[2009-09-02 09:59:23] N [client-protocol.c:6246:notify] remote1:  
disconnected
[2009-09-02 09:59:23] E [socket.c:745:socket_connect_finish]  
remote1: connection to 10.0.1.31:6996 failed (Connection refused)
[2009-09-02 09:59:26] N [client-protocol.c: 
5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996,  
attached to remote volume 'brick'.
[2009-09-02 09:59:30] E [saved-frames.c:165:saved_frames_unwind]  
remote2: forced unwinding frame type(1) op(WRITE)
[2009-09-02 09:59:30] W [fuse-bridge.c:1534:fuse_writev_cbk]  
glusterfs-fuse: 153358: WRITE => -1 (Transport endpoint is not  
connected)
[2009-09-02 09:59:30] N [client-protocol.c:6246:notify] remote2:  
disconnected
[2009-09-02 09:59:30] E [socket.c:745:socket_connect_finish]  
remote2: connection to 10.0.1.32:6996 failed (Connection refused)
[2009-09-02 09:59:33] N [client-protocol.c: 
5559:client_setvolume_cbk] remote2: Connected to 10.0.1.32:6996,  
attached to remote volume 'brick'.
[2009-09-02 09:59:34] N [client-protocol.c: 
5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996,  
attached to remote volume 'brick'.
[2009-09-02 09:59:37] E [saved-frames.c:165:saved_frames_unwind]  
remote1: forced unwinding frame type(1) op(FINODELK)
[2009-09-02 09:59:37] W [fuse-bridge.c:1534:fuse_writev_cbk]  
glusterfs-fuse: 153923: WRITE => -1 (File descriptor in bad state)
[2009-09-02 09:59:37] N [client-protocol.c:6246:notify] remote1:  
disconnected
[2009-09-02 09:59:40] N [client-protocol.c: 
5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996,  
attached to remote volume 'brick'.
[2009-09-02 09:59:41] N [client-protocol.c: 
5559:client_setvolume_cbk] remote2: Connected to 10.0.1.32:6996,  
attached to remote volume 'brick'.
[2009-09-02 09:59:44] N [client-protocol.c: 
5559:client_setvolume_cbk] remote1: Connected to 10.0.1.31:6996,  
attached to remote volume 'brick'.
[2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs- 
fuse: 155106: FLUSH() ERR => -1 (File descriptor in bad state)
[2009-09-02 09:59:51] W [fuse-bridge.c:882:fuse_err_cbk] glusterfs- 
fuse: 155108: FLUSH() ERR => -1 (File descriptor in bad state)

Regards,
Vijay

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
http://lists.nongnu.org/mailman/listinfo/gluster-devel

--
Raghavendra G