I have found that the O_APPEND flag is key to this failure - I had overlooked that flag when reading the strace and trying to cobble up a minimal reproduction. I now have a small pair of python scripts that can reliably reproduce this failure. On Sep 5, 2014, at 11:11 AM, mike <mike@xxxxxxxxxxxxxxxxxxxx> wrote: > Hi all, > > I've been trying to track down a bug that looks like data corruption coming from the Gluster FUSE client. I'm pretty new to Gluster, so perhaps there are better ways to get to the bottom of this. > > I have a distributed, replicated volume that was initialized as 3.5.1 and it has been functioning fine in general. > > Recently we discovered a write corruption issue that is always reproducible by one of our binaries. > > The scenarios is as follows: > > There is one writer appending binary entries that have a sequence number and checksum. Writes are small - 38 bytes each - and there is no fsync while the file is streaming out. > > If this is the only activity, everything is perfectly fine. > > However, if that file is simultaneously opened for reading, in a tail-like fashion, the data written into gluster is corrupt. Reading the file back out of disk reveals a dropped message (a missing sequence number). This happens when the reader and writer are different processes, so this is not a bug in our userspace. > > What is even more interesting it that the first run during the writing reports the sequence number of the failure differently than all subsequent tests. > > It feels like there is some kind of cache coherency issue here, but after putting glusterfs into -L DEBUG and following along, I'm not closer to a understanding the issue. > > stracing glusterfs suggests that the missing records were actually sent to the remote glusterfs daemons. I have appended that strace at the bottom. > > For reference, here is a bit of strace that shows the overall data flow: > > The file is opened for writing like so: > > 32684 open(".../logtest", O_WRONLY|O_CREAT, 0664) = 7 > 32684 fcntl(7, F_SETFD, FD_CLOEXEC) = 0 > 32684 fcntl(7, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) > 32684 fcntl(7, F_SETFL, O_WRONLY|O_APPEND|O_LARGEFILE) = 0 > 32686 write(7, "...", 38) = 38 > 32686 nanosleep({0, 1000000}, NULL) = 0 > > > The file is opened for reading like so: > > open(".../logtest", O_RDONLY) = 6 > fstat(6, {st_mode=S_IFREG|0644, st_size=912, ...}) = 0 > lseek(6, 0, SEEK_CUR) = 0 > read(6, "..."..., 131072) = 912 > read(6, "", 131072) = 0 > lseek(6, 0, SEEK_SET) = 0 > read(6, "..."..., 912) = 912 > nanosleep({0, 10000000}, NULL) = 0 > read(6, "...", 131072) = 114 > read(6, "", 131072) = 0 > lseek(6, 0, SEEK_SET) = 0 > read(6, "..."..., 1026) = 1026 > nanosleep({0, 10000000}, NULL) = 0 > read(6, "...", 131072) = 114 > read(6, "", 131072) = 0 > lseek(6, 0, SEEK_SET) = 0 > read(6, "..."..., 1140) = 1102 > lseek(6, 38, SEEK_CUR) = 1140 > nanosleep({0, 10000000}, NULL) = 0 > read(6, "...", 131072) = 114 > > I have been trying to duplicate the failure with a small python script, but I have as yet been unable to do so. This C++ binary does reproduce it quite reliably, but the build system for this is quite complex and not readily shareable. > > I hope you can give me some advice about where to look next. > > Thanks, > -Mike > > > > Strace from glusterfs client. fd 22 a tcp socket connected to glusterfsd on a remote host. > > > 30049 writev(22, [{"\200\0\0\342", 4}, {"\0\0\0M\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1\n\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\357\306\250@\0\0\0\5\0\0\0\0\273\\^\v8\356\243\372\10\1\30\0 \7", 38}], 4) = 230 > > This is the last record that shows up in the glusterfs file before the "gap" - sequence number 7. > > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0N\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\220", 4}, {"\0\0\0O\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\31\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 116}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0", 28}], 3) = 148 > 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0P\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0Q\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\204\236\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1\n\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3 <unfinished ...> > 30058 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0R\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3 <unfinished ...> > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0S\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0T\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > 30049 writev(22, [{"\200\0\1\10", 4}, {"\0\0\0U\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0L\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\357\350\336\20\0\0\0\5\0\0\0\0\345Ob/@P\226\233\10\1\30\0 \10\31D\6#\0\0\0\6\23\221\33-\357\373\321\240\0\0\0\5\0\0\0\0\27$\341,\303\324\25\202\10\1\30\0 \t", 76}], 4) = 268 > 30049 writev(22, [{"\200\0\0\342", 4}, {"\0\0\0V\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\1|\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\360\16\212\230\0\0\0\5\0\0\0\0\4t\22\330\323\r4\10\10\1\30\0 \n", 38}], 4) = 230 > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0W\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0X\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\377\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0Y\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\f\0\5\363\227\0\0\0L\0\0ul\0\0'\24\0\0\23\210\0\0\0\f\0\0\0\4\0\0\0\6\0\0\0.\0\0\0~\0\0\17\241\0\0\17\242\0\0\17\243\0\0\17\245\0\0\17\246\0\0\23\210\0\0'\20\0\0'\21\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 116}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0\0\0\0", 44}], 3) = 164 > 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0Z\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0L\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0[\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\10\264\205QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\2\0\0\0\2\0\0\0\0\0\0\0\0\0\0\1|\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164 > 30058 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0\\\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0010\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0]\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > > This is the first record that shows up in the file after the gap - sequence number 11. Somehow, records 8,9,10 are getting squashed. > > 30049 writev(22, [{"\200\0\0\342", 4}, {"\0\0\0^\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\r\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0010\0\0\0&\0\0\204\1\0\0\0J\0\0\0\2\0\0\0\31\0\0\0\2glusterfs.write-is-append\0000\0\0\0\0\27\0\0\0\2glusterfs.open-fd-count\0004\0\0\0", 120}, {"\31D\6#\0\0\0\6\23\221\33-\360mc\330\0\0\0\5\0\0\0\0\366\37\221\333\365V\4\223\10\1\30\0 \v", 38}], 4 <unfinished ...> > 30049 writev(22, [{"\200\0\0\200", 4}, {"\0\0\0_\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\20\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10\364-\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\32\0\0\0\1\0\0\0\v\0\0\0\2batch-fsync\0001\0\0\0", 60}], 3) = 132 > 30049 writev(22, [{"\200\0\0\240", 4}, {"\0\0\0`\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\36\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\1\0\0\0\0\0\0\0\0\0\0\1V\0\0\0\0\0\0\0&\0\0\0\0\0\0\0\0\0\0\0\21lab-0-replicate-6\0\0\0\0\0\0\0", 92}], 3) = 164 > 30049 writev(22, [{"\200\0\0\314", 4}, {"\0\0\0a\0\0\0\0\0\0\0\2\0\23\320\5\0\0\1J\0\0\0\"\0\5\363\227\0\0\0\34\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\10D\21\206QP\177\0\0\0\0\0\0\0\0\0\0", 68}, {"\300}T\25\223\275Hd\245:%\37UM5\230\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0d\0\0\0\2\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-13\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\33\0\0\0\ftrusted.afr.lab-0-client-12\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0", 136}], 3) = 208 > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users