Pranith and Vijay, the problems began when I started to use the alfresco.qc2 and other disk images on the gv_pri volume backed by XFS on LVM partitions. I copied these images from another GlusterFS volume, (backed by ext4, no LVM partitions) where they works as expected. The VMs runs on the same hosts, so the qemu-kvm version is the same. Here are the details of a brick from the gv_pri (new and problematic) volume: [root@networker bricks]# xfs_info /glustexp/pri1 meta-data=/dev/mapper/vg_guests-lv_brick1 isize=512 agcount=16, agsize=3194880 blks = sectsz=4096 attr=2, projid32bit=0 data = bsize=4096 blocks=51118080, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=24960, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 This is a brick partition from the gv_sec (old and working properly) volume: [root@networker2 bricks]# dumpe2fs -h /dev/sda1 dumpe2fs 1.41.12 (17-May-2010) Filesystem volume name: <none> Last mounted on: /glustexp/sec2 Filesystem UUID: 87678a0d-aef6-403c-930a-a9b2b4cb7c37 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 9773056 Block count: 39072718 Reserved block count: 1953635 Free blocks: 36406615 Free inodes: 9772982 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1014 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Wed Dec 18 10:03:39 2013 Last mount time: Thu Jan 9 23:03:24 2014 Last write time: Thu Jan 9 23:03:24 2014 Mount count: 2 Maximum mount count: 39 Last checked: Wed Dec 18 10:03:39 2013 Check interval: 15552000 (6 months) Next check after: Mon Jun 16 11:03:39 2014 Lifetime writes: 189 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 917534 Default directory hash: half_md4 Directory Hash Seed: 4891a3c2-8e00-45a3-ac6b-ea96de069b38 Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x0015bae4 Journal start: 31215 The block size is the same, 4096 bytes. I did some other investigation and it seems the problem happens only with VM disk images internally formatted with a blocksize of 1024 bytes. There are no problems with disk images formatted with a block size on 4096 bytes. Anyway, I don't know if this is a coincidence. Do you think this could be the origin of the problem? If so, how can I solve it? In the links posted by Vijay someone suggests to start the VM with cache != none but this will prevent live migration, AFAIK. Another solution may be to recreate the volume backing it with XFS partitions formatted with a different block size (smaller? 1024 bytes?), this would be a painful option, but if this will solve the problem, I go for it. Thanks a lot, Fabio ----- Messaggio originale ----- Da: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> A: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> Inviato: Venerdì, 24 gennaio 2014 12:52:44 Oggetto: Re: Replication delay Fabio, It has nothing to do with SELINUX IMO. You were saying self-heal happens when the VM is paused, that means writes from self-heal's fd are succeeding. So something happened to that VM's fd using which kvm writes. Wonder what!. When did you start getting this problem? What happened at that time. Pranith ----- Original Message ----- > From: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > Sent: Friday, January 24, 2014 5:09:25 PM > Subject: Re: Replication delay > > You're right! In the brick log from the first peer (networker AKA > nw1glus.gem.local) I found lots of these errors: > > [2014-01-24 11:32:28.482639] E [posix.c:2135:posix_writev] 0-gv_pri-posix: > write failed: offset 4812114432, Invalid argument > [2014-01-24 11:32:28.485334] I [server-rpc-fops.c:1439:server_writev_cbk] > 0-gv_pri-server: 31817: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==> > (Invalid argument) > [2014-01-24 11:32:28.483791] E [posix.c:2135:posix_writev] 0-gv_pri-posix: > write failed: offset 5562239488, Invalid argument > [2014-01-24 11:32:28.485416] I [server-rpc-fops.c:1439:server_writev_cbk] > 0-gv_pri-server: 31820: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==> > (Invalid argument) > [2014-01-24 11:32:28.484275] E [posix.c:2135:posix_writev] 0-gv_pri-posix: > write failed: offset 5757467136, Invalid argument > [2014-01-24 11:32:28.482841] E [posix.c:2135:posix_writev] 0-gv_pri-posix: > write failed: offset 3742501376, Invalid argument > [2014-01-24 11:32:28.485494] I [server-rpc-fops.c:1439:server_writev_cbk] > 0-gv_pri-server: 31822: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==> > (Invalid argument) > [2014-01-24 11:32:28.485534] I [server-rpc-fops.c:1439:server_writev_cbk] > 0-gv_pri-server: 31818: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==> > (Invalid argument) > [2014-01-24 11:32:28.530943] E [posix.c:2135:posix_writev] 0-gv_pri-posix: > write failed: offset 3156122112, Invalid argument > [2014-01-24 11:32:28.530997] I [server-rpc-fops.c:1439:server_writev_cbk] > 0-gv_pri-server: 31832: WRITEV 0 (f1e928ad-d4dd-49f3-abae-e99cb1f310e1) ==> > (Invalid argument) > > Then I noticed the SELinux context on the two bricks are different, I don't > know if this can be the cause of the errors: > > [root@networker gluspri]# ll -Z /glustexp/pri1/brick/ > -rw-------. qemu qemu system_u:object_r:file_t:s0 alfresco.qc2 > > [root@networker2 ~]# ll -Z /glustexp/pri1/brick/ > -rw-------. qemu qemu unconfined_u:object_r:file_t:s0 alfresco.qc2 > > > Fabio > > ----- Messaggio originale ----- > > Da: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > > A: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > > Inviato: Venerdì, 24 gennaio 2014 12:27:56 > > Oggetto: Re: Replication delay > > > > Fabio, > > Seems like writes on first brick of this replica pair seem to be > > failing from the mount. Could you check both client and brick logs to > > see where these failures are coming from? > > > > Pranith > > ----- Original Message ----- > > > From: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > > To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > > > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > > > Sent: Friday, January 24, 2014 4:50:52 PM > > > Subject: Re: Replication delay > > > > > > Ok, that's the output after the VM has been halted: > > > > > > [root@networker ~]# getfattr -d -m. -e hex > > > /glustexp/pri1/brick/alfresco.qc2 > > > getfattr: Removing leading '/' from absolute path names > > > # file: glustexp/pri1/brick/alfresco.qc2 > > > security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000 > > > trusted.afr.gv_pri-client-0=0x000001390000000000000000 > > > trusted.afr.gv_pri-client-1=0x000000000000000000000000 > > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe > > > > > > [root@networker2 ~]# getfattr -d -m. -e hex > > > /glustexp/pri1/brick/alfresco.qc2 > > > getfattr: Removing leading '/' from absolute path names > > > # file: glustexp/pri1/brick/alfresco.qc2 > > > security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 > > > trusted.afr.gv_pri-client-0=0x000001390000000000000000 > > > trusted.afr.gv_pri-client-1=0x000000000000000000000000 > > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe > > > > > > > > > When "heal info" stops reporting alfresco.qc2 I get: > > > > > > [root@networker glusterfs]# getfattr -d -m. -e hex > > > /glustexp/pri1/brick/alfresco.qc2 > > > getfattr: Removing leading '/' from absolute path names > > > # file: glustexp/pri1/brick/alfresco.qc2 > > > security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000 > > > trusted.afr.gv_pri-client-0=0x000000000000000000000000 > > > trusted.afr.gv_pri-client-1=0x000000000000000000000000 > > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe > > > > > > [root@networker2 ~]# getfattr -d -m. -e hex > > > /glustexp/pri1/brick/alfresco.qc2 > > > getfattr: Removing leading '/' from absolute path names > > > # file: glustexp/pri1/brick/alfresco.qc2 > > > security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000 > > > trusted.afr.gv_pri-client-0=0x000000000000000000000000 > > > trusted.afr.gv_pri-client-1=0x000000000000000000000000 > > > trusted.gfid=0x298c76de7c8643a3909f7ef77dc294fe > > > > > > > > > Fabio > > > > > > ----- Messaggio originale ----- > > > > Da: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > > > > A: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > > > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > > > > Inviato: Venerdì, 24 gennaio 2014 11:36:12 > > > > Oggetto: Re: Replication delay > > > > > > > > This time when you stop the VM, could you get the output of "getfattr > > > > -d > > > > -m. > > > > -e hex <file-path-on-brick>" on both the bricks to debug further. > > > > > > > > Pranith > > > > ----- Original Message ----- > > > > > From: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > > > > To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > > > > > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > > > > > Sent: Friday, January 24, 2014 3:58:38 PM > > > > > Subject: Re: Replication delay > > > > > > > > > > > > > > > > > > > > > > > > > ----- Messaggio originale ----- > > > > > > Da: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > > > > > > A: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > > > > > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > > > > > > Inviato: Venerdì, 24 gennaio 2014 11:02:15 > > > > > > Oggetto: Re: Replication delay > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> > > > > > > > To: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > > > > > > Cc: "Gluster-users@xxxxxxxxxxx List" <gluster-users@xxxxxxxxxxx> > > > > > > > Sent: Friday, January 24, 2014 3:29:19 PM > > > > > > > Subject: Re: Replication delay > > > > > > > > > > > > > > Hi Fabio, > > > > > > > This is a known issue that has been addressed on master. It > > > > > > > may > > > > > > > be > > > > > > > backported to 3.5. When a file is undergoing changes, it may > > > > > > > appear > > > > > > > in > > > > > > > 'gluster volume heal <volname> info' output even when it > > > > > > > doesn't > > > > > > > need > > > > > > > any self-heal. > > > > > > > > > > > > > > Pranith > > > > > > > > > > > > Sorry, I just saw that there is a self-heal happening for 15 > > > > > > minutes > > > > > > when > > > > > > you > > > > > > stop the VMs. How are you checking that the self-heal is happening? > > > > > > > > > > When I stop the VM for alfresco.qc2, "heal info" still reports > > > > > alfresco.qc2 > > > > > as in need for healing for about 15min. > > > > > It seems this is a real out-of-sync situation because if I check the > > > > > two > > > > > bricks I get different modification times up until they are healed > > > > > (no > > > > > more > > > > > reported by "heal info"). This is the bricks' status for alfresco.qc2 > > > > > while > > > > > the VM is halted: > > > > > > > > > > [root@networker ~]# ll /glustexp/pri1/brick/ > > > > > totale 27769492 > > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:16 alfresco.qc2 > > > > > [...] > > > > > > > > > > [root@networker2 ~]# ll /glustexp/pri1/brick/ > > > > > totale 27769384 > > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:05 alfresco.qc2 > > > > > [...] > > > > > > > > > > Bricks' status AFTER "heal info" doesn't report alfresco.qc2 anymore: > > > > > > > > > > [root@networker ~]# ll /glustexp/pri1/brick/ > > > > > totale 27769492 > > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:05 alfresco.qc2 > > > > > > > > > > [root@networker2 ~]# ll /glustexp/pri1/brick/ > > > > > totale 27769384 > > > > > -rw-------. 2 qemu qemu 8212709376 24 gen 11:05 alfresco.qc2 > > > > > > > > > > Thanks for helping! > > > > > > > > > > Fabio > > > > > > > > > > > > > > > > > Pranith > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > From: "Fabio Rosati" <fabio.rosati@xxxxxxxxxxxxxxxxx> > > > > > > > > To: "Gluster-users@xxxxxxxxxxx List" > > > > > > > > <gluster-users@xxxxxxxxxxx> > > > > > > > > Sent: Friday, January 24, 2014 3:17:27 PM > > > > > > > > Subject: Replication delay > > > > > > > > > > > > > > > > Hi All, > > > > > > > > > > > > > > > > in a distributed-replicated volume hosting some VMs disk images > > > > > > > > (GlusterFS > > > > > > > > 3.4.2 on CentOS 6.5, qemu-kvm with glusterfs native support, no > > > > > > > > fuse > > > > > > > > mount), > > > > > > > > I always get the same two files that need healing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [root@networker ~]# gluster volume heal gv_pri info > > > > > > > > Gathering Heal info on volume gv_pri has been successful > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brick nw1glus.gem.local:/glustexp/pri1/brick > > > > > > > > Number of entries: 2 > > > > > > > > /alfresco.qc2 > > > > > > > > /remlog.qc2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brick nw2glus.gem.local:/glustexp/pri1/brick > > > > > > > > Number of entries: 2 > > > > > > > > /alfresco.qc2 > > > > > > > > /remlog.qc2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brick nw3glus.gem.local:/glustexp/pri2/brick > > > > > > > > Number of entries: 0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Brick nw4glus.gem.local:/glustexp/pri2/brick > > > > > > > > Number of entries: 0 > > > > > > > > > > > > > > > > This is not a split-brain situation (I checked) and If I stop > > > > > > > > the > > > > > > > > two > > > > > > > > VMs > > > > > > > > that use these images, I get the two files healed/synced in > > > > > > > > about > > > > > > > > 15min. > > > > > > > > This is too much time, IMHO. > > > > > > > > In this volume there are other VM images with (smaller) disk > > > > > > > > images > > > > > > > > replicated on the same bricks and they get synced "in > > > > > > > > real-time". > > > > > > > > > > > > > > > > These are the volume's details, the host "networker" is > > > > > > > > nw1glus.gem.local > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [root@networker ~]# gluster volume info gv_pri > > > > > > > > > > > > > > > > Volume Name: gv_pri > > > > > > > > Type: Distributed-Replicate > > > > > > > > Volume ID: 3d91b91e-4d72-484f-8655-e5ed8d38bb28 > > > > > > > > Status: Started > > > > > > > > Number of Bricks: 2 x 2 = 4 > > > > > > > > Transport-type: tcp > > > > > > > > Bricks: > > > > > > > > Brick1: nw1glus.gem.local:/glustexp/pri1/brick > > > > > > > > Brick2: nw2glus.gem.local:/glustexp/pri1/brick > > > > > > > > Brick3: nw3glus.gem.local:/glustexp/pri2/brick > > > > > > > > Brick4: nw4glus.gem.local:/glustexp/pri2/brick > > > > > > > > Options Reconfigured: > > > > > > > > server.allow-insecure: on > > > > > > > > storage.owner-uid: 107 > > > > > > > > storage.owner-gid: 107 > > > > > > > > > > > > > > > > [root@networker ~]# gluster volume status gv_pri detail > > > > > > > > > > > > > > > > > > > > > > > > Status of volume: gv_pri > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > Brick : Brick nw1glus.gem.local:/glustexp/pri1/brick > > > > > > > > Port : 50178 > > > > > > > > Online : Y > > > > > > > > Pid : 25721 > > > > > > > > File System : xfs > > > > > > > > Device : /dev/mapper/vg_guests-lv_brick1 > > > > > > > > Mount Options : rw,noatime > > > > > > > > Inode Size : 512 > > > > > > > > Disk Space Free : 168.4GB > > > > > > > > Total Disk Space : 194.9GB > > > > > > > > Inode Count : 102236160 > > > > > > > > Free Inodes : 102236130 > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > Brick : Brick nw2glus.gem.local:/glustexp/pri1/brick > > > > > > > > Port : 50178 > > > > > > > > Online : Y > > > > > > > > Pid : 27832 > > > > > > > > File System : xfs > > > > > > > > Device : /dev/mapper/vg_guests-lv_brick1 > > > > > > > > Mount Options : rw,noatime > > > > > > > > Inode Size : 512 > > > > > > > > Disk Space Free : 168.4GB > > > > > > > > Total Disk Space : 194.9GB > > > > > > > > Inode Count : 102236160 > > > > > > > > Free Inodes : 102236130 > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > Brick : Brick nw3glus.gem.local:/glustexp/pri2/brick > > > > > > > > Port : 50182 > > > > > > > > Online : Y > > > > > > > > Pid : 14571 > > > > > > > > File System : xfs > > > > > > > > Device : /dev/mapper/vg_guests-lv_brick2 > > > > > > > > Mount Options : rw,noatime > > > > > > > > Inode Size : 512 > > > > > > > > Disk Space Free : 418.3GB > > > > > > > > Total Disk Space : 433.8GB > > > > > > > > Inode Count : 227540992 > > > > > > > > Free Inodes : 227540973 > > > > > > > > ------------------------------------------------------------------------------ > > > > > > > > Brick : Brick nw4glus.gem.local:/glustexp/pri2/brick > > > > > > > > Port : 50181 > > > > > > > > Online : Y > > > > > > > > Pid : 21942 > > > > > > > > File System : xfs > > > > > > > > Device : /dev/mapper/vg_guests-lv_brick2 > > > > > > > > Mount Options : rw,noatime > > > > > > > > Inode Size : 512 > > > > > > > > Disk Space Free : 418.3GB > > > > > > > > Total Disk Space : 433.8GB > > > > > > > > Inode Count : 227540992 > > > > > > > > Free Inodes : 227540973 > > > > > > > > > > > > > > > > fuse-mount of the gv_pri volume: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [root@networker ~]# ll -h /mnt/gluspri/ > > > > > > > > totale 37G > > > > > > > > -rw-------. 1 qemu qemu 7,7G 24 gen 10:21 alfresco.qc2 > > > > > > > > -rw-------. 1 qemu qemu 4,2G 24 gen 10:22 check_mk-salmo.qc2 > > > > > > > > -rw-------. 1 qemu qemu 27M 23 gen 16:42 newnxserver.qc2 > > > > > > > > -rw-------. 1 qemu qemu 1,1G 23 gen 13:38 newubutest1.qc2 > > > > > > > > -rw-------. 1 qemu qemu 11G 24 gen 10:17 nxserver.qc2 > > > > > > > > -rw-------. 1 qemu qemu 8,1G 24 gen 10:17 remlog.qc2 > > > > > > > > -rw-------. 1 qemu qemu 5,6G 24 gen 10:19 ubutest1.qc2 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Do you think this is the expected behaviour, maybe due to > > > > > > > > caching? > > > > > > > > What > > > > > > > > if > > > > > > > > the most updated node goes down while the VMs are running? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks a lot, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Fabio Rosati > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > Gluster-users mailing list > > > > > > > > Gluster-users@xxxxxxxxxxx > > > > > > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > > > > > _______________________________________________ > > > > > > > Gluster-users mailing list > > > > > > > Gluster-users@xxxxxxxxxxx > > > > > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users