Hi Krutika, Leo, Sounds promising. I will test this too, and report back tomorrow (or maybe sooner, if corruption occurs again). -- Sander On 27-03-19 10:00, Krutika Dhananjay wrote: > This is needed to prevent any inconsistencies stemming from buffered > writes/caching file data during live VM migration. > Besides, for Gluster to truly honor direct-io behavior in qemu's > 'cache=none' mode (which is what oVirt uses), > one needs to turn on performance.strict-o-direct and disable remote-dio. > > -Krutika > > On Wed, Mar 27, 2019 at 12:24 PM Leo David <leoalex@xxxxxxxxx > <mailto:leoalex@xxxxxxxxx>> wrote: > > Hi, > I can confirm that after setting these two options, I haven't > encountered disk corruptions anymore. > The downside, is that at least for me it had a pretty big impact > on performance. > The iops really went down - performing inside vm fio tests. > > On Wed, Mar 27, 2019, 07:03 Krutika Dhananjay <kdhananj@xxxxxxxxxx > <mailto:kdhananj@xxxxxxxxxx>> wrote: > > Could you enable strict-o-direct and disable remote-dio on the > src volume as well, restart the vms on "old" and retry migration? > > # gluster volume set <VOLNAME> performance.strict-o-direct on > # gluster volume set <VOLNAME> network.remote-dio off > > -Krutika > > On Tue, Mar 26, 2019 at 10:32 PM Sander Hoentjen > <sander@xxxxxxxxxxx <mailto:sander@xxxxxxxxxxx>> wrote: > > On 26-03-19 14:23, Sahina Bose wrote: > > +Krutika Dhananjay and gluster ml > > > > On Tue, Mar 26, 2019 at 6:16 PM Sander Hoentjen > <sander@xxxxxxxxxxx <mailto:sander@xxxxxxxxxxx>> wrote: > >> Hello, > >> > >> tl;dr We have disk corruption when doing live storage > migration on oVirt > >> 4.2 with gluster 3.12.15. Any idea why? > >> > >> We have a 3-node oVirt cluster that is both compute and > gluster-storage. > >> The manager runs on separate hardware. We are running > out of space on > >> this volume, so we added another Gluster volume that is > bigger, put a > >> storage domain on it and then we migrated VM's to it > with LSM. After > >> some time, we noticed that (some of) the migrated VM's > had corrupted > >> filesystems. After moving everything back with > export-import to the old > >> domain where possible, and recovering from backups > where needed we set > >> off to investigate this issue. > >> > >> We are now at the point where we can reproduce this > issue within a day. > >> What we have found so far: > >> 1) The corruption occurs at the very end of the > replication step, most > >> probably between START and FINISH of > diskReplicateFinish, before the > >> START merge step > >> 2) In the corrupted VM, at some place where data should > be, this data is > >> replaced by zero's. This can be file-contents or a > directory-structure > >> or whatever. > >> 3) The source gluster volume has different settings > then the destination > >> (Mostly because the defaults were different at creation > time): > >> > >> Setting old(src) new(dst) > >> cluster.op-version 30800 30800 > (the same) > >> cluster.max-op-version 31202 31202 > (the same) > >> cluster.metadata-self-heal off on > >> cluster.data-self-heal off on > >> cluster.entry-self-heal off on > >> performance.low-prio-threads 16 32 > >> performance.strict-o-direct off on > >> network.ping-timeout 42 30 > >> network.remote-dio enable off > >> transport.address-family - inet > >> performance.stat-prefetch off on > >> features.shard-block-size 512MB 64MB > >> cluster.shd-max-threads 1 8 > >> cluster.shd-wait-qlength 1024 10000 > >> cluster.locking-scheme full granular > >> cluster.granular-entry-heal no enable > >> > >> 4) To test, we migrate some VM's back and forth. The > corruption does not > >> occur every time. To this point it only occurs from old > to new, but we > >> don't have enough data-points to be sure about that. > >> > >> Anybody an idea what is causing the corruption? Is this > the best list to > >> ask, or should I ask on a Gluster list? I am not sure > if this is oVirt > >> specific or Gluster specific though. > > Do you have logs from old and new gluster volumes? Any > errors in the > > new volume's fuse mount logs? > > Around the time of corruption I see the message: > The message "I [MSGID: 133017] [shard.c:4941:shard_seek] > 0-ZoneA_Gluster1-shard: seek called on > 7fabc273-3d8a-4a49-8906-b8ccbea4a49f. [Operation not > supported]" repeated 231 times between [2019-03-26 > 13:14:22.297333] and [2019-03-26 13:15:42.912170] > > I also see this message at other times, when I don't see > the corruption occur, though. > > -- > Sander > _______________________________________________ > Users mailing list -- users@xxxxxxxxx <mailto:users@xxxxxxxxx> > To unsubscribe send an email to users-leave@xxxxxxxxx > <mailto:users-leave@xxxxxxxxx> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@xxxxxxxxx/message/M3T2VGGGV6DE643ZKKJUAF274VSWTJFH/ > > _______________________________________________ > Users mailing list -- users@xxxxxxxxx <mailto:users@xxxxxxxxx> > To unsubscribe send an email to users-leave@xxxxxxxxx > <mailto:users-leave@xxxxxxxxx> > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@xxxxxxxxx/message/ZUIRM5PT4Y4USOSDGSUEP3YEE23LE4WG/ > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users