Hi,
We have observed a data corruption bug in a database created by the
postmap command (BDB file) under the following conditions:
Xen domU guest kernel 3.8, 3.9 (3.5, 3.10, 3.11 don't show the
behaviour 3.6 and 3.7 are unknown)
dom0 Xen 4.2.1 / kernel 3.8 or Xen 4.3.0 / kernel 3.11
The guest has a passed through block device (phy:/ or file:/)
The filesytem on the passed through device is ext2/3/4 with a 1k block
size
By examining a strace of the postmap command we produced a short piece
of code (at the bottom) which demonstrates the problem. If this is
executed in a loop such as:
#!/bin/bash
for i in $(seq 1 5) ; do
mount /dev/xvde1 /mnt
pushd /mnt> /dev/null
echo "checksums after mount"
md5sum testcase.bin
[ "${i}" = "1" ] && ./a.out
echo "checksums before umount"
md5sum testcase.bin
popd> /dev/null
umount /mnt
done
The output is
checksums after mount
md5sum: testcase.bin: No such file or directory
checksums before umount
719f20c98b69457ce0247d6bf4474cf9 testcase.bin# the correct checksum
for the file
checksums after mount
a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin
checksums before umount
a90804e64bcc1c0c98dd2cb23d0e4c10 testcase.bin
checksums after mount
14bb035eca1ec516ce3865700536fc0c testcase.bin
checksums before umount
14bb035eca1ec516ce3865700536fc0c testcase.bin
checksums after mount
124d3d3ea8e421925825ff94a815630b testcase.bin
checksums before umount
124d3d3ea8e421925825ff94a815630b testcase.bin
checksums after mount
7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin
checksums before umount
7c05f36ffdd6b8217a27c0bd4d9cb531 testcase.bin
If we dd out the block device and then loop mount the resulting file
we do not see this problem suggesting that communication between xen
block back/front is ok and that it is only when the mount takes place
that there is a problem. The default libdb behaviour seems to be to
create a database with a block size matching that of the filesystem,
if we override this and set it at 4k we do not see this issue. This
is also observed by changing the bs value in our test program. Once
bs is > 3072 we no longer observe the problem. Also we can avoid the
issue in our test program by filling in hole while __testcase.bin is
being generated. A similar test on xfs with a 1k block size did not
demonstrate this problem. If make a cp of the file before the umount
then the copied version is and remains correct.
Our searching does not seem to have revealed any similar reports or an
explicitly identified fix that was introduced for 3.10. Our concern
therefore is that this is an unrecognised failure that has been
inadvertently fixed and could equally inadvertently be reintroduced by
some other change. If this problem sounds familiar or there are
suggestions on how to narrow this down further we would greatly
appreciate the advice.
Thanks,
James
#include <string.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/stat.h>
extern
int main(int argc, char *argv[])
{
struct stat *sbuf;
char *buf, *zero, *null;
int fd5, fd6, fd7;
int i;
int bs = 1024; /* lte 3072 = corruption */
buf = malloc(3*bs);
zero = malloc(3*bs);
null = malloc(bs);
memset(zero, 0, 3*bs);
sbuf = malloc(sizeof(struct stat));
memset(sbuf, 0, sizeof(struct stat));
for(i = 0; i < 3*bs; i++) {
buf[i] = i & 0x000f;
}
fd5 = open("__testcase.bin", O_RDWR|O_CREAT|O_EXCL, 0644);
//fcntl(fd5, F_GETFD);
//fcntl(fd5, F_SETFD, FD_CLOEXEC);
//stat("__testcase.bin", sbuf);
fstat(fd5, sbuf);
/* this only writes the first and last blocks */
lseek(fd5, 0*bs, SEEK_SET);
write(fd5, zero, bs);
//lseek(fd5, 1*bs, SEEK_SET); /* filling in this hole is a fix! */
//write(fd5, zero, bs);
lseek(fd5, 2*bs, SEEK_SET);
write(fd5, zero, bs);
fdatasync(fd5);
rename("__testcase.bin", "testcase.bin");
//stat("testcase.bin", sbuf);
fd6 = open("testcase.bin", O_RDWR|O_CREAT, 0);
//fcntl(fd6, F_GETFD);
//fcntl(fd6, F_SETFD, FD_CLOEXEC);
//fstat(fd6, sbuf);
pread(fd6, null, bs, 0);
//fstat(fd6, sbuf);
//fcntl(fd6, F_GETFD);
//fcntl(fd6, F_SETFD, FD_CLOEXEC);
//fcntl(fd6, F_GETFD);
//fcntl(fd6, F_SETFD, FD_CLOEXEC);
fd7 = open("testcase.bin", O_RDWR);
flock(fd7, LOCK_EX);
umask(022);
pread(fd6, null, bs, 1*bs);
pread(fd6, null, bs, 2*bs);
pwrite(fd6, buf, bs, 0*bs);
pwrite(fd6, buf, bs, 1*bs);
pwrite(fd6, buf, bs, 2*bs);
fdatasync(fd6);
fdatasync(fd6);
close(fd5);
close(fd6);
fd5 = open("testcase.bin", O_RDWR, 0);
//fcntl(fd5, F_GETFD);
//fcntl(fd5, F_SETFD, FD_CLOEXEC);
fdatasync(fd5);
close(fd5);
close(fd7);
free(buf);
free(sbuf);
free(zero);
free(null);
}