Ron Mayer wrote: > Greg Smith wrote: >> There are some known limitations to Linux fsync that I remain somewhat >> concerned about, independantly of LVM, like "ext3 fsync() only does a >> journal commit when the inode has changed" (see >> http://kerneltrap.org/mailarchive/linux-kernel/2008/2/26/990504 ). The >> way files are preallocated, the PostgreSQL WAL is supposed to function >> just fine even if you're using fdatasync after WAL writes, which also >> wouldn't touch the journal (last time I checked fdatasync was >> implemented as a full fsync on Linux). Since the new ext4 is more > > Indeed it does. > > I wonder if there should be an optional fsync mode > in postgres should turn fsync() into > fchmod (fd, 0644); fchmod (fd, 0664); > to work around this issue. Question is... why do you care if the journal is not flushed on fsync? Only the file data blocks need to be, if the inode is unchanged. > For example this program below will show one write > per disk revolution if you leave the fchmod() in there, > and run many times faster (i.e. lying) if you remove it. > This with ext3 on a standard IDE drive with the write > cache enabled, and no LVM or anything between them. > > ========================================================== > /* > ** based on http://article.gmane.org/gmane.linux.file-systems/21373 > ** http://thread.gmane.org/gmane.linux.kernel/646040 > */ > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <unistd.h> > #include <stdio.h> > #include <stdlib.h> > > int main(int argc,char *argv[]) { > if (argc<2) { > printf("usage: fs <filename>\n"); > exit(1); > } > int fd = open (argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666); > int i; > for (i=0;i<100;i++) { > char byte; > pwrite (fd, &byte, 1, 0); > fchmod (fd, 0644); fchmod (fd, 0664); > fsync (fd); > } > } > ========================================================== > I ran the program above, w/o the fchmod()s. $ time ./test2 testfile real 0m0.056s user 0m0.001s sys 0m0.008s This is with ext3+LVM+raid1+sata disks with hdparm -W1. With -W0 I get: $ time ./test2 testfile real 0m1.014s user 0m0.000s sys 0m0.008s Big difference. The fsync() there does its job. The same program runs with a x3 slowdown with the fsyncs, but that's expected, it's doing twice the writes, and in different places. .TM. - Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general