Erik Faye-Lund <kusmabite@xxxxxxxxx> writes: > On Fri, Jan 18, 2013 at 6:50 PM, Eric Chamberland > <Eric.Chamberland@xxxxxxxxxxxxxxx> wrote: >> Good idea! >> >> I did a strace and here is the output with the error: >> >> http://www.giref.ulaval.ca/~ericc/strace_git_error.txt >> >> Hope it will be insightful! > > This trace doesn't seem to contain child-processes, but instead having > their stderr inlined into the log. Try using "strace -f" instead... I happen to have access to a lustre FS on the brutus cluster of ETH Zurich, so I figured I could give it a shot. What's odd is that while I cannot reproduce the original problem, there seems to be another issue/bug with utime(): $ strace -f -o ~/gc.trace git gc warning: failed utime() on .git/objects/68/tmp_obj_sCAEVc: Interrupted system call warning: failed utime() on .git/objects/a6/tmp_obj_3cdB2c: Interrupted system call warning: failed utime() on .git/objects/69/tmp_obj_lbU3Xc: Interrupted system call warning: failed utime() on .git/objects/c3/tmp_obj_EU97Wc: Interrupted system call warning: failed utime() on .git/objects/3e/tmp_obj_tb2j3c: Interrupted system call warning: failed utime() on .git/objects/15/tmp_obj_e6zMXc: Interrupted system call warning: failed utime() on .git/objects/54/tmp_obj_ExOJVc: Interrupted system call warning: failed utime() on .git/objects/e3/tmp_obj_GtPw4c: Interrupted system call warning: failed utime() on .git/objects/21/tmp_obj_Xex32c: Interrupted system call warning: failed utime() on .git/objects/1a/tmp_obj_CzwsZc: Interrupted system call warning: failed utime() on .git/objects/18/tmp_obj_o6fp3c: Interrupted system call warning: failed utime() on .git/objects/32/tmp_obj_Ih0G4c: Interrupted system call warning: failed utime() on .git/objects/41/tmp_obj_1RXV1c: Interrupted system call Counting objects: 137744, done. Delta compression using up to 48 threads. Compressing objects: 100% (36510/36510), done. Writing objects: 100% (137744/137744), done. Total 137744 (delta 101591), reused 135512 (delta 99472) The trace is here (2.1MB compressed): http://thomasrast.ch/download/gc.trace.bz2 For the test I used a clone of another git.git I had around. I think the error is from sha1_file.c:2564. While that doesn't look too important (see ca11b212 for context), it does raise the question: what other system calls that we never expect to EINTR can do this on sufficiently arcane system/FS combinations? Peff's test ran without any apparent issue for a few minutes. I also ran an extended version (at the end) that sets alarms, so as to actually get interrupted. That proved more interesting. I had to fix verify() and write_in_full() to account for EINTR in read()/write(), as those seem likely to fail. I also got link() to fail: $ ~/lustre-peff-reproducer unable to create hard link: Interrupted system call unable to open index file: No such file or directory but it took a long time. Unfortunately, when running it with strace I managed to panic the host I ran it on: $ strace -o ~/peff-reproducer.trace ~/lustre-peff-reproducer Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:LustreError: 37417:0:(osc_lock.c:1182:osc_lock_enqueue()) LBUG Message from syslogd@brutus1 at Jan 21 17:09:43 ... kernel:Kernel panic - not syncing: LBUG Yay for now having to explain this to the cluster team. I tried finding a standard that limits the syscalls to which EINTR applies, without too much success. I'm not sure how far I should trust my manpages, but while some of them explicitly list EINTR as a possible error (read, write, etc.) link() does not. (And the linux manpages agree with the POSOIX ones for once.) If somebody finds such a standard, we could of course use it to blame lustre instead :-) In the absence of it, wouldn't we in theory have to write a simple loop-on-EINTR wrapper for *all* syscalls? Of course there's the added problem that when open(O_CREAT|O_EXCL) fails with EINTR, it's hard to tell whether a file that may now exist is indeed yours or some other process's. --- 8< ---- #include <fcntl.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <sys/time.h> #include <signal.h> #include <errno.h> struct itimerval itv; static int randomize(unsigned char *buf, int len) { int i; len = rand() % len; for (i = 0; i < len; i++) buf[i] = rand() & 0xff; return len; } static int check_eof(int fd) { int ch; int r = read(fd, &ch, 1); if (r < 0) { perror("read error after expected EOF"); return -1; } if (r > 0) { fprintf(stderr, "extra byte after expected EOF"); return -1; } return 0; } static int verify(int fd, const unsigned char *buf, int len) { while (len) { char to_check[4096]; int got = read(fd, to_check, len < sizeof(to_check) ? len : sizeof(to_check)); if (got < 0 && errno == EINTR) continue; if (got < 0) { perror("unable to read"); return -1; } if (got == 0) { fprintf(stderr, "premature EOF (%d bytes remaining)", len); return -1; } if (memcmp(buf, to_check, got)) { fprintf(stderr, "bytes differ"); return -1; } buf += got; len -= got; } return check_eof(fd); } int write_in_full(int fd, const unsigned char *buf, int len) { while (len) { int r = write(fd, buf, len); if (r < 0 && errno == EINTR) continue; if (r < 0) return -1; buf += r; len -= r; } return 0; } int move_into_place(const char *old, const char *new) { if (link(old, new) < 0) { perror("unable to create hard link"); return 1; } unlink(old); return 0; } void handle_alarm(int signal) { } int main(void) { struct sigaction sa; sa.sa_handler = handle_alarm; sa.sa_flags = SA_RESTART; sigaction(SIGALRM, &sa, NULL); itv.it_interval.tv_sec = 0; itv.it_interval.tv_usec = 10000; itv.it_value.tv_sec = 0; itv.it_value.tv_usec = 100000; setitimer(ITIMER_REAL, &itv, NULL); while (1) { static unsigned char junk[1024*1024]; int len = randomize(junk, sizeof(junk)); int fd; /* clean up from any previous round */ unlink("tmpfile"); unlink("final.idx"); fd = open("tmpfile", O_WRONLY|O_CREAT, 0666); if (fd < 0) { perror("unable to open tmpfile"); return 1; } if (write_in_full(fd, junk, len) < 0 || fsync(fd) < 0 || close(fd) < 0) { perror("unable to write"); return 1; } if (move_into_place("tmpfile", "final.idx") < 0) return 1; fd = open("final.idx", O_RDONLY); if (fd < 0) { perror("unable to open index file"); return 1; } if (verify(fd, junk, len) < 0) return 1; close(fd); } } -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html