On Fri, Mar 4, 2011 at 12:08, Ãvar ArnfjÃrà Bjarmason <avarab@xxxxxxxxx> wrote: > This is a summary of an issue I've been looking at with a very large > centralized Git repository. It's a repository that gets approximately > 100 commits per day, almost all to its master branch. > > I think I've found why the issue I'm describing happens (not confirmed > yet), I mainly wanted to write something to the list to have a record > of this in case anyone runs into it in the future. > > Last week we upgraded form Git 1.6.5 to 1.7.2.1 on the server housing > our repository, and started getting errors like these from developers > running variants of git-fetch: > >  Â$ git pull --rebase >  Âremote: Counting objects: 2, done. >  Âremote: Compressing objects: 100% (2/2), done. >  Âremote: Total 2 (delta 0), reused 0 (delta 0) >  Âremote: aborting due to possible repository corruption on the remote side. >  Âerror: waitpid for pack-objects failed: No child processes >  Âerror: git upload-pack: git-pack-objects died with error. >  Âfatal: git upload-pack: aborting due to possible repository > corruption on the remote side. >  ÂUnpacking objects: 100% (2/2), done. >  Âfatal: error in sideband demultiplexer > > That error is from > https://github.com/git/git/commit/b1c71b72815cb82a8bad14020a047320b88a04eb > by Junio from 2006, we're refusing to send an incomplete pack file on > failure. > > We've also been getting this error from git-fetch directly (from a > wrapper script): > >  Â# INFO : Checking working directory >  Â# ERROR: failed to git fetch --tags from 'origin' errorcode: 128 >  Â# ERROR: git fetch --tags origin >  Â# ERROR: error: waitpid for pack-objects failed: No child processes >  Â# ERROR: error: git upload-pack: git-pack-objects died with error. >  Â# ERROR: fatal: git upload-pack: aborting due to possible > repository corruption on the remote side. >  Â# ERROR: remote: aborting due to possible repository corruption on > the remote side. >  Â# ERROR: fatal: error in sideband demultiplexer > > And from git-remote-update(1): > >  Â$ git remote update >  ÂFetching origin >  Âremote: Counting objects: 9, done. >  Âremote: Compressing objects: 100% (5/5), done. >  Âremote: Total 5 (delta 4), reused 0 (delta 0) >  Âerror: waitpid for pack-objects failed: No child processes >  Âerror: git upload-pack: git-pack-objects died with error. >  Âfatal: git upload-pack: aborting due to possible repository > corruption on the remote side. >  Âremote: aborting due to possible repository corruption on the remote side. >  ÂUnpacking objects: 100% (5/5), done. >  Âfatal: error in sideband demultiplexer >  Âerror: Could not fetch origin > > All of these except maybe the first one (wasn't able to contact the > dev in question) come from Git 1.7.2.1 clients talking to the 1.7.2.1 > server. > > Anyway, I think this issue is caused by this RHEL bug: > https://bugzilla.redhat.com/show_bug.cgi?id=166669 ([RHEL3 U5] > waitpid() returns unexpected ECHILD) which was fixed in this RHEL > update: http://rhn.redhat.com/errata/RHSA-2006-0144.html > > This is our Git server: > >  Â$ cat /etc/redhat-release && uname -r >  ÂCentOS release 4.1 (Final) >  Â2.6.9-11.ELsmp > > And if I run: > >  Âwget https://bugzilla.redhat.com/attachment.cgi?id=118759 -O killipf.c && >  Âgcc -O2 -o killipf killipf.c -lpthread && >  ÂPASS=0; while ./killipf; do let PASS=++PASS; echo $PASS; done > > It'll die within a minute with a message like this: > >  ÂPASS : received expected signal 9 >  Â14605 > >  Âchild pid:2563 >  Âwaitpid failed!: No child processes > > It does *not* die on these machines: > >  Â$ cat /etc/redhat-release && uname -r >  ÂCentOS release 4.6 (Final) >  Â2.6.9-67.0.7.ELsmp > >  Â$ cat /etc/redhat-release && uname -r >  ÂCentOS release 5.5 (Final) >  Â2.6.18-194.el5PAE > > Or on my personal Debian box: > >  Â$ cat /etc/debian_version && uname -r >  Âwheezy/sid >  Â2.6.32-5-amd64 > > I haven't been able to trigger this issue with Git itself. I tried > putting a copy of the repository in /tmp, then one on client running > in a while loop: > >  Âwhile true; do >    Âhead -n 10 /dev/urandom >a_file && >    Âgit commit -m"more crap" a_file && >    Âgit push >  Âdone > > And on another client running: > >  Âwhile true; do >    Âgit pull >  Âdone > > And I never got this waitpid error message, I might have just been > unlucky though, or perhaps it wasn't triggered in that case for some > reason. > > Given this information we're going to upgrade CentOS on the relevant > machine, I'll follow up on the list in a couple of weeks indicating > whether or not that worked. We have enough users that if I ask people > to tell me if we get this error and I don't hear anything for two > weeks I can safely assume it went away. > > What we might want to do in Git is to work around this broken waitpid > behavior (if that's indeed the issue). I haven't dug into what the > RHEL kernel patch is solving, so I don't know if we can inexpensively > detect this when this is happening and warn users about it. > > Then again it would be a lot of work to work around a specific kernel > bug. What I *mainly* wanted to do was to insert some note of this into > the Git mailing list archive. Which I've now done. An update: after upgrading CentOS past a kernel with this bug we stopped having these errors. We used to have many of them every day, but haven't had one in the week since the server was upgraded. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html