Re: git-pack-objects dying with errors due to possible RHEL kernel bug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 4, 2011 at 12:08, Ãvar ArnfjÃrà Bjarmason <avarab@xxxxxxxxx> wrote:
> This is a summary of an issue I've been looking at with a very large
> centralized Git repository. It's a repository that gets approximately
> 100 commits per day, almost all to its master branch.
>
> I think I've found why the issue I'm describing happens (not confirmed
> yet), I mainly wanted to write something to the list to have a record
> of this in case anyone runs into it in the future.
>
> Last week we upgraded form Git 1.6.5 to 1.7.2.1 on the server housing
> our repository, and started getting errors like these from developers
> running variants of git-fetch:
>
> Â Â$ git pull --rebase
> Â Âremote: Counting objects: 2, done.
> Â Âremote: Compressing objects: 100% (2/2), done.
> Â Âremote: Total 2 (delta 0), reused 0 (delta 0)
> Â Âremote: aborting due to possible repository corruption on the remote side.
> Â Âerror: waitpid for pack-objects failed: No child processes
> Â Âerror: git upload-pack: git-pack-objects died with error.
> Â Âfatal: git upload-pack: aborting due to possible repository
> corruption on the remote side.
> Â ÂUnpacking objects: 100% (2/2), done.
> Â Âfatal: error in sideband demultiplexer
>
> That error is from
> https://github.com/git/git/commit/b1c71b72815cb82a8bad14020a047320b88a04eb
> by Junio from 2006, we're refusing to send an incomplete pack file on
> failure.
>
> We've also been getting this error from git-fetch directly (from a
> wrapper script):
>
> Â Â# INFO : Checking working directory
> Â Â# ERROR: failed to git fetch --tags from 'origin' errorcode: 128
> Â Â# ERROR: git fetch --tags origin
> Â Â# ERROR: error: waitpid for pack-objects failed: No child processes
> Â Â# ERROR: error: git upload-pack: git-pack-objects died with error.
> Â Â# ERROR: fatal: git upload-pack: aborting due to possible
> repository corruption on the remote side.
> Â Â# ERROR: remote: aborting due to possible repository corruption on
> the remote side.
> Â Â# ERROR: fatal: error in sideband demultiplexer
>
> And from git-remote-update(1):
>
> Â Â$ git remote update
> Â ÂFetching origin
> Â Âremote: Counting objects: 9, done.
> Â Âremote: Compressing objects: 100% (5/5), done.
> Â Âremote: Total 5 (delta 4), reused 0 (delta 0)
> Â Âerror: waitpid for pack-objects failed: No child processes
> Â Âerror: git upload-pack: git-pack-objects died with error.
> Â Âfatal: git upload-pack: aborting due to possible repository
> corruption on the remote side.
> Â Âremote: aborting due to possible repository corruption on the remote side.
> Â ÂUnpacking objects: 100% (5/5), done.
> Â Âfatal: error in sideband demultiplexer
> Â Âerror: Could not fetch origin
>
> All of these except maybe the first one (wasn't able to contact the
> dev in question) come from Git 1.7.2.1 clients talking to the 1.7.2.1
> server.
>
> Anyway, I think this issue is caused by this RHEL bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=166669 ([RHEL3 U5]
> waitpid() returns unexpected ECHILD) which was fixed in this RHEL
> update: http://rhn.redhat.com/errata/RHSA-2006-0144.html
>
> This is our Git server:
>
> Â Â$ cat /etc/redhat-release && uname -r
> Â ÂCentOS release 4.1 (Final)
> Â Â2.6.9-11.ELsmp
>
> And if I run:
>
> Â Âwget https://bugzilla.redhat.com/attachment.cgi?id=118759 -O killipf.c &&
> Â Âgcc -O2 -o killipf killipf.c -lpthread &&
> Â ÂPASS=0; while ./killipf; do let PASS=++PASS; echo $PASS; done
>
> It'll die within a minute with a message like this:
>
> Â ÂPASS : received expected signal 9
> Â Â14605
>
> Â Âchild pid:2563
> Â Âwaitpid failed!: No child processes
>
> It does *not* die on these machines:
>
> Â Â$ cat /etc/redhat-release && uname -r
> Â ÂCentOS release 4.6 (Final)
> Â Â2.6.9-67.0.7.ELsmp
>
> Â Â$ cat /etc/redhat-release && uname -r
> Â ÂCentOS release 5.5 (Final)
> Â Â2.6.18-194.el5PAE
>
> Or on my personal Debian box:
>
> Â Â$ cat /etc/debian_version && uname -r
> Â Âwheezy/sid
> Â Â2.6.32-5-amd64
>
> I haven't been able to trigger this issue with Git itself. I tried
> putting a copy of the repository in /tmp, then one on client running
> in a while loop:
>
> Â Âwhile true; do
> Â Â Â Âhead -n 10 /dev/urandom >a_file &&
> Â Â Â Âgit commit -m"more crap" a_file &&
> Â Â Â Âgit push
> Â Âdone
>
> And on another client running:
>
> Â Âwhile true; do
> Â Â Â Âgit pull
> Â Âdone
>
> And I never got this waitpid error message, I might have just been
> unlucky though, or perhaps it wasn't triggered in that case for some
> reason.
>
> Given this information we're going to upgrade CentOS on the relevant
> machine, I'll follow up on the list in a couple of weeks indicating
> whether or not that worked. We have enough users that if I ask people
> to tell me if we get this error and I don't hear anything for two
> weeks I can safely assume it went away.
>
> What we might want to do in Git is to work around this broken waitpid
> behavior (if that's indeed the issue). I haven't dug into what the
> RHEL kernel patch is solving, so I don't know if we can inexpensively
> detect this when this is happening and warn users about it.
>
> Then again it would be a lot of work to work around a specific kernel
> bug. What I *mainly* wanted to do was to insert some note of this into
> the Git mailing list archive. Which I've now done.

An update: after upgrading CentOS past a kernel with this bug we
stopped having these errors. We used to have many of them every day,
but haven't had one in the week since the server was upgraded.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]