Re: [sepia] CentOS builds failing in Shaman since Friday evening

John Spray <jspray@xxxxxxxxxx> · Mon, 5 Jun 2017 18:29:49 +0100

On Mon, Jun 5, 2017 at 2:16 PM, Alfredo Deza <adeza@xxxxxxxxxx> wrote:
> On Sun, Jun 4, 2017 at 5:26 AM, John Spray <jspray@xxxxxxxxxx> wrote:
>> On Sat, Jun 3, 2017 at 9:45 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>> I'm seeing the builds all complete:
>>>
>>> https://shaman.ceph.com/repos/ceph/wip-sage-testing/468be5dab6a2d8421a4fc35744463d47a80f47c2/
>>>
>>> but it won't schedule:
>>>
>>> $ teuthology-suite -s rados -c wip-sage-testing2 --subset 111/444 -p 100
>>> -k distro
>>> 2017-06-03 20:44:27,112.112 INFO:teuthology.suite.run:kernel sha1: distro
>>> 2017-06-03 20:44:27,389.389 INFO:teuthology.suite.run:ceph sha1:
>>> b03f3062d40c35e4898d77604d62e7e7c4e88afd
>>> Traceback (most recent call last):
>>>   File "/home/sage/src/teuthology/virtualenv/bin/teuthology-suite", line 11, in <module>
>>>     load_entry_point('teuthology', 'console_scripts', 'teuthology-suite')()
>>>   File "/home/sage/src/teuthology/scripts/suite.py", line 137, in main
>>>     return teuthology.suite.main(args)
>>>   File "/home/sage/src/teuthology/teuthology/suite/__init__.py", line 86, in main
>>>     run = Run(conf)
>>>   File "/home/sage/src/teuthology/teuthology/suite/run.py", line 46, in __init__
>>>     self.base_config = self.create_initial_config()
>>>   File "/home/sage/src/teuthology/teuthology/suite/run.py", line 92, in create_initial_config
>>>     self.choose_ceph_version(ceph_hash)
>>>   File "/home/sage/src/teuthology/teuthology/suite/run.py", line 185, in choose_ceph_version
>>>     util.schedule_fail(str(exc), self.name)
>>>   File "/home/sage/src/teuthology/teuthology/suite/util.py", line 72, in schedule_fail
>>>     raise ScheduleFailError(message, name)
>>> teuthology.exceptions.ScheduleFailError: Scheduling sage-2017-06-03_20:44:27-rados-wip-sage-testing2-distro-basic-smithi
>>> failed: 'package_manager_version'
>>>
>>> :/
>>
>> Same here.
>
> On Friday we had configuration that was recently pushed to Jenkins to
> build nfs-ganesha and samba for every commit to Ceph release branches
> *including master*
>
> The effect of that change was not immediately apparent since it
> "reacts" to behavior on the Ceph repo.
>
> Locally `git log` just shows 19 new commits for June 2nd (that Friday)
> but Github shows about 15 merges with a *ton* of commits for master
> (+100 commits)
>
> This is not usually a problem, but the combinatorial effect meant that
> those ~100 commits where really more like +300 commits *that appeared
> within minutes of each other*.
>
> Trying to mitigate that problem, I manually tried to change a slave to
> be able to consume more of these "bookkeeping" from the master Jenkins
> instance. This had the problem of
> doing up to 10 ceph builds at the same time (we don't allow this) and
> having mixed information as to where builds go.
>
> Builds follow this path: github -> jenkins trigger -> jenkins jobs for
> different distros -> jenkins asks shaman what chacra server to push to
> -> binaries are pushed to selected chacra server
>
> Since I made this one server do several Ceph builds, the variables
> that are used to find out "what chacra server should I push my
> binaries to" got polluted. This is why John's build POSTed
> to a chacra server that was wrong (hence the 404).
>
> On Friday we disabled the nfs-ganesha, and samba builds, and we have a
> tracker issue open to address the fact that we are (currently) unable
> to digest several hundred commits at once:
>
>     http://tracker.ceph.com/issues/20095
>
> Apologies for the trouble, this unfortunately means you will need to
> rebuild your branches (if they failed to schedule)

Thanks Alfredo -- we appear to be back in business!

John

>
>>
>> The tip of my branch is 50b0654e, I can see teuthology finding that
>> and going to query shaman at this URL:
>> https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=centos%2F7%2Fx86_64&sha1=50b0654ed19cb083af494878738c7debe80db31e
>>
>> The result has an empty dict for the 'extra' field, where teuthology
>> is expecting to see package_manager_version.
>>
>> That stuff is supposed to be populated by ceph-build/build/build-rpm
>> posting a repo-extra.json file to chacra.ceph.com
>>
>> I see my build log here:
>> https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos7,DIST=centos7,MACHINE_SIZE=huge/3848//consoleFull
>>
>> And I see the POST to chacra failing here:
>>
>> """
>> build@2/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/repo-extra.json
>> -u admin:[*******]
>> https://1.chacra.ceph.com/repos/ceph/wip-jcsp-testing-20170604/50b0654ed19cb083af494878738c7debe80db31e/centos/7/flavors/default/extra/
>>   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
>>                                  Dload  Upload   Total   Spent    Left  Speed
>>
>>   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
>> 100   505  100    52  100   453    145   1267 --:--:-- --:--:-- --:--:--  1268
>> 404 Not Found
>>
>> The resource could not be found.
>> """
>>
>> So the ceph-build script is succeeding where it should be failing
>> (does curl not return an error or is the script ignoring it?) and
>> something is wrong with chacra.ceph.com that's making it 404 here (I
>> don't know where to begin to debug that).
>>
>> John
>>
>> P.S. Probably a topic for another day, but I didn't love having to
>> traverse several different git repos to try and work out what was
>> happening during a build, wouldn't it be simpler to have a single repo
>> for the build infrastructure?
>>
>>>
>>> On Sat, 3 Jun 2017, Gregory Farnum wrote:
>>>
>>>> Adding sepia list for more infrastructure dev attention. (No idea where that
>>>> problem is coming from.)
>>>>
>>>> On Sat, Jun 3, 2017 at 5:52 AM Nathan Cutler <ncutler@xxxxxxx> wrote:
>>>>       CentOS builds in Shaman started failing with this error:
>>>>
>>>>       {standard input}: Assembler messages:
>>>>       {standard input}:186778: Warning: end of file not at end of a
>>>>       line;
>>>>       newline inserted
>>>>       {standard input}: Error: open CFI at the end of file; missing
>>>>       .cfi_endproc directive
>>>>       c++: internal compiler error: Killed (program cc1plus)
>>>>       Please submit a full bug report,
>>>>       with preprocessed source if appropriate.
>>>>       See <http://bugzilla.redhat.com/bugzilla> for instructions.
>>>>
>>>>       AFAICT the first occurrence was in [1] and the error has been
>>>>       haunting
>>>>       the build queue since then.
>>>>
>>>>       [1]
>>>> https://shaman.ceph.com/builds/ceph/wip-sage-testing2/f93ad23a8fec219667a03
>>>>       695136842edb0cceace/default/45729/
>>>>
>>>>       Nathan
>>>>       --
>>>>       To unsubscribe from this list: send the line "unsubscribe
>>>>       ceph-devel" in
>>>>       the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>       More majordomo info at
>>>>       http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Sepia mailing list
>>> Sepia@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/sepia-ceph.com
>>>
>> _______________________________________________
>> Sepia mailing list
>> Sepia@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/sepia-ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html