Well this release went so much better then the last release. Matt has
created the following page:
http://fedoraproject.org/wiki/Infrastructure/F7LessonsLearned
For those that don't remember last year.... It didn't go so well. Here's some issues we had this year:
* Spikey load + iptables implementation broght the static pages down for a few minutes at a time at the beginning of the release (first couple of hours) This also caused some mirrors.fp.o outages - Minor issue
* Wiki Load - The wiki, as always causes issues. All in all I think the wiki did ok, we never created any static pages for it (as we did last year) but we did lose app1 three times (needed to be rebooted) This also brought docs.fp.o down (see below) - Medium issue
* Nagios - Just ended up creating a lot of noise. This was a known issue and is already being worked on :) This stuff will get better as we implement better service dependencies. - Known issue
* Docs - This one lies on me, it went much better than last year (docs was down for close to 4 days). Unfortunately, unlike most of our apps, docs didn't get setup to load balance properly. As such all traffic was going to app1, and when moin combined with docs took app1 down, docs just died. We put a temporary site in place but it was untested and we had some display issues. We'll have this one totally corrected in no time.
* Ram - This was an identified issue but not one that we had much time to fix. Seems most of the problems we had were ram related (technically swap related) There's a great deal of performance we can gain in every area by adding ram. Additionally some of our boxes are close to being out of warranty and can be replaced with 64 bit boxes.
* Accounts - The accounts stuff died a bit, same issue as the wiki. Accounts is working to be replaced anyway and in the meantime the account system can be installed on more app servers (presently its only on two)
Stuff that went right:
* Mirrors.fp.o and smolt - Mirrors worked like a champ, so has smolt. Spread acrsoss two boxes they only required minor tweaking from matt and really just worked. This is quite an impressive feat since neither one of these apps has been under this sort of load before and we really didn't know exactly how they would work (though our initial load tests on both proved they would work fine). The public list was CRITICAL in getting users to the mirrors and it just worked. My hats off to everyone with this stuff, both of these apps are written in TurboGears.
* The static pages - Boy what a difference these made. It was very nice to know we also had them on the mirrors. though never needed to use them.
* #fedora-admin - For the most part stayed on topic, everyone was helpful and it was great to get all of the smart people together to test, discover what went wrong and work to fix it.
* Torrent - Didn't here a word about it and AFAIK it 'just worked' amazing.
* Dynamic allocation of systems. We were able to take down and add additional ram to some of our app servers as well as create a new proxy server at the temporary cost of test and build servers. now that release day is done we can just put everything back as it was. We could have done even more to help the wiki if we had more ram. App1 had 3G RAM App2 had 4G. App1 still crashed. (those are our wiki servers)
* Fedora 7 Release - We had a few bumps but I'd easily call this a success. We got people to the mirrors on release day. Much better than last year, and even better still we did better then Ubuntu did on their last release :)
For next time:
Aside from the wiki upgrade and trying to find more efficiencies there, the big thing we need is more RAM. I'd say 95% of our issues could have been non-issues just with more RAM. Fortunately thats the easy part :) Nagios should be in good shape by next time. Thanks everyone for your hard work we've done a really good job this time out. Anyone have other comments?
-Mike