Re: Outage notes

Stephen John Smoogen <smooge@xxxxxxxxx> · Wed, 12 Jan 2011 14:13:36 -0700

On Wed, Jan 12, 2011 at 08:06, seth vidal <skvidal@xxxxxxxxxxxxxxxxx> wrote:
> Hi Everyone,
>  I took some notes while we were rebooting boxes I wanted to share them
> with everyone for future outages.
>
> Ordering of the bounces:
> 1. xen14: puppet is on there and if that is back up first we have a
> place to stand for pushing out any changes (dns changes for example via
> puppet) - xen14 takes about 4 minutes to restart/POST

Most of the new IBM hardware can take 4-6 minutes to reboot. I don't
know if there is some flags I should have put in it, but it is deadly
slow.

> Overall things to think about for the future:
> 1. dumping a complete virsh list - including how much memory is actually
> being used per vm per server before we start reboots
> 2. checking what disks need fscks because of mounted time and doing
> those earlier or separately.
> 3. verifying that all running vms are:
>   a. intended to be running
>   b. have a config file
>   c. are set to autostart
> 4. verifying that all NOT running vms are:
>   a. intended to be off
>   b. are NOT set to autostart

looks good. I thought koji2 was running before the reboots but it may
have been a ghost vm.

> thoughts welcome.
> -sv
>
>
>
>
> _______________________________________________
> infrastructure mailing list
> infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
> https://admin.fedoraproject.org/mailman/listinfo/infrastructure
>

-- 
Stephen J Smoogen.
"The core skill of innovators is error recovery, not failure avoidance."
Randy Nelson, President of Pixar University.
"Let us be kind, one to another, for most of us are fighting a hard
battle." -- Ian MacLaren
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure