MirrorManager outage root cause

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



At approximately 01:00 UTC today, clients requesting the mirror list
started getting timeouts, then HTTP 503 errors generated by the
MirrorManager mirror list processes.  On the Fedora Infrastructure
application servers, the loads spiked, the out-of-memory killer
started firing, and chaos ensued.

Proximate cause of this failure appears to be due to invalid data in
the MirrorManager database - specifically, the bandwidth value for
several servers was NULL, when that should not be possible.  I say
proximate, not root, as I have not been able to validate the incorrect
behavior with incorrect data, though after fixing the invalid data, we
have not seen further failures.  That remains to be done.

There are fixes in the MirrorManager 1.4 (unreleased) branch to
prevent invalid data from happening, but these were not present in the
1.3 version currently in production.  Additional fixes have been put
into the MM 1.4 branch tonight to further ensure this type of invalid
data cannot affect the mirrorlist_server process.

Thanks to Stephen Smoogen and Kevin Fenzi for their quick work to
identify the failing systems and minimize the impact to other Fedora
services.

Thanks,
Matt
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure



[Index of Archives]     [Fedora Development]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [KDE Users]

  Powered by Linux