Re: [109all] NOC update #2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi John

As this is quite a political question I will answer it for you.

If the criticisms are narrowly interpreted as being about the front end we are running then this information implies that they were likely based on an incorrect interpretation of the issues.  If however the criticisms are more broadly interpreted as about the whole approach of us taking responsibility for the systems integration rather than outsourcing to an integrated service provider, then this information does not change anything.

Jay

On 19/11/2020, at 9:17 PM, John C Klensin <john-ietf@xxxxxxx> wrote:

Sean,

I think this is obvious but, just to check my understanding:
this implies that all of the attacks on Meetecho for
unreliability of their software during those incidents were
misguided or misdirected.  Correct?

thanks,
  john




--On Thursday, November 19, 2020 01:56 +0000 Sean Croghan
<sean@xxxxxxxxxxxxxxxx> wrote:

As previously reported, we tracked down the cause of the
interruption of the iabopen session to an issue with an
unexpected Azure network interface removal event on network
interfaces provisioned with SR-IOV.  To prevent this happening
again we intended to remove SR-IOV networking entirely.
Unfortunately it now transpires that this change did not get
applied to 2 of the 16 VMs including the application VM for
the Plenary. So to add to the list of reasons to want 2020 to
be over, towards the end of Plenary the same network interface
removal event occurred and triggered an outage long enough to
affect everyone.

I can confirm that the SR-IOV provisioning has now been
removed from all VMs, which we believe eliminates the risk of
the same thing happening again.  We continue to work with
Azure Direct Support to determine the underlying cause of the
removal events.

Please let me know if you have any questions.

Sean



On Nov 17, 2020, at 4:56 PM, Sean Croghan  wrote:



I have an update for those of you affected by the outage in
yesterdays IABOPEN session. We have isolated this to a
interrupt to the virtual machines network interface. We
currently have no explanation for this outage. We have engaged
the hardware and network team with Azure to determine the
cause of this event but do not have an explanation at this
time.

I will provide an update when we have received more
information.


For those interested in details:

At 07:56:36 UTC the network interface (eth0) went link down
and the interface was removed from the VM At 08:00:28 UTC then
a new interface was added to the VM At 08:00:29 UTC (eth1)
went link up

Yes the VM added a new interface. The servers were provisioned
with SR-IOV and we suspect that a migration event occurred
that moved the VM to different hardware causing the NIC driver
to be reloaded. We have found some evidence that would support
our theory that a migration or unscheduled maintenance event
occurred and are working to verify if that happened during
this event. We have removed SR-IOV from the network interfaces
on all servers.

I hope you are having a good and productive week


— The IEFT NOC Team

--
109all mailing list
109all@xxxxxxxx<mailto:109all@xxxxxxxx>
https://www.ietf.org/mailman/listinfo/109all

-- 
Jay Daley
IETF Executive Director
jay@xxxxxxxx


[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux