On 1/27/20 12:04 PM, Allan W. Nielsen wrote: > CAUTION: This Email originated from outside Televic. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > > On 26.01.2020 16:59, Andrew Lunn wrote: >> EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe >> >> On Sun, Jan 26, 2020 at 02:22:13PM +0100, Horatiu Vultur wrote: >>> The 01/25/2020 17:35, Andrew Lunn wrote: >>> > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe >>> > >>> > > SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending >>> > > MRP_Test frames on the mrp ring ports. This is called only on nodes that have >>> > > the role Media Redundancy Manager. >>> > >>> > How do you handle the 'headless chicken' scenario? User space tells >>> > the port to start sending MRP_Test frames. It then dies. The hardware >>> > continues sending these messages, and the neighbours thinks everything >>> > is O.K, but in reality the state machine is dead, and when the ring >>> > breaks, the daemon is not there to fix it? > I agree, we need to find a solution to this issue. > >>> > And it is not just the daemon that could die. The kernel could opps or >>> > deadlock, etc. >>> > >>> > For a robust design, it seems like SWITCHDEV_OBJ_ID_RING_TEST_MRP >>> > should mean: start sending MRP_Test frames for the next X seconds, and >>> > then stop. And the request is repeated every X-1 seconds. > Sounds like a good idea to me. Indeed, and it should then do the same as mentioned below and "... come a 'dumb switch' ", except that I propose to make it configurable how to fallback: with auto-recovery ('dumb switch') or safe mode that keeps the ports blocked, and then some higher layer protocol should fix it. > >>> I totally missed this case, I will update this as you suggest. >> >> What does your hardware actually provide? >> >> Given the design of the protocol, if the hardware decides the OS etc >> is dead, it should stop sending MRP_TEST frames and unblock the ports. >> If then becomes a 'dumb switch', and for a short time there will be a >> broadcast storm. Hopefully one of the other nodes will then take over >> the role and block a port. > As far as I know, the only feature HW has to prevent this is a > watch-dog timer. Which will reset the entire system (not a bad idea if > the kernel has dead-locked). Indeed. Our designs always have a watchdog. And then I again propose to have 2 bootup options. I refer here also to my answer on Allan's answer on my email of 12:29PM. Kind regards, Jürgen > > /Allan >