Re: [PATCH] arch/sparc: Measure receiver forward progress to avoid send mondo timeout

jane.chu@xxxxxxxxxx · Mon, 10 Jul 2017 10:23:32 -0700

On 07/03/2017 08:14 AM, David Miller wrote:

From: Steven Sistare <steven.sistare@xxxxxxxxxx>
Date: Mon, 3 Jul 2017 09:34:48 -0400

On 7/3/2017 5:50 AM, David Miller wrote:
From: Jane Chu <jane.chu@xxxxxxxxxx>
Date: Wed, 28 Jun 2017 15:02:26 -0600

   static void hypervisor_xcall_deliver(struct trap_per_cpu *tb, int cnt)
   {
-	int retries, this_cpu, prev_sent, i, saw_cpu_error;
+	int retries, this_cpu, prev_sent, i, rem;
+	uint16_t first_cpu = 0xffff;
+	unsigned long xc_rcvd = 0;
+	int usec_wait = cnt * 2;
   	unsigned long status;
+	int ecpuerror_id = 0;
+	int enocpu_id = 0;
   	u16 *cpu_list;
+	uint16_t cpu;
As you can see at the variable declarations around the ones you are
adding, "u16" is the appropriate type to use.  "uint16_t" is not.
So my concern about this patch is that in my mind, getting into a
state where a cpu is looping and doing nothing but handling mondos
is a bug.
That cpu is making no progress in it's execution stream, and that's
problematic.
I'd rather we attack the issue that gets into this situation in the
first place.
It's because we don't optimize large amounts of page TLB flushes
properly.
Firstly, we don't have a way to pass the array of pages to flush.
That would cut down the mondos by orders of magnitude.
We also could have a cutoff where we do a full MM flush instead
of flushing individual pages.
I bet if you implemented these two things, it would not only
make the mondo timeouts go away, it with make cpus actually
make forward progress in their instruction stream rather than
looping like crazy processing mondos.
Thanks.
There is room for improvement in the TLB flush algorithms, and it is
on our longer term list of things to do, as it will generally improve
performance of demap operations.  However, on another operating
system for sparc, we have a large set of algorithms to use
large pages extensively, batch translation shootdowns, transition
to demap-context and demap-all, and use hardware MMU-group demap
features, and it is still not enough to prevent mondo timeout panic
under stressful conditions on large systems using the "sender counts"
method of judging forward progress.  The "receiver counts" method
has proven to be a robust way of riding out mondo storms into calmer
waters without panicking the system, and greatly reduced the number
of bug reports from users due to mondo timeouts.  This is a valuable
feature that users appreciate.
Ok, please make the variable type changes I suggested and submit
that new version and I will think about this further.

Sorry for my delay in reply.  My colleague saw mondo timeout while 
performing
PCIe hotplug tests.  The timeout came from a network thread sending a 
mondo to
a single cpu to raise software interrupt.   Occasionally, the target cpu 
made no
progress in 20msec.  The reason I think, is that the target cpu was borrowed
by hypervisor to service the correctable errors triggered by hotplug.
As strand in hypervisor service cannot respond to mondo,  we need to 
increase the
overall wait time to an agreeable value between hypervisor and OS.

I will revise the patch and fix the variable types.

thanks!
-jane

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html