Re: [PATCH V5 6/7] iommu/msm: Use writel_relaxed and add a barrier

Arnd Bergmann <arnd@xxxxxxxx> · Fri, 20 May 2016 13:44:10 +0200

On Friday 20 May 2016 16:24:53 Sricharan R wrote:
> While using the generic pagetable ops the tlb maintenance
> operation gets completed in the sync callback. So use writel_relaxed
> for all register access and add a mb() at appropriate places.
> 
> Signed-off-by: Sricharan R <sricharan@xxxxxxxxxxxxxx>
> ---
>  drivers/iommu/msm_iommu.c         |  24 +++++++--
>  drivers/iommu/msm_iommu_hw-8xxx.h | 109 ++++++++++++++++++++------------------
>  2 files changed, 79 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/iommu/msm_iommu.c b/drivers/iommu/msm_iommu.c
> index 0299a37..dfcaeef 100644
> --- a/drivers/iommu/msm_iommu.c
> +++ b/drivers/iommu/msm_iommu.c
> @@ -124,6 +124,9 @@ static void msm_iommu_reset(void __iomem *base, int ncb)
>  		SET_TLBLKCR(base, ctx, 0);
>  		SET_CONTEXTIDR(base, ctx, 0);
>  	}
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }

Why do the above use the relaxed writes? Do you have any performance
numbers showing that skipping the sync for the reset makes a measurable
difference?

How did you prove that skipping the sync before the write is safe?

How about resetting the iommu less often instead?

>  static void __flush_iotlb(void *cookie)
> @@ -141,6 +144,9 @@ static void __flush_iotlb(void *cookie)
>  		list_for_each_entry(master, &iommu->ctx_list, list)
>  			SET_CTX_TLBIALL(iommu->base, master->num, 0);
>  
> +		/* To ensure completion of TLBIALL above */
> +		mb();
> +
>  		__disable_clocks(iommu);
>  	}
>  fail:

__disable_clocks() takes spinlocks and accesses the hardware, which
in turn will also involve multiple syncs, so this seems like
a premature optimization. 

Have you tried just leaving the clocks enabled? I'd suspect you could
gain more performance that way.

> @@ -181,7 +187,8 @@ fail:
>  
>  static void __flush_iotlb_sync(void *cookie)
>  {
> -	/* To avoid a null function pointer */
> +	/* To ensure completion of the TLBIVA in __flush_iotlb_range */
> +	mb();
>  }

I don't understand the specific race from the comment.

What operation comes after this that relies on __flush_iotlb_range
having completed, and how does an mb() guarantee this?

This seems to be a bug fix that is unrelated to the change to
use writel_relaxed(), so better split it out into a separate
patch, with a longer changeset description. Did you spot this
race by running into incorrect data, or by inspection?

>  static const struct iommu_gather_ops msm_iommu_gather_ops = {
> @@ -235,6 +242,9 @@ static void config_mids(struct msm_iommu_dev *iommu,
>  		/* Set security bit override to be Non-secure */
>  		SET_NSCFG(iommu->base, mid, 3);
>  	}
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }
>  

This looks similar to the msm_iommu_reset() case. Maybe put those
two into one patch, or find a way to run config_mids() less often so
it doesn't show up in the profiles any more.

If you hit this often enough that it causes performance problems,
it's more likely that there is a bug in your code than that changing
to relaxed accessors is a good idea.

>  static void __reset_context(void __iomem *base, int ctx)
> @@ -257,6 +267,9 @@ static void __reset_context(void __iomem *base, int ctx)
>  	SET_TLBFLPTER(base, ctx, 0);
>  	SET_TLBSLPTER(base, ctx, 0);
>  	SET_TLBLKCR(base, ctx, 0);
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }
>
>  static void __program_context(void __iomem *base, int ctx,
> @@ -305,6 +318,9 @@ static void __program_context(void __iomem *base, int ctx,
>  
>  	/* Enable the MMU */
>  	SET_M(base, ctx, 1);
> +
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  }
>  
>  static struct iommu_domain *msm_iommu_domain_alloc(unsigned type)

Maybe one more patch for these? I can see that
__reset_context/__program_context would be run often enough to make
a difference, though maybe not as much as the flushes.

Also, split this in two patches: one that adds a comment describing
how you proved that you don't need barriers before the writes, and
another one adding the barrier afterwards, with a description of
the failure scenario.

> @@ -500,7 +516,8 @@ static phys_addr_t msm_iommu_iova_to_phys(struct iommu_domain *domain,
>  	/* Invalidate context TLB */
>  	SET_CTX_TLBIALL(iommu->base, master->num, 0);
>  	SET_V2PPR(iommu->base, master->num, va & V2Pxx_VA);
> -
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  	par = GET_PAR(iommu->base, master->num);
>  
>  	/* We are dealing with a supersection */

In this case, I'd say it's better to rewrite the function to avoid the
read: iova_to_phys() should be fast, and not require hardware access.
Getting rid of the hardware access by using an in-memory cache for
this should gain more than just removing the barriers, as an MMIO read
is always slow.

> @@ -714,7 +731,8 @@ static int msm_iommu_probe(struct platform_device *pdev)
>  	par = GET_PAR(iommu->base, 0);
>  	SET_V2PCFG(iommu->base, 0, 0);
>  	SET_M(iommu->base, 0, 0);
> -
> +	/* Ensure completion of relaxed writes from the above SET macros */
> +	mb();
>  	if (!par) {

Probe() clearly isn't performance sensitive, so please back this out
and use the non-relaxed accessors.

> diff --git a/drivers/iommu/msm_iommu_hw-8xxx.h b/drivers/iommu/msm_iommu_hw-8xxx.h
> index fc16010..fe2c5ca 100644
> --- a/drivers/iommu/msm_iommu_hw-8xxx.h
> +++ b/drivers/iommu/msm_iommu_hw-8xxx.h
> @@ -24,13 +24,19 @@
>  #define GET_CTX_REG(reg, base, ctx) \
>  				(readl((base) + (reg) + ((ctx) << CTX_SHIFT)))
>  
> -#define SET_GLOBAL_REG(reg, base, val)	writel((val), ((base) + (reg)))
> +/*
> + * The writes to the global/context registers needs to be synced only after
> + * all the configuration writes are done. So use relaxed accessors and
> + * a barrier at the end.
> + */
> +#define SET_GLOBAL_REG_RELAXED(reg, base, val) \
> +			       writel_relaxed((val), ((base) + (reg)))
>  
> -#define SET_CTX_REG(reg, base, ctx, val) \
> -			writel((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))
> +#define SET_CTX_REG_RELAXED(reg, base, ctx, val) \
> +		writel_relaxed((val), ((base) + (reg) + ((ctx) << CTX_SHIFT)))

Please add the relaxed accessors first in a separate patch, and then
change over one user at a time before you remove the non-relaxed ones.

It's hard to believe that all users are actually performance critical,
so start with the ones that gain the most performance. As commented above,
some of the conversions seem particularly fishy and it's likely that
a slow reset() function should not be fixed by making reset() faster
but by calling it less often.

>  /* Context register setters/getters */
> -#define SET_SCTLR(b, c, v)	 SET_CTX_REG(SCTLR, (b), (c), (v))
> -#define SET_ACTLR(b, c, v)	 SET_CTX_REG(ACTLR, (b), (c), (v))
> -#define SET_CONTEXTIDR(b, c, v)	 SET_CTX_REG(CONTEXTIDR, (b), (c), (v))
> -#define SET_TTBR0(b, c, v)	 SET_CTX_REG(TTBR0, (b), (c), (v))
> -#define SET_TTBR1(b, c, v)	 SET_CTX_REG(TTBR1, (b), (c), (v))
> +#define SET_SCTLR(b, c, v)	 SET_CTX_REG_RELAXED(SCTLR, (b), (c), (v))
> +#define SET_ACTLR(b, c, v)	 SET_CTX_REG_RELAXED(ACTLR, (b), (c), (v))
> +#define SET_CONTEXTIDR(b, c, v)	 SET_CTX_REG_RELAXED(CONTEXTIDR, (b), (c), (v))
> +#define SET_TTBR0(b, c, v)	 SET_CTX_REG_RELAXED(TTBR0, (b), (c), (v))
> +#define SET_TTBR1(b, c, v)	 SET_CTX_REG_RELAXED(TTBR1, (b), (c), (v))
> +#define SET_TTBCR(b, c, v)	 SET_CTX_REG_RELAXED(TTBCR, (b), (c), (v))
> +#define SET_PAR(b, c, v)	 SET_CTX_REG_RELAXED(PAR, (b), (c), (v))

Can you just remove those macros, and open-code the function calls?

This is an unnecessary obfuscation that now also hides the fact that you
are using relaxed accessors.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html