On 5/1/2019 8:49 AM, Michael Ellerman wrote: > Vakul Garg wrote: >> In function caam_jr_dequeue(), a full memory barrier is used before >> writing response job ring's register to signal removal of the completed >> job. Therefore for writing the register, we do not need another write >> memory barrier. Hence it is removed by replacing the call to wr_reg32() >> with a newly defined function wr_reg32_relaxed(). >> >> Signed-off-by: Vakul Garg <vakul.garg@xxxxxxx> >> --- >> drivers/crypto/caam/jr.c | 2 +- >> drivers/crypto/caam/regs.h | 8 ++++++++ >> 2 files changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/crypto/caam/jr.c b/drivers/crypto/caam/jr.c >> index 4e9b3fca5627..2ce6d7d2ad72 100644 >> --- a/drivers/crypto/caam/jr.c >> +++ b/drivers/crypto/caam/jr.c >> @@ -266,7 +266,7 @@ static void caam_jr_dequeue(unsigned long devarg) >> mb(); >> >> /* set done */ >> - wr_reg32(&jrp->rregs->outring_rmvd, 1); >> + wr_reg32_relaxed(&jrp->rregs->outring_rmvd, 1); >> >> jrp->out_ring_read_index = (jrp->out_ring_read_index + 1) & >> (JOBR_DEPTH - 1); >> diff --git a/drivers/crypto/caam/regs.h b/drivers/crypto/caam/regs.h >> index 3cd0822ea819..9e912c722e33 100644 >> --- a/drivers/crypto/caam/regs.h >> +++ b/drivers/crypto/caam/regs.h >> @@ -96,6 +96,14 @@ cpu_to_caam(16) >> cpu_to_caam(32) >> cpu_to_caam(64) >> >> +static inline void wr_reg32_relaxed(void __iomem *reg, u32 data) >> +{ >> + if (caam_little_end) >> + writel_relaxed(data, reg); >> + else >> + writel_relaxed(cpu_to_be32(data), reg); >> +} When both core (PPC) and crypto engine (caam) are big endian, data ends up being swapped - which is incorrect: writel_relaxed -> writel -> __do_writel -> out_le32 -> swap cpu_to_be32(data) -> data >> + >> static inline void wr_reg32(void __iomem *reg, u32 data) >> { >> if (caam_little_end) > > This crashes on my p5020ds. Did you test on powerpc? > > # first bad commit: [bbfcac5ff5f26aafa51935a62eb86b6eacfe8a49] crypto: caam/jr - Remove extra memory barrier during job ring dequeue Thanks for the report Michael. Any hint what would be the proper approach here - to have relaxed I/O accessors that would work both for ARM and PPC, and avoid ifdeffery etc.? For non-relaxed version, we used iowriteXX and iowriteXXbe - which work fine on ARM and PPC, covering all the endianness combinations (core + crypto engine): static inline void wr_reg32(void __iomem *reg, u32 data) { if (caam_little_end) iowrite32(data, reg); else iowrite32be(data, reg); } Thanks, Horia