Re: [PATCH v2 00/19] mtd: rawnand: cafe: Convert to exec_op() (and more)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, 10 May 2020 09:35:49 +0200
Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:

> On Sun, 10 May 2020 09:21:08 +0200
> Lubomir Rintel <lkundrak@xxxxx> wrote:
> 
> > On Sun, May 10, 2020 at 08:45:41AM +0200, Boris Brezillon wrote:  
> > > On Sun, 10 May 2020 08:31:05 +0200
> > > Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:
> > >     
> > > > On Sat, 9 May 2020 22:28:55 +0200
> > > > Lubomir Rintel <lkundrak@xxxxx> wrote:
> > > >     
> > > > > On Sat, May 09, 2020 at 10:01:02PM +0200, Boris Brezillon wrote:      
> > > > > > On Sat, 9 May 2020 21:34:40 +0200
> > > > > > Lubomir Rintel <lkundrak@xxxxx> wrote:
> > > > > >         
> > > > > > > On Thu, May 07, 2020 at 10:12:57PM +0200, Boris Brezillon wrote:        
> > > > > > > > On Thu, 7 May 2020 15:47:08 +0200
> > > > > > > > Lubomir Rintel <lkundrak@xxxxx> wrote:
> > > > > > > >           
> > > > > > > > > On Wed, May 06, 2020 at 11:35:52PM +0200, Boris Brezillon wrote:          
> > > > > > > > > > On Wed, 6 May 2020 22:36:35 +0200
> > > > > > > > > > Lubomir Rintel <lkundrak@xxxxx> wrote:
> > > > > > > > > >             
> > > > > > > > > > > > We really should mask IRQs (AKA disable IRQs in my naming convention
> > > > > > > > > > > > :-)) here, unless we want to switch to interrupt-based waits (which
> > > > > > > > > > > > would be a good thing when we have DMA or WAIT_RDY involved). Having an
> > > > > > > > > > > > interrupt handler in the current implementation doesn't make any sense
> > > > > > > > > > > > (that's assuming the IRQ_STATUS bits are updated even if the interrupts
> > > > > > > > > > > > are disabled, which am not sure is a valid assumption in this case).              
> > > > > > > > > > > 
> > > > > > > > > > > I have no idea why the interrupt handler is there. Perhaps some
> > > > > > > > > > > interrupts can't be masked and need an ack or something.            
> > > > > > > > > > 
> > > > > > > > > > Can you try to set NAND_IRQ_MASK to 0x0 and see if that still works.
> > > > > > > > > > Can you also check the number of NAND interrupts when set to 0x0? It's
> > > > > > > > > > hard to tell exactly what caused the interrupt handler to be called
> > > > > > > > > > since this is a shared interrupt.            
> > > > > > > > > 
> > > > > > > > > When it's set to 0, I get an interrupt with CAFE_NAND_IRQ=0x40000000
> > > > > > > > > (CAFE_NAND_IRQ_FLASH_RDY) right off the bat. That doesn't happen with
> > > > > > > > > a mask of 0xffffffff.
> > > > > > > > > 
> > > > > > > > > When changing the handler to always ack CAFE_NAND_IRQ_FLASH_RDY I've
> > > > > > > > > also seen CAFE_NAND_IRQ=0x80000000 (CAFE_NAND_IRQ_CMD_DONE) suggesting
> > > > > > > > > that other interrupts aren't masked either.
> > > > > > > > > 
> > > > > > > > > It seems to be that ones indeed mask interrupts but just can't be
> > > > > > > > > masked (CAFE_NAND_IRQ_CMD_DONE or CAFE_NAND_IRQ_DMA_DONE), perhaps
> > > > > > > > > due to hardware bugs.
> > > > > > > > >           
> > > > > > > > 
> > > > > > > > I pushed a new version with some interrupt-related changes [1].
> > > > > > > > 
> > > > > > > > [1]https://github.com/bbrezillon/linux/commits/nand/cafe-nand-exec-op-debug          
> > > > > > > 
> > > > > > > Works with one fix:
> > > > > > > 
> > > > > > > diff --git a/drivers/mtd/nand/raw/cafe_nand.c b/drivers/mtd/nand/raw/cafe_nand.c
> > > > > > > index 591d79730961..e37737b7b089 100644
> > > > > > > --- a/drivers/mtd/nand/raw/cafe_nand.c
> > > > > > > +++ b/drivers/mtd/nand/raw/cafe_nand.c
> > > > > > > @@ -801,6 +801,7 @@ static int cafe_nand_probe(struct pci_dev *pdev,
> > > > > > >         if (!cafe)
> > > > > > >                 return  -ENOMEM;
> > > > > > >  
> > > > > > > +       init_completion(&cafe->complete);        
> > > > > > 
> > > > > > Oops, indeed.
> > > > > >         
> > > > > > >         mtd = nand_to_mtd(&cafe->nand);
> > > > > > >         mtd->dev.parent = &pdev->dev;
> > > > > > >         nand_set_controller_data(&cafe->nand, cafe);
> > > > > > > 
> > > > > > > However, the mount JFFS2 mount takes about twice as long as it did with
> > > > > > > the polling version:        
> > > > > > 
> > > > > > Yes, that's not surprising. At the same time, using atomic-polling for
> > > > > > something that's expected to take hundreds of microseconds is not that
> > > > > > great. That means your CPU is not doing anything useful while you wait
> > > > > > for the read/write/erase operation to finish.        
> > > > > 
> > > > > Yes. But this really is too much of a slowdown:
> > > > > 
> > > > >   bash-5.0# time dd count=65536 bs=2k if=/dev/mtd0 of=/dev/null
> > > > >   65536+0 records in
> > > > >   65536+0 records out
> > > > >   
> > > > >   real    0m20.191s
> > > > >   user    0m0.346s
> > > > >   sys     0m10.366s
> > > > > 
> > > > > vs (previously):
> > > > >   
> > > > >   bash-5.0# time dd count=65536 bs=2k if=/dev/mtd0 of=/dev/null
> > > > >   65536+0 records in
> > > > >   65536+0 records out
> > > > >   
> > > > >   real    0m7.629s
> > > > >   user    0m0.010s
> > > > >   sys     0m7.500s
> > > > >   bash-5.0#      
> > > > 
> > > > Almost a factor 3. I was definitely not expecting interrupt-based waits
> > > > to have such a huge impact on the perfs.
> > > >     
> > > > > 
> > > > > Note that your CPU can't be doing anything useful before the program and
> > > > > its data is loaded from the storage :)      
> > > > 
> > > > Well, that's only true at mount time (and if you delay the mount after
> > > > the boot, your CPU might already have other things to do), but any
> > > > erase/write operations are likely to monopolize your CPU for no good
> > > > reason.
> > > >     
> > > > > 
> > > > > I suppose that if someone really prefers to avoid hogging the CPU at
> > > > > this cost, then it makes sense to add a knob (a module parameter or
> > > > > something) that would enable the interrupt-driven operation, but
> > > > > keep polling as a default.      
> > > > 
> > > > Let's not add more module params than we already have, it just
> > > > confuses users and deciding how to wait on HW events doesn't sounds
> > > > like something they should be able to choose anyway (just like passing
> > > > the timing params, this should be calculated by the driver). Oh well,
> > > > I'll drop the patch adding interrupt-based waits. Having the driver
> > > > converted to exec_op() is more than enough :-).    
> > > 
> > > Just pushed a new version. If it works for you I'll send a v3.    
> > 
> > Thank you. That's b6b10b45dd9 in nand/cafe-nand-exec-op-debug of
> > https://github.com/bbrezillon/linux/ I suppose?
> > 
> > Without the readl_poll_timeout() -> readl_poll_timeout_atomic() change
> > it's still very slow.  
> 
> Should be fixed now.
> 
> > 
> > Also, commit f89355b6b6 ("mtd: rawnand: cafe: Return IRQ_HANDLED when
> > appropriate") looks somewhat suspicious to me. Previously it wrote the
> > pending interrupt bits back into CAFE_NAND_IRQ, now you're masking them
> > out in CAFE_NAND_IRQ_MASK (which already should be 0xffffffff) at this
> > point. Why?  
> 
> If interrupts are masked we don't need to clear them. We only clear
> them before executing an operation to start from a fresh state.
> 
> > I thought the write back to CAFE_NAND_IRQ serves to ack the
> > interrupts that came up but we don't handle elsewhere because we weren't
> > expecting them.  
> 
> If we reach the handler and all our irqs are masked, that means the irq
> was not for us, which is possible since the irq line is shared. We
> really should to return IRQ_NONE in that case, and clearing pending
> interrupts is useless, since they are masked anyway. Since we read
> the interrupt status from exec_op(), I thought it'd be better to never
> clear any interrupt bits instead of clearing all bits but the CMD_DONE,
> DMA_DONE and FLASH_RDY.
> 
> > 
> > As you correctly pointed out; the source of the interrupts I'm seeing
> > could be something else than the CAFE chip -- the camera or the MMC
> > card. I'm not sure though; camera is certainly off and there shouldn't
> > be much going on about the MMC card. I'm testing with a init=/bin/bash
> > installation off a SD-card currently. I guess I can try switching to the
> > USB flash stick and disable the camera and MMC altogether.  
> 
> Okay, if that happens that would be a HW bug (or an interrupt coming
> from somewhere else, maybe PCI errors?)? Can you print the values of
> CAFE_GLOBAL_IRQ and CAFE_GLOBAL_IRQ_MASK in your irq handler?

If you think that's less risky, I can drop "mtd: rawnand: cafe: Return
IRQ_HANDLED when appropriate" and go for your initial fix (avoid
clearing FLSH_READY interrupt). It just feels like the current
implementation is papering over a bug.

______________________________________________________
Linux MTD discussion mailing list
http://lists.infradead.org/mailman/listinfo/linux-mtd/



[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux