2009/11/13 Dan Williams <dan.j.williams@xxxxxxxxx>: > Hi Hank, > > Thanks for testing. > > On Tue, Nov 10, 2009 at 4:44 AM, hank peng <pengxihan@xxxxxxxxx> wrote: >> CPU is MPC8548, kernel version is 2.6.31.5,CONFIG_FSL_DMA and >> CONFIG_ASYNC_TX_DMA options are all enabled. >> #mdadm -C /dev/md0 --assume-clean -l5 -n3 /dev/sd{a,b,c} >> #dd if=/dev/zero of=/dev/md0 bs=1M count=1000 >> Oops: Exception in kernel mode, sig: 5 [#1] >> MPC85xx CDS >> Modules linked in: >> NIP: c01c45d8 LR: c01c4d48 CTR: 00000000 >> REGS: c2dd5c80 TRAP: 0700 Not tainted (2.6.31.5) >> MSR: 00029000 <EE,ME,CE> CR: 22004028 XER: 00000000 >> TASK = e820a580[3804] 'md0_raid5' THREAD: c2dd4000 >> GPR00: 00000001 c2dd5d30 e820a580 c2fb1088 00000001 00000000 00000002 00001000 >> GPR08: 00000001 c0485a20 00000000 ef8092f8 22002024 55555555 c2d67870 c0282d2c >> GPR16: 00001000 e8355c00 c2eff964 00000000 00000000 00000019 01000040 c2dd5e00 >> GPR24: c2dd5dfc 00000001 c2dd5dc0 c099c420 00000000 c2d67838 00000002 c2dd5d58 >> NIP [c01c45d8] async_tx_quiesce+0x28/0x74 > [..] >> I checked the kernel source code, and find that this OOPS was caused >> by the following BUG_ON code: >> It is in crypto/async_tx/async_tx.c: >> void async_tx_quiesce(struct dma_async_tx_descriptor **tx) >> { >> if (*tx) { >> /* if ack is already set then we cannot be sure >> * we are referring to the correct operation >> */ >> BUG_ON(async_tx_test_ack(*tx)); >> /* OOPS occured */ > > Yes, this looks like a manifestation of the issue I brought up in my > review of the driver [1]. The talitos_prep_dma_xor routine is always > acknowledging its descriptors, which it should not because that is the > responsibility of the client of the api. When the raid code tries to > attach a memcpy that depends on the xor it sees that it needs to > switch to from talitos to fsldma (or software if fsldma is turned > off). Since talitos does not have the DMA_INTERRUPT capability to > trigger the channel switch we need to perform a synchronous wait for > the xor to complete before submitting the memcpy. When the ack bit is > not set the xor descriptor might be recycled by the dma device driver > while we are waiting for it, hence the BUG_ON(). > Thanks for reply, Dan. Forgot to say, when this OOPS happened, I have not applied talitos XOR patch. I only enabled async_xx api and FSL_DMA, so here I think XOR was done by CPU and memcpy was done by DMA using async_xx api. Another interseting thing I should say is that I have tried latest stable kernel 2.6.31.6, this problem didn't exist. After I applied talitos XOR patch, it was OK too. I checked the related souce codes and it seems that there were no changes which make me feel very confused. I have been testing latest serials of kernels about XOR patch on MPC8548 board and I hope Freescale guys also can give me help. > -- > Dan > > See the final comment: > [1]: http://marc.info/?l=linux-raid&m=125685641412112&w=2 > -- The simplest is not all best but the best is surely the simplest! -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html