Hi Ming, On 02.07.2020 13:48, Ming Lei wrote: > On Thu, Jul 02, 2020 at 12:19:08PM +0200, Marek Szyprowski wrote: >> On 02.07.2020 11:23, Ming Lei wrote: >>> On Thu, Jul 02, 2020 at 10:04:38AM +0200, Marek Szyprowski wrote: >>>> On 02.07.2020 03:22, Ming Lei wrote: >>>>> On Wed, Jul 01, 2020 at 04:16:32PM +0200, Marek Szyprowski wrote: >>>>>> On 01.07.2020 15:45, Ming Lei wrote: >>>>>>> On Wed, Jul 01, 2020 at 03:01:03PM +0200, Marek Szyprowski wrote: >>>>>>>> On 29.06.2020 11:47, Ming Lei wrote: >>>>>>>>> It is natural to release driver tag when this request is completed by >>>>>>>>> LLD or device since its purpose is for LLD use. >>>>>>>>> >>>>>>>>> One big benefit is that the released tag can be re-used quicker since >>>>>>>>> bio_endio() may take too long. >>>>>>>>> >>>>>>>>> Meantime we don't need to release driver tag for flush request. >>>>>>>>> >>>>>>>>> Cc: Christoph Hellwig <hch@xxxxxx> >>>>>>>>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> >>>>>>>> This patch landed recently in linux-next as commit 36a3df5a4574. Sadly >>>>>>>> it causes a regression on one of my test systems (ARM 32bit, Samsung >>>>>>>> Exynos5422 SoC based Odroid XU3 board with eMMC). The system boots fine >>>>>>>> and then after a few seconds every executed command hangs. No >>>>>>>> panic/ops/any other message. I will try to provide more information asap >>>>>>>> I find something to share. Simple reverting it in linux-next is not >>>>>>>> possible due to dependencies. >>>>>>> What is the exact eMMC's driver code(include the host driver)? >>>>>> dwmmc-exynos (drivers/mmc/host/dw_mmc-exynos.c) >>>>> Hi, >>>>> >>>>> Just take a quick look at mmc code, there are only two req->tag >>>>> consumers: >>>>> >>>>> 1) cqhci_tag >>>>> cqhci_tag >>>>> cqhci_request >>>>> host->cqe_ops->cqe_request >>>>> mmc_cqe_start_req >>>>> cqhci_timeout >>>>> >>>>> 2) mmc_hsq_request >>>>> mmc_hsq_request >>>>> host->cqe_ops->cqe_request >>>>> mmc_cqe_start_req >>>>> >>>>> mmc_cqe_start_req() is called before issuing this request to hardware, >>>>> so completion won't happen when the tag is used in mmc_cqe_start_req(). >>>>> >>>>> cqhci_timeout() may race with normal completion, however looks the >>>>> following code can handle the race correctly: >>>>> >>>>> spin_lock_irqsave(&cq_host->lock, flags); >>>>> timed_out = slot->mrq == mrq; >>>>> >>>>> So still no idea why the commit causes the trouble for mmc. >>>>> >>>>> Do you know it is cqhci or mmc_hsh which works for dw_mmc-exynos? >>>>> And can you apply the following patch and see if warning can be >>>>> triggered? >>>>> >>>>> diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c >>>>> index 75934f3c117e..2cb49ecfbf34 100644 >>>>> --- a/drivers/mmc/host/cqhci.c >>>>> +++ b/drivers/mmc/host/cqhci.c >>>>> @@ -612,6 +612,7 @@ static int cqhci_request(struct mmc_host *mmc, struct mmc_request *mrq) >>>>> goto out_unlock; >>>>> } >>>>> >>>>> + WARN_ON_ONCE(cq_host->slot[tag].mrq); >>>>> cq_host->slot[tag].mrq = mrq; >>>>> cq_host->slot[tag].flags = 0; >>>>> >>>>> diff --git a/drivers/mmc/host/mmc_hsq.c b/drivers/mmc/host/mmc_hsq.c >>>>> index a5e05ed0fda3..11a4c1f3a970 100644 >>>>> --- a/drivers/mmc/host/mmc_hsq.c >>>>> +++ b/drivers/mmc/host/mmc_hsq.c >>>>> @@ -227,6 +227,7 @@ static int mmc_hsq_request(struct mmc_host *mmc, struct mmc_request *mrq) >>>>> return -EBUSY; >>>>> } >>>>> >>>>> + WARN_ON_ONCE(hsq->slot[tag].mrq); >>>>> hsq->slot[tag].mrq = mrq; >>>>> >>>>> /* >>>> None of the above is even compiled for my system (I'm using >>>> arm/exynos_defconfig), so this must be something else. >>> Hello Marek, >>> >>> Or can you boot the system with one workable disk(usb, nand, ...)? >>> then run some IO test on this eMMC, and collect debugfs log via the following >>> command after the hang is triggered: >>> >>> (cd /sys/kernel/debug/block/$MMC && find . -type f -exec grep -aH . {} \;) >>> >>> $MMC is this mmc disk name. >> >> I hope it helps. > It does help, :-) > > Thanks for collecting the log, now I understood the reason: flush > request's driver tag is leaked in case that request isn't done via > blk_mq_complete_request(), such as freed via blk_mq_end_request() > directly. > > Please try the following patch, which should have been one two-line > change if the driver tag cleanup patch isn't merged. Yes, this fixes the issue on my test system! :) Feel free to add: Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> Tested-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> to the final patch. Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland