On 02.07.2020 11:23, Ming Lei wrote: > On Thu, Jul 02, 2020 at 10:04:38AM +0200, Marek Szyprowski wrote: >> On 02.07.2020 03:22, Ming Lei wrote: >>> On Wed, Jul 01, 2020 at 04:16:32PM +0200, Marek Szyprowski wrote: >>>> On 01.07.2020 15:45, Ming Lei wrote: >>>>> On Wed, Jul 01, 2020 at 03:01:03PM +0200, Marek Szyprowski wrote: >>>>>> On 29.06.2020 11:47, Ming Lei wrote: >>>>>>> It is natural to release driver tag when this request is completed by >>>>>>> LLD or device since its purpose is for LLD use. >>>>>>> >>>>>>> One big benefit is that the released tag can be re-used quicker since >>>>>>> bio_endio() may take too long. >>>>>>> >>>>>>> Meantime we don't need to release driver tag for flush request. >>>>>>> >>>>>>> Cc: Christoph Hellwig <hch@xxxxxx> >>>>>>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> >>>>>> This patch landed recently in linux-next as commit 36a3df5a4574. Sadly >>>>>> it causes a regression on one of my test systems (ARM 32bit, Samsung >>>>>> Exynos5422 SoC based Odroid XU3 board with eMMC). The system boots fine >>>>>> and then after a few seconds every executed command hangs. No >>>>>> panic/ops/any other message. I will try to provide more information asap >>>>>> I find something to share. Simple reverting it in linux-next is not >>>>>> possible due to dependencies. >>>>> What is the exact eMMC's driver code(include the host driver)? >>>> dwmmc-exynos (drivers/mmc/host/dw_mmc-exynos.c) >>> Hi, >>> >>> Just take a quick look at mmc code, there are only two req->tag >>> consumers: >>> >>> 1) cqhci_tag >>> cqhci_tag >>> cqhci_request >>> host->cqe_ops->cqe_request >>> mmc_cqe_start_req >>> cqhci_timeout >>> >>> 2) mmc_hsq_request >>> mmc_hsq_request >>> host->cqe_ops->cqe_request >>> mmc_cqe_start_req >>> >>> mmc_cqe_start_req() is called before issuing this request to hardware, >>> so completion won't happen when the tag is used in mmc_cqe_start_req(). >>> >>> cqhci_timeout() may race with normal completion, however looks the >>> following code can handle the race correctly: >>> >>> spin_lock_irqsave(&cq_host->lock, flags); >>> timed_out = slot->mrq == mrq; >>> >>> So still no idea why the commit causes the trouble for mmc. >>> >>> Do you know it is cqhci or mmc_hsh which works for dw_mmc-exynos? >>> And can you apply the following patch and see if warning can be >>> triggered? >>> >>> diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c >>> index 75934f3c117e..2cb49ecfbf34 100644 >>> --- a/drivers/mmc/host/cqhci.c >>> +++ b/drivers/mmc/host/cqhci.c >>> @@ -612,6 +612,7 @@ static int cqhci_request(struct mmc_host *mmc, struct mmc_request *mrq) >>> goto out_unlock; >>> } >>> >>> + WARN_ON_ONCE(cq_host->slot[tag].mrq); >>> cq_host->slot[tag].mrq = mrq; >>> cq_host->slot[tag].flags = 0; >>> >>> diff --git a/drivers/mmc/host/mmc_hsq.c b/drivers/mmc/host/mmc_hsq.c >>> index a5e05ed0fda3..11a4c1f3a970 100644 >>> --- a/drivers/mmc/host/mmc_hsq.c >>> +++ b/drivers/mmc/host/mmc_hsq.c >>> @@ -227,6 +227,7 @@ static int mmc_hsq_request(struct mmc_host *mmc, struct mmc_request *mrq) >>> return -EBUSY; >>> } >>> >>> + WARN_ON_ONCE(hsq->slot[tag].mrq); >>> hsq->slot[tag].mrq = mrq; >>> >>> /* >> None of the above is even compiled for my system (I'm using >> arm/exynos_defconfig), so this must be something else. > Hello Marek, > > Or can you boot the system with one workable disk(usb, nand, ...)? > then run some IO test on this eMMC, and collect debugfs log via the following > command after the hang is triggered: > > (cd /sys/kernel/debug/block/$MMC && find . -type f -exec grep -aH . {} \;) > > $MMC is this mmc disk name. I have no physical access to this system now, but I managed to run a screen session just after the boot to have more than one shell. Here is the dump when the other shell locks: # (cd /sys/kernel/debug/block/mmcblk0 && find . -type f -exec grep -aH . {} \;) ./hctx0/cpu7/completed:372 0 ./hctx0/cpu7/merged:0 ./hctx0/cpu7/dispatched:372 0 ./hctx0/cpu6/completed:385 0 ./hctx0/cpu6/merged:0 ./hctx0/cpu6/dispatched:385 0 ./hctx0/cpu5/completed:525 0 ./hctx0/cpu5/merged:0 ./hctx0/cpu5/dispatched:526 0 ./hctx0/cpu4/completed:707 0 ./hctx0/cpu4/merged:0 ./hctx0/cpu4/dispatched:708 0 ./hctx0/cpu3/completed:97 1 ./hctx0/cpu3/merged:0 ./hctx0/cpu3/dispatched:97 2 ./hctx0/cpu2/completed:180 0 ./hctx0/cpu2/merged:0 ./hctx0/cpu2/dispatched:181 0 ./hctx0/cpu1/completed:257 0 ./hctx0/cpu1/merged:0 ./hctx0/cpu1/dispatched:257 2 ./hctx0/cpu0/completed:45 0 ./hctx0/cpu0/merged:0 ./hctx0/cpu0/dispatched:45 0 ./hctx0/type:default ./hctx0/dispatch_busy:0 ./hctx0/active:0 ./hctx0/run:2126 ./hctx0/queued:2575 ./hctx0/dispatched: 0 6 ./hctx0/dispatched: 1 2050 ./hctx0/dispatched: 2 150 ./hctx0/dispatched: 4 11 ./hctx0/dispatched: 8 1 ./hctx0/dispatched: 16 1 ./hctx0/dispatched: 32+ 1 ./hctx0/io_poll:considered=0 ./hctx0/io_poll:invoked=0 ./hctx0/io_poll:success=0 ./hctx0/sched_tags_bitmap:00000000: 0020 0000 1000 0000 0000 7000 0800 0000 ./hctx0/sched_tags:nr_tags=128 ./hctx0/sched_tags:nr_reserved_tags=0 ./hctx0/sched_tags:active_queues=0 ./hctx0/sched_tags:bitmap_tags: ./hctx0/sched_tags:depth=128 ./hctx0/sched_tags:busy=6 ./hctx0/sched_tags:cleared=40 ./hctx0/sched_tags:bits_per_word=32 ./hctx0/sched_tags:map_nr=4 ./hctx0/sched_tags:alloc_hint={126, 78, 98, 35, 12, 83, 11, 21} ./hctx0/sched_tags:wake_batch=8 ./hctx0/sched_tags:wake_index=0 ./hctx0/sched_tags:ws_active=0 ./hctx0/sched_tags:ws={ ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/sched_tags:} ./hctx0/sched_tags:round_robin=0 ./hctx0/sched_tags:min_shallow_depth=4294967295 ./hctx0/tags_bitmap:00000000: ffff ffff ffff ffff ./hctx0/tags:nr_tags=64 ./hctx0/tags:nr_reserved_tags=0 ./hctx0/tags:active_queues=0 ./hctx0/tags:bitmap_tags: ./hctx0/tags:depth=64 ./hctx0/tags:busy=64 ./hctx0/tags:cleared=0 ./hctx0/tags:bits_per_word=16 ./hctx0/tags:map_nr=4 ./hctx0/tags:alloc_hint={14, 0, 0, 0, 0, 0, 13, 22} ./hctx0/tags:wake_batch=8 ./hctx0/tags:wake_index=0 ./hctx0/tags:ws_active=0 ./hctx0/tags:ws={ ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags: {.wait_cnt=8, .wait=inactive}, ./hctx0/tags:} ./hctx0/tags:round_robin=0 ./hctx0/tags:min_shallow_depth=4294967295 ./hctx0/ctx_map:00000000: 00 ./hctx0/dispatch:35e4bf48 {.op=FLUSH, .cmd_flags=PREFLUSH, .rq_flags=FLUSH_SEQ, .state=idle, .tag=-1, .internal_tag=84} ./hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|BLOCKING ./hctx0/state:SCHED_RESTART ./sched/starved:0 ./sched/batching:1 ./sched/write_fifo_list:ab51b2cc {.op=WRITE, .cmd_flags=META|PRIO|BACKGROUND, .rq_flags=ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, .internal_tag=85} ./sched/write_fifo_list:ac854e48 {.op=WRITE, .cmd_flags=META|PRIO|BACKGROUND, .rq_flags=ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, .internal_tag=86} ./sched/write_fifo_list:b12e714a {.op=WRITE, .cmd_flags=, .rq_flags=ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, .internal_tag=36} ./sched/write_fifo_list:99383a87 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=ELVPRIV|IO_STAT|HASHED, .state=idle, .tag=-1, .internal_tag=99} ./sched/read_fifo_list:29365a9b {.op=READ, .cmd_flags=FAILFAST_DEV|FAILFAST_TRAag=-1, .internal_tag=13}AHEAD, .rq_flags=ELVPRIV|IO_STAT|HASHED, .state=idle, .ta ./write_hints:hint0: 0 ./write_hints:hint1: 0 ./write_hints:hint2: 0 ./write_hints:hint3: 0 ./write_hints:hint4: 0 ./state:SAME_COMP|NONROT|IO_STAT|DISCARD|SECERASE|INIT_DONE|WC|FUA|REGISTERED ./pm_only:0 ./poll_stat:read (512 Bytes): samples=0 ./poll_stat:write (512 Bytes): samples=0 ./poll_stat:read (1024 Bytes): samples=0 ./poll_stat:write (1024 Bytes): samples=0 ./poll_stat:read (2048 Bytes): samples=0 ./poll_stat:write (2048 Bytes): samples=0 ./poll_stat:read (4096 Bytes): samples=0 ./poll_stat:write (4096 Bytes): samples=0 ./poll_stat:read (8192 Bytes): samples=0 ./poll_stat:write (8192 Bytes): samples=0 ./poll_stat:read (16384 Bytes): samples=0 ./poll_stat:write (16384 Bytes): samples=0 ./poll_stat:read (32768 Bytes): samples=0 ./poll_stat:write (32768 Bytes): samples=0 ./poll_stat:read (65536 Bytes): samples=0 ./poll_stat:write (65536 Bytes): samples=0 root@target:~# I hope it helps. Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland