Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxxxxxxx> writes: >> Can you share what you ran to online/offline CPUs? I can't reproduce >> this here. > > I was using the ppc64_cpu tool, which shouldn't do nothing more than > write to sysfs. but I just reproduced it with the script below. > > Note that this is ppc64le. I don't have a x86 in hand to attempt to > reproduce right now, but I'll look for one and see how it goes. Hi, Any luck on reproducing it? We were initially reproducing with a proprietary stress test, but I gave a try to a generated fio jobfile associated with the SMT script I shared earlier and I could reproduce the crash consistently in less than 10 minutes of execution. this was still ppc64le, though. I couldn't get my hands on nvme on x86 yet. The job file I used, as well as the smt.sh script, in case you want to give it a try: jobfile: http://krisman.be/k/nvmejob.fio smt.sh: http://krisman.be/k/smt.sh Still, the trigger seems to be consistently a heavy load of IO associated with CPU addition/removal. Let me share my progress from the last couple days in the hope that it rings a bell for you. Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the request_queue's freeze_depth is 0, which points away from a fault in the freeze/unfreeze mechanism. If a request was escaping and going through the block layer during a freeze, we'd see freeze_depth >= 1. Before that, I had also tried to keep the q_usage_counter in atomic mode, in case of a bug in the percpu refcount. No luck, the BUG_ON was still hit. Also, I don't see anything special about the request that reaches the BUG_ON. It's a REQ_TYPE_FS request and, at least in the last time I reproduced, it was a READ that came from the stress test task through submit_bio. So nothing remarkable about it too, as far as I can see. I'm still thinking about a case in which the mapping get's screwed up, where a ctx would appear into two hctxs bitmaps after a remap, or if the ctx got remaped to another hctx. I'm still learning my way through the cpumap code, so I'm not sure it's a real possibility, but I'm not convinced it isn't. Some preliminary tests don't suggest it's the case at play, but I wanna spend a little more time on this theory (maybe for my lack of better ideas :) On a side note, probably unrelated to this crash, it also got me thinking about the current usefulness of blk_mq_hctx_notify. Since CPU is dead, no more requests would be coming through its ctx. I think we could force a queue run in blk_mq_queue_reinit_notify, before remapping, which would cause the hctx to fetch the remaining requests from that dead ctx (since it's not unmapped yet). This way, we could maintain a single hotplug notification hook and simplify the hotplug path. I haven't written code for it yet, but I'll see if I can come up with something and send to the list. -- Gabriel Krisman Bertazi -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html