On Fri, Dec 01, 2023 at 04:43:40PM -0400, Jason Gunthorpe wrote: > On Fri, Dec 01, 2023 at 11:58:07AM -0800, Nicolin Chen wrote: > > > It seems there is not a simple way to realize this error back to > > > userspace since we can't block the global command queue and we proceed > > > to complete commands that the real HW would not have completed. > > > > > > To actually emulate this the gerror handler would have to capture all > > > the necessary registers, return them back to the thread doing > > > invalidate_user and all of that would have to return back to userspace > > > to go into the virtual version of all the same registers. > > > > > > Yes, it can be synchronous it seems, but we don't have any > > > infrastructure in the driver to do this. > > > > > > Given this is pretty niche maybe we just don't support error > > > forwarding and simply ensure it could be added to the uapi later? > > > > If arm_smmu_cmdq_issue_cmdlist in arm_smmu_cache_invalidate_user > > fails with ETIMEOUT, we polls the CONS register to get the error > > code. This can cover CERROR_ABT and CERROR_ATC_INV. > > Why is timeout linked to these two? Or rather, it doesn't have to be > linked like that. Any gerror is effectively synchronous because it > halts the queue and allows SW time to inspect which command failed and > record the gerror flags. So each and every command can get an error > indication. > > Restarting the queue is done by putting sync in there to effectively > nop the failed command and we hope for the best and let it rip. I see that SMMU driver only restarts the queue when dealing with CERROR_ILL. So only CERROR_ABT or CERROR_ATC_INV would result in -ETIMEOUT. > > As you remarked that we can't block the global CMDQ, so we have > > to let a real CERROR_ILL go. Yet, we can make sure commands to > > be fully sanitized before being issued, as we should immediately > > reject faulty commands anyway, for errors such as unsupported op > > codes, unzero-ed reserved fields, and unlinked vSIDs. This can > > at least largely reduce the probability of a real CERROR_ILL. > > I'm more a little more concerend with ATC_INV as a malfunctioning > device can trigger this.. How about making sure that the invalidate handler always issues one CMD_ATC_INV at a time, so each arm_smmu_cmdq_issue_cmdlist() call has a chance to timeout? Then, we can simply know which one in the user array fails. > > So, combining these two, we can still have a basic synchronous > > way by returning an errno to the invalidate ioctl? I see Kevin > > replied something similar too. > > It isn't enough information, you don't know which gerror bits to set > and you don't know what cons index to stick to indicate the error > triggering command with just a simple errno. > > It does need to return a bunch of data to get it all right. The array structure returns req_num to indicate the index. This works, even if the command consumption stops in the middle: * @req_num: Input the number of cache invalidation requests in the array. * Output the number of requests successfully handled by kernel. So we only need an error code of CERROR_ABT/ILL/ATC_INV. Or am I missing some point here? > Though again, there is no driver infrastructure to do all this and it > doesn't seem that important so maybe we can just ensure there is a > future extension possiblity and have userspace understand an errno > means to generate CERROR_ILL on the last command in the batch? Yea, it certainly can be a plan. Thanks Nicolin