On 2024/12/13 20:49, Jason Gunthorpe wrote: > On Fri, Dec 13, 2024 at 05:37:58PM +0800, Junxian Huang wrote: >>> But your reset flow partially disassociates the device, when the >>> userspace goes back to sleep, or rearms the CQ, it should get a hard >>> fail and do a full cleanup without relying on flushing. >> >> Not sure if I got your point, when you said "the userspace goes back to sleep", >> did you mean the ibv_get_async_event() api? Are you suggesting that userspace >> should call ibv_get_async_event() to monitor async events, and when it gets a >> fatal event, it should stop polling CQs and clean up everything instead of >> still waiting for the remaining CQEs? > > Yes, it should do that as well. This is wha the devce fatal event is > for. > > I'm also saying that any kernel systems calls, like sleeping for CQ > events should start failing too. > > Jason Thanks. I took a cursory look at some open-source userspace projects, UCX and SPDK handle the device fatal event properly by doing cleanup. But Ceph doesn't seem to have any special handling except for logs.. Junxian