Hi, When we create 2 QPs each with a separate context, the QPs are naturally assigned to different bfregs on different UAR pages. When we create 2 QPs within the same context, the QPs are assigned to different bfregs but on the same UAR page. The first 2 QPs are assigned to low_lat_uuars and so, no locks are taken while writing to the different bfregs. However, the Mellanox PRM states that doorbells to the same UAR page must be serialized. I see the serialization effect when I post 1 WQE-per-ibv_post_send in the graph attached (multi-threaded 1 thread-per-QP using MOFED). But I'm failing to understand how this serialization is enforced in the 1 context case: The only synchronization mechanism I see is the sfence barrier. The sfence is imperative in the multiple-threads-per-QP case when the order of doorbells needs to be preserved. But how does this sfence synchronize writes to different bfregs of the same UAR? Since the message size is 2 bytes, each of the 2 QPs' MMIO-writes is only 64 bytes. My understanding is that the size of the write-combining buffer is 64 bytes. How many WC buffers are there per UAR page? Here is the doorbell ringing code from MOFED-4.1 case MLX5_DB_METHOD_DEDIC_BF: /* The QP has dedicated blue-flame */ /* * Make sure that descriptors are written before * updating doorbell record and ringing the doorbell */ wmb(); qp->gen_data.db[MLX5_SND_DBR] = htonl(curr_post); /* This wc_wmb ensures ordering between DB record and BF copy */ wc_wmb(); if (size <= bf->buf_size / 64) mlx5_bf_copy(bf->reg + bf->offset, seg, size * 64, qp); else mlx5_write_db(bf->reg + bf->offset, seg); /* * use wc_wmb to ensure write combining buffers are flushed out * of the running CPU. This must be carried inside the spinlock. * Otherwise, there is a potential race. In the race, CPU A * writes doorbell 1, which is waiting in the WC buffer. CPU B * writes doorbell 2, and it's write is flushed earlier. Since * the wc_wmb is CPU local, this will result in the HCA seeing * doorbell 2, followed by doorbell 1. */ wc_wmb(); bf->offset ^= bf->buf_size; break; Thanks, Rohit
Attachment:
sharedUAR.png
Description: PNG image