Why is wc_wmb() not sufficient? It flushes all WC buffers in the CPU core (around 10 per core). On Sun, Mar 11, 2018 at 9:07 PM, Rohit Zambre <rzambre@xxxxxxx> wrote: > Hi, > > When we create 2 QPs each with a separate context, the QPs are > naturally assigned to different bfregs on different UAR pages. When we > create 2 QPs within the same context, the QPs are assigned to > different bfregs but on the same UAR page. The first 2 QPs are > assigned to low_lat_uuars and so, no locks are taken while writing to > the different bfregs. However, the Mellanox PRM states that doorbells > to the same UAR page must be serialized. I see the serialization > effect when I post 1 WQE-per-ibv_post_send in the graph attached > (multi-threaded 1 thread-per-QP using MOFED). But I'm failing to > understand how this serialization is enforced in the 1 context case: > > The only synchronization mechanism I see is the sfence barrier. The > sfence is imperative in the multiple-threads-per-QP case when the > order of doorbells needs to be preserved. But how does this sfence > synchronize writes to different bfregs of the same UAR? Since the > message size is 2 bytes, each of the 2 QPs' MMIO-writes is only 64 > bytes. My understanding is that the size of the write-combining buffer > is 64 bytes. How many WC buffers are there per UAR page? > > Here is the doorbell ringing code from MOFED-4.1 > > case MLX5_DB_METHOD_DEDIC_BF: > /* The QP has dedicated blue-flame */ > > /* > * Make sure that descriptors are written before > * updating doorbell record and ringing the doorbell > */ > wmb(); > qp->gen_data.db[MLX5_SND_DBR] = htonl(curr_post); > > /* This wc_wmb ensures ordering between DB record and BF copy */ > wc_wmb(); > if (size <= bf->buf_size / 64) > mlx5_bf_copy(bf->reg + bf->offset, seg, > size * 64, qp); > else > mlx5_write_db(bf->reg + bf->offset, seg); > /* > * use wc_wmb to ensure write combining buffers are flushed out > * of the running CPU. This must be carried inside the spinlock. > * Otherwise, there is a potential race. In the race, CPU A > * writes doorbell 1, which is waiting in the WC buffer. CPU B > * writes doorbell 2, and it's write is flushed earlier. Since > * the wc_wmb is CPU local, this will result in the HCA seeing > * doorbell 2, followed by doorbell 1. > */ > wc_wmb(); > bf->offset ^= bf->buf_size; > break; > > Thanks, > Rohit <div class="gmail_extra"><br><div class="gmail_quote">On Sun, Mar 11, 2018 at 9:07 PM, Rohit Zambre <span dir="ltr"><<a href="mailto:rzambre@xxxxxxx" target="_blank">rzambre@xxxxxxx</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br> <br> When we create 2 QPs each with a separate context, the QPs are<br> naturally assigned to different bfregs on different UAR pages. When we<br> create 2 QPs within the same context, the QPs are assigned to<br> different bfregs but on the same UAR page. The first 2 QPs are<br> assigned to low_lat_uuars and so, no locks are taken while writing to<br> the different bfregs. However, the Mellanox PRM states that doorbells<br> to the same UAR page must be serialized. I see the serialization<br> effect when I post 1 WQE-per-ibv_post_send in the graph attached<br> (multi-threaded 1 thread-per-QP using MOFED). But I'm failing to<br> understand how this serialization is enforced in the 1 context case:<br> <br> The only synchronization mechanism I see is the sfence barrier. The<br> sfence is imperative in the multiple-threads-per-QP case when the<br> order of doorbells needs to be preserved. But how does this sfence<br> synchronize writes to different bfregs of the same UAR? Since the<br> message size is 2 bytes, each of the 2 QPs' MMIO-writes is only 64<br> bytes. My understanding is that the size of the write-combining buffer<br> is 64 bytes. How many WC buffers are there per UAR page?<br> <br> Here is the doorbell ringing code from MOFED-4.1<br> <br> case MLX5_DB_METHOD_DEDIC_BF:<br> /* The QP has dedicated blue-flame */<br> <br> /*<br> * Make sure that descriptors are written before<br> * updating doorbell record and ringing the doorbell<br> */<br> wmb();<br> qp->gen_data.db[MLX5_SND_DBR] = htonl(curr_post);<br> <br> /* This wc_wmb ensures ordering between DB record and BF copy */<br> wc_wmb();<br> if (size <= bf->buf_size / 64)<br> mlx5_bf_copy(bf->reg + bf->offset, seg,<br> size * 64, qp);<br> else<br> mlx5_write_db(bf->reg + bf->offset, seg);<br> /*<br> * use wc_wmb to ensure write combining buffers are flushed out<br> * of the running CPU. This must be carried inside the spinlock.<br> * Otherwise, there is a potential race. In the race, CPU A<br> * writes doorbell 1, which is waiting in the WC buffer. CPU B<br> * writes doorbell 2, and it's write is flushed earlier. Since<br> * the wc_wmb is CPU local, this will result in the HCA seeing<br> * doorbell 2, followed by doorbell 1.<br> */<br> wc_wmb();<br> bf->offset ^= bf->buf_size;<br> break;<br> <br> Thanks,<br> Rohit<br> </blockquote></div><br></div> -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html