Re: What is synchronizing MMIO-writes on a shared UAR?

Anuj Kalia <anujkaliaiitd@xxxxxxxxx> · Mon, 12 Mar 2018 01:09:25 -0400

Why is wc_wmb() not sufficient? It flushes all WC buffers in the CPU
core (around 10 per core).

On Sun, Mar 11, 2018 at 9:07 PM, Rohit Zambre <rzambre@xxxxxxx> wrote:
> Hi,
>
> When we create 2 QPs each with a separate context, the QPs are
> naturally assigned to different bfregs on different UAR pages. When we
> create 2 QPs within the same context, the QPs are assigned to
> different bfregs but on the same UAR page. The first 2 QPs are
> assigned to low_lat_uuars and so, no locks are taken while writing to
> the different bfregs. However, the Mellanox PRM states that doorbells
> to the same UAR page must be serialized. I see the serialization
> effect when I post 1 WQE-per-ibv_post_send in the graph attached
> (multi-threaded 1 thread-per-QP using MOFED). But I'm failing to
> understand how this serialization is enforced in the 1 context case:
>
> The only synchronization mechanism I see is the sfence barrier. The
> sfence is imperative in the multiple-threads-per-QP case when the
> order of doorbells needs to be preserved. But how does this sfence
> synchronize writes to different bfregs of the same UAR? Since the
> message size is 2 bytes, each of the 2 QPs' MMIO-writes is only 64
> bytes. My understanding is that the size of the write-combining buffer
> is 64 bytes. How many WC buffers are there per UAR page?
>
> Here is the doorbell ringing code from MOFED-4.1
>
>     case MLX5_DB_METHOD_DEDIC_BF:
>         /* The QP has dedicated blue-flame */
>
>         /*
>          * Make sure that descriptors are written before
>          * updating doorbell record and ringing the doorbell
>          */
>         wmb();
>         qp->gen_data.db[MLX5_SND_DBR] = htonl(curr_post);
>
>         /* This wc_wmb ensures ordering between DB record and BF copy */
>         wc_wmb();
>         if (size <= bf->buf_size / 64)
>             mlx5_bf_copy(bf->reg + bf->offset, seg,
>                      size * 64, qp);
>         else
>             mlx5_write_db(bf->reg + bf->offset, seg);
>         /*
>          * use wc_wmb to ensure write combining buffers are flushed out
>          * of the running CPU. This must be carried inside the spinlock.
>          * Otherwise, there is a potential race. In the race, CPU A
>          * writes doorbell 1, which is waiting in the WC buffer. CPU B
>          * writes doorbell 2, and it's write is flushed earlier. Since
>          * the wc_wmb is CPU local, this will result in the HCA seeing
>          * doorbell 2, followed by doorbell 1.
>          */
>         wc_wmb();
>         bf->offset ^= bf->buf_size;
>         break;
>
> Thanks,
> Rohit
<div class="gmail_extra"><br><div class="gmail_quote">On Sun, Mar 11,
2018 at 9:07 PM, Rohit Zambre <span dir="ltr">&lt;<a
href="mailto:rzambre@xxxxxxx";
target="_blank">rzambre@xxxxxxx</a>&gt;</span> wrote:<br><blockquote
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">Hi,<br>
<br>
When we create 2 QPs each with a separate context, the QPs are<br>
naturally assigned to different bfregs on different UAR pages. When we<br>
create 2 QPs within the same context, the QPs are assigned to<br>
different bfregs but on the same UAR page. The first 2 QPs are<br>
assigned to low_lat_uuars and so, no locks are taken while writing to<br>
the different bfregs. However, the Mellanox PRM states that doorbells<br>
to the same UAR page must be serialized. I see the serialization<br>
effect when I post 1 WQE-per-ibv_post_send in the graph attached<br>
(multi-threaded 1 thread-per-QP using MOFED). But I'm failing to<br>
understand how this serialization is enforced in the 1 context case:<br>
<br>
The only synchronization mechanism I see is the sfence barrier. The<br>
sfence is imperative in the multiple-threads-per-QP case when the<br>
order of doorbells needs to be preserved. But how does this sfence<br>
synchronize writes to different bfregs of the same UAR? Since the<br>
message size is 2 bytes, each of the 2 QPs' MMIO-writes is only 64<br>
bytes. My understanding is that the size of the write-combining buffer<br>
is 64 bytes. How many WC buffers are there per UAR page?<br>
<br>
Here is the doorbell ringing code from MOFED-4.1<br>
<br>
&nbsp; &nbsp; case MLX5_DB_METHOD_DEDIC_BF:<br>
&nbsp; &nbsp; &nbsp; &nbsp; /* The QP has dedicated blue-flame */<br>
<br>
&nbsp; &nbsp; &nbsp; &nbsp; /*<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* Make sure that descriptors are
written before<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* updating doorbell record and
ringing the doorbell<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;*/<br>
&nbsp; &nbsp; &nbsp; &nbsp; wmb();<br>
&nbsp; &nbsp; &nbsp; &nbsp; qp-&gt;gen_data.db[MLX5_SND_DBR] =
htonl(curr_post);<br>
<br>
&nbsp; &nbsp; &nbsp; &nbsp; /* This wc_wmb ensures ordering between DB
record and BF copy */<br>
&nbsp; &nbsp; &nbsp; &nbsp; wc_wmb();<br>
&nbsp; &nbsp; &nbsp; &nbsp; if (size &lt;= bf-&gt;buf_size / 64)<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mlx5_bf_copy(bf-&gt;reg +
bf-&gt;offset, seg,<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
&nbsp;size * 64, qp);<br>
&nbsp; &nbsp; &nbsp; &nbsp; else<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; mlx5_write_db(bf-&gt;reg +
bf-&gt;offset, seg);<br>
&nbsp; &nbsp; &nbsp; &nbsp; /*<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* use wc_wmb to ensure write
combining buffers are flushed out<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* of the running CPU. This must be
carried inside the spinlock.<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* Otherwise, there is a potential
race. In the race, CPU A<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* writes doorbell 1, which is
waiting in the WC buffer. CPU B<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* writes doorbell 2, and it's write
is flushed earlier. Since<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* the wc_wmb is CPU local, this will
result in the HCA seeing<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;* doorbell 2, followed by doorbell 1.<br>
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;*/<br>
&nbsp; &nbsp; &nbsp; &nbsp; wc_wmb();<br>
&nbsp; &nbsp; &nbsp; &nbsp; bf-&gt;offset ^= bf-&gt;buf_size;<br>
&nbsp; &nbsp; &nbsp; &nbsp; break;<br>
<br>
Thanks,<br>
Rohit<br>
</blockquote></div><br></div>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html