What is synchronizing MMIO-writes on a shared UAR?

Rohit Zambre <rzambre@xxxxxxx> · Sun, 11 Mar 2018 20:07:43 -0500

Hi,

When we create 2 QPs each with a separate context, the QPs are
naturally assigned to different bfregs on different UAR pages. When we
create 2 QPs within the same context, the QPs are assigned to
different bfregs but on the same UAR page. The first 2 QPs are
assigned to low_lat_uuars and so, no locks are taken while writing to
the different bfregs. However, the Mellanox PRM states that doorbells
to the same UAR page must be serialized. I see the serialization
effect when I post 1 WQE-per-ibv_post_send in the graph attached
(multi-threaded 1 thread-per-QP using MOFED). But I'm failing to
understand how this serialization is enforced in the 1 context case:

The only synchronization mechanism I see is the sfence barrier. The
sfence is imperative in the multiple-threads-per-QP case when the
order of doorbells needs to be preserved. But how does this sfence
synchronize writes to different bfregs of the same UAR? Since the
message size is 2 bytes, each of the 2 QPs' MMIO-writes is only 64
bytes. My understanding is that the size of the write-combining buffer
is 64 bytes. How many WC buffers are there per UAR page?

Here is the doorbell ringing code from MOFED-4.1

    case MLX5_DB_METHOD_DEDIC_BF:
        /* The QP has dedicated blue-flame */

        /*
         * Make sure that descriptors are written before
         * updating doorbell record and ringing the doorbell
         */
        wmb();
        qp->gen_data.db[MLX5_SND_DBR] = htonl(curr_post);

        /* This wc_wmb ensures ordering between DB record and BF copy */
        wc_wmb();
        if (size <= bf->buf_size / 64)
            mlx5_bf_copy(bf->reg + bf->offset, seg,
                     size * 64, qp);
        else
            mlx5_write_db(bf->reg + bf->offset, seg);
        /*
         * use wc_wmb to ensure write combining buffers are flushed out
         * of the running CPU. This must be carried inside the spinlock.
         * Otherwise, there is a potential race. In the race, CPU A
         * writes doorbell 1, which is waiting in the WC buffer. CPU B
         * writes doorbell 2, and it's write is flushed earlier. Since
         * the wc_wmb is CPU local, this will result in the HCA seeing
         * doorbell 2, followed by doorbell 1.
         */
        wc_wmb();
        bf->offset ^= bf->buf_size;
        break;

Thanks,
Rohit
Attachment:
sharedUAR.png

Description: PNG image