Re: [LSF/MM/BPF TOPIC] Improving Zoned Storage Support

Jens Axboe <axboe@xxxxxxxxx> · Wed, 17 Jan 2024 11:43:47 -0700

On 1/17/24 11:22 AM, Bart Van Assche wrote:
> On 1/17/24 09:48, Jens Axboe wrote:
>>> When posting this patch series, please include performance results
>>> (IOPS) for a zoned null_blk device instance. mq-deadline doesn't support
>>> more than 200 K IOPS, which is less than what UFS devices support. I
>>> hope that this performance bottleneck will be solved with the new
>>> approach.
>>
>> Not really zone related, but I was very aware of the single lock
>> limitations when I ported deadline to blk-mq. Was always hoping that
>> someone would actually take the time to make it more efficient, but so
>> far that hasn't happened. Or maybe it'll be a case of "just do it
>> yourself, Jens" at some point...
> 
> Hi Jens,
> 
> I think it is something fundamental rather than something that can be
> fixed. The I/O scheduling algorithms in mq-deadline and BFQ require
> knowledge of all pending I/O requests. This implies that data structures
> must be maintained that are shared across all CPU cores. Making these
> thread-safe implies having synchronization mechanisms that are used
> across all CPU cores. I think this is where the (about) 200 K IOPS
> bottleneck comes from.

Has any analysis been done on where the limitation comes from? For
kicks, I ran an IOPS benchmark on a smaller AMD box. It has 4 fast
drives, and if I use mq-deadline on those 4 drives I can get 13.5M IOPS
using just 4 threads, and only 2 cores. That's vastly more than 200K, in
fact that's ~3.3M per drive. At the same time it's vastly slower than
the 5M that they will do without a scheduler.

Doing a quick look at what slows it down, it's a mix of not being able
to use completion side batching (which again then brings in TSC reading
as the highest cycler user...), and some general deadline overhead. In
order:

+    3.32%  io_uring  [kernel.kallsyms]  [k] __dd_dispatch_request
+    2.71%  io_uring  [kernel.kallsyms]  [k] dd_insert_requests
+    1.21%  io_uring  [kernel.kallsyms]  [k] dd_dispatch_request

with the rest being noise. Biggest one is dd_has_work(), which seems
like it would be trivially fixable by just having a shared flag if ANY
of the priorities had work.

Granted, this test case is single threaded as far as a device is
concerned, which is obviously best case. Which then leads me to believe
that it may indeed be locking that's the main issue here, which is what
I suspected from the get-go. And while yes this is a lot of shared data,
there's absolutely ZERO reason why we would end up with a hard limit of
~200K IOPS even maintaining the behavior it has now.

So let's try a single device, single thread:

IOPS=5.10M, BW=2.49GiB/s, IOS/call=32/31

That's device limits, using mq-deadline. Now let's try and have 4
threads banging on it, pinned to the same two cores:

IOPS=3.90M, BW=1903MiB/s, IOS/call=32/31

Certainly slower. Now let's try and have the scheduler place the same 4
threads where it sees fit:

IOPS=1.56M, BW=759MiB/s, IOS/call=32/31

Yikes! That's still substantially more than 200K IOPS even with heavy
contention, let's take a look at the profile:

-   70.63%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
   - submitter_uring_fn
      - entry_SYSCALL_64
      - do_syscall_64
         - __se_sys_io_uring_enter
            - 70.62% io_submit_sqes
                 blk_finish_plug
                 __blk_flush_plug
               - blk_mq_flush_plug_list
                  - 69.65% blk_mq_run_hw_queue
                       blk_mq_sched_dispatch_requests
                     - __blk_mq_sched_dispatch_requests
                        + 60.61% dd_dispatch_request
                        + 8.98% blk_mq_dispatch_rq_list
                  + 0.98% dd_insert_requests

which is exactly as expected, we're spending 70% of the CPU cycles
banging on dd->lock.

Let's run the same thing again, but let's just do single requests at the
time:

IOPS=1.10M, BW=535MiB/s, IOS/call=1/0

worse again, but still a far cry from 200K IOPS. Contention basically
the same, but now we're not able to amortize other submission side
costs.

What I'm getting at is that it's a trap to just say "oh IO schedulers
can't scale beyong low IOPS" without even looking into where those
limits may be coming from. I'm willing to bet that increasing the
current limit for multi-threaded workloads would not be that difficult,
and it would probably 5x the performance potential of such setups.

Do we care? Maybe not, if we accept that an IO scheduler is just for
"slower devices". But let's not go around spouting some 200K number as
if it's gospel, when it depends on so many factors like IO workload,
system used, etc.

> Additionally, the faster storage devices become, the larger the relative
> overhead of an I/O scheduler is (assuming that I/O schedulers won't
> become significantly faster).

FIrst part is definitely true, second assumption I think is a "I just
give up without even looking at why" kind of attitude.

> A fundamental limitation of I/O schedulers is that multiple commands
> must be submitted simultaneously to the storage device to achieve good
> performance. However, if the queue depth is larger than one then the
> device has some control over the order in which commands are executed.

This isn't new, that's been known and understood for decades.

> Because of all the above reasons I'm recommending my colleagues to move
> I/O prioritization into the storage device and to evolve towards a
> future for solid storage devices without I/O schedulers. I/O schedulers
> probably will remain important for rotating magnetic media.

While I don't agree with a lot of your stipulations above, this is a
recommendation I've been giving for a long time as well. Mostly
because it means less cruft for us to maintain in software, also full
well knowing that we're then at the mercy of hardware implementations
which may all behave differently. And even if we historically have not
had good luck punting these problems to hardware and getting the desired
outcome.

-- 
Jens Axboe