Re: [PATCH V5 00/25] mmc: mmc: Add Software Command Queuing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 28/10/16 14:48, Ritesh Harjani wrote:
> Hi Adrian,
> 
> 
> Thanks for the re-base. I am trying to apply this patch series and validate.
> I am seeing some tuning related errors, it could be due to my setup/device.
> I will be working on this.
> 
> 
> Meanwhile below perf readout was showing some discrepancy (could be that I
> am missing something).
> In your last test both sequential and random read gives almost same
> throughput and even the percentage increase is similar(legacy v/s SW cmdq).
> Could be due to benchmark itself ?

Normally sequential reading would result in large transfers due to
readahead.  But the test is using Direct-I/O (-I option) to prevent VFS
caching distorting the results.  So the sequential read case ends up being
more like a random read case anyway.

> 
> Do you think we should get some other benchmarks tested as well?
> (may be tio?)
> 
> 
> Also could you please also add some analysis on where we should expect to
> see the score improvement in SW CMDQ and where we may see some decrements
> due to SW CMDQ?

I would expect random reads to be improved, but there is the overhead of the
extra cmdq commands, so a small decrease in performance would be expected
otherwise.

> You did mention in original cover-letter that we mostly should see
> improvement in Random read scores, but below here we are seeing similar or
> higher improvement in sequential reads(4 threads). Which is a bit surprising.

As explained above: the test is called "sequential" but the limited record
size combined with Direct I/O makes it more like random I/O especially when
it is mixed from multiple threads.

> 
> 
> 
> 
> On 10/24/2016 2:07 PM, Adrian Hunter wrote:
>> Hi
>>
>> Here is an updated version of the Software Command Queuing patches,
>> re-based on next, with patches 1-5 dropped because they have been applied,
>> and 2 fixes that have been rolled in (refer Changes in V5 below).
>>
>> Performance results:
>>
>> Results can vary from run to run, but here are some results showing 1, 2 or 4
>> processes with 4k and 32k record sizes.  They show up to 40% improvement in
>> read performance when there are multiple processes.
>>
>> iozone -s 8192k -r 4k -i 0 -i 1 -i 2 -i 8 -I -t 1 -F /mnt/mmc/iozone1.tmp
>>
>>     Children see throughput for  1 initial writers     =     27909.87
>> kB/sec     24204.14 kB/sec      -13.28 %
>>     Children see throughput for  1 rewriters     =     28839.28 kB/sec    
>> 25531.92 kB/sec      -11.47 %
>>     Children see throughput for  1 readers         =     25889.65
>> kB/sec     24883.23 kB/sec       -3.89 %
>>     Children see throughput for 1 re-readers     =     25558.23 kB/sec    
>> 24679.89 kB/sec       -3.44 %
>>     Children see throughput for 1 random readers     =     25571.48
>> kB/sec     24689.52 kB/sec       -3.45 %
>>     Children see throughput for 1 mixed workload     =     25758.59
>> kB/sec     24487.52 kB/sec       -4.93 %
>>     Children see throughput for 1 random writers     =     24787.51
>> kB/sec     19368.99 kB/sec      -21.86 %
>>
>> iozone -s 8192k -r 32k -i 0 -i 1 -i 2 -i 8 -I -t 1 -F /mnt/mmc/iozone1.tmp
>>
>>     Children see throughput for  1 initial writers     =     91344.61
>> kB/sec    102008.56 kB/sec       11.67 %
>>     Children see throughput for  1 rewriters     =     87932.36 kB/sec    
>> 96630.44 kB/sec        9.89 %
>>     Children see throughput for  1 readers         =    134879.82
>> kB/sec    110292.79 kB/sec      -18.23 %
>>     Children see throughput for 1 re-readers     =    147632.13 kB/sec   
>> 109053.33 kB/sec      -26.13 %
>>     Children see throughput for 1 random readers     =     93547.37
>> kB/sec    112225.50 kB/sec       19.97 %
>>     Children see throughput for 1 mixed workload     =     93560.04
>> kB/sec    110515.21 kB/sec       18.12 %
>>     Children see throughput for 1 random writers     =     92841.84
>> kB/sec     81153.81 kB/sec      -12.59 %
>>
>> iozone -s 8192k -r 4k -i 0 -i 1 -i 2 -i 8 -I -t 2 -F /mnt/mmc/iozone1.tmp
>> /mnt/mmc/iozone2.tmp
>>
>>     Children see throughput for  2 initial writers     =     31145.43
>> kB/sec     33771.25 kB/sec        8.43 %
>>     Children see throughput for  2 rewriters     =     30592.57 kB/sec    
>> 35916.46 kB/sec       17.40 %
>>     Children see throughput for  2 readers         =     31669.83
>> kB/sec     37460.13 kB/sec       18.28 %
>>     Children see throughput for 2 re-readers     =     32079.94 kB/sec    
>> 37373.33 kB/sec       16.50 %
>>     Children see throughput for 2 random readers     =     27731.19
>> kB/sec     37601.65 kB/sec       35.59 %
>>     Children see throughput for 2 mixed workload     =     13927.50
>> kB/sec     14617.06 kB/sec        4.95 %
>>     Children see throughput for 2 random writers     =     31250.00
>> kB/sec     33106.72 kB/sec        5.94 %
>>
>> iozone -s 8192k -r 32k -i 0 -i 1 -i 2 -i 8 -I -t 2 -F /mnt/mmc/iozone1.tmp
>> /mnt/mmc/iozone2.tmp
>>
>>     Children see throughput for  2 initial writers     =    123255.84
>> kB/sec    131252.22 kB/sec        6.49 %
>>     Children see throughput for  2 rewriters     =    115234.91 kB/sec   
>> 107225.74 kB/sec       -6.95 %
>>     Children see throughput for  2 readers         =    128921.86
>> kB/sec    148562.71 kB/sec       15.23 %
>>     Children see throughput for 2 re-readers     =    127815.24 kB/sec   
>> 149304.32 kB/sec       16.81 %
>>     Children see throughput for 2 random readers     =    125600.46
>> kB/sec    148406.56 kB/sec       18.16 %
>>     Children see throughput for 2 mixed workload     =     44006.94
>> kB/sec     50937.36 kB/sec       15.75 %
>>     Children see throughput for 2 random writers     =    120623.95
>> kB/sec    103969.05 kB/sec      -13.81 %
>>
>> iozone -s 8192k -r 4k -i 0 -i 1 -i 2 -i 8 -I -t 4 -F /mnt/mmc/iozone1.tmp
>> /mnt/mmc/iozone2.tmp /mnt/mmc/iozone3.tmp /mnt/mmc/iozone4.tmp
>>
>>     Children see throughput for  4 initial writers     =     24100.96
>> kB/sec     33336.58 kB/sec       38.32 %
>>     Children see throughput for  4 rewriters     =     31650.20 kB/sec    
>> 33091.53 kB/sec        4.55 %
>>     Children see throughput for  4 readers         =     33276.92
>> kB/sec     41799.89 kB/sec       25.61 %
>>     Children see throughput for 4 re-readers     =     31786.96 kB/sec    
>> 41501.74 kB/sec       30.56 %
>>     Children see throughput for 4 random readers     =     31991.65
>> kB/sec     40973.93 kB/sec       28.08 %
>>     Children see throughput for 4 mixed workload     =     15804.80
>> kB/sec     13581.32 kB/sec      -14.07 %
>>     Children see throughput for 4 random writers     =     31231.42
>> kB/sec     34537.03 kB/sec       10.58 %
>>
>> iozone -s 8192k -r 32k -i 0 -i 1 -i 2 -i 8 -I -t 4 -F /mnt/mmc/iozone1.tmp
>> /mnt/mmc/iozone2.tmp /mnt/mmc/iozone3.tmp /mnt/mmc/iozone4.tmp
>>
>>     Children see throughput for  4 initial writers     =    116567.42
>> kB/sec    119280.35 kB/sec        2.33 %
>>     Children see throughput for  4 rewriters     =    115010.96 kB/sec   
>> 120864.34 kB/sec        5.09 %
>>     Children see throughput for  4 readers         =    130700.29
>> kB/sec    177834.21 kB/sec       36.06 %
> Do you think sequential read will increase more that of random read. It
> should mostly benefit in random reads right. Any idea why it's behaving
> differently here?

Explained above.

> 
> 
>>     Children see throughput for 4 re-readers     =    125392.58 kB/sec   
>> 175975.28 kB/sec       40.34 %
>>     Children see throughput for 4 random readers     =    132194.57
>> kB/sec    176630.46 kB/sec       33.61 %
>>     Children see throughput for 4 mixed workload     =     56464.98
>> kB/sec     54140.61 kB/sec       -4.12 %
>>     Children see throughput for 4 random writers     =    109128.36
>> kB/sec     85359.80 kB/sec      -21.78 %
> Similarly, we don't expect random write scores to decrease here. Do you know
> why this could be the case here?

There is a lot of variation from one run to another

> 
> 
>>
>>
>> The current block driver supports 2 requests on the go at a time. Patches
>> 1 - 8 make preparations for an arbitrary sized queue. Patches 9 - 12
>> introduce Command Queue definitions and helpers.  Patches 13 - 19
>> complete the job of making the block driver use a queue.  Patches 20 - 23
>> finally add Software Command Queuing, and 24 - 25 enable it for Intel eMMC
>> controllers. Most of the Software Command Queuing functionality is added
>> in patch 22.
>>
>> As noted below, the patches can also be found here:
>>
>>     http://git.infradead.org/users/ahunter/linux-sdhci.git/shortlog/refs/heads/swcmdq
>>
>>
>> Changes in V5:
>>
>>   Patches 1-5 dropped because they have been applied.
>>
>>   Re-based on next.
>>
>>   Fixed use of blk_end_request_cur() when it should have been
>>   blk_end_request_all() to error out requests during error recovery.
>>
>>   Fixed unpaired retune_hold / retune_release in the error recovery path.
>>
>> Changes in V4:
>>
>>   Re-based on next + v4.8-rc2 + "block: Fix secure erase" patch
>>
>> Changes in V3:
>>
>>   Patches 1-25 dropped because they have been applied.
>>
>>   Re-based on next.
>>
>>   mmc: queue: Allocate queue of size qdepth
>>     Free queue during cleanup
>>
>>   mmc: mmc: Add Command Queue definitions
>>     Add cmdq_en to mmc-dev-attrs.txt documentation
>>
>>   mmc: queue: Share mmc request array between partitions
>>     New patch
>>
>> Changes in V2:
>>
>>   Added 5 patches already sent here:
>>
>>     http://marc.info/?l=linux-mmc&m=146712062816835
>>
>>   Added 3 more new patches:
>>
>>     mmc: sdhci-pci: Do not runtime suspend at the end of sdhci_pci_probe()
>>     mmc: sdhci: Avoid STOP cmd triggering warning in sdhci_send_command()
>>     mmc: sdhci: sdhci_execute_tuning() must delete timer
>>
>>   Carried forward the V2 fix to:
>>
>>     mmc: mmc_test: Disable Command Queue while mmc_test is used
>>
>>   Also reset the cmd circuit for data timeout if it is processing the data
>>   cmd, in patch:
>>
>>     mmc: sdhci: Do not reset cmd or data circuits that are in use
>>
>> There wasn't much comment on the RFC so there have been few changes.
>> Venu Byravarasu commented that it may be more efficient to use Software
>> Command Queuing only when there is more than 1 request queued - it isn't
>> obvious how well that would work in practice, but it could be added later
>> if it could be shown to be beneficial.
>>
>> Original Cover Letter:
>>
>> Chuanxiao Dong sent some patches last year relating to eMMC 5.1 Software
>> Command Queuing.  He did not follow-up but I have contacted him and he says
>> it is OK if I take over upstreaming the patches.
>>
>> eMMC Command Queuing is a feature added in version 5.1.  The card maintains
>> a queue of up to 32 data transfers.  Commands CMD45/CMD45 are sent to queue
>> up transfers in advance, and then one of the transfers is selected to
>> "execute" by CMD46/CMD47 at which point data transfer actually begins.
>>
>> The advantage of command queuing is that the card can prepare for transfers
>> in advance.  That makes a big difference in the case of random reads because
>> the card can start reading into its cache in advance.
>>
>> A v5.1 host controller can manage the command queue itself, but it is also
>> possible for software to manage the queue using an non-v5.1 host controller
>> - that is what Software Command Queuing is.
>>
>> Refer to the JEDEC (http://www.jedec.org/) eMMC v5.1 Specification for more
>> information about Command Queuing.
>>
>> While these patches are heavily based on Dong's patches, there are some
>> changes:
>>
>> SDHCI has been amended to support commands during transfer. That is a
>> generic change added in patches 1 - 5. [Those patches have now been applied]
>> In principle, that would also support SDIO's CMD52 during data transfer.
>>
>> The original approach added multiple commands into the same request for
>> sending CMD44, CMD45 and CMD13. That is not strictly necessary and has
>> been omitted for now.
>>
>> The original approach also called blk_end_request() from the mrq->done()
>> function, which means the upper layers learnt of completed requests
>> slightly earlier. That is not strictly related to Software Command Queuing
>> and is something that could potentially be done for all data requests.
>> That has been omitted for now.
>>
>> The current block driver supports 2 requests on the go at a time. Patches
>> 1 - 8 make preparations for an arbitrary sized queue. Patches 9 - 12
>> introduce Command Queue definitions and helpers.  Patches 13 - 19
>> complete the job of making the block driver use a queue.  Patches 20 - 23
>> finally add Software Command Queuing, and 24 - 25 enable it for Intel eMMC
>> controllers. Most of the Software Command Queuing functionality is added
>> in patch 22.
>>
>> The patches can also be found here:
>>
>>     http://git.infradead.org/users/ahunter/linux-sdhci.git/shortlog/refs/heads/swcmdq
>>
>>
>> The patches have only had basic testing so far. Ad-hoc testing shows a
>> degradation in sequential read performance of about 10% but an increase in
>> throughput for mixed workload of multiple processes of about 90%. The
>> reduction in sequential performance is due to the need to read the Queue
>> Status register between each transfer.
>>
>> These patches should not conflict with Hardware Command Queuing which
>> handles the queue in a completely different way and thus does not need
>> to share code with Software Command Queuing. The exceptions being the
>> Command Queue definitions and queue allocation which should be able to be
>> used.
>>
>>
>> Adrian Hunter (25):
>>       mmc: queue: Fix queue thread wake-up
>>       mmc: queue: Factor out mmc_queue_alloc_bounce_bufs()
>>       mmc: queue: Factor out mmc_queue_alloc_bounce_sgs()
>>       mmc: queue: Factor out mmc_queue_alloc_sgs()
>>       mmc: queue: Factor out mmc_queue_reqs_free_bufs()
>>       mmc: queue: Introduce queue depth
>>       mmc: queue: Use queue depth to allocate and free
>>       mmc: queue: Allocate queue of size qdepth
>>       mmc: mmc: Add Command Queue definitions
>>       mmc: mmc: Add functions to enable / disable the Command Queue
>>       mmc: mmc_test: Disable Command Queue while mmc_test is used
>>       mmc: block: Disable Command Queue while RPMB is used
>>       mmc: core: Do not prepare a new request twice
>>       mmc: core: Export mmc_retune_hold() and mmc_retune_release()
>>       mmc: block: Factor out mmc_blk_requeue()
>>       mmc: block: Fix 4K native sector check
>>       mmc: block: Use local var for mqrq_cur
>>       mmc: block: Pass mqrq to mmc_blk_prep_packed_list()
>>       mmc: block: Introduce queue semantics
>>       mmc: queue: Share mmc request array between partitions
>>       mmc: queue: Add a function to control wake-up on new requests
>>       mmc: block: Add Software Command Queuing
>>       mmc: mmc: Enable Software Command Queuing
>>       mmc: sdhci-pci: Enable Software Command Queuing for some Intel
>> controllers
>>       mmc: sdhci-acpi: Enable Software Command Queuing for some Intel
>> controllers
>>
>>  Documentation/mmc/mmc-dev-attrs.txt |   1 +
>>  drivers/mmc/card/block.c            | 738
>> +++++++++++++++++++++++++++++++++---
>>  drivers/mmc/card/mmc_test.c         |  13 +
>>  drivers/mmc/card/queue.c            | 332 ++++++++++------
>>  drivers/mmc/card/queue.h            |  35 +-
>>  drivers/mmc/core/core.c             |  18 +-
>>  drivers/mmc/core/host.c             |   2 +
>>  drivers/mmc/core/host.h             |   2 -
>>  drivers/mmc/core/mmc.c              |  43 ++-
>>  drivers/mmc/core/mmc_ops.c          |  27 ++
>>  drivers/mmc/host/sdhci-acpi.c       |   2 +-
>>  drivers/mmc/host/sdhci-pci-core.c   |   2 +-
>>  include/linux/mmc/card.h            |   9 +
>>  include/linux/mmc/core.h            |   5 +
>>  include/linux/mmc/host.h            |   4 +-
>>  include/linux/mmc/mmc.h             |  17 +
>>  16 files changed, 1046 insertions(+), 204 deletions(-)
>>
>>
>> Regards
>> Adrian
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Linux Media]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux