Re: slow eMMC write speed

Philip Rakity <prakity@xxxxxxxxxxx> · Wed, 28 Sep 2011 17:57:19 -0700

On Sep 28, 2011, at 4:16 PM, Praveen G K wrote:

> On Wed, Sep 28, 2011 at 3:59 PM, J Freyensee
> <james_p_freyensee@xxxxxxxxxxxxxxx> wrote:
>> On 09/28/2011 03:24 PM, Praveen G K wrote:
>>> 
>>> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee
>>> <james_p_freyensee@xxxxxxxxxxxxxxx>  wrote:
>>>> 
>>>> On 09/28/2011 02:03 PM, Praveen G K wrote:
>>>>> 
>>>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee
>>>>> <james_p_freyensee@xxxxxxxxxxxxxxx>    wrote:
>>>>>> 
>>>>>> On 09/28/2011 01:34 PM, Praveen G K wrote:
>>>>>>> 
>>>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee
>>>>>>> <james_p_freyensee@xxxxxxxxxxxxxxx>      wrote:
>>>>>>>> 
>>>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote:
>>>>>>>>> 
>>>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij
>>>>>>>>> <linus.walleij@xxxxxxxxxx>        wrote:
>>>>>>>>>> 
>>>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO
>>>>>>>>>>> 3.0
>>>>>>>>>>> controller).  I am seeing very low write speed for eMMC transfers.
>>>>>>>>>>> On
>>>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer
>>>>>>>>>>> takes
>>>>>>>>>>> a long time.
>>>>>>>>>> 
>>>>>>>>>> Are you not just seeing the card-internal garbage collection?
>>>>>>>>>> http://lwn.net/Articles/428584/
>>>>>>>>> 
>>>>>>>>> Does this mean, theoretically, I should be able to achieve larger
>>>>>>>>> speeds if I am not using linux?
>>>>>>>> 
>>>>>>>> In theory in a fairy-tale world, maybe, in reality, not really.  In
>>>>>>>> R/W
>>>>>>>> performance measurements we have done, eMMC performance in products
>>>>>>>> users
>>>>>>>> would buy falls well, well short of any theoretical numbers.  We
>>>>>>>> believe
>>>>>>>> in
>>>>>>>> theory, the eMMC interface should be able to support up to 100MB/s,
>>>>>>>> but
>>>>>>>> in
>>>>>>>> reality on real customer platforms write bandwidths (for example)
>>>>>>>> barely
>>>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment
>>>>>>>> or
>>>>>>>> Android (Linux OS environment we care about).  So maybe it is
>>>>>>>> software
>>>>>>>> implementation issues of multiple OSs preventing higher eMMC
>>>>>>>> performance
>>>>>>>> numbers (hence the reason why I sometimes ask basic coding questions
>>>>>>>> of
>>>>>>>> the
>>>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one
>>>>>>>> looks
>>>>>>>> no
>>>>>>>> further than what Apple has done with the iPad2 to see that eMMC
>>>>>>>> probably
>>>>>>>> just is not a good solution to use in the first place.  We have
>>>>>>>> measured
>>>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being
>>>>>>>> double
>>>>>>>> what
>>>>>>>> we see with products using eMMC solutions. The big difference?  Apple
>>>>>>>> doesn't use eMMC at all for the iPad2.
>>>>>>> 
>>>>>>> Thanks for all the clarification.  The problem is I am seeing write
>>>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see
>>>>>>> the time lost when measured between sending a command and receiving a
>>>>>>> data irq.  I am not sure what kind of an issue this is.  5MBps feels
>>>>>>> really slow but can the internal housekeeping of the card take so much
>>>>>>> time?
>>>>>> 
>>>>>> Have you tried to trace through all structs used for an MMC
>>>>>> operation??!
>>>>>> Good gravy, there are request, mmc_queue, mmc_card, mmc_host,
>>>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple
>>>>>> scatterlists
>>>>>> that these other structs use...I've been playing around on trying to
>>>>>> cache
>>>>>> some things to try and improve performance and it blows me away how
>>>>>> many
>>>>>> variables and pointers I have to keep track of for one operation going
>>>>>> to
>>>>>> an
>>>>>> LBA on an MMC.  I keep wondering if more of the 'struct request' could
>>>>>> have
>>>>>> been used, and 1/3 of these structures could be eliminated.  And
>>>>>> another
>>>>>> thing I wonder too is how much of this infrastructure is really needed,
>>>>>> that
>>>>>> when I do ask "what is this for?" question on the list and no one
>>>>>> responds,
>>>>>> if anyone else understands if it's needed either.
>>>>> 
>>>>> I know I am not using the scatterlists, since the scatterlists are
>>>>> aggregated into a 64k bounce buffer.  Regarding the different structs,
>>>>> I am just taking them on face value assuming everything works "well".
>>>>> But, my concern is why does it take such a long time (250 ms) to
>>>>> return a transfer complete interrupt on occasional cases.  During this
>>>>> time, the kernel is just waiting for the txfer_complete interrupt.
>>>>> That's it.
>>>> 
>>>> I think one fundamental problem with execution of the MMC commands is
>>>> even
>>>> though the MMC has it's own structures and own DMA/Host-controller, the
>>>> OS's
>>>> block subsystem and MMC subsystem do not really run independent of either
>>>> other and each are still tied to each others' fate, holding up
>>>> performance
>>>> of the kernel in general.
>>>> 
>>>> In particular, I have found that in the 2.6.36+ kernels that the sooner
>>>> you
>>>> can retire the 'struct request *req' (ie using __blk_end_request()) with
>>>> respect to when the mmc_wait_for_req() call is made, the higher
>>>> performance
>>>> you are going to get out of the OS in terms of reads/writes using an MMC.
>>>> mmc_wait_for_req() is a blocking call, so that OS 'struct request req'
>>>> will
>>>> just sit around and do nothing until mmc_wait_for_req() is done.  I have
>>>> been able to do some caching of some commands, calling
>>>> __blk_end_request()
>>>> before mmc_wait_for_req(), and getting much higher performance in a few
>>>> experiments (but the work certainly is not ready for prime-time).
>>>> 
>>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal
>>>> was
>>>> to try and make that function a bit more non-blocking, but I have not
>>>> played
>>>> with it too much because my current focus is on existing products and no
>>>> handheld product uses a 3.0 kernel yet (that I am aware of at least).
>>>> However, I still see the fundamental problem is that the MMC stack,
>>>> which
>>>> was probably written with the intended purpose to be independent of the
>>>> OS
>>>> block subsystem (struct request and other stuff), really isn't
>>>> independent
>>>> of the OS block subsystem and will cause holdups between one another,
>>>> thereby dragging down read/write performance of the MMC.
>>>> 
>>>> The other fundamental problem is the writes themselves.  Way, WAY more
>>>> writes occur on a handheld system in an end-user's hands than reads.
>>>> Fundamental computer principle states "you make the common case fast". So
>>>> focus should be on how to complete a write operation the fastest way
>>>> possible.
>>> 
>>> Thanks for the detailed explanation.
>>> Please let me know if there is a fundamental issue with the way I am
>>> inserting the high res timers.  In the block.c file, I am timing the
>>> transfers as follows
>>> 
>>> 1. Start timer
>>> mmc_queue_bounce_pre()
>>> mmc_wait_for_req()
>>> mmc_queue_bounce_post()
>>> End timer
>>> 
>>> So, I don't really have to worry about the blk_end_request right.
>>> Like you said, wait_for_req is a blocking wait.  I don't see what is
>>> wrong with that being a blocking wait, because until you get the data
>>> xfer complete irq, there is no point in going ahead.  The
>>> blk_end_request comes later in the picture only when all the data is
>>> transferred to the card.
>> 
>> Yes, that is correct.
>> 
>> But if you can do some cache trickery or queue tricks, you can delay when
>> you have to actually write to the MMC, so then __blk_end_request() and
>> retiring the 'struct request *req' becomes the time-sync.  That is a reason
>> why mmc_wait_for_req() got some work done on it in the 3.0 kernel.  The OS
>> does not have to wait for the host controller to complete the operation (ie,
>> block on mmc_wait_for_data()) if there is no immediate dependency on that
>> data- that is kind-of dumb.  This is why this can be a problem and a time
>> sync.  It's no different than out-of-order execution in CPUs.
> 
> Thanks I'll look into the 3.0 code to see what the changes are and
> whether it can improve the speed.  Thanks for your suggestions.
> 
>>> My line of thought is that the card is taking a lot of time for its
>>> internal housekeeping.
>> 
>> Each 'write' to a solid-state/nand/flash requires an erase operation first,
>> so yes, there is more housekeeping going on than a simple 'write'.
>> 
>> But, I want to be absolutely sure of my
>>> 
>>> analysis before I can pass that judgement.
>>> 
>>> I have also used another Toshiba card that gives me about 12 MBps
>>> write speed for the same code, but I am worried is whether I am
>>> masking some issue by blaming it on the card.  What if the Toshiba
>>> card can give a throughput more than 12MBps ideally?
>> 
>> No clue...you'd have to talk to Toshiba.
>> 
>>> 
>>> Or could there be an issue that the irq handler(sdhci_irq) is called
>>> with some kind of a delay and is there a possibility that we are not
>>> capturing the transfer complete interrupt immediately?
>>> 
>>>>> 
>>>>>> I mean, for the usual transfers it takes about 3ms to transfer
>>>>>>> 
>>>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms.
>>>>>>> The thing is this is not on a file system.  I am measuring the speed
>>>>>>> using basic "dd" command to write directly to the block device.
>>>>>>> 
>>>>>>>> So, is this a software issue? or if
>>>>>>>>> 
>>>>>>>>> there is a way to increase the size of bounce buffers to 4MB?
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> Yours,
>>>>>>>>>> Linus Walleij
>>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>>>> in
>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> J (James/Jay) Freyensee
>>>>>>>> Storage Technology Group
>>>>>>>> Intel Corporation
>>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc"
>>>>>>> in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> J (James/Jay) Freyensee
>>>>>> Storage Technology Group
>>>>>> Intel Corporation
>>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> 
>>>> --
>>>> J (James/Jay) Freyensee
>>>> Storage Technology Group
>>>> Intel Corporation
>>>> 
>> 

some questions:

does using a bounce buffer make things faster ?

I think you are using sdma.   I am wondering if there is a way to increase the the xfer size.  
Is there some magic number inside the mmc code that can be increased ?

Philip

>> 
>> --
>> J (James/Jay) Freyensee
>> Storage Technology Group
>> Intel Corporation
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-mmc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html