On Wed, Sep 28, 2011 at 5:57 PM, Philip Rakity <prakity@xxxxxxxxxxx> wrote: > > > On Sep 28, 2011, at 4:16 PM, Praveen G K wrote: > >> On Wed, Sep 28, 2011 at 3:59 PM, J Freyensee >> <james_p_freyensee@xxxxxxxxxxxxxxx> wrote: >>> On 09/28/2011 03:24 PM, Praveen G K wrote: >>>> >>>> On Wed, Sep 28, 2011 at 2:34 PM, J Freyensee >>>> <james_p_freyensee@xxxxxxxxxxxxxxx> wrote: >>>>> >>>>> On 09/28/2011 02:03 PM, Praveen G K wrote: >>>>>> >>>>>> On Wed, Sep 28, 2011 at 2:01 PM, J Freyensee >>>>>> <james_p_freyensee@xxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> On 09/28/2011 01:34 PM, Praveen G K wrote: >>>>>>>> >>>>>>>> On Wed, Sep 28, 2011 at 12:59 PM, J Freyensee >>>>>>>> <james_p_freyensee@xxxxxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> On 09/28/2011 12:06 PM, Praveen G K wrote: >>>>>>>>>> >>>>>>>>>> On Tue, Sep 27, 2011 at 10:42 PM, Linus Walleij >>>>>>>>>> <linus.walleij@xxxxxxxxxx> wrote: >>>>>>>>>>> >>>>>>>>>>> On Fri, Sep 23, 2011 at 7:05 AM, Praveen G K<praveen.gk@xxxxxxxxx> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I am working on the block driver module of the eMMC driver (SDIO >>>>>>>>>>>> 3.0 >>>>>>>>>>>> controller). I am seeing very low write speed for eMMC transfers. >>>>>>>>>>>> On >>>>>>>>>>>> further debugging, I observed that every 63rd and 64th transfer >>>>>>>>>>>> takes >>>>>>>>>>>> a long time. >>>>>>>>>>> >>>>>>>>>>> Are you not just seeing the card-internal garbage collection? >>>>>>>>>>> http://lwn.net/Articles/428584/ >>>>>>>>>> >>>>>>>>>> Does this mean, theoretically, I should be able to achieve larger >>>>>>>>>> speeds if I am not using linux? >>>>>>>>> >>>>>>>>> In theory in a fairy-tale world, maybe, in reality, not really. In >>>>>>>>> R/W >>>>>>>>> performance measurements we have done, eMMC performance in products >>>>>>>>> users >>>>>>>>> would buy falls well, well short of any theoretical numbers. We >>>>>>>>> believe >>>>>>>>> in >>>>>>>>> theory, the eMMC interface should be able to support up to 100MB/s, >>>>>>>>> but >>>>>>>>> in >>>>>>>>> reality on real customer platforms write bandwidths (for example) >>>>>>>>> barely >>>>>>>>> approach 20MB/s, regardless if it's a Microsoft Windows environment >>>>>>>>> or >>>>>>>>> Android (Linux OS environment we care about). So maybe it is >>>>>>>>> software >>>>>>>>> implementation issues of multiple OSs preventing higher eMMC >>>>>>>>> performance >>>>>>>>> numbers (hence the reason why I sometimes ask basic coding questions >>>>>>>>> of >>>>>>>>> the >>>>>>>>> MMC subsystem- the code isn't the easiest to follow); however, one >>>>>>>>> looks >>>>>>>>> no >>>>>>>>> further than what Apple has done with the iPad2 to see that eMMC >>>>>>>>> probably >>>>>>>>> just is not a good solution to use in the first place. We have >>>>>>>>> measured >>>>>>>>> Apple's iPad2 write performance on *WHAT A USER WOULD SEE* being >>>>>>>>> double >>>>>>>>> what >>>>>>>>> we see with products using eMMC solutions. The big difference? Apple >>>>>>>>> doesn't use eMMC at all for the iPad2. >>>>>>>> >>>>>>>> Thanks for all the clarification. The problem is I am seeing write >>>>>>>> speeds of about 5MBps on a Sandisk eMMC product and I can clearly see >>>>>>>> the time lost when measured between sending a command and receiving a >>>>>>>> data irq. I am not sure what kind of an issue this is. 5MBps feels >>>>>>>> really slow but can the internal housekeeping of the card take so much >>>>>>>> time? >>>>>>> >>>>>>> Have you tried to trace through all structs used for an MMC >>>>>>> operation??! >>>>>>> Good gravy, there are request, mmc_queue, mmc_card, mmc_host, >>>>>>> mmc_blk_request, mmc_request, multiple mmc_command and multiple >>>>>>> scatterlists >>>>>>> that these other structs use...I've been playing around on trying to >>>>>>> cache >>>>>>> some things to try and improve performance and it blows me away how >>>>>>> many >>>>>>> variables and pointers I have to keep track of for one operation going >>>>>>> to >>>>>>> an >>>>>>> LBA on an MMC. I keep wondering if more of the 'struct request' could >>>>>>> have >>>>>>> been used, and 1/3 of these structures could be eliminated. And >>>>>>> another >>>>>>> thing I wonder too is how much of this infrastructure is really needed, >>>>>>> that >>>>>>> when I do ask "what is this for?" question on the list and no one >>>>>>> responds, >>>>>>> if anyone else understands if it's needed either. >>>>>> >>>>>> I know I am not using the scatterlists, since the scatterlists are >>>>>> aggregated into a 64k bounce buffer. Regarding the different structs, >>>>>> I am just taking them on face value assuming everything works "well". >>>>>> But, my concern is why does it take such a long time (250 ms) to >>>>>> return a transfer complete interrupt on occasional cases. During this >>>>>> time, the kernel is just waiting for the txfer_complete interrupt. >>>>>> That's it. >>>>> >>>>> I think one fundamental problem with execution of the MMC commands is >>>>> even >>>>> though the MMC has it's own structures and own DMA/Host-controller, the >>>>> OS's >>>>> block subsystem and MMC subsystem do not really run independent of either >>>>> other and each are still tied to each others' fate, holding up >>>>> performance >>>>> of the kernel in general. >>>>> >>>>> In particular, I have found that in the 2.6.36+ kernels that the sooner >>>>> you >>>>> can retire the 'struct request *req' (ie using __blk_end_request()) with >>>>> respect to when the mmc_wait_for_req() call is made, the higher >>>>> performance >>>>> you are going to get out of the OS in terms of reads/writes using an MMC. >>>>> mmc_wait_for_req() is a blocking call, so that OS 'struct request req' >>>>> will >>>>> just sit around and do nothing until mmc_wait_for_req() is done. I have >>>>> been able to do some caching of some commands, calling >>>>> __blk_end_request() >>>>> before mmc_wait_for_req(), and getting much higher performance in a few >>>>> experiments (but the work certainly is not ready for prime-time). >>>>> >>>>> Now in the 3.0 kernel I know mmc_wait_for_req() has changed and the goal >>>>> was >>>>> to try and make that function a bit more non-blocking, but I have not >>>>> played >>>>> with it too much because my current focus is on existing products and no >>>>> handheld product uses a 3.0 kernel yet (that I am aware of at least). >>>>> However, I still see the fundamental problem is that the MMC stack, >>>>> which >>>>> was probably written with the intended purpose to be independent of the >>>>> OS >>>>> block subsystem (struct request and other stuff), really isn't >>>>> independent >>>>> of the OS block subsystem and will cause holdups between one another, >>>>> thereby dragging down read/write performance of the MMC. >>>>> >>>>> The other fundamental problem is the writes themselves. Way, WAY more >>>>> writes occur on a handheld system in an end-user's hands than reads. >>>>> Fundamental computer principle states "you make the common case fast". So >>>>> focus should be on how to complete a write operation the fastest way >>>>> possible. >>>> >>>> Thanks for the detailed explanation. >>>> Please let me know if there is a fundamental issue with the way I am >>>> inserting the high res timers. In the block.c file, I am timing the >>>> transfers as follows >>>> >>>> 1. Start timer >>>> mmc_queue_bounce_pre() >>>> mmc_wait_for_req() >>>> mmc_queue_bounce_post() >>>> End timer >>>> >>>> So, I don't really have to worry about the blk_end_request right. >>>> Like you said, wait_for_req is a blocking wait. I don't see what is >>>> wrong with that being a blocking wait, because until you get the data >>>> xfer complete irq, there is no point in going ahead. The >>>> blk_end_request comes later in the picture only when all the data is >>>> transferred to the card. >>> >>> Yes, that is correct. >>> >>> But if you can do some cache trickery or queue tricks, you can delay when >>> you have to actually write to the MMC, so then __blk_end_request() and >>> retiring the 'struct request *req' becomes the time-sync. That is a reason >>> why mmc_wait_for_req() got some work done on it in the 3.0 kernel. The OS >>> does not have to wait for the host controller to complete the operation (ie, >>> block on mmc_wait_for_data()) if there is no immediate dependency on that >>> data- that is kind-of dumb. This is why this can be a problem and a time >>> sync. It's no different than out-of-order execution in CPUs. >> >> Thanks I'll look into the 3.0 code to see what the changes are and >> whether it can improve the speed. Thanks for your suggestions. >> >>>> My line of thought is that the card is taking a lot of time for its >>>> internal housekeeping. >>> >>> Each 'write' to a solid-state/nand/flash requires an erase operation first, >>> so yes, there is more housekeeping going on than a simple 'write'. >>> >>> But, I want to be absolutely sure of my >>>> >>>> analysis before I can pass that judgement. >>>> >>>> I have also used another Toshiba card that gives me about 12 MBps >>>> write speed for the same code, but I am worried is whether I am >>>> masking some issue by blaming it on the card. What if the Toshiba >>>> card can give a throughput more than 12MBps ideally? >>> >>> No clue...you'd have to talk to Toshiba. >>> >>>> >>>> Or could there be an issue that the irq handler(sdhci_irq) is called >>>> with some kind of a delay and is there a possibility that we are not >>>> capturing the transfer complete interrupt immediately? >>>> >>>>>> >>>>>>> I mean, for the usual transfers it takes about 3ms to transfer >>>>>>>> >>>>>>>> 64kB of data, but for the 63rd and 64th transfers, it takes 250 ms. >>>>>>>> The thing is this is not on a file system. I am measuring the speed >>>>>>>> using basic "dd" command to write directly to the block device. >>>>>>>> >>>>>>>>> So, is this a software issue? or if >>>>>>>>>> >>>>>>>>>> there is a way to increase the size of bounce buffers to 4MB? >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>> Yours, >>>>>>>>>>> Linus Walleij >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" >>>>>>>>>> in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> J (James/Jay) Freyensee >>>>>>>>> Storage Technology Group >>>>>>>>> Intel Corporation >>>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" >>>>>>>> in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> J (James/Jay) Freyensee >>>>>>> Storage Technology Group >>>>>>> Intel Corporation >>>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> -- >>>>> J (James/Jay) Freyensee >>>>> Storage Technology Group >>>>> Intel Corporation >>>>> >>> > > some questions: > > does using a bounce buffer make things faster ? > > I think you are using sdma. I am wondering if there is a way to increase the the xfer size. > Is there some magic number inside the mmc code that can be increased ? The bounce buffer increases the speed, but that is limited to 64kB. I don't know why it is limited to that number though. > Philip > > >>> >>> -- >>> J (James/Jay) Freyensee >>> Storage Technology Group >>> Intel Corporation >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-mmc" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html