On Mon, Aug 27, 2012 at 11:58 PM, <merez@xxxxxxxxxxxxxx> wrote: > > On Fri, July 27, 2012 2:07 am, S, Venkatraman wrote: >> On Fri, Jul 27, 2012 at 12:24 AM, <merez@xxxxxxxxxxxxxx> wrote: >>> >>> On Thu, July 26, 2012 8:28 am, S, Venkatraman wrote: >>>> On Tue, Jul 24, 2012 at 2:14 PM, <merez@xxxxxxxxxxxxxx> wrote: >>>>> On Mon, July 23, 2012 5:22 am, S, Venkatraman wrote: >>>>>> On Mon, Jul 23, 2012 at 5:13 PM, <merez@xxxxxxxxxxxxxx> wrote: >>>>>>> On Wed, July 18, 2012 12:26 am, Chris Ball wrote: >>>>>>>> Hi, [removing Jens and the documentation list, since now we're >>>>> talking about the MMC side only] >>>>>>>> On Wed, Jul 18 2012, merez@xxxxxxxxxxxxxx wrote: >>>>>>>>> Is there anything else that holds this patch from being pushed to >>>>>>> mmc-next? >>>>>>>> Yes, I'm still uncomfortable with the write packing patchsets for a >>>>>>> couple of reasons, and I suspect that the sum of those reasons means >>>>>>> that >>>>>>> we should probably plan on holding off merging it until after 3.6. >>>>>>>> Here are the open issues; please correct any misunderstandings: >>>>>>>> With >>>>> Seungwon's patchset ("Support packed write command"): >>>>>>>> * I still don't have a good set of representative benchmarks >>>>>>>> showing >>>>>>>> what kind of performance changes come with this patchset. It >>>>>>>> seems >>>>>>> like we've had a small amount of testing on one controller/eMMC part >>>>>>> combo >>>>>>> from Seungwon, and an entirely different test from Maya, and the >>>>> results >>>>>>> aren't documented fully anywhere to the level of describing what the >>>>> hardware was, what the test was, and what the results were before and >>>>> after the patchset. >>>>>>> Currently, there is only one card vendor that supports packed >>>>>>> commands. >>>>> Following are our sequential write (LMDD) test results on 2 of our >>>>> targets >>>>>>> (in MB/s): >>>>>>> No packing packing >>>>>>> Target 1 (SDR 50MHz) 15 25 >>>>>>> Target 2 (DDR 50MHz) 20 30 >>>>>>>> With the reads-during-writes regression: >>>>>>>> * Venkat still has open questions about the nature of the read >>>>>>>> regression, and thinks we should understand it with blktrace >>>>>>>> before >>>>>>> trying to fix it. Maya has a theory about writes overwhelming >>>>>>> reads, >>>>>>> but >>>>>>> Venkat doesn't understand why this would explain the observed >>>>>>> bandwidth drop. >>>>>>> The degradation of read due to writes is not a new behavior and >>>>>>> exists >>>>> also without the write packing feature (which only increases the >>>>> degradation). Our investigation of this phenomenon led us to the >>>>> Conclusion that a new scheduling policy should be used for mobile >>>>> devices, >>>>>>> but this is not related to the current discussion of the write >>>>>>> packing >>>>> feature. >>>>>>> The write packing feature increases the degradation of read due to >>>>> write >>>>>>> since it allows the MMC to fetch many write requests in a row, >>>>>>> instead >>>>>>> of >>>>>>> fetching only one at a time. Therefore some of the read requests >>>>>>> will >>>>> have to wait for the completion of more write requests before they can >>>>> be >>>>>>> issued. >>>>>> >>>>>> I am a bit puzzled by this claim. One thing I checked carefully when >>>>> reviewing write packing patches from SJeon was that the code didn't >>>>> plough through a mixed list of reads and writes and selected only >>>>> writes. >>>>>> This section of the code in "mmc_blk_prep_packed_list()", from v8 >>>>> patchset.. >>>>>> <Quote> >>>>>> + if (rq_data_dir(cur) != rq_data_dir(next)) { >>>>>> + put_back = 1; >>>>>> + break; >>>>>> + } >>>>>> </Quote> >>>>>> >>>>>> means that once a read is encountered in the middle of write packing, >>>>> the packing is stopped at that point and it is executed. Then the next >>>>> blk_fetch_request should get the next read and continue as before. >>>>>> >>>>>> IOW, the ordering of reads and writes is _not_ altered when using >>>>>> packed >>>>> commands. >>>>>> For example if there were 5 write requests, followed by 1 read, >>>>>> followed by 5 more write requests in the request_queue, the first 5 >>>>> writes will be executed as one "packed command", then the read will be >>>>> executed, and then the remaining 5 writes will be executed as one >>>>> "packed command". So the read does not have to wait any more than it >>>>> waited before (packing feature) >>>>> >>>>> Let me try to better explain with your example. >>>>> Without packing the MMC layer will fetch 2 write requests and wait for >>>>> the >>>>> first write request completion before fetching another write request. >>>>> During this time the read request could be inserted into the CFQ and >>>>> since >>>>> it has higher priority than the async write it will be dispatched in >>>>> the >>>>> next fetch. So, the result would be 2 write requests followed by one >>>>> read >>>>> request and the read would have to wait for completion of only 2 write >>>>> requests. >>>>> With packing, all the 5 write requests will be fetched in a row, and >>>>> then >>>>> the read will arrive and be dispatched in the next fetch. Then the >>>>> read >>>>> will have to wait for the completion of 5 write requests. >>>>> >>>>> Few more clarifications: >>>>> Due to the plug list mechanism in the block layer the applications can >>>>> "aggregate" several requests to be inserted into the scheduler before >>>>> waking the MMC queue thread. >>>>> This leads to a situation where there are several write requests in >>>>> the >>>>> CFQ queue when MMC starts to do the fetches. >>>>> >>>>> If the read was inserted while we are building the packed command then >>>>> I >>>>> agree that we should have seen less effect on the read performance. >>>>> However, the write packing statistics show that in most of the cases >>>>> the >>>>> packing stopped due to an empty queue, meaning that the read was >>>>> inserted >>>>> to the CFQ after all the pending write requests were fetched and >>>>> packed. >>>>> >>>>> Following is an example for write packing statistics of a READ/WRITE >>>>> parallel scenario: >>>>> write packing statistics: >>>>> Packed 1 reqs - 448 times >>>>> Packed 2 reqs - 38 times >>>>> Packed 3 reqs - 23 times >>>>> Packed 4 reqs - 30 times >>>>> Packed 5 reqs - 14 times >>>>> Packed 6 reqs - 8 times >>>>> Packed 7 reqs - 4 times >>>>> Packed 8 reqs - 1 times >>>>> Packed 10 reqs - 1 times >>>>> Packed 34 reqs - 1 times >>>>> stopped packing due to the following reasons: >>>>> 2 times: wrong data direction (meaning a READ was fetched and stopped >>>>> the >>>>> packing) >>>>> 1 times: flush or discard >>>>> 565 times: empty queue (meaning blk_fetch_request returned NULL) >>>>> >>>>>> >>>>>> And I requested blktrace to confirm that this is indeed the >>>>>> behaviour. >>>>> >>>>> The trace logs show that in case of no packing, there are maximum of >>>>> 3-4 >>>>> requests issued before a read request, while with packing there are >>>>> also >>>>> cases of 6 and 7 requests dispatched before a read request. >>>>> >>>>> I'm waiting for an approval for sharing the block trace logs. >>>>> Since this is a simple test to run you can collect the trace logs and >>>>> let >>>>> us know if you reach other conclusions. >>>>> >>>> Thanks for the brief. I don't have the eMMC4.5 device with me yet, so >>>> I can't reproduce the result. >>> >>> I sent the trace logs of both packing and non packing. Please let me >>> know >>> if you have additional questions after reviewing them. >>> >>> The problem you describe is most likely >>>> applicable >>>> to any block device driver with a large queue depth ( any queue depth >>>> >1). >>>> I'll check to see what knobs in block affect the result. >>>> Speaking of it, what is the host controller you use to test this ? >>> >>> The controller I use is msm_sdcc. >>> >>>> I was wondering if host->max_seg_size is taken into account while >>>> packed >>>> command >>>> is in use. If not, shouldn't it be ? - it could act as a better >>>> throttle for "packing density". >>> >>> The max segments (which is calculated from host->max_seg_size) is taking >>> into account when preparing the packed list (so that the whole packed >>> won't exceed the max number of segments). >>> I'm not sure I understand how host->max_seg_size can be used as a >>> throttle >>> for "packing density". Can you please explain? >>> >> Ok - I overlooked that max_segments is indeed used to limit the number >> of requests >> that are packed.(And this corresponds to max_seg_size, which is what I >> intended) >> I should be getting my MMC4.5 test gear in a couple of days - I'll run >> it through >> on some hosts and can either provide more feedback or Ack this patch. >> Regards, >> Venkat. > > Hi Venkat, > > Do you have additional questions/comments? > None. I am just running some stress tests on BKOPS patches right now and after tomorrow I'll start testing packed command. Will all the patches apply on top of current mmc-next ? If not, it would be great if you can send an updated version. Thanks, Venkat. -- To unsubscribe from this list: send the line "unsubscribe linux-mmc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html