2012-04-09 오후 9:35, Arnd Bergmann 쓴 글: > On Monday 09 April 2012, Minchan Kim wrote: >> 2012-04-07 오전 1:16, Arnd Bergmann 쓴 글: >> >>> larger chunks would generally be helpful, in order to guarantee that we >>> the drive doesn't do any garbage collection, we would have to do all writes >> >> >> And we should guarantee for avoiding unnecessary swapout, even OOM killing. >> >>> in aligned chunks. It would probably be enough to do this in 8kb or >>> 16kb units for most devices over the next few years, but implementing it >>> for 64kb should be the same amount of work and will get us a little bit >>> further. >> >> >> I understand it's best for writing 64K in your statement. >> What the 8K, 16K? Could you elaborate relation between 8K, 16K and 64K? > > From my measurements, there are three sizes that are relevant here: > > 1. The underlying page size of the flash: This used to be less than 4kb, > which is fine when paging out 4kb mmu pages, as long as the partition is > aligned. Today, most devices use 8kb pages and the number is increasing > over time, meaning we will see more 16kb page devices in the future and > presumably larger sizes after that. Writes that are not naturally aligned > multiples of the page size tend to be a significant problem for the > controller to deal with: in order to guarantee that a 4kb write makes it > into permanent storage, the device has to write 8kb and the next 4kb > write has to go into another 8kb page because each page can only be > written once before the block is erased. At a later point, all the partial > pages get rewritten into a new erase block, a process that can take > hundreds of miliseconds and that we absolutely want to prevent from > happening, as it can block all other I/O to the device. Writing all > (flash) pages in an erase block sequentially usually avoids this, as > long as you don't write to many different erase blocks at the same time. > Note that the page size depends on how the controller combines different > planes and channels. > > 2. The super-page size of the flash: When you have multiple channels > between the controller and the individual flash chips, you can write > multiple pages simultaneously, which means that e.g. sending 32kb of > data to the device takes roughly the same amount of time as writing a > single 8kb page. Writing less than the super-page size when there is > more data waiting to get written out is a waste of time, although the > effects are much less drastic as writing data that is not aligned to > pages because it does not require garbage collection. > > 3. optimum write size: While writing larger amounts of data in a single > request is usually faster than writing less, almost all devices > I've seen have a sharp cut-off where increasing the size of the write > does not actually help any more because of a bottleneck somewhere > in the stack. Writing more than 64kb almost never improves performance > and sometimes reduces performance. For our understanding, you mean we have to do aligned-write as follows if possible? "Nand internal page size write(8K, 16K)" < "Super-page size write(32K) which considers parallel working with number of channel and plane" < some sequential big write (64K) > > From the I've done, a typical profile could look like > > Size Throughput > 1KB 200KB/s > 2KB 450KB/s > 4KB 1MB/s > 8KB 4MB/s <== page size > 16KB 8MB/s > 32KB 16MB/s <== superpage size > 64KB 18MB/s <== optimum size > 128KB 17MB/s > ... > 8MB 18MB/s <== erase block size > >>> I'm not sure what we would do when there are less than 64kb available >>> for pageout on the inactive list. The two choices I can think of are >>> either not writing anything, or wasting the swap slots and filling >> >> >> No wrtite will cause unnecessary many pages to swap out by next prioirty >> of scanning and we can't gaurantee how long we wait to queue up to 64KB >> in anon pages. It might take longer than GC time so we need some deadline. >> >> >>> up the data with zeroes. >> >> >> Zero padding would be a good solution but I have a concern on WAP so we >> need smart policy. >> >> To be honest, I think swapout is normally asynchonous operation so that >> it should not affect system latency rather than swap read which is >> synchronous operation. So if system is low memory pressure, we can queue >> swap out pages up to 64KB and then batch write-out in empty cluster. If >> we don't have any empty cluster in low memory pressure, we should write >> out it in partial cluster. Maybe it doesn't affect system latency >> severely in low memory pressure. > > The main thing that can affect system latency is garbage collection > that blocks any other reads or writes for an extended amount of time. > If we can avoid that, we've got the 95% solution. I see. > > Note that eMMC-4.5 provides a high-priority interrupt mechamism that > lets us interrupt the a write that has hit the garbage collection > path, so we can send a more important read request to the device. > This will not work on other devices though and the patches for this > are still under discussion. Nice feature but I think swap system doesn't need to consider such feature. I should be handled by I/O subsystem like I/O scheduler. > >> If system memory pressure is high(and It shoud be not frequent), >> swap-out B/W would be more important. So we can reserve some clusters >> for it and I think we can use page padding you mentioned in this case >> for reducing latency if we can queue it up to 64KB within threshold time. >> >> Swap-read is also important. We have to investigate fragmentation of >> swap slots because we disable swap readahead in non-rotation device. It >> can make lots of hole in swap cluster and it makes to find empty >> cluster. So for it, it might be better than enable swap-read in >> non-rotation devices, too. > > Yes, reading in up to 64kb or at least a superpage would also help here, > although there is no problem reading in a single cpu page, it will still > take no more time than reading in a superpage. > >>>>> 2) Make variable sized swap clusters. Right now, the swap space is >>>>> organized in clusters of 256 pages (1MB), which is less than the typical >>>>> erase block size of 4 or 8 MB. We should try to make the swap cluster >>>>> aligned to erase blocks and have the size match to avoid garbage collection >>>>> in the drive. The cluster size would typically be set by mkswap as a new >>>>> option and interpreted at swapon time. >>>>> >>>> >>>> If we can find such big contiguous swap slots easily, it would be good. >>>> But I am not sure how often we can get such big slots. And maybe we have to >>>> improve search method for getting such big empty cluster. >>> >>> As long as there are clusters available, we should try to find them. When >>> free space is too fragmented to find any unused cluster, we can pick one >>> that has very little data in it, so that we reduce the time it takes to >>> GC that erase block in the drive. While we could theoretically do active >>> garbage collection of swap data in the kernel, it won't get more efficient >>> than the GC inside of the drive. If we do this, it unfortunately means that >>> we can't just send a discard for the entire erase block. >> >> >> Might need some compaction during idle time but WAP concern raises again. :( > > Sorry for my ignorance, but what does WAP stand for? I should have written more general term. I means write amplication but WAF(Write Amplication Factor) is more popular. :( > > Arnd > -- > To unsubscribe from this list: send the line "unsubscribe linux-mmc" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>