From: "Hatayama, Daisuke/?? ??" <d.hatayama@xxxxxxxxxxxxxx> Subject: Re: [PATCH] makedumpfile: --split: assign fair I/O workloads for each process Date: Tue, 25 Mar 2014 14:52:36 +0900 > > > (2014/03/25 10:14), Atsushi Kumagai wrote: >>> From: HATAYAMA Daisuke <d.hatayama at jp.fujitsu.com> >>> >>> When --split option is specified, fair I/O workloads should be >>> assigned for each process to maximize amount of performance >>> optimization by parallel processing. >>> >>> However, the current implementation of setup_splitting() in cyclic >>> mode doesn't care about filtering at all; I/O workloads for each >>> process could be biased easily. >>> >>> This patch deals with the issue by implementing the fair I/O workload >>> assignment as setup_splitting_cyclic(). >>> >>> Note: If --split is specified in cyclic mode, we do filtering three >>> times: in get_dumpable_pages_cyclic(), in setup_splitting_cyclic() and >>> in writeout_dumpfile(). Filtering takes about 10 minutes on system >>> with huge memory according to the benchmark on the past, so it might >>> be necessary to optimize filtering or setup_filtering_cyclic(). >> >> Sorry, I lost the result of that benchmark, could you give me the URL? >> I'd like to confirm that the advantage of fair I/O will exceed the >> 10 minutes disadvantage. >> > > Here are two benchmarks by Jingbai Ma and myself. > > http://lists.infradead.org/pipermail/kexec/2013-March/008515.html > http://lists.infradead.org/pipermail/kexec/2013-March/008517.html > > > Note that Jingbai Ma's results are sum of get_dumpable_cyclic() and writeout_dumpfile(), so apparently it looks twice larger than mine, but actually they show almost same performance. > > In summary, each result shows about 40 seconds per 1TiB. So, most of systems is not affected very much. On 12TiB memory, which is the current maximum memory size of Fujitsu system, we needs 480 seconds == 8 minutes more. But this is stable in the sense that time never become long suddenly in some rare worst case, so it seems to me optimistic in this sense. > > The other ideas to deal with the issue are: > > - paralellize the counting up processes. But it might be difficult to paralellize the 2nd pass, which seems inherently serial processing. > > - Insead of doing the 2nd pass, make the terminating proces join to still running process. But it might be combersome to implement this not using pthread. > I noticed that it's able to create a table of dumpable pages with a relatively small amount of memory by manging a memory as blocks. This is just kind of a page table management. For example, define a block 1 GiB boundary region and assume a system with 64 TiB physical memory (which is current maximum on x86_64). Then, 64 TiB / 1 GiB = 64 Ki blocks A table we consdier here have the number of dumpable pages for each 1 GiB boundary in each entry of 8 bytes. So, total size of the table is: 8 B * 64 Ki blocks = 512 KiB Counting up dumpable pages in each GiB boundary can be done by 1 pass only; get_dumpable_cyclic() does that too. Then, it's assign amount of I/O to each process fairly enough. The difference is at most 1 GiB. If disk speed is 100 MiB/sec, 1 GiB corresponds to about 10 seconds only. If you think 512 KiB not small enough, it would be able to increase block size a little more. If choosing 8 GiB block, table size is 64 KiB only, and the 8 GiB data corresponds to about 80 seconds on typical disks. How do you think this? Thanks. HATAYAMA, Daisuke