Hi Phillip, Gentle ping. Regards, Xiongwei On Fri, Jul 15, 2022 at 9:45 AM Xiongwei Song <sxwjean@xxxxxxxxx> wrote: > > Hi Phillip, > > Sorry for providing my test info so late. > > On Fri, Jun 10, 2022 at 3:42 PM Phillip Lougher <phillip@xxxxxxxxxxxxxxx> wrote: > > > > On 09/06/2022 15:46, Xiongwei Song wrote: > > > This version is bad for my test. I ran the test below > > > "for cnt in $(seq 0 9); do echo 3 > /proc/sys/vm/drop_caches; echo > > > "Loop ${cnt}:"; time -v find /software/test[0-9][0-9] | xargs -P 24 -i > > > cat {} > /dev/null 2>/dev/null; echo ""; done" > > > in 90 partitions. > > > > > > With 9eec1d897139 reverted: > > > 1:06.18 (1m + 6.18s) > > > 1:05.65 > > > 1:06.34 > > > 1:06.88 > > > 1:06.52 > > > 1:06.78 > > > 1:06.61 > > > 1:06.99 > > > 1:06.60 > > > 1:06.79 > > > > > > With this version: > > > 2:36.85 (2m + 36.85s) > > > 2:28.89 > > > 1:43.46 > > > 1:41.50 > > > 1:42.75 > > > 1:43.46 > > > 1:43.67 > > > 1:44.41 > > > 1:44.91 > > > 1:45.44 > > > > > > Any thoughts? > > > > Thank-you for your latest test results, and they tend to > > imply that the latest version of the patch hasn't improved > > performance in your use-case. > > > > One thing which is becoming clear here is that the devil is in > > the detail, and your results being summaries are not capturing > > enough detail to understand what is happening. They show > > something is wrong, but, don't give any guidance as to what > > is happening. > > > > I think it will be difficult to capture more details from > > your test case. But, detail can be captured from summaries, by > > varying the input and extrapolating from the results. > > > > By that I mean have you tried changing anything, and observed any > > changed results? > > > > For instance have you tried any of the following > > > > 1. Changing the parallelism of your test from 24 read threads. > > Does 1, 2, 4 etc parallel read threads change the observed > > behaviour? In other words is the slow-down observed across > > all degrees of parallelism, or is there a critical point. > > Please see the test results below, which are from my colleague Xiaohong Qi: > > I test file size from 256KB to 5120KB with thread number > 1,2,4,8,16,24,32(run ten times and get it’s average value). The read > performance is shown below. The difference of read performance between > 4.18 kernel and 5.10(with squashfs_readahead() patch v7) seems is > caused by the files whose size is litter than 256KB. > > T1 T2 T4 T8 > T16 T24 T32 > All File Size > 4.18 136.8642 100.479 96.5523 96.1569 96.204 > 96.0587 96.0519 > 5.10-v7 138.474 103.1351 99.9192 99.7091 99.7894 > 100.2034 100.4447 > Delta 1.6098 2.6561 3.3669 3.5522 > 3.5854 4.1447 4.3928 > > Fsize < 256KB > 4.18 21.7949 14.6959 11.639 10.5154 10.14 > 10.1092 10.1425 > 5.10-v7 23.8629 16.2483 13.1475 12.3697 12.1985 > 12.8799 13.3292 > Delta 2.068 1.5524 1.5085 1.8543 > 2.0585 2.7707 3.1867 > > 256KB < Fsize < 512KB > 4.18 11.8042 7.9228 7.6891 7.7924 7.8181 > 7.8548 7.8496 > 5.10-v7 12.0505 8.2506 8.1557 8.156 8.16 > 8.1577 8.1611 > Delta 0.2463 0.3278 0.4666 0.3636 0.3419 > 0.3029 0.3115 > > 512KB < Fsize < 1024KB > 4.18 7.7806 5.5496 5.496 5.4912 5.4897 > 5.4883 5.6602 > 5.10-v7 8.1283 5.8784 5.8486 5.8505 5.8523 > 5.8511 5.856 > Delta 0.3477 0.3288 0.3526 0.3593 0.3626 > 0.3628 0.1958 > > 1024KB < Fsize < 1536KB > 4.18 10.2686 7.5294 7.5012 7.4902 7.4855 > 7.4858 7.4851 > 5.10-v7 10.5289 7.8486 7.8502 7.8477 7.849 > 7.8482 7.8542 > Delta 0.2603 0.3192 0.349 0.3575 0.3635 > 0.3624 0.3691 > > 1536KB < Fsize < 2048KB > 4.18 5.6439 4.0588 3.9974 3.9946 3.9949 > 3.9942 3.9925 > 5.10-v7 6.2263 4.6009 4.6062 4.6069 4.6078 > 4.6074 4.6099 > Delta 0.5824 0.5421 0.6088 0.6123 0.6129 > 0.6132 0.6174 > > 2048KB < Fsize < 5120KB > 4.18 34.9166 28.7944 28.7355 28.7192 28.7046 > 28.6976 28.69 > 5.10-v7 33.8689 27.9726 27.9747 27.9801 27.9849 > 27.9855 27.9915 > Delta -1.0477 -0.8218 -0.7608 -0.7391 > -0.7197 -0.7121 -0.6985 > > > 5120KB > 4.18 45.6575 33.8609 33.7512 33.7349 33.7196 > 33.7166 33.708 > 5.10-v7 45.3494 34.0473 34.0443 34.0692 34.0635 > 34.0622 34.0599 > Delta -0.3081 0.1864 0.2931 0.3343 > 0.3439 0.3456 0.3519 > > (T1 means test with 1 thread, File size unit: KB, time unit: second, > 5.10-v7 means > we backported squashfs_readahead() v7 patchset on linux 5.10) > > The command to test is like: > echo 3 > /proc/sys/vm/drop_caches; sleep 3; time -v find /test/ -type > f -size -256k | xargs -P 32 -i cat {} > /dev/null 2>/dev/null > echo 3 > /proc/sys/vm/drop_caches; sleep 3; time -v find /test/ -type > f -size +256k -size -512k | xargs -P 32 -i cat {} > /dev/null > 2>/dev/null > > > > > 2. Does the Squashfs parallelism options in the kernel configuration > > change the behaviour? Knowing if the number of "decompressors" > > available changes the difference in performance could be important. > > In our ENV, the config SQUASHFS_DECOMP_MULTI_PERCPU is enalbed. There are > 12 cpus in our board. We tried to enable CONFIG_SQUASHFS_DECOMP_MULTI and > read files with 2/4/6/8/12/16/24/32 threads, the performance was not > improved and even a bit worse. > > > > > 3. Are your Squashfs filesystems built using fragments, or without > > fragments? Rebuilding the filesystems without fragments, and > > observing any different performance, would help to pinpoint > > where the issue lies. > > We didn't use option "-no-fragments" when build the squashfs image. > The steps of build squashfs partition is: > a. mksquashfs /lib64/ test.squash > b. lvcreate -L 24M /dev/vg0 -n test -y > c. dd if=/root/test.squash of=/dev/vg0/test > d. mount -t squashfs /dev/vg0/test xxx > > When using "-no-fragments", the performance is much worse than with > fragments. As you can see, the test files are from /lib64, most of > them are small files. > > > > > 4. What is the block size used in your Squashfs filesystems. Have > > you tried changing the block size, and seen what effect > > it has on the difference in performance between the patches? > > We configured CONFIG_SQUASHFS_4K_DEVBLK_SIZE to "y", so the blk size > should be 4k. We didn't try other block sizes because we have identical squashfs > configs on 4.18 and 5.10. > > > > > 5. You don't mention where your Squashfs filesystems are stored. > > Is this slow media or fast media? > > Please see the disk info we are testing on: > """ > $ hdparm -I /dev/sda1 > > /dev/sda1: > SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 > 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > ATA device, with non-removable media > Standards: > Likely used: 1 > Configuration: > Logical max current > cylinders 0 0 > heads 0 0 > sectors/track 0 0 > – > Logical/Physical Sector size: 512 bytes > device size with M = 1024*1024: 0 MBytes > device size with M = 1000*1000: 0 MBytes > cache/buffer size = unknown > Capabilities: > IORDY not likely > Cannot perform double-word IO > R/W multiple sector transfer: not supported > DMA: not supported > PIO: pio0 > """ > > > Have you tried moving > > the Squashfs filesystems onto different media and observed > > any difference in performance between the patches? > > Sorry, I still didn't get a chance to test on other medias. > > > > > The fact of the matter is there are many over-lapping factors > > which affect the performance of squashfs filesystems (like any > > reasonably complex code), which may be elsewhere. It can only > > take a small change somewhere to have a dramatic affect on > > performance. > > We found the performance is improved when running our test after remaking > the partitions with my steps in item 3 above. The following data is the > elapsed times of squashfs_readahead() when reading files before(this status > means we have run the test command many times) and after remaking the > partitions. I captured the data below with ftrace: > > Fo 14k file: > Before partition remade After partition remade: > 4352.306 us 3943.846 us > 4321.176 us 3929.255 us > > For 1.8M file: > Before partition remade After partition remade: > 17446.73 us 16506.58 us > 17446.73 us 16201.32 us > 18465.38 us 17548.96 us > 12269.78 us 11939.09 us > 9627.990 us 9167.052 us > > As you can see the elapsed times of squashfs_readahead() got significant > reduction after fresh partitions. We hit same problem on linux 4.18. > > By the way, I think my test results that I have ever sent out in v5 thread > is related with if the partitions remade: > https://lore.kernel.org/lkml/20220606150305.1883410-1-hsinyi@xxxxxxxxxxxx/T/#m5f3f8386eb8b72a1f63b60be37ea2cc6d03c5f84 > > > > > This is particularly the case with embedded systems, which > > may be short on CPU performance, short on RAM, and have low > > performance media, and be effectively operating on the "edge". > > It can only take a small change, an update for instance, to > > change from performing well to badly. > > Checked cpu usage it's not over 11%. The RAM is also enough: > total used free shared buff/cache available > Mem: 15837684 531420 11051344 262080 4254920 14858224 > Swap: 0 > > Regards, > Xiongwei > > > > > > I speak from experience, having spent over ten years in embedded > > Linux as a senior engineer and then as a consultant. I have > > my own horror tales as a consultant, dealing with systems pushed > > beyond the edge (with hacks), and the customer insisting they > > didn't do anything to cause the system to finally break. > > > > Maybe it is off topic here. But, I remember one instance where > > a customer had a system out in the field, which "inexplicably" > > started to lock up every 6 months or so. This system had regular > > updates "over the air", and I discovered the "lock up" only > > started happening after the latest update. It turns out the new version > > of the application had grown a new feature which needed more > > RAM than normal. This feature wasn't used very often, but, > > if it coincided with an infrequent "house-keeping" background task, > > the system ran out of memory and locked up (they had disabled the OOM > > killer). This was so rare it might only coincide after six months. No > > bug, but a slow growth in working set RAM over a number of versions. > > > > In other words we may be looking at a knock-on side effect of > > readahead, which is either caused by issues elsewhere or is > > causing issues elsewhere. > > > > Dealing with it in isolation, as bug in the readahead code is going > > to get us nowhere, looking for something that isn't there. > > > > I'm not saying that this is the case here. But, the more detail > > you can provide, and the more test variants you can provide will > > help to determine what is the problem. > > > > Thanks > > > > Phillip > > > > > > > > > > Regards, > > > Xiongwei > > > > > > On Mon, Jun 6, 2022 at 11:03 PM Hsin-Yi Wang <hsinyi@xxxxxxxxxxxx> wrote: > > >> > > >> Implement readahead callback for squashfs. It will read datablocks > > >> which cover pages in readahead request. For a few cases it will > > >> not mark page as uptodate, including: > > >> - file end is 0. > > >> - zero filled blocks. > > >> - current batch of pages isn't in the same datablock. > > >> - decompressor error. > > >> Otherwise pages will be marked as uptodate. The unhandled pages will be > > >> updated by readpage later. > > >> > > >> Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > >> Signed-off-by: Hsin-Yi Wang <hsinyi@xxxxxxxxxxxx> > > >> Reported-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > > >> Reported-by: Phillip Lougher <phillip@xxxxxxxxxxxxxxx> > > >> Reported-by: Xiongwei Song <Xiongwei.Song@xxxxxxxxxxxxx> > > >> Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> > > >> Reported-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > >> --- > > >> v4->v5: > > >> - Handle short file cases reported by Marek and Matthew. > > >> - Fix checkpatch error reported by Andrew. > > >> > > >> v4: https://lore.kernel.org/lkml/20220601103922.1338320-4-hsinyi@xxxxxxxxxxxx/ > > >> v3: https://lore.kernel.org/lkml/20220523065909.883444-4-hsinyi@xxxxxxxxxxxx/ > > >> v2: https://lore.kernel.org/lkml/20220517082650.2005840-4-hsinyi@xxxxxxxxxxxx/ > > >> v1: https://lore.kernel.org/lkml/20220516105100.1412740-3-hsinyi@xxxxxxxxxxxx/ > > >> --- > > >> fs/squashfs/file.c | 124 ++++++++++++++++++++++++++++++++++++++++++++- > > >> 1 file changed, 123 insertions(+), 1 deletion(-) > > >> > > >> diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c > > >> index a8e495d8eb86..fbd096cd15f4 100644 > > >> --- a/fs/squashfs/file.c > > >> +++ b/fs/squashfs/file.c > > >> @@ -39,6 +39,7 @@ > > >> #include "squashfs_fs_sb.h" > > >> #include "squashfs_fs_i.h" > > >> #include "squashfs.h" > > >> +#include "page_actor.h" > > >> > > >> /* > > >> * Locate cache slot in range [offset, index] for specified inode. If > > >> @@ -495,7 +496,128 @@ static int squashfs_read_folio(struct file *file, struct folio *folio) > > >> return 0; > > >> } > > >> > > >> +static void squashfs_readahead(struct readahead_control *ractl) > > >> +{ > > >> + struct inode *inode = ractl->mapping->host; > > >> + struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info; > > >> + size_t mask = (1UL << msblk->block_log) - 1; > > >> + unsigned short shift = msblk->block_log - PAGE_SHIFT; > > >> + loff_t start = readahead_pos(ractl) & ~mask; > > >> + size_t len = readahead_length(ractl) + readahead_pos(ractl) - start; > > >> + struct squashfs_page_actor *actor; > > >> + unsigned int nr_pages = 0; > > >> + struct page **pages; > > >> + int i, file_end = i_size_read(inode) >> msblk->block_log; > > >> + unsigned int max_pages = 1UL << shift; > > >> + > > >> + readahead_expand(ractl, start, (len | mask) + 1); > > >> + > > >> + if (file_end == 0) > > >> + return; > > >> + > > >> + pages = kmalloc_array(max_pages, sizeof(void *), GFP_KERNEL); > > >> + if (!pages) > > >> + return; > > >> + > > >> + actor = squashfs_page_actor_init_special(pages, max_pages, 0); > > >> + if (!actor) > > >> + goto out; > > >> + > > >> + for (;;) { > > >> + pgoff_t index; > > >> + int res, bsize; > > >> + u64 block = 0; > > >> + unsigned int expected; > > >> + > > >> + nr_pages = __readahead_batch(ractl, pages, max_pages); > > >> + if (!nr_pages) > > >> + break; > > >> + > > >> + if (readahead_pos(ractl) >= i_size_read(inode)) > > >> + goto skip_pages; > > >> + > > >> + index = pages[0]->index >> shift; > > >> + if ((pages[nr_pages - 1]->index >> shift) != index) > > >> + goto skip_pages; > > >> + > > >> + expected = index == file_end ? > > >> + (i_size_read(inode) & (msblk->block_size - 1)) : > > >> + msblk->block_size; > > >> + > > >> + bsize = read_blocklist(inode, index, &block); > > >> + if (bsize == 0) > > >> + goto skip_pages; > > >> + > > >> + if (nr_pages < max_pages) { > > >> + struct squashfs_cache_entry *buffer; > > >> + unsigned int block_mask = max_pages - 1; > > >> + int offset = pages[0]->index - (pages[0]->index & ~block_mask); > > >> + > > >> + buffer = squashfs_get_datablock(inode->i_sb, block, > > >> + bsize); > > >> + if (buffer->error) { > > >> + squashfs_cache_put(buffer); > > >> + goto skip_pages; > > >> + } > > >> + > > >> + expected -= offset * PAGE_SIZE; > > >> + for (i = 0; i < nr_pages && expected > 0; i++, > > >> + expected -= PAGE_SIZE, offset++) { > > >> + int avail = min_t(int, expected, PAGE_SIZE); > > >> + > > >> + squashfs_fill_page(pages[i], buffer, > > >> + offset * PAGE_SIZE, avail); > > >> + unlock_page(pages[i]); > > >> + } > > >> + > > >> + squashfs_cache_put(buffer); > > >> + continue; > > >> + } > > >> + > > >> + res = squashfs_read_data(inode->i_sb, block, bsize, NULL, > > >> + actor); > > >> + > > >> + if (res == expected) { > > >> + int bytes; > > >> + > > >> + /* Last page may have trailing bytes not filled */ > > >> + bytes = res % PAGE_SIZE; > > >> + if (bytes) { > > >> + void *pageaddr; > > >> + > > >> + pageaddr = kmap_atomic(pages[nr_pages - 1]); > > >> + memset(pageaddr + bytes, 0, PAGE_SIZE - bytes); > > >> + kunmap_atomic(pageaddr); > > >> + } > > >> + > > >> + for (i = 0; i < nr_pages; i++) { > > >> + flush_dcache_page(pages[i]); > > >> + SetPageUptodate(pages[i]); > > >> + } > > >> + } > > >> + > > >> + for (i = 0; i < nr_pages; i++) { > > >> + unlock_page(pages[i]); > > >> + put_page(pages[i]); > > >> + } > > >> + } > > >> + > > >> + kfree(actor); > > >> + kfree(pages); > > >> + return; > > >> + > > >> +skip_pages: > > >> + for (i = 0; i < nr_pages; i++) { > > >> + unlock_page(pages[i]); > > >> + put_page(pages[i]); > > >> + } > > >> + > > >> + kfree(actor); > > >> +out: > > >> + kfree(pages); > > >> +} > > >> > > >> const struct address_space_operations squashfs_aops = { > > >> - .read_folio = squashfs_read_folio > > >> + .read_folio = squashfs_read_folio, > > >> + .readahead = squashfs_readahead > > >> }; > > >> -- > > >> 2.36.1.255.ge46751e96f-goog > > >> > > >> > >