Re: [PATCH v5 3/3] squashfs: implement readahead

Xiongwei Song <sxwjean@xxxxxxxxx> · Fri, 29 Jul 2022 13:22:29 +0800

Hi Phillip,

Gentle ping.

Regards,
Xiongwei

On Fri, Jul 15, 2022 at 9:45 AM Xiongwei Song <sxwjean@xxxxxxxxx> wrote:
>
> Hi Phillip,
>
> Sorry for providing my test info so late.
>
> On Fri, Jun 10, 2022 at 3:42 PM Phillip Lougher <phillip@xxxxxxxxxxxxxxx> wrote:
> >
> > On 09/06/2022 15:46, Xiongwei Song wrote:
> > > This version is bad for my test. I ran the test below
> > > "for cnt in $(seq 0 9); do echo 3 > /proc/sys/vm/drop_caches; echo
> > > "Loop ${cnt}:"; time -v find /software/test[0-9][0-9] | xargs -P 24 -i
> > > cat {} > /dev/null 2>/dev/null; echo ""; done"
> > > in 90 partitions.
> > >
> > > With 9eec1d897139 reverted:
> > > 1:06.18 (1m + 6.18s)
> > > 1:05.65
> > > 1:06.34
> > > 1:06.88
> > > 1:06.52
> > > 1:06.78
> > > 1:06.61
> > > 1:06.99
> > > 1:06.60
> > > 1:06.79
> > >
> > > With this version:
> > > 2:36.85 (2m + 36.85s)
> > > 2:28.89
> > > 1:43.46
> > > 1:41.50
> > > 1:42.75
> > > 1:43.46
> > > 1:43.67
> > > 1:44.41
> > > 1:44.91
> > > 1:45.44
> > >
> > > Any thoughts?
> >
> > Thank-you for your latest test results, and they tend to
> > imply that the latest version of the patch hasn't improved
> > performance in your use-case.
> >
> > One thing which is becoming clear here is that the devil is in
> > the detail, and your results being summaries are not capturing
> > enough detail to understand what is happening.  They show
> > something is wrong, but, don't give any guidance as to what
> > is happening.
> >
> > I think it will be difficult to capture more details from
> > your test case.  But, detail can be captured from summaries, by
> > varying the input and extrapolating from the results.
> >
> > By that I mean have you tried changing anything, and observed any
> > changed results?
> >
> > For instance have you tried any of the following
> >
> > 1.  Changing the parallelism of your test from 24 read threads.
> >      Does 1, 2, 4 etc parallel read threads change the observed
> >      behaviour?  In other words is the slow-down observed across
> >      all degrees of parallelism, or is there a critical point.
>
> Please see the test results below, which are from my colleague Xiaohong Qi:
>
> I test file size from  256KB to 5120KB with thread number
> 1,2,4,8,16,24,32(run ten times and get it’s average value). The read
> performance is shown below. The difference of read performance between
> 4.18 kernel and 5.10(with squashfs_readahead() patch v7) seems is
> caused by the files whose size is litter than 256KB.
>
>                     T1              T2            T4             T8
>          T16          T24          T32
> All File Size
>     4.18         136.8642   100.479    96.5523    96.1569    96.204
>  96.0587    96.0519
>     5.10-v7    138.474     103.1351  99.9192    99.7091    99.7894
> 100.2034   100.4447
>     Delta        1.6098       2.6561      3.3669      3.5522
> 3.5854      4.1447      4.3928
>
> Fsize < 256KB
>     4.18          21.7949    14.6959    11.639     10.5154    10.14
>   10.1092    10.1425
>     5.10-v7     23.8629    16.2483    13.1475   12.3697    12.1985
> 12.8799    13.3292
>     Delta         2.068        1.5524      1.5085     1.8543
> 2.0585     2.7707     3.1867
>
> 256KB < Fsize < 512KB
>     4.18          11.8042    7.9228     7.6891     7.7924     7.8181
>  7.8548     7.8496
>     5.10-v7     12.0505    8.2506     8.1557     8.156       8.16
>   8.1577     8.1611
>     Delta         0.2463      0.3278     0.4666     0.3636     0.3419
>   0.3029     0.3115
>
> 512KB < Fsize < 1024KB
>     4.18           7.7806     5.5496     5.496      5.4912     5.4897
>   5.4883     5.6602
>     5.10-v7      8.1283     5.8784     5.8486    5.8505     5.8523
> 5.8511     5.856
>     Delta          0.3477     0.3288     0.3526     0.3593     0.3626
>   0.3628     0.1958
>
> 1024KB < Fsize < 1536KB
>     4.18           10.2686    7.5294     7.5012     7.4902     7.4855
>   7.4858     7.4851
>     5.10-v7      10.5289    7.8486     7.8502     7.8477     7.849
> 7.8482     7.8542
>     Delta          0.2603     0.3192     0.349      0.3575     0.3635
>   0.3624     0.3691
>
> 1536KB < Fsize < 2048KB
>     4.18           5.6439     4.0588     3.9974     3.9946     3.9949
>   3.9942     3.9925
>     5.10-v7      6.2263     4.6009     4.6062     4.6069     4.6078
> 4.6074     4.6099
>     Delta          0.5824     0.5421     0.6088     0.6123     0.6129
>   0.6132     0.6174
>
> 2048KB < Fsize < 5120KB
>     4.18           34.9166    28.7944    28.7355    28.7192    28.7046
>   28.6976    28.69
>     5.10-v7      33.8689    27.9726    27.9747    27.9801    27.9849
> 27.9855    27.9915
>     Delta          -1.0477    -0.8218     -0.7608     -0.7391
> -0.7197    -0.7121     -0.6985
>
> > 5120KB
>     4.18           45.6575    33.8609    33.7512    33.7349    33.7196
>   33.7166    33.708
>     5.10-v7      45.3494    34.0473    34.0443    34.0692    34.0635
> 34.0622    34.0599
>     Delta          -0.3081     0.1864     0.2931       0.3343
> 0.3439     0.3456      0.3519
>
> (T1 means test with 1 thread, File size unit: KB, time unit: second,
> 5.10-v7 means
>  we backported squashfs_readahead() v7 patchset on linux 5.10)
>
> The command to test is like:
> echo 3 > /proc/sys/vm/drop_caches; sleep 3; time -v find /test/ -type
> f -size -256k | xargs -P 32 -i cat {} > /dev/null 2>/dev/null
> echo 3 > /proc/sys/vm/drop_caches; sleep 3; time -v find /test/ -type
> f -size +256k -size -512k | xargs -P 32 -i cat {} > /dev/null
> 2>/dev/null
>
> >
> > 2. Does the Squashfs parallelism options in the kernel configuration
> >     change the behaviour?  Knowing if the number of "decompressors"
> >     available changes the difference in performance could be important.
>
> In our ENV, the config SQUASHFS_DECOMP_MULTI_PERCPU is enalbed. There are
> 12 cpus in our board. We tried to enable CONFIG_SQUASHFS_DECOMP_MULTI and
> read files with 2/4/6/8/12/16/24/32 threads, the performance was not
> improved and even a bit worse.
>
> >
> > 3. Are your Squashfs filesystems built using fragments, or without
> >     fragments?  Rebuilding the filesystems without fragments, and
> >     observing any different performance, would help to pinpoint
> >     where the issue lies.
>
> We didn't use option "-no-fragments" when build the squashfs image.
> The steps of build squashfs partition is:
>         a. mksquashfs /lib64/ test.squash
>         b. lvcreate -L 24M /dev/vg0 -n test -y
>         c. dd if=/root/test.squash of=/dev/vg0/test
>         d. mount -t squashfs /dev/vg0/test xxx
>
> When using "-no-fragments", the performance is much worse than with
> fragments. As you can see, the test files are from /lib64, most of
> them are small files.
>
> >
> > 4. What is the block size used in your Squashfs filesystems.  Have
> >     you tried changing the block size, and seen what effect
> >     it has on the difference in performance between the patches?
>
> We configured CONFIG_SQUASHFS_4K_DEVBLK_SIZE to "y", so the blk size
> should be 4k. We didn't try other block sizes because we have identical squashfs
> configs on 4.18 and 5.10.
>
> >
> > 5. You don't mention where your Squashfs filesystems are stored.
> >     Is this slow media or fast media?
>
> Please see the disk info we are testing on:
>     """
>     $ hdparm -I /dev/sda1
>
>     /dev/sda1:
>     SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00
> 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
>     ATA device, with non-removable media
>     Standards:
>     Likely used: 1
>     Configuration:
>     Logical max current
>     cylinders 0 0
>     heads 0 0
>     sectors/track 0 0
>     –
>     Logical/Physical Sector size: 512 bytes
>     device size with M = 1024*1024: 0 MBytes
>     device size with M = 1000*1000: 0 MBytes
>     cache/buffer size = unknown
>     Capabilities:
>     IORDY not likely
>     Cannot perform double-word IO
>     R/W multiple sector transfer: not supported
>     DMA: not supported
>     PIO: pio0
>     """
>
> >    Have you tried moving
> >     the Squashfs filesystems onto different media and observed
> >     any difference in performance between the patches?
>
> Sorry, I still didn't get a chance to test on other medias.
>
> >
> > The fact of the matter is there are many over-lapping factors
> > which affect the performance of squashfs filesystems (like any
> > reasonably complex code), which may be elsewhere.  It can only
> > take a small change somewhere to have a dramatic affect on
> > performance.
>
> We found the performance is improved when running our test after remaking
> the partitions with my steps in item 3 above. The following data is the
> elapsed times of squashfs_readahead() when reading files before(this status
> means we have run the test command many times) and after remaking the
> partitions. I captured the data below with ftrace:
>
> Fo 14k file:
> Before partition remade     After partition remade:
> 4352.306 us                      3943.846 us
> 4321.176 us                      3929.255 us
>
> For 1.8M file:
> Before partition remade     After partition remade:
> 17446.73 us                     16506.58 us
> 17446.73 us                     16201.32 us
> 18465.38 us                     17548.96 us
> 12269.78 us                     11939.09 us
> 9627.990 us                     9167.052 us
>
> As you can see the elapsed times of squashfs_readahead() got significant
> reduction after fresh partitions. We hit same problem on linux 4.18.
>
> By the way, I think my test results that I have ever sent out in v5 thread
> is related with if the partitions remade:
> https://lore.kernel.org/lkml/20220606150305.1883410-1-hsinyi@xxxxxxxxxxxx/T/#m5f3f8386eb8b72a1f63b60be37ea2cc6d03c5f84
>
> >
> > This is particularly the case with embedded systems, which
> > may be short on CPU performance, short on RAM, and have low
> > performance media, and be effectively operating on the "edge".
> > It can only take a small change, an update for instance, to
> > change from performing well to badly.
>
>   Checked cpu usage it's not over 11%. The RAM is also enough:
>               total        used        free      shared  buff/cache   available
> Mem:    15837684   531420  11051344      262080     4254920    14858224
> Swap:             0
>
> Regards,
> Xiongwei
>
>
> >
> > I speak from experience, having spent over ten years in embedded
> > Linux as a senior engineer and then as a consultant.  I have
> > my own horror tales as a consultant, dealing with systems pushed
> > beyond the edge (with hacks), and the customer insisting they
> > didn't do anything to cause the system to finally break.
> >
> > Maybe it is off topic here.  But, I remember one instance where
> > a customer had a system out in the field, which "inexplicably"
> > started to lock up every 6 months or so.  This system had regular
> > updates "over the air", and I discovered the "lock up" only
> > started happening after the latest update.  It turns out the new version
> > of the application had grown a new feature which needed more
> > RAM than normal.  This feature wasn't used very often, but,
> > if it coincided with an infrequent "house-keeping" background task,
> > the system ran out of memory and locked up (they had disabled the OOM
> > killer).  This was so rare it might only coincide after six months.  No
> > bug, but a slow growth in working set RAM over a number of versions.
> >
> > In other words we may be looking at a knock-on side effect of
> > readahead, which is either caused by issues elsewhere or is
> > causing issues elsewhere.
> >
> > Dealing with it in isolation, as bug in the readahead code is going
> > to get us nowhere, looking for something that isn't there.
> >
> > I'm not saying that this is the case here.  But, the more detail
> > you can provide, and the more test variants you can provide will
> > help to determine what is the problem.
> >
> > Thanks
> >
> > Phillip
> >
> >
> > >
> > > Regards,
> > > Xiongwei
> > >
> > > On Mon, Jun 6, 2022 at 11:03 PM Hsin-Yi Wang <hsinyi@xxxxxxxxxxxx> wrote:
> > >>
> > >> Implement readahead callback for squashfs. It will read datablocks
> > >> which cover pages in readahead request. For a few cases it will
> > >> not mark page as uptodate, including:
> > >> - file end is 0.
> > >> - zero filled blocks.
> > >> - current batch of pages isn't in the same datablock.
> > >> - decompressor error.
> > >> Otherwise pages will be marked as uptodate. The unhandled pages will be
> > >> updated by readpage later.
> > >>
> > >> Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> > >> Signed-off-by: Hsin-Yi Wang <hsinyi@xxxxxxxxxxxx>
> > >> Reported-by: Matthew Wilcox <willy@xxxxxxxxxxxxx>
> > >> Reported-by: Phillip Lougher <phillip@xxxxxxxxxxxxxxx>
> > >> Reported-by: Xiongwei Song <Xiongwei.Song@xxxxxxxxxxxxx>
> > >> Reported-by: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>
> > >> Reported-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > >> ---
> > >> v4->v5:
> > >> - Handle short file cases reported by Marek and Matthew.
> > >> - Fix checkpatch error reported by Andrew.
> > >>
> > >> v4: https://lore.kernel.org/lkml/20220601103922.1338320-4-hsinyi@xxxxxxxxxxxx/
> > >> v3: https://lore.kernel.org/lkml/20220523065909.883444-4-hsinyi@xxxxxxxxxxxx/
> > >> v2: https://lore.kernel.org/lkml/20220517082650.2005840-4-hsinyi@xxxxxxxxxxxx/
> > >> v1: https://lore.kernel.org/lkml/20220516105100.1412740-3-hsinyi@xxxxxxxxxxxx/
> > >> ---
> > >>   fs/squashfs/file.c | 124 ++++++++++++++++++++++++++++++++++++++++++++-
> > >>   1 file changed, 123 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/fs/squashfs/file.c b/fs/squashfs/file.c
> > >> index a8e495d8eb86..fbd096cd15f4 100644
> > >> --- a/fs/squashfs/file.c
> > >> +++ b/fs/squashfs/file.c
> > >> @@ -39,6 +39,7 @@
> > >>   #include "squashfs_fs_sb.h"
> > >>   #include "squashfs_fs_i.h"
> > >>   #include "squashfs.h"
> > >> +#include "page_actor.h"
> > >>
> > >>   /*
> > >>    * Locate cache slot in range [offset, index] for specified inode.  If
> > >> @@ -495,7 +496,128 @@ static int squashfs_read_folio(struct file *file, struct folio *folio)
> > >>          return 0;
> > >>   }
> > >>
> > >> +static void squashfs_readahead(struct readahead_control *ractl)
> > >> +{
> > >> +       struct inode *inode = ractl->mapping->host;
> > >> +       struct squashfs_sb_info *msblk = inode->i_sb->s_fs_info;
> > >> +       size_t mask = (1UL << msblk->block_log) - 1;
> > >> +       unsigned short shift = msblk->block_log - PAGE_SHIFT;
> > >> +       loff_t start = readahead_pos(ractl) & ~mask;
> > >> +       size_t len = readahead_length(ractl) + readahead_pos(ractl) - start;
> > >> +       struct squashfs_page_actor *actor;
> > >> +       unsigned int nr_pages = 0;
> > >> +       struct page **pages;
> > >> +       int i, file_end = i_size_read(inode) >> msblk->block_log;
> > >> +       unsigned int max_pages = 1UL << shift;
> > >> +
> > >> +       readahead_expand(ractl, start, (len | mask) + 1);
> > >> +
> > >> +       if (file_end == 0)
> > >> +               return;
> > >> +
> > >> +       pages = kmalloc_array(max_pages, sizeof(void *), GFP_KERNEL);
> > >> +       if (!pages)
> > >> +               return;
> > >> +
> > >> +       actor = squashfs_page_actor_init_special(pages, max_pages, 0);
> > >> +       if (!actor)
> > >> +               goto out;
> > >> +
> > >> +       for (;;) {
> > >> +               pgoff_t index;
> > >> +               int res, bsize;
> > >> +               u64 block = 0;
> > >> +               unsigned int expected;
> > >> +
> > >> +               nr_pages = __readahead_batch(ractl, pages, max_pages);
> > >> +               if (!nr_pages)
> > >> +                       break;
> > >> +
> > >> +               if (readahead_pos(ractl) >= i_size_read(inode))
> > >> +                       goto skip_pages;
> > >> +
> > >> +               index = pages[0]->index >> shift;
> > >> +               if ((pages[nr_pages - 1]->index >> shift) != index)
> > >> +                       goto skip_pages;
> > >> +
> > >> +               expected = index == file_end ?
> > >> +                          (i_size_read(inode) & (msblk->block_size - 1)) :
> > >> +                           msblk->block_size;
> > >> +
> > >> +               bsize = read_blocklist(inode, index, &block);
> > >> +               if (bsize == 0)
> > >> +                       goto skip_pages;
> > >> +
> > >> +               if (nr_pages < max_pages) {
> > >> +                       struct squashfs_cache_entry *buffer;
> > >> +                       unsigned int block_mask = max_pages - 1;
> > >> +                       int offset = pages[0]->index - (pages[0]->index & ~block_mask);
> > >> +
> > >> +                       buffer = squashfs_get_datablock(inode->i_sb, block,
> > >> +                                                       bsize);
> > >> +                       if (buffer->error) {
> > >> +                               squashfs_cache_put(buffer);
> > >> +                               goto skip_pages;
> > >> +                       }
> > >> +
> > >> +                       expected -= offset * PAGE_SIZE;
> > >> +                       for (i = 0; i < nr_pages && expected > 0; i++,
> > >> +                                               expected -= PAGE_SIZE, offset++) {
> > >> +                               int avail = min_t(int, expected, PAGE_SIZE);
> > >> +
> > >> +                               squashfs_fill_page(pages[i], buffer,
> > >> +                                               offset * PAGE_SIZE, avail);
> > >> +                               unlock_page(pages[i]);
> > >> +                       }
> > >> +
> > >> +                       squashfs_cache_put(buffer);
> > >> +                       continue;
> > >> +               }
> > >> +
> > >> +               res = squashfs_read_data(inode->i_sb, block, bsize, NULL,
> > >> +                                        actor);
> > >> +
> > >> +               if (res == expected) {
> > >> +                       int bytes;
> > >> +
> > >> +                       /* Last page may have trailing bytes not filled */
> > >> +                       bytes = res % PAGE_SIZE;
> > >> +                       if (bytes) {
> > >> +                               void *pageaddr;
> > >> +
> > >> +                               pageaddr = kmap_atomic(pages[nr_pages - 1]);
> > >> +                               memset(pageaddr + bytes, 0, PAGE_SIZE - bytes);
> > >> +                               kunmap_atomic(pageaddr);
> > >> +                       }
> > >> +
> > >> +                       for (i = 0; i < nr_pages; i++) {
> > >> +                               flush_dcache_page(pages[i]);
> > >> +                               SetPageUptodate(pages[i]);
> > >> +                       }
> > >> +               }
> > >> +
> > >> +               for (i = 0; i < nr_pages; i++) {
> > >> +                       unlock_page(pages[i]);
> > >> +                       put_page(pages[i]);
> > >> +               }
> > >> +       }
> > >> +
> > >> +       kfree(actor);
> > >> +       kfree(pages);
> > >> +       return;
> > >> +
> > >> +skip_pages:
> > >> +       for (i = 0; i < nr_pages; i++) {
> > >> +               unlock_page(pages[i]);
> > >> +               put_page(pages[i]);
> > >> +       }
> > >> +
> > >> +       kfree(actor);
> > >> +out:
> > >> +       kfree(pages);
> > >> +}
> > >>
> > >>   const struct address_space_operations squashfs_aops = {
> > >> -       .read_folio = squashfs_read_folio
> > >> +       .read_folio = squashfs_read_folio,
> > >> +       .readahead = squashfs_readahead
> > >>   };
> > >> --
> > >> 2.36.1.255.ge46751e96f-goog
> > >>
> > >>
> >