Re: [Summit topic] Documentation (translations, FAQ updates, new user-focused, general improvements, etc.)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Oct 22, 2021 at 04:31:46PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I'd very much support this living in-tree just as the po/* directory
> already does. I.e. periodically pulled down.

Just a bit of a tangent here, since weblate was mentioned earlier.

I'd caution a bit against pulling the history generated by weblate
directly. It's pretty sub-optimal from a Git perspective: you have a
bunch of big .po files and then a ton of little commits changing one or
a handful of lines.

So the "logical" size of the repository (the sum of the actual object
sizes) ends up growing quite a bit. Deltas can help with the on-disk
size, but:

  - lots of operations scale with the logical size. The client-side
    index-pack of a clone, for instance, but also everyday stuff like
    "git log -S".

  - empirically we don't do a great job of finding these. See below for
    some numbers.

For instance, take https://github.com/phpmyadmin/phpmyadmin, a
repository which uses weblate (I don't mean to pick on them; it's just a
repo whose weblate-related packing I've looked into before). A fresh
clone is 1.3GB. If you do an aggressive repack, you can get it down to
about 550MB. But there's still tons of logical data. Running:

  git cat-file --batch-all-objects --batch-check='%(objectsize) %(objectsize:disk)' |
  perl -alne '
    $logical += $F[0]; $disk += $F[1];
    END { print "$logical / $disk = " . $logical / $disk }
  '

shows that there's over 70GB of logical data. It gets an impressive
156:1 compression ratio (for comparison, "normal" repos like linux.git
and git.git are around 40-60x in my experience).

If you split it up by directory, like this:

  git rev-list --objects --all --no-object-names -- po |
  git cat-file --batch-check='%(objectsize)' |
  perl -lne '$total += $_; END { print $total }'

you'll see that po/ accounts for almost 60GB of that logical size.

We face some of that in our current po/, too. They're big files, and
that's the nature of the problem space. But our current ones tend to be
edited by taking a pass over the whole file, rather than the one-liners
that a web-based workflow encourages.

To be clear, I'm not arguing against weblate in general. It's cool that
it makes it easier for people to contribute to translations. But I think
it has an outsized impact on size and performance compared to the rest
of the repository. That's a big price to pay for carrying the history
in-tree.

Obviously one option there is to squash the po/ history before pulling
it in. The weblate commit messages themselves aren't that useful. I'm
not actually sure if jnavila's work so far has been using weblate. The
commits in his git-html-l10n are much coarser than what I see in
phpmyadmin, for example (so maybe he's doing similar squashing already).

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux