more: Limited line buffer length results in corrupted UTF-8 text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I've attached a simple patch for a bug I found in more, which was
also filed in the Debian BTS.

Note the patch is needed to support multibyte UTF-8 characters which
overflow the existing small buffer, but this does not fix an additional
issue with corruption of the output when the buffer overflows--that
needs fixing separately and I have no patch for that.

An alternative approach which would avoid *all* overflow would be to
dynamically allocate a minimum buffer size of 4× the number of columns
(since UTF-8 is at most 4 bytes).  Personally I'd go with a minimum of
6× since that's the real limit, it's just limited to 4 by the standard
and it might be increased in the future.


Regards,
Roger

-- 
  .''`.  Roger Leigh
 : :' :  Debian GNU/Linux             http://people.debian.org/~rleigh/
 `. `'   Printing on GNU/Linux?       http://gutenprint.sourceforge.net/
   `-    GPG Public Key: 0x25BFB848   Please GPG sign your mail.
--- Begin Message ---
Package: util-linux
Version: 2.16.1-4
Severity: important
Tags: patch

Attached is a file which may be used to demonstrate the problem.

With a terminal of standard 80 column width, more displays the text
correctly.  The longest line (11, 91 chars in 91×3=273 bytes) is
correctly folded over two lines.

─────────────────┼──────────┼──────────┼─────────────┼─────────────┼───────────────────────
                                                                                   col 91 ↑
⇒

─────────────────┼──────────┼──────────┼─────────────┼─────────────┼────────────
───────────
                                                                        col 80 ↑

Now, resize the terminal width to over 85 columns, and one sees this:

─────────────────┼──────────┼──────────┼─────────────┼─────────────┼─────────────────
��─────
                                                                             col 85 ↑

There is a newline inserted after 85 chars, and the first byte of the
following UTF-8 3-byte code is lost (replaced by \n?) leading to
corruption since the following two bytes are now invalid UTF-8.

Why is this happening?

I believe it's partly down to
  #define LINSIZ  256
in text-utils/more.c, since all the UTF-8 characters are 3-byte codes,
256/3 is 85 + 1 remainder.  But there's a bug in the code somewhere
else as well, since not only is it flushing the buffer, it's corrupting
it.

Partial solution: 256 bytes for the line buffer is way too small.  I'd
suggest that for a modern system using UTF-8 1024 bytes would be a
more sensible default, since this would allow use of at least 256 columns
of 4-byte UTF-8 codes.  4096 bytes would be even safer, and since it's
for a single static buffer, the increased overhead is minimal.  I've
built with the following patch and it does prevent the corruption.

There's still the matter of corruption in the case of overflow, which
still would need addressing--the increased buffer size is just hiding
it rather than fixing it.  It should probably only flush up to the end
of the last valid UTF-8 sequence.

diff -urN util-linux-2.16.1.orig/text-utils/more.c util-linux-2.16.1/text-utils/more.c
--- util-linux-2.16.1.orig/text-utils/more.c	2009-07-04 00:20:07.000000000 +0100
+++ util-linux-2.16.1/text-utils/more.c	2009-10-27 11:11:32.046127972 +0000
@@ -107,7 +107,7 @@
 FILE *checkf (char *, int *);
 
 #define TBUFSIZ	1024
-#define LINSIZ	256
+#define LINSIZ	4096
 #define ctrl(letter)	(letter & 077)
 #define RUBOUT	'\177'
 #define ESC	'\033'


Regards,
Roger

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (550, 'unstable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.30-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages util-linux depends on:
ii  dpkg                   1.15.4.1          Debian package management system
ii  initscripts            2.87dsf-8         scripts for initializing and shutt
ii  install-info           4.13a.dfsg.1-5    Manage installed documentation in 
ii  libblkid1              2.16.1-4          block device id library
ii  libc6                  2.10.1-2          GNU C Library: Shared libraries
ii  libncurses5            5.7+20090803-2    shared libraries for terminal hand
ii  libselinux1            2.0.88-1          SELinux runtime shared libraries
ii  libslang2              2.2.1-1           The S-Lang programming library - r
ii  libuuid1               2.16.1-4          Universally Unique ID library
ii  lsb-base               3.2-23            Linux Standard Base 3.2 init scrip
ii  tzdata                 2009o-2           time zone and daylight-saving time
ii  zlib1g                 1:1.2.3.3.dfsg-15 compression library - runtime

util-linux recommends no packages.

Versions of packages util-linux suggests:
ii  console-tools              1:0.2.3dbs-66 Linux console and font utilities
ii  dosfstools                 3.0.6-1       utilities for making and checking 
ii  util-linux-locales         2.16.1-4      Locales files for util-linux

-- no debconf information
psql (8.5devel, server 8.4.1)
WARNING: psql version 8.5, server version 8.4.
         Some psql features might not work.
Type "help" for help.

rleigh=# \pset pager off
Pager usage is off.
rleigh=# \l
                                     List of databases
      Name       │  Owner   │ Encoding │  Collation  │    Ctype    │   Access privileges   
─────────────────┼──────────┼──────────┼─────────────┼─────────────┼───────────────────────
 merkelpb        │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 postgres        │ postgres │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 projectb        │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 rleigh          │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 rleigh-amarok   │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 sbuild-packages │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 scratch         │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 scratch2        │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 template0       │ postgres │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │ =c/postgres          ↵
                 │          │          │             │             │ postgres=CTc/postgres 
 template1       │ postgres │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │ =c/postgres          ↵
                 │          │          │             │             │ postgres=CTc/postgres 
 test            │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 test2           │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 test3           │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 test4           │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 test5           │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 testp           │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
 vtest           │ rleigh   │ UTF8     │ en_GB.UTF-8 │ en_GB.UTF-8 │                       
(17 rows)

--- End Message ---

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux