On Wed, Aug 22, 2018 at 12:50:12PM +0200, Simon Kobyda wrote: > On Tue, 2018-08-21 at 11:46 +0100, Daniel P. Berrangé wrote: > > On Tue, Aug 21, 2018 at 12:27:34PM +0200, Michal Privoznik wrote: > > > On 08/21/2018 11:18 AM, Simon Kobyda wrote: > > > > On Thu, 2018-08-16 at 12:28 +0100, Daniel P. Berrangé wrote: > > > > > On Thu, Aug 16, 2018 at 12:56:24PM +0200, Simon Kobyda wrote: > > > > > > > > > > > > > > > > After asking around I have found the right solution that we > > > > > need to > > > > > use > > > > > for measuring string width. mbstowcs()/wcswidth() will get the > > > > > answer > > > > > wrong wrt zero-width characters, combining characters, non- > > > > > printable > > > > > characters, etc. We need to use the libunistring library: > > > > > > > > > > > > > > > > https://www.gnu.org/software/libunistring/manual/libunistring.html#uniwidth_002eh > > > > > > > > > > > > > > > > > > I've tried what you've suggested, but it seems that it doesn't > > > > work > > > > well with all unicode characters. I'm looking into the code of > > > > the > > > > library, and each function uN_strwidth calls function uN_width, > > > > and > > > > that function calls uc_width for calculation of width of > > > > characters. > > > > And if we look into the code of uc_width here: > > > > > > > > > http://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/uniwidth/width.c;h=269cfc77f50a3b9802e5fb5620ff8bcf95e05e40;hb=HEAD#l415 > > > > it seems that this library is limited only to certain unicodes, > > > > e.g.: > > > > hangul characters, angle brackets, CJK characters... But it > > > > doesn't > > > > cover all multiple-width characters. Example: I try to throw any > > > > emoji > > > > (e.g. 🙉, 🦀, 🏙), it returns width of 1 column for each charact > > > > er, nevertheless these characters have width of 2 columns on > > > > terminal. > > > > > > > > BTW, it seems unistring library imports those funcions from > > > > gnulib. > > > > > > I guess the only option then is to try smartcols [1]. If it is good > > > for > > > util-linux it's going to be good for us too. Although, I'd prefer > > > to > > > have our own wrappers over their API. > > > > > > https://github.com/karelzak/util-linux/tree/master/libsmartcols > > > > The util-linux code does something that uses mbstowcs / wcwidth to > > convert the characters and count their width, sort of like the > > original > > version of this patch. They have further code that decides to convert > > certain unicode characters into "\xNN" escaped sequences, which > > avoids > > the problems I raised wrt non-printable strings. > > > > https://github.com/karelzak/util-linux/blob/master/lib/mbsalign.c > > > > So we could pull that helper API into our code, since its LGPL > > loicensed. > > I'm unclear if this correctly handles all the cases or not though as > > there's no unit tests for it in util-linux AFACT. > > > > Really the only way for us to be sure is to provide a unit test which > > stresses our the code with a variety of unicode input strings. > > About unit tests. Right now i've got tests for non-pritnable, zero- > width, combining characters and opposite (rigth to left) writing. > Anybody got any idea what else could be problematic with > mbstowcs()/wcswidth(), and therefore tested? I think that sounds reasonable enough for now - passing such tests would already be massively better than the code that exists today with strlen() Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list