Re: [PATCH] Strip control codes in virBufferEscapeString

Daniel Veillard <veillard@xxxxxxxxxx> · Tue, 31 Mar 2015 09:38:34 +0800

On Mon, Mar 30, 2015 at 10:56:11AM -0600, Eric Blake wrote:
> On 03/30/2015 09:50 AM, Daniel Veillard wrote:
> 
> >> NACK.  Stripping control codes from a volume name represents the wrong
> >> name.  We need to escape the problematic bytes, rather than strip them.
> > 
> >   you can't escape them with a CharRef for sure
> > 
> > http://www.w3.org/TR/REC-xml/#wf-Legalchar
> > Characters referred to using character references must match the
> > production for Char.
> > 
> >   That time Ján  is right :-)
> 
> Ouch.  Then how do we represent the name of a storage volume, when the
> file system allows arbitrary bytes including control characters, in the
> volume name, but where we are restricted to only using valid XML?  Do we
> just silently ignore such files as impossible volumes that libvirt
> cannot manage?  (I'd rather omit such a volume from the list in the
> pool, than silently munge its name into something incorrect)

 Since if such an invalid CharRef were to hit libxml2 you would
get a parser error and no result. So you can safely assume nobody
ever has experienced those. Then you can try to push an additional
patch doing a libvirt escaping but of only those problematic characters
prior to the encoding in the XML. Then escape them back when reading
from the XML to libvirt internals. This should not affect any deployed
instance since they would be unparseable if that was the case.
I would suggest using the same charref escaping but before passing to
XML, e.g.

real path:       /foo\3bar
libvirt encoded: /foo&#3;bar
XML encoded:     /foo&amp;#3;bar

you also need to catch & and give him special status

real path:       /foo&bar
libvirt encoded: /foo&#38;bar
XML encoded:     /foo&amp;#38;bar

after libvirt parsing you end up with /foo&#3;bar
and each time you see &#numericsequence; you translate that to
the equivalent UTF-8 character.

Char  ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

As a first approach, I would suggest just detecting bytes
1-8 0xB-0x1F and giving them the treatment, the probability of hitting
surrogates in UTF-8 filesnames seems low enough that the patch
should work in general.

Whether using /foo&#3;bar vs. /foo&#0x3;bar is a matter of taste
you only need to handle one IMHO.

Add a little regression tests with all the lower caracter and
& use in the path and I think you're covered.

Sounds too late for 1.2.14 though,

Daniel

-- 
Daniel Veillard      | Open Source and Standards, Red Hat
veillard@xxxxxxxxxx  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | virtualization library  http://libvirt.org/

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list