[RFC] Replace the flex-based scanner with an re2c [1] based lexer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

  sorry for the crosspost. But recent discussions about:
'[RFC] Replace the flex-based scanner with an re2c [1] based lexer'
revealed one big issue. During the development of said RFC we dropped
--enable-multibyte-support and interaction between engine and ext/mbstring
using declare(encoding=..). Now neither of the two is documented anywhere,
nor does any of the core developers happen to know how it works, what it is
supposed to do or how to test it.

Since we do not want to drop this feature we need some test code, best in
the form of .PHPTs. You can find information on how to write tests here:
http://qa.php.net/write-test.php and
http://talks.somabo.de/200703_montreal_need_for_testing.pdf

If you are interested in this further you are of course also more than
welcome to help in any other form. Apart from the proposal below, there
is also my blog entry to help you getting started:
http://blog.somabo.de/2008/02/php-on-re2c.html

thanks
marcus


Sunday, March 2, 2008, 11:21:34 PM, you wrote:

> RFC: REPLACE THE FLEX-BASED SCANNER WITH AN RE2C [1] BASED LEXER

> Situation:
> The current flex-based lexer depends on an outdated and unsupported flex
> version. Alternatives include either updating to a newer version of flex or
> using re2c, which we already use for a variety of things (serializing, pdo sql
> scanning, date/time parsing). While moving towards a newer flex version would
> be much easier, switching to re2c promises a much faster lexer. Actually,
> without any specific re2c optimizations we already get around a 20% scanner
> performance increase. Running the tests gets an overall speedup of 2%. It is
> arguable whether this is enough, but re2c has more advantages. First of all,
> re2c allows one to scan any type of input (ASCII, UTF-8, UTF-16, UTF-32).
> Secondly, it allows for better integration with Lemon [2], which would be the
> next step. And thirdly we can switch to a reentrant scanner.

> Current state:
> Flex has been fully replaced by re2c in Zend. We have also switched to an
> mmap-based lexer approach for now. However, we had to drop multibyte support
> as well as the encoding declare. The current state can be checked out from
> Scott's subversion repository [3] and you can follow the development on his
> Trac setup [4]. When you want to build php with re2c, then you need to grab
> re2c from its sourceforge subversion repository [5]. You can also check out
> the changes in a patch created Sunday 2nd March against a PHP checkout from 
> 14th February [6].

> Further steps:
> Commit this to PHP 5.3. Synch to HEAD. Add pecl/intl to 5.3. Discuss/recreate
> multibyte support with libintl.

> Future steps:
> Replace bison with lemon in PHP 5.4 or HEAD.

> Time Frame:
> Commit to 5.3 between the 5th and the 15th of March. Synch to HEAD a couple
> of days later. Moving pecl/libintl to ext (depends on the 5.3 RMs decision).
> After that is done, decide about multibyte support. Along with the commit to
> the 5.3 branch there will be a new re2c version available.


> Marcus Boerger
> Nuno Lopes
> Scott MacVicar


> [1] http://re2c.org/
> [2] http://www.hwaci.com/sw/lemon/
> [3] svn://whisky.macvicar.net/php-re2c
> [4] http://trac.macvicar.net/php-re2c/
> [5] https://re2c.svn.sourceforge.net/svnroot/re2c/trunk/re2c
> [6] http://php.net/~helly/php-re2c-20080302.diff.txt



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux