Re: Re: need some regex help to strip out // comments but not http:// urls

David Harkness <david.h@xxxxxxxxxxxxxxxxx> · Thu, 30 May 2013 08:45:32 -0700

On Wed, May 29, 2013 at 10:20 AM, Matijn Woudt <tijnema@xxxxxxxxx> wrote:

> It is possible to write a whole parser as a single regex, being it terribly
> long and complex.
>

While regular expressions are often used in the lexer--the part that scans
the input stream and breaks it up into meaningful tokens like

    { keyword: "function" }
    { operator: "+" }

and

    { identifier: "$foo" }

that form the building blocks of the language--they aren't combined into a
single expression. Instead, a lexer generator is used to build a state
machine that switches the active expressions to check based on the previous
tokens and context. Each expression recognizes a different type of token,
and many times these aren't even regular expressions.

The second stage--combining tokens based on the rules of the grammar--is
more complex and beyond the abilities of regular expressions. There are
plenty of books on the subject and tools [1] to build the pieces such as
Lex, Yacc, Flex, and Bison. Someone even asked this question on Stack
Overflow [2] a few years ago. And I'm sure if you look you can find someone
that did a masters thesis proving that regular expressions cannot handle a
context-free grammar. And finally I leave you with Jeff Atwood's article
about (not) parsing HTML with regex. [3]

Peace,
David

[1] http://dinosaur.compilertools.net/
[2]
http://stackoverflow.com/questions/3487089/are-regular-expressions-used-to-build-parsers
[3]
http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html