On Wed, May 29, 2013 at 10:20 AM, Matijn Woudt <tijnema@xxxxxxxxx> wrote: > It is possible to write a whole parser as a single regex, being it terribly > long and complex. > While regular expressions are often used in the lexer--the part that scans the input stream and breaks it up into meaningful tokens like { keyword: "function" } { operator: "+" } and { identifier: "$foo" } that form the building blocks of the language--they aren't combined into a single expression. Instead, a lexer generator is used to build a state machine that switches the active expressions to check based on the previous tokens and context. Each expression recognizes a different type of token, and many times these aren't even regular expressions. The second stage--combining tokens based on the rules of the grammar--is more complex and beyond the abilities of regular expressions. There are plenty of books on the subject and tools [1] to build the pieces such as Lex, Yacc, Flex, and Bison. Someone even asked this question on Stack Overflow [2] a few years ago. And I'm sure if you look you can find someone that did a masters thesis proving that regular expressions cannot handle a context-free grammar. And finally I leave you with Jeff Atwood's article about (not) parsing HTML with regex. [3] Peace, David [1] http://dinosaur.compilertools.net/ [2] http://stackoverflow.com/questions/3487089/are-regular-expressions-used-to-build-parsers [3] http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html