Re: [OT] searching for a regular expression to match strings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thursday, Jan 29th 2009 at 13:10 -0000, quoth Konstantin Svist:

=>Christoph H?ger wrote:
=>> Hi,
=>>
=>> anyone knows about a (high-performance) regular expression to match
=>> java-like Strings?
=>> (e.g. "Hi, World \n this is a \"-quoted string.\n")
=>>
=>> I have tested 
=>>
=>> ((\\.)|[^"\\])*
=>>
=>> which basically does what I want (although capturing too much escape
=>> sequences). 
=>>
=>> The problem is: I've tried jakarta's regexp and java's implementation
=>> and both run into Stack Overflows for input strings with more than 500
=>> characters. That is definitely not acceptable as a hard limit for
=>> tokenizing source code.
=>>
=>> Any suggestions?
=>>   
=>
=>IIRC from taking a compiler course, tokenizing is usually done with a
=>lexer like Lex, Flex, JLex, Ragel, and so on (in conjunction with Yacc
=>or Bison).
=>They're highly tuned for the task - for example, Ragel creates a state
=>machine (decision trees), and only needs to scan each character of the
=>source text once. This is as high performance as you can expect to get
=>(maybe you can squeeze a cycle or two more by coding in assembly, of
=>course). See http://www.complang.org/ragel/
=>I'm not sure exactly how regular expressions work, but I suspect they're
=>not nearly this robust.

Don't jump to any conclusions. Regex's break up into the deterministic 
(DFA) and the non-deterministic (NFA) variety. Tools like lex/flex are 
totally deterministic and they do in fact produce a state machine as 
output.

The regex engine that's in things like perl/python/egrep/emacs/etc allow 
for lots of extended operators to be used including things like 
look-ahead.

You're trying to say something like

"([^"\\]|\\.)*"

but I'm not sure it's correct because the * (i.e., closure operator) is 
greedy. So you might be better off using a non-greedy closure operator, a 
la, 

"([^"\\]|\\.)*?"

BTW, NOT TRIED. If it doesn't work I won't even feel bad. My point is that 
it's eminently doable and that if it's correct then it's certainly robust.

-- 
Time flies like the wind. Fruit flies like a banana. Stranger things have  .0.
happened but none stranger than this. Does your driver's license say Organ ..0
Donor?Black holes are where God divided by zero. Listen to me! We are all- 000
individuals! What if this weren't a hypothetical question?
steveo at syslang.net

-- 
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines
[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [Fedora Magazine]     [Fedora News]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Maintainers]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Legacy]     [Fedora Desktop]     [Fedora Fonts]     [ATA RAID]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [SSH]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Centos]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Tux]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Asterisk PBX]     [Fedora Sparc]     [Fedora Universal Network Connector]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux