Re: Parsing images

David Tulloh <david@xxxxxxxxxxxx> · Fri, 12 May 2006 23:47:42 +1000

Robert Cummings wrote:
> On Thu, 2006-05-11 at 13:48, tedd wrote:
> 
>>At 12:11 PM -0400 5/11/06, Robert Cummings wrote:
>>
>>>On Thu, 2006-05-11 at 11:47, tedd wrote:
>>>
>>>> At 9:28 AM +0300 5/11/06, Dotan Cohen wrote:
>>>> >Hey all, it is possible to parse capcha's in php? I'm not asking how
>>>> >to do it, nor have I any need, it's just something that I was
>>>> >discussing with a friend. My stand was that ImageMagik could crack
>>>> >them. She says no way. What are your opinions?
>>>> >
>>>> >Thanks.
>>>> >
>>>> >Dotan Cohen
>>>> >http://what-is-what.com
>>>>
>>>
>>> > Of course -- it's trivial.
>>>
>>>> All images can be broken down into signals and analyzed as such. If
>>>> you have any coherent data, it will show up. If it has to conform to
>>>> glyphs, it most certainly can be identified.
>>>>
>>>> You want something that's not trivial, take a look at medical imaging
>>>> and analysis thereof.
>>>
>>>Extracting passcodes from captcha text is not what I'd call trivial.
>>>It's one thing to pull trends out of an image, it's quite another to
>>>know that a curvy line is the morphed vertical base of the capital
>>>letter T. Similarly knowing that the intensity of red in an area is
>>>related to the existence of some radioacive tracer agent, isn't quite
>>>the same as knowing that the curvy letter T might be red, yellow, green,
>>>yellow blended to green,. etc etc. The human eye and brain are amazing
>>>accomplishments, and while someday we may match their ability in code, I
>>>don't think it's this year.
>>
>>We've been doing edge detection, noise suppression, data analysis, 
>>and OCR for over 30 years. While it may not be obvious, it's still 
>>trivial in the overall scheme of things. The bleeding edge is far 
>>beyond this technology.
> 
> 
> Edge detection, noise suppression, and data analysis don't quite equate
> to recognition. Also 30 years of OCR still requires that the sample be
> good quality and conform to fairly detectable patterns. If this is so
> trivial, I await the release of your captcha parser. The spammers would
> probably pay you millions for it. Where exactly is this bleeding edge,
> and where can I read more about it? I think you're quite wholeheartedly
> being naive about the complexity of visual recognition. Prove me wrong.

I also agree most are breakable.  I've done a very small amount of
character recognition processing and most of the captcha's I've seen
would be breakable.  The ones that look hard such as the bugs.php.net
captcha, I end up getting wrong about 1/3rd of the time.

There is a substantial difference between standard OCR and captcha
breaking.  With OCR you need to get it right 99% of the time, with a
captcha if you can get it one time in 1000 you can still get into a
website several times a second.

Really though, there are easier ways to do it.  My favorite story was a
small free porn site that required you to enter a captcha to get in.
They were taking the captcha's they needed to break and getting horny
teenagers to do the recognition phase for them.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php