Binary Data / Strings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Background:

I'm using cURL to snarf down a web page and examine an image in that
page -- where I would like to be able to use
http://php.net/imagecolorat on the image.

There are some wrinkles, however, best explained by a slimmed-down
sample program:
<?php
function foo(){
  global $curl;
  if (!isset($curl)) $curl = curl_init();

  //Fetch HTML
  curl_setopt($curl, CURLOPT_RETURNTRANSFER);
  curl_setopt($curl, CURLOPT_COOKIEJAR, 'cookies');
  curl_setopt($curl, CURLOPT_COOKIEFILE, 'cookies');
  curl_setopt($curl, CURLOPT_URL, 'http://example.com');
  $html = curl_exec($curl);

  //Fetch image:
  preg_match('/<img src="([^"]*">/', $html, $image_url);
  $image_url = $image_url[1];
  curl_setopt($curl, CURLOPT_BINARYTRANSFER, 1); //Getting binary data
  curl_setopt($curl, $image_url);
  $image_string = curl_exec($curl);
  curl_setopt9$curl, CURLOPT_BINARYTRANSFER, 0); //Set it BACK to text!

  //SOMETIMES that URL sends me this for an image:
  $bad_data = '<META HTTP-EQUIV=Refresh URL="0;http://example.com";>';
  if (stristr($image_string, $bad_data)){
    //start all over again:
    return foo();
  }

  //Use GD to get image:
  $image = imagecreatefromstring($image_string);

  //Begin analysis
  //irrelevant to the problem, deleted.
  $result = 'foo';

  return $result;
}

//Assume  the image changes on every page hit, and we call foo() a LOT
for ($i = 0; $i < 100000; $i++){
  echo foo();
  sleep(mt_rand(1, 5); //Don't kill their server
}

?>

NOW, for the problem[s].

#1.
If I don't use BINARYTRANSFER, then imagecreatefromstring segfaults,
pretty much every time.  Well usually, anyway.

Presumably, that's because cURL/PHP are pretending the string is
null-terminated when it's not, and then handing a corrupted image
string to GD, and that's bad.  Or, perhaps, without BINARYTRANSFER,
some sort of CRLF correction is corrupting the binary data.  I dunno,
really.  I just figured I got binary data coming in, and I must want
BINARYTRANSFER, based on what I can find documented.

So, assuming BINARYTRANSFER means what I think it means, I need that.

I've put in a bug report here, and pajoye is being VERY helpful, in
hopefully getting segfault to be an E_ERROR instead of segfault:

http://bugs.php.net/bug.php?id=37005

So this one will probably get resolved, eventually.

But I'm hoping for a pointer to a longer explanation of what
BINARYTRANSFER actually does, as I've only found rather circular/brief
definitions so far on php.net and I'm not finding anything on the
libcurl page here:
http://curl.haxx.se/libcurl/c/curl_easy_setopt.html

A quick Google also yielded only the barest circular definition:
CURLOPT_BINARYTRANSFER
TRUE to return the raw output when CURLOPT_RETURNTRANSFER is used.

I mean, yeah, guys, I know what the words BINARY and "raw output"
mean, and the docs pretty much tell me they are synonyms...  That's
not real useful, eh? :-) :-) :-)

#2.
Once I start using BINARYTRANSFER, however...

I *still* get segfaults sometimes, even when everything else seems to
be okay.
This is happening in *all* of these versions from CGI compile on
command-line usage:
PHP 5.0.4
PHP 5.1.2
PHP 5.1.2RC3

So, perhaps my use of BINARYTRANSFER is completely wrong, and merely
masks the real problem a little bit?

The segfault DOES happen at different points in the different versions
of PHP. 5.0.4 segfaults within call to imagecreatefromstring()
5.1.2RC3 segfaults at some later point.

#3.
It seems like once is set BINARYTRANSFER to 1, setting it back to 0 is
not taking effect...

I say this because after a recursive call to foo() to start over, I
get $html filled with data such as:

<html><head>...</head><body>...</body></html>ZZZZZZZ...ZZZZ?more
garbage data

I.E., it seems like curl and/or PHP are ignoring null-terminated data,
and using some other indicator to define the end of a string.

As additional evidence, I get messages such as:

Run-time warning. String is not zero-terminated (   ) (source:
/php-5.1.2/Zend/zend_variables.h:45) in /script.php:128
/php-5.1.2/Zend/zend-hash.c(754) : ht=0x8381124 is being cleaned

Now, I dunno what all that is supposed to mean, but I'm pretty sure
it's a sign of things going drastically wrong with a string being
treated as binary data when it's not or vice versa...

Is it not possible to switch CURLOPT_BINARYTRANSFER back to 0 ?

Or is 0 treated as TRUE in cURL and I need FALSE?  Surely not, right,
since PHP handles that internally...

#4
The complaint about a string not being zero-terminated is happening on
the line such as:
if (stristr($image_string, $bad_data)){

stristr is supposed to be "binary-safe"

My assumption, then, was that I could search inside of a binary data
string (a valid image) for a particular pattern (the HTML they send
out instead of a JPEG sometimes) to detect when they've done that...

So, apparently, "binary-safe" doesn't mean what I think it means...
Or I've found another bug in PHP? Unlikely.

What does binary-safe actually MEAN anyway?

#5
Is there some way to distinguish between a binary string and a
"normal" string.
I.E.:
$image_string = file_get_contents('image.jpg');
if (is_binary_string($image_string))

There does not seem to be an "is_binary_string" function, just
"is_string"...

How can I check?


Some things I have considered:

If I abandond the COOKIEJAR/COOKIEFILE, I can manage the cookies
myself, and hopefully, detect some headers that the server is
HOPEFULLY sending out before this goofy META tag to refresh the whole
page when I've asked for a JPEG.

I may be wasting my time with this BINARYTRANSFER because it's not
what I think it is.

[The following will make sense only to a select few readers...]
Given the PHP / GD double-free bug and the solutions of bundling
and/or upgrading to the latest GD and PHP which have functions
specifically so that PHP can free the RAM instead of GD doing it, what
combination of PHP / GD versions and bundle/separate is *most* stable
for CGI/CLI usage?

Does CGI versus CLI have any real effect on GD?...  Maybe I've missed
a whole thread of research here.  Never remember why CGI/CLI are
different, though I've re-read the page a lot.  Always seems a whole
lot of nothing to my usage habits, as I recall, but maybe I'm missing
an implication in my reading.

I did my initial tests with this script using local image files
instead of curl and the real data, as I didn't want to pound the
server with development mistakes (infinite loops etc).  The original
data was acquired with basic PHP file operations from that server,
though, so I thought I was using exemplary sample data.  So only when
I try to use cURL to get my image does everything fall apart.

I can save the cURL fetched data using file_put_contents, and, sure
enough, I can crash the script using those local files, when I don't
use BINARYTRANSFER.
Saved files from cURL using BINARYTRANSFER do not seem to crash, or at
least not nearly as often.
The main exception being that if I save their "META Refrash" as a JPEG
file, and try to imagecreatefromstring(file_get_contents()) on that, I
can segfault.

I guess I'm at the point now where I have so MANY theories about what
to try next to tackle this, that I just don't know what to do.

Any collective wisdom from the list on which road to take would be
most welcome.

Thanks for reading.  Sorry it got so long, but I'm really thrashing
around here, trying to make heads or tails of any of this.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


[Index of Archives]     [PHP Home]     [Apache Users]     [PHP on Windows]     [Kernel Newbies]     [PHP Install]     [PHP Classes]     [Pear]     [Postgresql]     [Postgresql PHP]     [PHP on Windows]     [PHP Database Programming]     [PHP SOAP]

  Powered by Linux