Re: SQUID store_url_rewrite

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Wed, 01 Jun 2011 12:39:11 +1200

On Tue, 31 May 2011 20:47:13 +0300, Ghassan Gharabli wrote:
Im sorry again for the last email but I also have something to ask 
for ..

(m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)

now Im talking about this element ([\w\d]{2,4}) which seems to match
.ex , .ext or .exte for example .mp3

I understand that \w matches an alphanumeric character, including "_"
same as [A-Za-z0-9_] in ASCII

that I know it finds for numbers , letters including underscore ..
which is correct here but the thing that is confusing ot me
also we have used \d which finds for matches a digit same as [0-9] in
ASCII.. so we have used 0-9 twice! any comment about it?

No idea. As you say, it seems to be redundant.

Im also seeing these urls again

#generic http://variable.domain.com/path/filename."ex";, "ext" or 
"exte"
#http://cdn1-28.projectplaylist.com
#http://s1sdlod041.bcst.cdn.s1s.yimg.com

^ means that we matches the beginning of a line or string.
m/^http:\/\/ ... we used at the start (.*?) which seems to be to find
anything !

Yes.

If we want to look at this url ; 
#http://s1sdlod041.bcst.cdn.s1s.yimg.com

If Im correct then (.*?) means to match "s1sdlod041" and then the
second element(\.[^\.\-]*?\..*?) we moved to . after
"s1sdlod041" so nw we have "http://s1sdlod041."; but I want to know 
how
about "[^\.\-]*?\..*?" like [] or we used ^ for \. and \-
coz we are also finding dashes or dots .. after that we used "*"
anything! and then Question Mark "?" .. something also confusing to 
me
"\.." or "\..*?" .

(.*?) should match the whole: "s1sdlod041.bcst.cdn.s1s" or 
"evil.com/?url=http://blah";. Then...

Maybe a bug: this should probably be: ([\w\-\.]?) to avoid that OR.

(\.[^\.\-]*?\..*?) matches: "yimg.com" or "yimg.com/blah/blah". Then...

Maybe a bug: this should probably be: (\.[^\.\-]*?\.[\w]*?) to avoid 
that OR and make the next bit match the whole path instead of filename.

\/ matches a "/". Then...

([^\?\&\=]*) matches "filename" or nothing. Then...

\. matches a ".". Then...

([\w\d]{2,4}) matches some alphanumeric 2-4 bytes long. Then...

\?? matches a '?' or nothing. Then...

.*$ matches anything else.

Maybe a bug: these late two should probably be:  (\?.*)?$ to avoid a 
lot more evilness.

another question to ask for ([^\?\&\=]*) umm I think this one is for
folders or what ?...

as I saw the slash \/ before it .. which seems to catch
/?url=blah&C=blah2 and the "*" matches "blah" and "bla2"

but please if you dont mind then you can explain or illustrate more
about (\.[^\.\-]*?\..*?) or maybe you can explain it well

see above.

using your way as Im sure you are a good teacher hehehe

Please explain the whole match to me

(m/^http:\/\/(.*?)(\.[^\.\-]*?\..*?)\/([^\?\&\=]*)\.([\w\d]{2,4})\??.*$/)

above.

I was eager to ask you all these questions from the start but I was
afraid thinking you'll not help anyway

that what I was trying to go so far is FileHippo domain

http://fs34.filehippo.com/6574/058e5771e07c467cb38d70ab6fbed3c0/Opera_1150b1_int_Setup.exe

in this case we have to try to change the domain into
"cdn.filehippo.com/6574/Opera_1150b1_int_Setup.exe" because we 
removed
the hashed folder!

Its okay I have the script for it

			#cdn, varialble 1st path
} elsif (($u =~ /filehippo/) &&
(m/^http:\/\/(.*?)\.(.*?)\/(.*?)\/(.*)\.([a-z0-9]{3,4})(\?.*)?/)) {
	@y = ($1,$2,$4,$5);
	$y[0] =~ s/[a-z0-9]{2,5}/cdn./;
	print $x . "http://"; . $y[0] . $y[1] . "/" . $y[2] . "." . $y[3] . 
"\n";

and its working 100% . I can get it from cache too .. what if I want
to add wlxrs.com into ($u =~ /filehippo|wlxrs/)

does that match this URL?

http://css.wlxrs.com/HGjlAVvMlW6-1!iEEpuBkgo2TZKpU8RH!W4mH-UPgteZ8OD6Oxte!sCQWfQ1OB7A6B-NZoBS1jrItq7zq!v10A/OOB_30_IllustratedKai/15.40.1211/img/Kai_Sunny_thumbnail.jpg
I dont think so as it has "!" where should I add this one to match a
folder like

"/HGjlAVvMlW6-1!iEEpuBkgo2TZKpU8RH!W4mH-UPgteZ8OD6Oxte!sCQWfQ1OB7A6B-NZoBS1jrItq7zq!v10A/"

It will. The "([^\?\&\=]*)" pattern does not prevent '!' or any other 
valid weird characters.

sometimes the CDN folder comes at the 1st folder or 2nd or 3rd ..
deopends on any website.

Yes. This is back to the knowing fine details about what the individual 
website or CDN. The changes done have to be customised to individual 
sites. If they change anything you have to alter the patterns.

can you lead me where should I find or edit this script to follow 
WLXRS.COM

The second maybe-bug I pointed out before, when fixed should make $3 
have the whole file path for you to play with.

Amos