Re: FW: Accessing Files Outside the Web Root

tamouse mailing lists <tamouse.lists@xxxxxxxxx> · Thu, 14 Mar 2013 03:06:21 -0500

On Mar 13, 2013 7:06 PM, "David Robley" <robleyd@xxxxxxxxxxx> wrote:
>
> "Dale H. Cook" wrote:
>
> > At 05:04 PM 3/13/2013, Dan McCullough wrote
> > :
> >>Web bots can ignore the robots.txt file, most scrapers would.
> >
> > and at 05:06 PM 3/13/2013, Marc Guay wrote:
> >
> >>These don't sound like robots that would respect a txt file to me.
> >
> > Dan and Marc are correct. Although I used the terms "spiders" and
> > "pirates" I believe that the correct term, as employed by Dan, is
> > "scrapers," and that twerm might be applied to either the robot or the
> > site which displays its results. One blogger has called scrapers "the
> > arterial plaque of the Internet." I need to implement a solution that
> > allows humans to access my files but prevents scrapers from accessing
> > them. I will undoubtedly have to implement some type of
> > challenge-and-response in the system (such as a captcha), but as long as
> > those files are stored below the web root a scraper that has a valid URL
> > can probably grab them. That is part of what the "public" in public_html
> > implies.
> >
> > One of the reasons why this irks me is that the scrapers are all
> > commercial sites, but they haven't offered me a piece of the action for
> > the use of my files. My domain is an entirely non-commercial domain,
and I
> > provide free hosting for other non-commercial genealogical works,
> > primarily pages that are part of the USGenWeb Project, which is perhaps
> > the largest of all non-commercial genealogical projects.
> >
>
> readfile() is probably where you want to start, in conjunction with a
> captcha or similar
>
> --
> Cheers
> David Robley
>
> Catholic (n.) A cat with a drinking problem.
>
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>

If the files are delivered via the web, by php or some other means, even if
located outside webroot, they'd still be scrapeable.