Re: Large(ish) scale pdf file cacheing

Bastien Koert <phpster@xxxxxxxxx> · Tue, 4 Mar 2014 16:56:07 -0500

Ok, so what I proposed should work fairly well. To keep spiders off, add a
robots.txt file to the webserver to block them, but hopefully this process
is hidden behind a password or session to prevent unnecessary generation.
This note from google might help ..
https://support.google.com/webmasters/answer/156449?hl=en

On Tue, Mar 4, 2014 at 2:46 PM, George Wilson <rmwscbt@xxxxxxxxx> wrote:

> On Tue, Mar 4, 2014 at 11:01 AM, Bastien Koert <phpster@xxxxxxxxx> wrote:
>
> > Use the filesystem and store each pdf on the file system
> >
> > you can use a cron to delete the files older than x
> >
> Thanks for the suggestion- sounds like that might be the simplest approach.
>
>
> > Is each PDF substantially different or just some pertinent details like
> > customer name, address etc? Could you template that so that you are
> > generating a minimal number of PDFs?
> >
> The generation is handled by a 3rd party black box application. Each
> document is substantially different from others- they are chemical safety
> datasheets and hence are specific to the particular chemicals they
> represent. (If you are curious/interested, this page from Dow Jones
> Chemicals has a brief explanation:
> http://www.dow.com/productsafety/safety/sds.htm)
>
>
> > How often does a customer come in and request a new PDF?
> >
> This is a new system so it is kind of hard to say. Some chemicals, we might
> anticipate requests several times a week (perhaps several times a day).
> Others, very rarely- as little as once ever. One thing we had considered is
> creating an SDS hit tracker which could sort of scale the relative
> importance of a particular file- then when the cron job comes around it
> could take that into consideration. I am not really sure if we would see
> substantial benefit over just a simple find based cron job. Customers are
> not the PM's concern it is the web spiders.
>
>
> >
> >
> > On Tue, Mar 4, 2014 at 1:09 PM, George Wilson <rmwscbt@xxxxxxxxx> wrote:
> >
> >> Greetings all,
> >> I hope this is not tip toeing off topic but I am working on solving a
> >> problem at work right now and was hoping someone here might have some
> >> experience/insight.
> >>
> >> My company has a new proprietary server which generates pdf chemical
> >> safety
> >> files via a rest API and returns them to the user. My project manager
> >> wants
> >> a layer of separation between the website(and hence user) and the
> document
> >> server so I wrote an intermediary script which accepts a request from
> the
> >> website and attempts to grab a pdf from the document server via the php
> >> curl system. That appears to be working well.
> >>
> >> Here is the issue I am trying to solve:
> >>
> >> We must assume a total of 1.4 million possible documents which may be
> >> generated by this system- each initiated directly from our website. Each
> >> document is estimated to be about a megabyte in size. Generating each
> one
> >> takes at least a few seconds.
> >>
> >> We are interested in setting up some kind of document caching system
> >> (either a home brewed php based system or a system that generates the
> >> files, saves them and periodically deletes them). My project manager is
> >> concerned about web crawlers kicking off the generation of these files
> and
> >> so we are considering strategies to avoid blowing out our server
> >> resources.
> >>
> >> Does anyone have any suggestions or have you dealt with this problem
> >> before?
> >>
> >> Thank you in advance
> >>
> >
> >
> >
> > --
> >
> > Bastien
> >
> > Cat, the other other white meat
> >
>

-- 

Bastien

Cat, the other other white meat