Ok, so what I proposed should work fairly well. To keep spiders off, add a robots.txt file to the webserver to block them, but hopefully this process is hidden behind a password or session to prevent unnecessary generation. This note from google might help .. https://support.google.com/webmasters/answer/156449?hl=en On Tue, Mar 4, 2014 at 2:46 PM, George Wilson <rmwscbt@xxxxxxxxx> wrote: > On Tue, Mar 4, 2014 at 11:01 AM, Bastien Koert <phpster@xxxxxxxxx> wrote: > > > Use the filesystem and store each pdf on the file system > > > > you can use a cron to delete the files older than x > > > Thanks for the suggestion- sounds like that might be the simplest approach. > > > > Is each PDF substantially different or just some pertinent details like > > customer name, address etc? Could you template that so that you are > > generating a minimal number of PDFs? > > > The generation is handled by a 3rd party black box application. Each > document is substantially different from others- they are chemical safety > datasheets and hence are specific to the particular chemicals they > represent. (If you are curious/interested, this page from Dow Jones > Chemicals has a brief explanation: > http://www.dow.com/productsafety/safety/sds.htm) > > > > How often does a customer come in and request a new PDF? > > > This is a new system so it is kind of hard to say. Some chemicals, we might > anticipate requests several times a week (perhaps several times a day). > Others, very rarely- as little as once ever. One thing we had considered is > creating an SDS hit tracker which could sort of scale the relative > importance of a particular file- then when the cron job comes around it > could take that into consideration. I am not really sure if we would see > substantial benefit over just a simple find based cron job. Customers are > not the PM's concern it is the web spiders. > > > > > > > > On Tue, Mar 4, 2014 at 1:09 PM, George Wilson <rmwscbt@xxxxxxxxx> wrote: > > > >> Greetings all, > >> I hope this is not tip toeing off topic but I am working on solving a > >> problem at work right now and was hoping someone here might have some > >> experience/insight. > >> > >> My company has a new proprietary server which generates pdf chemical > >> safety > >> files via a rest API and returns them to the user. My project manager > >> wants > >> a layer of separation between the website(and hence user) and the > document > >> server so I wrote an intermediary script which accepts a request from > the > >> website and attempts to grab a pdf from the document server via the php > >> curl system. That appears to be working well. > >> > >> Here is the issue I am trying to solve: > >> > >> We must assume a total of 1.4 million possible documents which may be > >> generated by this system- each initiated directly from our website. Each > >> document is estimated to be about a megabyte in size. Generating each > one > >> takes at least a few seconds. > >> > >> We are interested in setting up some kind of document caching system > >> (either a home brewed php based system or a system that generates the > >> files, saves them and periodically deletes them). My project manager is > >> concerned about web crawlers kicking off the generation of these files > and > >> so we are considering strategies to avoid blowing out our server > >> resources. > >> > >> Does anyone have any suggestions or have you dealt with this problem > >> before? > >> > >> Thank you in advance > >> > > > > > > > > -- > > > > Bastien > > > > Cat, the other other white meat > > > -- Bastien Cat, the other other white meat