Re: Large(ish) scale pdf file cacheing

George Wilson <rmwscbt@xxxxxxxxx> · Tue, 4 Mar 2014 11:46:53 -0800

On Tue, Mar 4, 2014 at 11:01 AM, Bastien Koert <phpster@xxxxxxxxx> wrote:

> Use the filesystem and store each pdf on the file system
>
> you can use a cron to delete the files older than x
>
Thanks for the suggestion- sounds like that might be the simplest approach.

> Is each PDF substantially different or just some pertinent details like
> customer name, address etc? Could you template that so that you are
> generating a minimal number of PDFs?
>
The generation is handled by a 3rd party black box application. Each
document is substantially different from others- they are chemical safety
datasheets and hence are specific to the particular chemicals they
represent. (If you are curious/interested, this page from Dow Jones
Chemicals has a brief explanation:
http://www.dow.com/productsafety/safety/sds.htm)

> How often does a customer come in and request a new PDF?
>
This is a new system so it is kind of hard to say. Some chemicals, we might
anticipate requests several times a week (perhaps several times a day).
Others, very rarely- as little as once ever. One thing we had considered is
creating an SDS hit tracker which could sort of scale the relative
importance of a particular file- then when the cron job comes around it
could take that into consideration. I am not really sure if we would see
substantial benefit over just a simple find based cron job. Customers are
not the PM's concern it is the web spiders.

>
>
> On Tue, Mar 4, 2014 at 1:09 PM, George Wilson <rmwscbt@xxxxxxxxx> wrote:
>
>> Greetings all,
>> I hope this is not tip toeing off topic but I am working on solving a
>> problem at work right now and was hoping someone here might have some
>> experience/insight.
>>
>> My company has a new proprietary server which generates pdf chemical
>> safety
>> files via a rest API and returns them to the user. My project manager
>> wants
>> a layer of separation between the website(and hence user) and the document
>> server so I wrote an intermediary script which accepts a request from the
>> website and attempts to grab a pdf from the document server via the php
>> curl system. That appears to be working well.
>>
>> Here is the issue I am trying to solve:
>>
>> We must assume a total of 1.4 million possible documents which may be
>> generated by this system- each initiated directly from our website. Each
>> document is estimated to be about a megabyte in size. Generating each one
>> takes at least a few seconds.
>>
>> We are interested in setting up some kind of document caching system
>> (either a home brewed php based system or a system that generates the
>> files, saves them and periodically deletes them). My project manager is
>> concerned about web crawlers kicking off the generation of these files and
>> so we are considering strategies to avoid blowing out our server
>> resources.
>>
>> Does anyone have any suggestions or have you dealt with this problem
>> before?
>>
>> Thank you in advance
>>
>
>
>
> --
>
> Bastien
>
> Cat, the other other white meat
>