Large(ish) scale pdf file cacheing

George Wilson <rmwscbt@xxxxxxxxx> · Tue, 4 Mar 2014 10:09:09 -0800

Greetings all,
I hope this is not tip toeing off topic but I am working on solving a
problem at work right now and was hoping someone here might have some
experience/insight.

My company has a new proprietary server which generates pdf chemical safety
files via a rest API and returns them to the user. My project manager wants
a layer of separation between the website(and hence user) and the document
server so I wrote an intermediary script which accepts a request from the
website and attempts to grab a pdf from the document server via the php
curl system. That appears to be working well.

Here is the issue I am trying to solve:

We must assume a total of 1.4 million possible documents which may be
generated by this system- each initiated directly from our website. Each
document is estimated to be about a megabyte in size. Generating each one
takes at least a few seconds.

We are interested in setting up some kind of document caching system
(either a home brewed php based system or a system that generates the
files, saves them and periodically deletes them). My project manager is
concerned about web crawlers kicking off the generation of these files and
so we are considering strategies to avoid blowing out our server resources.

Does anyone have any suggestions or have you dealt with this problem before?

Thank you in advance