How to store "blobs" efficiently for small and large sizes, with random access

Dominique Devienne <ddevienne@xxxxxxxxx> · Wed, 19 Oct 2022 11:47:59 +0200

Hi. I'd like some advice storing blobs (millions of them), ranging
from very small, to large > 1GB.

I know about bytea versus lo, and I have probably read most of what's
out there about them :)

Upfront, I have to state that I'm not keen on lo, because of security
considerations. We store
blobs in many different schemas, and users can access some schemas,
and not others. So
the fact the lo table is unique for the whole database would allow
users to see blobs from any
schema, as I understand it. Right? OTOH, lo has random access, which I
also need...
(I'm also not a fan of lo needing triggers for lifetime management)
(nor of the requirement to have an explicit transaction to use lo).

Here are the main requirement I need to fulfil:
1) store literally millions of rows, 1 "blob" per row. (scientific data).
2) most "blobs" are smallish, below a few KBs. bytea's perfect for that.
3) yet many blobs are on the dozens of MBs. bytea's still OK.
4) but some blobs exceed the 1GB byte limit. Found at least a dozen
just in our test data, so clients will have them.
5) for accessing larger blobs, the API I must implement accesses
contiguous chunks of the blobs. Thus I need random access.

I'm porting the backend of that API from Oracle to PostgreSQL.
In Oracle we used RAW for smaller values, and SecureFile blobs for larger ones.
Oracle LOBs are similar to lo I guess, providing random access, except
higher performance for large sizes,
but slower than bytea for smaller one. But a PITA to deal with, with
locators, breaking pre-fetching, etc...

PostgreSQL bytea is much better and simpler, except limited to 1GB...
Bytea also has no direct random access, except via substr[ing], but
how efficient and "random access" is that?
For example, SQLite stores large values in overflow pages, and has to
traverse all those pages to pointer-chase
the chain of those for large blobs, so the substr() "random access" is
NOT O(1) and involves way too much IO (i.e. O(N)).
Do TOAST'ed large values (text or bytea) fare better in term of
substr() efficiency, close to O(logN) and only
the necessary IO for what's actually requested by the substr() (modulo
the extrema 2K chunks of TOAST)?

Some posts on the internet also peg lo to be faster above 20MB
compared to bytea, for example. Is that even true?
One post I read (from Daniel Verite I think) kinda hinted the lo table
is not that different from TOAST ones,
with 2K chunks, thus I don't see why lo would be any faster than a
TOAST'ed bytea for example.
Any insights from someone with knowledge of the internals can share on this?

At this point, I have not yet redone the extensive benchmarking we did
a few years ago between
Oracle and PostgreSQL, which went into lo versus bytea then. And
benchmark are hard to do, not
having easy access to different kind of servers with different storage
backends, or cloud-hosted PG.
It's too easy to get biais from a local setup, leading to a design
that'd not perform optimally in a different one.
That's why I'm asking a more open-ended question to experts on this list.

Because of #4 above, I need to either use lo (but see above, notably
the security concern),
or roll-up my own TOAST^2 (squared), as I call it, where I manually
"shard" / chunk large blobs
in an auxiliary table, itself TOAST'ed of course, with some threshold
for chunks (e.g. 1 or 4 or 16 MB perhaps).
The latter keeps the blobs in the schema (good for security), lifetime
is managed by FKs (as usual),
and the sharding limits the ill-effects of "emulating" random-access
with substr() if necessary.
I've already done things like this in SQLite land (also has the 1GB
limit for its text and blob types).

So is this a terrible idea? What alternatives do I have?
I'd really appreciate some expert advice on the above, before I go too
far down the rabbit hole.

Thanks, --DD

PS: Another drawback of lo is that because it's a single table, it
still subject to the 32 TB limit on a relation.
  The 4TB limit per lo is way more than we need, but the 32 TB limit
may actually be more of an issue for
  our larger clients, which have thousands of projects, each with
upwards of a few millions of those blobs.
  bytea values being stored in different schemas (per-project, a
design constraint), puts that limit per-project
  which will be more than enough. For the sum of all projects, maybe
not... I.e. with real client-case of 3K projects,
  that puts an average of only 10GB of lo's per-project (i.e. schema),
which could very well be problematic...