Dear fellow list members,
I'm in the process of implementing a file storage system that is based on PostgreSQL and streaming replication. There will possibly be many similar files stored. I would like to implement block-level deduplication: each file consists of a series of blocks, and each unique block is stored only once (e.g. one block may be reused by multiple files). It will be part of a bigger software, e.g. the same database will be used for other purposes too.
Here is the basic idea for storing individual blocks:
create table block(
id uuid not null primary key,
block bytea not null,
hs256 bytea not null
) ;
create unique index uidx_block_hs256 on block(hs256);
create or replace function trg_biu_block() returns trigger language plpgsql as
$function$
begin
new.hs256 = digest(new.block, 'sha256');
end;
$function$;
create trigger trg_biu_block before insert or update on block for each row execute procedure trg_biu_block();
This is just for storing the blocks. I'm going to put this "block" table into a separate tablespace. File operations will be at least 95% read and at most 5% write. (Streaming replication will hopefully allow almost horizontal scaling for read operations.) Most of the files will be several megabytes in size (images), and some of them will be 100MB or more (videos). Total capacity is in the 10TB range. Storage will be SSD (possibly fiber connected, or local RAID, we are not sure yet).
I do not want to use PostgreSQL large objects, because it does not have block level deduplication.
Here are some things that I need help with:
1. What should be the maximum size of a block? I was trying to find out the optimal value. Default BLCKSZ is 8192 bytes. AFAIK PostgreSQL does not allow a row to occupy multiple blocks. I don't know enough to calculate the optimal "block size" (the max. number of bytes stored in a single row in the block.block field), but I suspect that it should be 4K or something similar. I think that it should be as large as possible, without hitting the toast. My thinking is this: most files will be at least 1MB in size, so most "block" rows will reach the maximum tuple size. It would be practical to make one row in the "block" table occupy almost one PostgreSQL block.
2. I'm not sure if it would be beneficial to increase BLCKSZ. I will be able to test the speed of my (not yet finished) implementation with different BLCKSZ values, but that alone won't help me make the decision, because AFAIK BLCKSZ must have a fixed value for the PostgreSQL instance, so it will affect all other tables in the database. It would be hard to tell how changing BLCKSZ would affect the system as a whole.
3. In the above example, I used SHA-256 (pgcrypto), because apparently it is very well optimized for 64 bit machines, and it has practically zero chance of a collision. I think that sha512 would be an overkill. But I'm not sure that this is the best choice. Maybe somebody with more experience can suggest a better hash function.
4. The hs256 value will always be non-null, fixed 32 byte binary value, but probably the query planner will not know anything about that. I was also thinking about bit(256), but I don't see an easy way to convert the bytea digest into bit(256). A simple type cast won't work here. Maybe using bytea here is perfectly fine, and creating an index on the hs256 bytea fields is as effective as possible.
I'm not looking for a definitive answer, just trying to get some hints from more experienced users before I fill up the drives with terabytes of data.
Thank you,
Laszlo