Re: Bcache on btrfs: the perfect setup or the perfect storm?

Kai Krakow <hurikhan77@xxxxxxxxx> · Sat, 11 Apr 2015 01:33:02 +0200

arnaud gaboury <arnaud.gaboury@xxxxxxxxx> schrieb:

> On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@xxxxxxxxxxx> wrote:
>> arnaud gaboury <arnaud.gaboury@xxxxxxxxx> schrieb:
>>
>>> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
>>> box will be a server, but not a production one.
>>>
>>> I know this scheme is not recommended and can be a cause of filesystem
>>> corruption. But as I like living on the edge, I want to give a try,
>>> with a tight backup policy.
>>>
>>> Any advice? Any settings that could be worth to avoid future drama?
>>
>> See my recommendations about partitioning here I just posted:
>> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865
>>
>> Take care of the warnings about trimming/enabling discard. I'd at least
>> test it if you want to use it before trusting your kitty to it. ;-)
>>
>> BTW: I'm planning on a similar system tho the plans may well be 1 or 2
>> years in the future and the system is planned to be based off
>> bcache/btrfs. It should become a production container-VM host. We'll see.
>> Fallback plan is to use a battery-backed RAID controller with CacheCade
>> (SSDs as volume cache).
> 
> Thank you.
> Here is what I did to prevent any drama:

Due to bugs mentioned here, to prevent drama, better don't use discard or at 
least do your tests with appropriate backups, also do your performance 
tests.

> the caching device, a SSD:
> - GPT partitions.
> 
> -------------------------------------------------------
> # gisk /dev/sdd
> Command: p
> Disk /dev/sdb: 224674128 sectors, 107.1 GiB
> Logical sector size: 512 bytes
> Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72
> Partition table holds up to 128 entries
> First usable sector is 34, last usable sector is 224674094
> Partitions will be aligned on 2048-sector boundaries
> Total free space is 2014 sectors (1007.0 KiB)
> 
> Number  Start (sector)    End (sector)  Size       Code  Name
>    1            2048       167774207   80.0 GiB    8300  poppy-root
>    2       167774208       224674094   27.1 GiB    8300  poppy-cache

This matches my setup (though I used another layout). My drive reports full 
128GB instead of only 100GB. So I partitioned only 100GB and trimmed the 
whole drive in advance to partitioning.

> Then :
> 
> # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C
> /dev/sdb2 --discard

Take your safety measurements with "discard", see other threads/posts and 
above.

> Now, when referring to your post, I have no idea about:
>  Fourth - wear-levelling reservation.

Wear-levelling means that the drive tries to distribute writes evenly to the 
flash cells. It uses an internal mapping to dynamically remap logical blocks 
to internal block addresses. Because flash cells cannot be overridden, they 
have to be erased first, then written with the new data. The erasing itself 
is slow. This also implies that to modify a single logical sector of a cell, 
the drive has to read, modify, erase, then write a cell. This is clearly 
even slower. It is also known as "write amplification", this is more or less 
the same effect.

To compensate for that (performance), the drive reads a block, modifies the 
data, and writes it to a fresh (already erased) cell. This is known as read-
modify-write-cycle. Here comes the internal remapping into play. The erasing 
of the old cell is deferred into background. It will be done when the drive 
is idle. If you do a lot of writes, this accumulates. The drive needs a 
reservation area for discardable cells (thus, the option is usually called 
"discard" in filesystems).

A flash cell also has a limited lifetime how often it can be erased and 
rewritten. To compensate for that, SSD firmwares implement "wear levelling", 
that means it will try to remap to a flash cell that has a low counter for 
this. If your system informs the drive which cells are no longer holding 
data (the "discard" option), the drive can do a much better job at it 
because it has not to rely on the reservation pool alone. I artificially 
grow this pool by only partitioning about 80% of the native flash size, this 
means the pool has 20% of the whole capacity (usually, this is around 7% by 
default for most manufacturers). (100GB vs. 128GB ~ 20%, 120GB vs. 128GB ~ 
7%)

Now, bcache's bucket size comes into play. Because cells are much bigger 
then logical sectors (usually 512k, if your drive is RAID-striped internally 
and bigger drives usually are, you have integer multiples of that, like 2M 
for a 500GB drive because 4 cells make up on block [1], this is called an 
erase block because it can only be erased/written to all or nothing), you 
want to avoid the read-modify-write cycle as good as possible. This is what 
bcache uses its buckets for. It tries to fill and discard complete buckets 
in one go, caching as most in RAM first before pushing it out to the drive. 
If you are writing a complete block sized the erase block of the drive, it 
doesn't have to read-modify first, it just writes. This is clearly a 
performance benefit.

As a logical consequence, you can improve long-term performance and lifetime 
by using discard and a reservation pool, and you can improve direct 
performance by using an optimal bucket size in bcache. If optimizing Ext4 
for SSD, there are similar approaches to make best use of the erase block 
size.

> Shall I change my partition table for /dev/sdd2 and leave some space?

No, but maybe on sdb2 because that is your SSD, right?

> Below are some infos about the SSD:
[...]

Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If 
you repartition, you may want to trim your whole drive first for the above 
mentioned reasons.

[1]:  If your drive has high write rates, it is probably using striped cells 
because writing flash is a slow process, much slower than reading (25% or 
less).

-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html