Re: Bcache on btrfs: the perfect setup or the perfect storm?

Kai Krakow <hurikhan77@xxxxxxxxx> · Sat, 11 Apr 2015 01:49:06 +0200

Kai Krakow <hurikhan77@xxxxxxxxx> schrieb:

BTW: When I wrote "cell" below, I didn't refer the single cell which stores 
only one, two, or three bits in a flash block. I referred to a complete 
blocks of cells making up on native block that can be erased as a whole.

> arnaud gaboury <arnaud.gaboury@xxxxxxxxx> schrieb:
> 
>> On Wed, Apr 8, 2015 at 9:02 PM, Kai Krakow <kai@xxxxxxxxxxx> wrote:
>>> arnaud gaboury <arnaud.gaboury@xxxxxxxxx> schrieb:
>>>
>>>> I plan to setup Bcache on  Btrfs SSD/HD caching/backing device. The
>>>> box will be a server, but not a production one.
>>>>
>>>> I know this scheme is not recommended and can be a cause of filesystem
>>>> corruption. But as I like living on the edge, I want to give a try,
>>>> with a tight backup policy.
>>>>
>>>> Any advice? Any settings that could be worth to avoid future drama?
>>>
>>> See my recommendations about partitioning here I just posted:
>>> http://permalink.gmane.org/gmane.linux.kernel.bcache.devel/2865
>>>
>>> Take care of the warnings about trimming/enabling discard. I'd at least
>>> test it if you want to use it before trusting your kitty to it. ;-)
>>>
>>> BTW: I'm planning on a similar system tho the plans may well be 1 or 2
>>> years in the future and the system is planned to be based off
>>> bcache/btrfs. It should become a production container-VM host. We'll
>>> see. Fallback plan is to use a battery-backed RAID controller with
>>> CacheCade (SSDs as volume cache).
>> 
>> Thank you.
>> Here is what I did to prevent any drama:
> 
> Due to bugs mentioned here, to prevent drama, better don't use discard or
> at least do your tests with appropriate backups, also do your performance
> tests.
> 
>> the caching device, a SSD:
>> - GPT partitions.
>> 
>> -------------------------------------------------------
>> # gisk /dev/sdd
>> Command: p
>> Disk /dev/sdb: 224674128 sectors, 107.1 GiB
>> Logical sector size: 512 bytes
>> Disk identifier (GUID): EAAC52BC-8236-483F-9875-744AF7031E72
>> Partition table holds up to 128 entries
>> First usable sector is 34, last usable sector is 224674094
>> Partitions will be aligned on 2048-sector boundaries
>> Total free space is 2014 sectors (1007.0 KiB)
>> 
>> Number  Start (sector)    End (sector)  Size       Code  Name
>>    1            2048       167774207   80.0 GiB    8300  poppy-root
>>    2       167774208       224674094   27.1 GiB    8300  poppy-cache
> 
> This matches my setup (though I used another layout). My drive reports
> full 128GB instead of only 100GB. So I partitioned only 100GB and trimmed
> the whole drive in advance to partitioning.
> 
>> Then :
>> 
>> # make-bcache --wipe-bcache --block 4k --bucket 2M -B /dev/sdd4 -C
>> /dev/sdb2 --discard
> 
> Take your safety measurements with "discard", see other threads/posts and
> above.
> 
>> Now, when referring to your post, I have no idea about:
>>  Fourth - wear-levelling reservation.
> 
> Wear-levelling means that the drive tries to distribute writes evenly to
> the flash cells. It uses an internal mapping to dynamically remap logical
> blocks to internal block addresses. Because flash cells cannot be
> overridden, they have to be erased first, then written with the new data.
> The erasing itself is slow. This also implies that to modify a single
> logical sector of a cell, the drive has to read, modify, erase, then write
> a cell. This is clearly even slower. It is also known as "write
> amplification", this is more or less the same effect.
> 
> To compensate for that (performance), the drive reads a block, modifies
> the data, and writes it to a fresh (already erased) cell. This is known as
> read- modify-write-cycle. Here comes the internal remapping into play. The
> erasing of the old cell is deferred into background. It will be done when
> the drive is idle. If you do a lot of writes, this accumulates. The drive
> needs a reservation area for discardable cells (thus, the option is
> usually called "discard" in filesystems).
> 
> A flash cell also has a limited lifetime how often it can be erased and
> rewritten. To compensate for that, SSD firmwares implement "wear
> levelling", that means it will try to remap to a flash cell that has a low
> counter for this. If your system informs the drive which cells are no
> longer holding data (the "discard" option), the drive can do a much better
> job at it because it has not to rely on the reservation pool alone. I
> artificially grow this pool by only partitioning about 80% of the native
> flash size, this means the pool has 20% of the whole capacity (usually,
> this is around 7% by default for most manufacturers). (100GB vs. 128GB ~
> 20%, 120GB vs. 128GB ~ 7%)
> 
> Now, bcache's bucket size comes into play. Because cells are much bigger
> then logical sectors (usually 512k, if your drive is RAID-striped
> internally and bigger drives usually are, you have integer multiples of
> that, like 2M for a 500GB drive because 4 cells make up on block [1], this
> is called an erase block because it can only be erased/written to all or
> nothing), you want to avoid the read-modify-write cycle as good as
> possible. This is what bcache uses its buckets for. It tries to fill and
> discard complete buckets in one go, caching as most in RAM first before
> pushing it out to the drive. If you are writing a complete block sized the
> erase block of the drive, it doesn't have to read-modify first, it just
> writes. This is clearly a performance benefit.
> 
> As a logical consequence, you can improve long-term performance and
> lifetime by using discard and a reservation pool, and you can improve
> direct performance by using an optimal bucket size in bcache. If
> optimizing Ext4 for SSD, there are similar approaches to make best use of
> the erase block size.
> 
> 
>> Shall I change my partition table for /dev/sdd2 and leave some space?
> 
> No, but maybe on sdb2 because that is your SSD, right?
> 
>> Below are some infos about the SSD:
> [...]
> 
> Since sdb is your SSD, the above recommendations apply to sdb, not sdd. If
> you repartition, you may want to trim your whole drive first for the above
> mentioned reasons.
> 
> 
> [1]:  If your drive has high write rates, it is probably using striped
> [cells
> because writing flash is a slow process, much slower than reading (25% or
> less).
> 
-- 
Replies to list only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html