Re: FreeBSD port net/ceph-devel released

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Sat, 1 Apr 2017 22:58:39 +0200

On 1-4-2017 21:59, Wido den Hollander wrote:
> 
>> Op 31 maart 2017 om 19:15 schreef Willem Jan Withagen <wjw@xxxxxxxxxxx>:
>>
>>
>> On 31-3-2017 17:32, Wido den Hollander wrote:
>>> Hi Willem Jan,
>>>
>>>> Op 30 maart 2017 om 13:56 schreef Willem Jan Withagen
>>>> <wjw@xxxxxxxxxxx>:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> I'm pleased to announce that my efforts to port to FreeBSD have
>>>> resulted in a ceph-devel port commit in the ports tree.
>>>>
>>>> https://www.freshports.org/net/ceph-devel/
>>>>
>>>
>>> Awesome work! I don't touch FreeBSD that much, but I can imagine that
>>> people want this.
>>>
>>> Out of curiosity, does this run on ZFS under FreeBSD? Or what
>>> Filesystem would you use behind FileStore with this? Or does
>>> BlueStore work?
>>
>> Since I'm a huge ZFS fan, that is what I run it on.
> 
> Cool! The ZIL, ARC and L2ARC can actually make that very fast. Interesting!

Right, ZIL is magic, and more or equal to the journal now used with OSDs
for exactly the same reason. Sad thing is that a write is now 3*
journaled: 1* by Ceph, and 2* by ZFS. Which means that the used
bandwidth to the SSDs is double of what it could be.

Had some discussion about this, but disabling the Ceph journal is not
just setting an option. Although I would like to test performance of an
OSD with just the ZFS journal. But I expect that the OSD journal is
rather firmly integrated.

Now the real nice thing is that one does not need to worry about
cacheing the OSD performance. This is fully covered by ZFS. Both by ARC
and L2ARC. And ZIL and L2ARC can be constructed again in all shapes and
forms that all AFS vdev's can be made.
So for the ZIL you'd build and SSD's mirror: double the write speed, but
still redundant. For L2ARC I'd concatenate 2 SSD's to get the read
bandwidth. And contrary to some of the other caches ZFS does not return
errors if the l2arc devices go down. (note that data errors are detected
by checksumming) So that again is one less thing to worry about.

> CRC and Compression from ZFS are also very nice.

I did not want to go into too much details, but this is a large part of
the reasons. Compression I tried a bit, but does cost quite a bit of
performance at the Ceph end. Perhaps because the write to the journal is
synced, and thus has to way on both compression and synced writting.

It also bring snapshots without much hassle. But I have not yet figured
(looked at) out if and how btrfs snapshots are used.

Other challenge is the Ceph deep scrubbing: checking for corruption
within files. ZFS is able to detect corruption all by itself due to
extensive file checksumming. And with something way much stronger/better
that crc32. (just put on my fireproof suite)
So I'm not certain that deep-scrub would be obsolete, but I think it
could the frequency could perhaps go down, and/or be triggered by ZFS
errors after scrubbing a pool. Something that has way much less impact
on performance.

In some of the talks I give, I always try to explain to people that RAID
and RAID controllers are the current dinosaurs of IT.

>> To be honest I have not tested on UFS, but I would expect that the xattr
>> are not long enough.
>>
>> BlueStore is not (yet) available because there is a different AIO
>> implementation on FreeBSD. But Sage thinks it is very doable to glue in
>> posix AIO. And one of my port reviewers has offered to look at it. So it
>> could be that BlueStore will be available in the foreseeable future.
>>
>> --WjW

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com