Re: Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

Willem Jan Withagen <wjw@xxxxxxxxxxx> · Tue, 10 Jan 2017 00:33:10 +0100

On 9-1-2017 23:58, Brian Andrus wrote:
> Sorry for spam... I meant D_SYNC.

That term does not run any lights in Google...
So I would expect it has to O_DSYNC.
(https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/)

Now you tell me there is a SSDs that does take correct action with
O_SYNC but not with O_DSYNC... That makes no sense to me. It is a
typical solution in the OS as speed trade-off versus a bit less
consistent FS.

Either a device actually writes its data persistenly (either in silicon
cells, or keeps it in RAM with a supercapacitor), or it does not.
Something else I can not think off. Maybe my EE background is sort of in
the way here. And I know that is rather hard to write correct SSD
firmware, I seen lots of firmware upgrades to actually fix serious
corner cases.

Now the second thing is how hard does a drive lie when being told that
the request write is synchronised. And Oke is only returned when data is
in stable storage, and can not be lost.

If there is a possibility that a sync write to a drive is not
persistent, then that is a serious breach of the sync write contract.
There will always be situations possible that these drives will lose data.
And if data is no longer in the journal, because the writing process
thinks the data is on stable storage it has deleted the data from the
journal. In this case that data is permanently lost.

Now you have a second chance (even a third) with Ceph, because data is
stored multiple times. And you can go to another OSD and try to get it back.

--WjW

> 
> On Mon, Jan 9, 2017 at 2:56 PM, Brian Andrus <brian.andrus@xxxxxxxxxxxxx
> <mailto:brian.andrus@xxxxxxxxxxxxx>> wrote:
> 
>     Hi Willem, the SSDs are probably fine for backing OSDs, it's the
>     O_DSYNC writes they tend to lie about.
> 
>     They may have a failure rate higher than enterprise-grade SSDs, but
>     are otherwise suitable for use as OSDs if journals are placed elsewhere.
> 
>     On Mon, Jan 9, 2017 at 2:39 PM, Willem Jan Withagen <wjw@xxxxxxxxxxx
>     <mailto:wjw@xxxxxxxxxxx>> wrote:
> 
>         On 9-1-2017 18:46, Oliver Humpage wrote:
>         >
>         >> Why would you still be using journals when running fully OSDs on
>         >> SSDs?
>         >
>         > In our case, we use cheaper large SSDs for the data (Samsung 850 Pro
>         > 2TB), whose performance is excellent in the cluster, but as has been
>         > pointed out in this thread can lose data if power is suddenly
>         > removed.
>         >
>         > We therefore put journals onto SM863 SSDs (1 journal SSD per 3 OSD
>         > SSDs), which are enterprise quality and have power outage protection.
>         > This seems to balance speed, capacity, reliability and budget fairly
>         > well.
> 
>         This would make me feel very uncomfortable.....
> 
>         So you have a reliable journal, so upto there thing do work:
>           Once in the journal you data is safe.
> 
>         But then you async transfer the data to disk. And that is an SSD
>         that
>         lies to you? It will tell you that the data is written. But if
>         you pull
>         the power, then it turns out that the data is not really stored.
> 
>         And then the only way to get the data consistent again, is to
>         (deep)scrub.
> 
>         Not a very appealing lookout??
> 
>         --WjW
> 
> 
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
>     -- 
>     Brian Andrus
>     Cloud Systems Engineer
>     DreamHost, LLC
> 
> 
> 
> 
> -- 
> Brian Andrus
> Cloud Systems Engineer
> DreamHost, LLC

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com