Re: SSD journals killed by VMs generating 500 IOPs (4kB) non-stop for a month, seemingly because of a syslog-ng bug

Mart van Santen <mart@xxxxxxxxxxxx> · Mon, 23 Nov 2015 11:06:30 +0100

    On 11/23/2015 10:42 AM, Eneko Lacunza
      wrote:

      Hi Mart,

        El 23/11/15 a las 10:29, Mart van Santen escribió:

        On 11/22/2015 10:01 PM, Robert LeBlanc wrote:

        There have been numerous on the mailing
          list of the Samsung EVO and

          Pros failing far before their expected wear. This is most
          likely due

          to the 'uncommon' workload of Ceph and the controllers of
          those drives

          are not really designed to handle the continuous direct sync
          writes

          that Ceph does. Because of this they can fail without warning

          (controller failure rather than MLC failure).

        I'm new to the mailinglist and I'm scanning the archive
        currently.  And I'm getting a sense of the Samsung Evo quality
        disks. If i understand correctly, is is at least advise to put
        DC grade Journals in front om them to safe them a bit from
        failure. For example intel 750's. 

      I don't think Intel 750's are DC grade. I don't have any of them
      though.

    OK thx. Interesting,  as the 750's are more expansive here than some
    Intel S3xxx (per $)

        However, is there experience in when the Evo's fail in the Ceph
        scenarion? For example, is wear leveling is according SMART
        about 40%, it's time to replace your disks? Or is it just
        random. Actually we are using mostly Crucial drives (m550,
        mx200's), there is not a lot about them on the list. Do other
        people use them and what's there experience so far. I expect
        about the same quality of the Samsung Evo's, but I'm not sure if
        that is the correct conclusion.

      My experience with Samsung 840 pro is that they can't be used for
      Ceph at all. In case of Crucial M550, they are slow and have
      little endurance for ceph use, but I have used them and seemed
      reliable during warranty lifetime (we retired them for performance
      reasons).

    They are indeed not the fasted in the world, but our cluster isn't
    that heavy used (max ~4000 IOPS for the whole cluster, ~50 1xTB
    disks currently). Disk IO is currently between 5% and 20% usage. 

        About SSD failure in general, do they normally fail hard, or are
        they just getting unbearable slow? We do measure/graph disks
        'busy' performance, and use that as an indicator if a disk is
        getting slow. Is this is a sensible approach?

      Just don't do it. Use DC SSDs, like intel S3xxx, or Samsung DC
      Pro, or something like that. You will save a lot of time and
      effort, and possibly also money.

    It's clear for me the Intel SSDs are the best to go. However, I
    still want to estimate at what pace we should replace them in the
    future. So it still make sense for us the have some idea of to
    predict when it's time to speed up replacements.

    Regards,

    Mart van Santen

      Cheers

      Eneko

      -- 
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
      943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    -- 
Mart van Santen
Greenhost
E: mart@xxxxxxxxxxxx
T: +31 20 4890444
W: https://greenhost.nl

A PGP signature can be attached to this e-mail,
you need PGP software to verify it. 
My public key is available in keyserver(s)
see: http://tinyurl.com/openpgp-manual

PGP Fingerprint: CA85 EB11 2B70 042D AF66  B29A 6437 01A1 10A3 D3A5

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com