Re: Ceph weird "corruption" but no corruption and performance = abysmal.

Christian Balzer <chibi@xxxxxxx> · Fri, 22 Apr 2016 09:16:31 +0900

Hello,

On Thu, 21 Apr 2016 15:35:52 +0300 Florian Rommel wrote:

> Ok, weird problem,(s) if you want to call it that..
> 
> So i run a 10 OSD Ceph cluster on 4 hosts with SSDs (Intel DC3700) as
> journals.
>
Small number of OSDs (at replication 3 at best the sustained performance of
3 HDDs) in total and per host, this is relevant later.

> I have a lot of mixed workloads running and the linux machines seem to
> get somehow corrupted in a weird way and the performance kind of sucks.
> First off: All hosts are running Openstack with KVM + libvirt to connect
> and boot the RBD volumes. Ceph -v : ceph version 0.94.6
> 
0.94.6 is the enemy, it was always the enemy. ^o^
But since you have no cache-tier, you should be fine.

> —————— Problem 1: Corruption:
> 
> Next, whenever I run fsck.ext4 -nvf /dev/vda1 on one of the guests I get
> this: 2fsck 1.42.9 (4-Feb-2014)
> Warning!  /dev/vda1 is mounted.
> Warning: skipping journal recovery because doing a read-only filesystem
> check. Pass 1: Checking inodes, blocks, and sizes
> Deleted inode 1647 has zero dtime.  Fix? no
> 
> Inodes that were part of a corrupted orphan linked list found.  Fix? no
> 
> Inode 133469 was part of the orphaned inode list.  IGNORED.
> Inode 133485 was part of the orphaned inode list.  IGNORED.
> Inode 133490 was part of the orphaned inode list.  IGNORED.
> Inode 133492 was part of the orphaned inode list.  IGNORED.
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (8866035, counted=8865735).
> Fix? no
> 
> Inode bitmap differences:  -1647 -133469 -133485 -133490 -133492
> Fix? no
> 
> Free inodes count wrong (2508840, counted=2509091).
> Fix? no
> 
> 
> cloudimg-rootfs: ********** WARNING: Filesystem still has errors
> **********
> 
> 
>       112600 inodes used (4.30%, out of 2621440)
>           70 non-contiguous files (0.1%)
>           77 non-contiguous directories (0.1%)
>              # of inodes with ind/dind/tind blocks: 0/0/0
>              Extent depth histogram: 104372/41
>      1619469 blocks used (15.44%, out of 10485504)
>            0 bad blocks
>            2 large files
> 
>        89034 regular files
>        14945 directories
>           55 character device files
>           25 block device files
>            1 fifo
>           16 links
>         8265 symbolic links (7832 fast symbolic links)
>           10 sockets
> ------------
>       112351 files
> 
> 
> So I mount the disk via RBD on a host directly with rbd map
> and when i do a fsck.ext4 -nfv /dev/rbd01p1 
> i get 
> 
> fsck.ext4 /dev/rbd0p1
> e2fsck 1.42.11 (09-Jul-2014)
> cloudimg-rootfs: clean, 112600/2621440 files, 1619469/10485504 blocks
> 
> 
> So which one do I trust??? 

The one that isn't "live", aka the 2nd one.
Too many possible cache and other layers to confuse things.

You could boot that VM from a recovery/install "CD" and then do a fsck on
the "disk" in question. It should come up clean as well.

> I have had corrupted files on some of the
> images but I accredited this due to a migration from qcow2 to RAW ->
> ceph.
> 
That should show up consistently in both of your fsck cases.

> Any help is really appreciated
> 
> ———— > Problem 2: Performance
> I would assume that even with the Intel DC SSDs as journals, 
                      ^^^^^^^^^
Even with? You meant to say "because using"?
Also which model is that? 
The 100GB for example is limited to 200MB/s writes.

> I would get
> decent performance out of the system. But currently I max this one out
> at 200MB/s write while read is full 10Gbit/s
>
Mixing apples (MByte/s) and oranges (Gbit/s) isn't helping. ^_-

> I have 10 SATA drives behind the SSDs using 2x 3 SATAs/SSD and 2x 2
> SATA / SSD
> 
I wouldn't expect much more than that sustained with your setup, in fact
I expected what you quote below, something below 150MB/s.

> fio is also giving terrible results, its like it cranks up the IO to
> about 5000 then dwindles down.. looks almost like its waiting to flush
> the SSDs out.. or the IO
> 
Ceph doesn't "flush" the journal, i.e. reads from it, ever in normal
operations. 
The only time it does that is during crash recovery.

Exact fio command line, please.
And some results, I would expect your tuning might induce more (uneven)
latencies as well.

But what you're seeing is exactly what is bound to happen eventually,
things being fast while they go to the journal and RAM on the storage
node, slowing down to the actual HDD speeds once things back up enough.

> The only changes i made to the base config is rbd cache = true and then
> the following lines:
> 
> ceph tell osd.* injectargs '--filestore_wbthrottle_enable=false'
> ceph tell osd.* injectargs '--filestore_queue_max_bytes=1048576000'
> ceph tell osd.* injectargs '--filestore_queue_committing_max_ops=5000'
> ceph tell osd.* injectargs
> '--filestore_queue_committing_max_bytes=1048576000' ceph tell osd.*
> injectargs '--filestore_queue_max_ops=200' ceph tell osd.* injectargs
> '--journal_max_write_entries=1000' ceph tell osd.* injectargs
> '--journal_queue_max_ops=3000’
> 
> Thats the only way I reached 200-250MB/s.. otherwise its more like
> 115MB/s also waiting for flush after a wave..
> 
Precisely.
If you already figured that out and set those parameters, you should know
that there isn't a way to improve things further.
Eventually all your in-flight data needs to be written to the HDDs and
about 100MB/s per HDD is a mighty good result.
Remember that this isn't a long sequential write, it is 4MB objects with
all the associated FS journal, Ceph metadata and leveldb updates and thus
seeks.

> Can anyone give me a fairly decent idea on how to tune this properly?
There are more knobs like "filestore min sync interval" but you're very
much and the end stop here.

Christian
> also could this modification have something to do with the corruption?
> 
> Thanks again for any help :)
> 
> //Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com