Re: OOM's on the Ceph client machine

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ted:

I'd like to follow your similar setup, too.
At this stage, I'm with the very recent version, I've tried btrfs but
ceph mount freezes as soon as I run any high benchmark or heavy iops.
I think it is to do with syncing, so I'm now trying ext4 without the
journal, hoping for some good news.

1. have you tried btrfs, with all configs the same?
2. did you use mkfs.ext4 -O ^has_journal to disable? (sorry, you know
the most since you are the ext4 man!)
3. did the change of journal size in ceph.conf changed any results?
e.g., 20MB to 1000MB?

Have you had any success in other benchmarks? I'd really like to know
if you could try 'fio' benchmark, if your time allows.
run ./fio example_file

and the example_file content is below, e.g.,

[global]
bs=4k
ioengine=libaio
iodepth=1
size=1g
direct=1
runtime=30
filename=/media/cephmount/afile

[seq-read]
  rw=read
  stonewall
[rand-read]
  rw=randread
  stonewall
[seq-write]
  rw=write
  stonewall
[rand-write]
  rw=randwrite
  stonewall

thanks a lot.

On Wed, Oct 13, 2010 at 1:31 PM, Theodore Ts'o <tytso@xxxxxxx> wrote:
> Hi there,
>
> I've recently been playing with Ceph on an evaluation basis, and found
> that I was able to fairly reliably induce an OOM kill on my the ceph
> client machine by using FFSB with the following configuration file (see
> attached, below).
>
> I am using Ceph v0.21.3 plus a few commits that were on the testing
> branch as of late September (commit ID 569d96b).  The Ceph cluster
> contains 10 commodity servers with 5 disks configured for Ceph object
> storage on each server (plus a separate spindle for the journal files),
> so there are 5 instances of cosd on each OSD server.  The disks are
> formatted using ext4 in no-journal mode.  I am using 3 servers for the
> MDS and montioring daemons, with the MDS and monitoring daemons
> colocated these 3 servers.  The machines all have gigabit ethernet
> cards.
>
> I've been running the client on a separate machine, and this is the
> machine which has been dying with an OOM.
>
> Any help, suggestions, or "hey stupid!  You screwed up XXXX in your
> ceph.conf file" would be gratefully accepted.
>
> Thanks,
>
>                                     - Ted
>
> P.S.  In case people are curious, here are the results of the "boxacle"
> (http://btrfs.boxacle.net) FFSB workloads that I ran.  The results are
> fairly stable, except very often the 8 thread random_write workload is a
> little hard to reproduce because it very often OOM's.  I've never gotten
> a 32 thread random_write workload measurement, since it very reliably
> OOM's on my client machine.
>
> Do these results look reasonable to you?  I confess I'm a little
> disappointed with the sequential and random read numbers in particular.
> And given 10 servers and fifty spindles, even the large_file_create
> numbers seems surprising slow.
>
> (Also, given the we are using gigabit ethernet in this evaluation
> cluster, the 1GB/sec seems ridiculously high, which suggests to me that
> the fsync request wasn't honored -- FFSB includes the fsync time when
> calculating write bandwidth -- and it may explain why we are OOM'ing in
> the random_write workload.)
>
>                    1 thread           8 threads            32 threads
> large_file_create   101 MB/sec         102 MB/sec           101 MB/sec
> sequential_reads     35 MB/sec         113 MB/sec           114 MB/sec
> random_reads          1.48 MB/sec        5.44 MB/sec        11.7 MB/sec
> random_writes      923 MB/sec           1.09 GB/sec             (*)
>
> For comparison, here are the FFSB numbers on a single local ext4 disk
> with no journal:
>
>                    1 thread           8 threads            32 threads
> large_file_create   75.5 MB/sec        72.2 MB/sec          74.2 MB/sec
> sequential_reads    77.2 MB/sec        69.2 MB/sec          70.3 MB/sec
> random_reads        734 K/sec          537 K/sec            537 K/sec
> random_writes       44.5 MB/sec        41.5 MB/sec          41.6 MB/sec
>
> It's very possible that I may have done something wrong, so I've
> enclosed the ceph.conf file I used for doing this test run....  please
> let me know if there's something I've screwed up.
>
> ---------------------------- random_write.32.ffsb
> # Large file random writes.
> # 1024 files, 100MB per file.
>
> time=300  # 5 min
> alignio=1
>
> [filesystem0]
>        location=/mnt/ffsb1
>        num_files=1024
>        min_filesize=104857600  # 100 MB
>        max_filesize=104857600
>        reuse=1
> [end0]
>
> [threadgroup0]
>        num_threads=32
>
>        write_random=1
>        write_weight=1
>
>        write_size=5242880  # 5 MB
>        write_blocksize=4096
>
>        [stats]
>                enable_stats=1
>                enable_range=1
>
>                msec_range    0.00      0.01
>                msec_range    0.01      0.02
>                msec_range    0.02      0.05
>                msec_range    0.05      0.10
>                msec_range    0.10      0.20
>                msec_range    0.20      0.50
>                msec_range    0.50      1.00
>                msec_range    1.00      2.00
>                msec_range    2.00      5.00
>                msec_range    5.00     10.00
>                msec_range   10.00     20.00
>                msec_range   20.00     50.00
>                msec_range   50.00    100.00
>                msec_range  100.00    200.00
>                msec_range  200.00    500.00
>                msec_range  500.00   1000.00
>                msec_range 1000.00   2000.00
>                msec_range 2000.00   5000.00
>                msec_range 5000.00  10000.00
>        [end]
> [end0]
> ------------------------------------------------ My ceph.conf file
>
> ;
> ; This is the test ceph configuration file
> ;
> ; [tytso:20101007.0813EDT]
> ;
> ; This file defines cluster membership, the various locations
> ; that Ceph stores data, and any other runtime options.
> ;
> ; If a 'host' is defined for a daemon, the start/stop script will
> ; verify that it matches the hostname (or else ignore it).  If it is
> ; not defined, it is assumed that the daemon is intended to start on
> ; the current host (e.g., in a setup with a startup.conf on each
> ; node).
>
> ; global
> [global]
>        user = root
>        pid file = /disk/sda3/tmp/ceph/$name.pid
>        logger dir = /disk/sda3/tmp/ceph
>        log dir = /disk/sda3/tmp/ceph
>        chdir = /disk/sda3
>
> ; monitors
> ;  You need at least one.  You need at least three if you want to
> ;  tolerate any node failures.  Always create an odd number.
> [mon]
>        mon data = /disk/sda3/cephmon/data/mon$id
>
>        ; logging, for debugging monitor crashes, in order of
>        ; their likelihood of being helpful :)
>        ;debug ms = 1
>        ;debug mon = 20
>        ;debug paxos = 20
>        ;debug auth = 20
>
> [mon0]
>        host = mach1
>        mon addr = 1.2.3.4:6789
>
> [mon1]
>        host = mach2
>        mon addr = 1.2.3.5:6789
>
> [mon1]
>        host = mach3
>        mon addr = 1.2.3.6:6789
>
> ; mds
> ;  You need at least one.  Define two to get a standby.
> [mds]
>        ; where the mds keeps it's secret encryption keys
>        keyring = /data/keyring.$name
>
>        ; mds logging to debug issues.
>        ;debug ms = 1
>        ;debug mds = 20
>
> [mds.alpha]
>        host = mach2
>
> [mds.beta]
>        host = mach3
>
> [mds.gamma]
>        host = mach1
>
> ; osd
> ;  You need at least one.  Two if you want data to be replicated.
> ;  Define as many as you like.
> [osd]
>        ; osd logging to debug osd issues, in order of likelihood of being
>        ; helpful
>        ;debug ms = 1
>        ;debug osd = 20
>        ;debug filestore = 20
>        ;debug journal = 20
>
> [osd0]
>        host = mach10
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd1]
>        host = mach11
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd2]
>        host = mach12
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd3]
>        host = mach13
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd4]
>        host = mach14
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd5]
>        host = mach15
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd6]
>        host = mach16
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd7]
>        host = mach17
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd8]
>        host = mach18
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd9]
>        host = mach19
>        osd data = /disk/sdb3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdb3
>
> [osd10]
>        host = mach10
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd11]
>        host = mach11
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd12]
>        host = mach12
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd13]
>        host = mach13
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd14]
>        host = mach14
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd15]
>        host = mach15
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd16]
>        host = mach16
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd17]
>        host = mach17
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd18]
>        host = mach18
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd19]
>        host = mach19
>        osd data = /disk/sdd3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdd3
>
> [osd20]
>        host = mach10
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd21]
>        host = mach11
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd22]
>        host = mach12
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd23]
>        host = mach13
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd24]
>        host = mach14
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd25]
>        host = mach15
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd26]
>        host = mach16
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd27]
>        host = mach17
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd28]
>        host = mach18
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd29]
>        host = mach19
>        osd data = /disk/sde3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sde3
>
> [osd30]
>        host = mach10
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd31]
>        host = mach11
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd32]
>        host = mach12
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd33]
>        host = mach13
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd34]
>        host = mach14
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd35]
>        host = mach15
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd36]
>        host = mach16
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd37]
>        host = mach17
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd38]
>        host = mach18
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd39]
>        host = mach19
>        osd data = /disk/sdf3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdf3
>
> [osd40]
>        host = mach10
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd41]
>        host = mach11
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd42]
>        host = mach12
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd43]
>        host = mach13
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd44]
>        host = mach14
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd45]
>        host = mach15
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd46]
>        host = mach16
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd47]
>        host = mach17
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd48]
>        host = mach18
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
> [osd49]
>        host = mach19
>        osd data = /disk/sdg3/cephdata
>        osd journal = /disk/sdc3/cephjnl.sdg3
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux