On 15/05/14 09:11, Stefan Priebe - Profihost AG wrote: > Am 15.05.2014 00:26, schrieb Josef Johansson: >> Hi, >> >> So, apparently tmpfs does not support non-root xattr due to a possible >> DoS-vector. There's configuration set for enabling it as far as I can see. >> >> CONFIG_TMPFS=y >> CONFIG_TMPFS_POSIX_ACL=y >> CONFIG_TMPFS_XATTR=y >> >> Anyone know a way around it? Saw that there's a patch for enabling it, >> but recompiling my kernel is out of reach right now ;) > I would create an empty file in tmpfs and then format that file as a > block device. How do you mean exactly? Creating with dd and mounting with losetup? Cheers, Josef >> Created the osd with following: >> >> root at osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1 >> root at osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img >> root at osd1:/# mkfs.xfs /dev/loop0 >> root at osd1:/# ceph osd create >> 50 >> root at osd1:/# mkdir /var/lib/ceph/osd/ceph-50 >> root at osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50 >> root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey >> --osd-journal=/dev/sdc7 --mkjournal >> 2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open: >> aio not supported without directio; disabling aio >> 2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid >> bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected >> b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal >> 2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open: >> aio not supported without directio; disabling aio >> 2014-05-15 00:20:29.807237 7f40063bb780 -1 >> filestore(/var/lib/ceph/osd/ceph-50) could not find >> 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory >> 2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store >> /var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid >> c51a2683-55dc-4634-9d9d-f0fec9a6f389 >> 2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file: >> /var/lib/ceph/osd/ceph-50/keyring: can't open >> /var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory >> 2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring >> /var/lib/ceph/osd/ceph-50/keyring >> root at osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey >> --osd-journal=/dev/sdc7 --mkjournal >> 2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open: >> aio not supported without directio; disabling aio >> 2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open: >> aio not supported without directio; disabling aio >> 2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 != >> superblock's -1 >> 2014-05-15 00:20:51.129845 7ff813ba4780 -1 ** ERROR: error creating >> empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument >> >> Cheers, >> Josef >> >> Christian Balzer skrev 2014-05-14 14:33: >>> Hello! >>> >>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote: >>> >>>> Hi Christian, >>>> >>>> I missed this thread, haven't been reading the list that well the last >>>> weeks. >>>> >>>> You already know my setup, since we discussed it in an earlier thread. I >>>> don't have a fast backing store, but I see the slow IOPS when doing >>>> randwrite inside the VM, with rbd cache. Still running dumpling here >>>> though. >>>> >>> Nods, I do recall that thread. >>> >>>> A thought struck me that I could test with a pool that consists of OSDs >>>> that have tempfs-based disks, think I have a bit more latency than your >>>> IPoIB but I've pushed 100k IOPS with the same network devices before. >>>> This would verify if the problem is with the journal disks. I'll also >>>> try to run the journal devices in tempfs as well, as it would test >>>> purely Ceph itself. >>>> >>> That would be interesting indeed. >>> Given what I've seen (with the journal at 20% utilization and the actual >>> filestore ataround 5%) I'd expect Ceph to be the culprit. >>> >>>> I'll get back to you with the results, hopefully I'll manage to get them >>>> done during this night. >>>> >>> Looking forward to that. ^^ >>> >>> >>> Christian >>>> Cheers, >>>> Josef >>>> >>>> On 13/05/14 11:03, Christian Balzer wrote: >>>>> I'm clearly talking to myself, but whatever. >>>>> >>>>> For Greg, I've played with all the pertinent journal and filestore >>>>> options and TCP nodelay, no changes at all. >>>>> >>>>> Is there anybody on this ML who's running a Ceph cluster with a fast >>>>> network and FAST filestore, so like me with a big HW cache in front of >>>>> a RAID/JBODs or using SSDs for final storage? >>>>> >>>>> If so, what results do you get out of the fio statement below per OSD? >>>>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, >>>>> which is of course vastly faster than the normal indvidual HDDs could >>>>> do. >>>>> >>>>> So I'm wondering if I'm hitting some inherent limitation of how fast a >>>>> single OSD (as in the software) can handle IOPS, given that everything >>>>> else has been ruled out from where I stand. >>>>> >>>>> This would also explain why none of the option changes or the use of >>>>> RBD caching has any measurable effect in the test case below. >>>>> As in, a slow OSD aka single HDD with journal on the same disk would >>>>> clearly benefit from even the small 32MB standard RBD cache, while in >>>>> my test case the only time the caching becomes noticeable is if I >>>>> increase the cache size to something larger than the test data size. >>>>> ^o^ >>>>> >>>>> On the other hand if people here regularly get thousands or tens of >>>>> thousands IOPS per OSD with the appropriate HW I'm stumped. >>>>> >>>>> Christian >>>>> >>>>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: >>>>> >>>>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: >>>>>> >>>>>>> Oh, I didn't notice that. I bet you aren't getting the expected >>>>>>> throughput on the RAID array with OSD access patterns, and that's >>>>>>> applying back pressure on the journal. >>>>>>> >>>>>> In the a "picture" being worth a thousand words tradition, I give you >>>>>> this iostat -x output taken during a fio run: >>>>>> >>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>> 50.82 0.00 19.43 0.17 0.00 29.58 >>>>>> >>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>> sda 0.00 51.50 0.00 1633.50 0.00 7460.00 >>>>>> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb >>>>>> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 >>>>>> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 >>>>>> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 >>>>>> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 >>>>>> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 >>>>>> >>>>>> The %user CPU utilization is pretty much entirely the 2 OSD processes, >>>>>> note the nearly complete absence of iowait. >>>>>> >>>>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. >>>>>> Look at these numbers, the lack of queues, the low wait and service >>>>>> times (this is in ms) plus overall utilization. >>>>>> >>>>>> The only conclusion I can draw from these numbers and the network >>>>>> results below is that the latency happens within the OSD processes. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Christian >>>>>>> When I suggested other tests, I meant with and without Ceph. One >>>>>>> particular one is OSD bench. That should be interesting to try at a >>>>>>> variety of block sizes. You could also try runnin RADOS bench and >>>>>>> smalliobench at a few different sizes. >>>>>>> -Greg >>>>>>> >>>>>>> On Wednesday, May 7, 2014, Alexandre DERUMIER <aderumier at odiso.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> Do you have tried without raid6, to have more osd ? >>>>>>>> (how many disks do you have begin the raid6 ?) >>>>>>>> >>>>>>>> >>>>>>>> Aslo, I known that direct ios can be quite slow with ceph, >>>>>>>> >>>>>>>> maybe can you try without --direct=1 >>>>>>>> >>>>>>>> and also enable rbd_cache >>>>>>>> >>>>>>>> ceph.conf >>>>>>>> [client] >>>>>>>> rbd cache = true >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ----- Mail original ----- >>>>>>>> >>>>>>>> De: "Christian Balzer" <chibi at gol.com <javascript:;>> >>>>>>>> ?: "Gregory Farnum" <greg at inktank.com <javascript:;>>, >>>>>>>> ceph-users at lists.ceph.com <javascript:;> >>>>>>>> Envoy?: Jeudi 8 Mai 2014 04:49:16 >>>>>>>> Objet: Re: Slow IOPS on RBD compared to journal and >>>>>>>> backing devices >>>>>>>> >>>>>>>> On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: >>>>>>>> >>>>>>>>> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer >>>>>>>>> <chibi at gol.com<javascript:;>> >>>>>>>> wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The >>>>>>>>>> journals are on (separate) DC 3700s, the actual OSDs are RAID6 >>>>>>>>>> behind an Areca 1882 with 4GB of cache. >>>>>>>>>> >>>>>>>>>> Running this fio: >>>>>>>>>> >>>>>>>>>> fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 >>>>>>>>>> --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k >>>>>>>>>> --iodepth=128 >>>>>>>>>> >>>>>>>>>> results in: >>>>>>>>>> >>>>>>>>>> 30k IOPS on the journal SSD (as expected) >>>>>>>>>> 110k IOPS on the OSD (it fits neatly into the cache, no surprise >>>>>>>>>> there) 3200 IOPS from a VM using userspace RBD >>>>>>>>>> 2900 IOPS from a host kernelspace mounted RBD >>>>>>>>>> >>>>>>>>>> When running the fio from the VM RBD the utilization of the >>>>>>>>>> journals is about 20% (2400 IOPS) and the OSDs are bored at 2% >>>>>>>>>> (1500 IOPS after some obvious merging). >>>>>>>>>> The OSD processes are quite busy, reading well over 200% on atop, >>>>>>>>>> but the system is not CPU or otherwise resource starved at that >>>>>>>>>> moment. >>>>>>>>>> >>>>>>>>>> Running multiple instances of this test from several VMs on >>>>>>>>>> different hosts changes nothing, as in the aggregated IOPS for >>>>>>>>>> the whole cluster will still be around 3200 IOPS. >>>>>>>>>> >>>>>>>>>> Now clearly RBD has to deal with latency here, but the network is >>>>>>>>>> IPoIB with the associated low latency and the journal SSDs are >>>>>>>>>> the (consistently) fasted ones around. >>>>>>>>>> >>>>>>>>>> I guess what I am wondering about is if this is normal and to be >>>>>>>>>> expected or if not where all that potential performance got lost. >>>>>>>>> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?) >>>>>>>> Yes, but going down to 32 doesn't change things one iota. >>>>>>>> Also note the multiple instances I mention up there, so that would >>>>>>>> be 256 IOs at a time, coming from different hosts over different >>>>>>>> links and nothing changes. >>>>>>>> >>>>>>>>> that's about 40ms of latency per op (for userspace RBD), which >>>>>>>>> seems awfully long. You should check what your client-side objecter >>>>>>>>> settings are; it might be limiting you to fewer outstanding ops >>>>>>>>> than that. >>>>>>>> Googling for client-side objecter gives a few hits on ceph devel and >>>>>>>> bugs and nothing at all as far as configuration options are >>>>>>>> concerned. Care to enlighten me where one can find those? >>>>>>>> >>>>>>>> Also note the kernelspace (3.13 if it matters) speed, which is very >>>>>>>> much in the same (junior league) ballpark. >>>>>>>> >>>>>>>>> If >>>>>>>>> it's available to you, testing with Firefly or even master would be >>>>>>>>> interesting ? there's some performance work that should reduce >>>>>>>>> latencies. >>>>>>>>> >>>>>>>> Not an option, this is going into production next week. >>>>>>>> >>>>>>>>> But a well-tuned (or even default-tuned, I thought) Ceph cluster >>>>>>>>> certainly doesn't require 40ms/op, so you should probably run a >>>>>>>>> wider array of experiments to try and figure out where it's coming >>>>>>>>> from. >>>>>>>> I think we can rule out the network, NPtcp gives me: >>>>>>>> --- >>>>>>>> 56: 4096 bytes 1546 times --> 979.22 Mbps in 31.91 usec >>>>>>>> --- >>>>>>>> >>>>>>>> For comparison at about 512KB it reaches maximum throughput and >>>>>>>> still isn't that laggy: >>>>>>>> --- >>>>>>>> 98: 524288 bytes 121 times --> 9700.57 Mbps in 412.35 usec >>>>>>>> --- >>>>>>>> >>>>>>>> So with the network performing as well as my lengthy experience with >>>>>>>> IPoIB led me to believe, what else is there to look at? >>>>>>>> The storage nodes perform just as expected, indicated by the local >>>>>>>> fio tests. >>>>>>>> >>>>>>>> That pretty much leaves only Ceph/RBD to look at and I'm not really >>>>>>>> sure what experiments I should run on that. ^o^ >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Christian >>>>>>>> >>>>>>>>> -Greg >>>>>>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>>>>>>>> >>>>>>>> -- >>>>>>>> Christian Balzer Network/Systems Engineer >>>>>>>> chibi at gol.com <javascript:;> Global OnLine Japan/Fusion >>>>>>>> Communications http://www.gol.com/ >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list >>>>>>>> ceph-users at lists.ceph.com <javascript:;> >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users at lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> _______________________________________________ >> ceph-users mailing list >> ceph-users at lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com