Is there a way to repair placement groups? [Offtopic - ZFS]

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Wed, 28 May 2014 11:31:44 -0700

On 5/28/14 09:45 , Dimitri Maziuk wrote:
> On 05/28/2014 09:32 AM, Christian Balzer wrote:
>> I was about to write something similar yesterday, but work interfered. ^o^
>>
>> For bandwidth a RAID(Z*/6, don't even think about RAID5 or equivalent) is
>> indeed very nice, but for IOPS it will be worse than a RAID10.
>>
>> Of course a controller with a large writeback cache can pretty alleviate
>> or at least hide those issues up to a point. ^.^
> Also, all benchmarks suck(tm). Are you comparing the exact same workload
> on the exact same disks on the exact same controller etc. Sure you can
> have a software raid 6 that's faster than hardware raid 10 -- it may
> take some work but it should be perfectly doable.
>
>

Agreed.  I rate all benchmark tools on the "least useless" scale.

In that case, I was using the same single server, with different zpool 
configurations and tunables.  The disks were single disk RAID0, with 
battery backed write cache.  All tests included a mirrored ZIL on SSD, 
and an L2ARC on SSD.  This was a while ago, so those SSDs would've been 
the Intel X25E 64GB.

The PostgreSQL benchmark tool is a modified TPC-B benchmark.  TPC is 
more IOps constrained than throughput constrained, but it had a 
component of both.  It doesn't really match my access patterns, but it's 
Close Enough (tm) that I'm not forced to do a better job.  I care about 
the overall latency and IOps in a many user scenario, not single thread 
performance.

I was tuning ZFS parameters for my database server.  It wasn't meant to 
be definitive, just something to quickly narrow values for real 
production tests.  In the end, the parameters that gave the best pgbench 
score also gave the best performing production database server, despite 
the different in access patterns.  I was surprised too.  It's probably 
because in the end, I didn't have to change much.

I ended up with 3 optimization, in order of effectiveness:

 1. Adding SSD ZIL and L2ARC
 2. Using 5 disk RAIDZ over 4 disk RAID10 (I had 8 drive bays. With the
    3 SSDs, that left 5 bays for spinners)
 3. Adjusting ZFS recordsize to match PostgreSQL's 8k object size.
 4. Enable compression.  This tied for effectiveness with #3, but was
    statically insignificant combined with #3.  I would re-test under Ceph.

Everything else I tested was either counter-productive or statistically 
insignificant.

Bringing it back to Ceph, I plan to re-test all of them on Ceph on ZFS.  
Prior to testing, I hypothesis that I'll end up with #1, #2, and #4.

My initial Ceph cluster uses the same chasis as my database server.  If 
I willfully ignore some things, the PostgreSQL benchmark sounds like a 
reasonable first order approximation for my Ceph nodes.

Using rados bench, my benchmarks on XFS told me to skip the SSD journals 
and put more spinners in.  That benchmarked really well in my mostly 
read workload.  It proved to be a disaster in production, when I started 
expanding the cluster.  My benchmarks only had the cluster 10% full, and 
there wasn't enough volume to actually stress things properly.  
Production load indicates that I need the SSD journals.  I'm in the 
process of adjusting the older nodes.

The reason my PostgreSQL benchmark was successful, and my Ceph benchmark 
failed so miserably?  I had enough production PostgreSQL experience to 
know that the benchmark was somewhat reasonable, and a way to test the 
results in production.  I had neither one of those things when I was 
running my Ceph benchmarks.  Which mostly boils down to hubris: "I'm 
good enough that I don't need those things anymore."

Hence my assertion to re-test things you think you know.  :-)

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis at centraldesktop.com <mailto:clewis at centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140528/c4b00639/attachment.htm>