Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Sat, 13 Feb 2016 19:08:02 +0100

Hi,

Le 13/02/2016 15:52, Christian Balzer a écrit :
> [..]
>
> Hum that's surprisingly long. How much data (size and nb of files) do
> you have on this OSD, which FS do you use, what are the mount options,
> what is the hardware and the kind of access ?
>
> I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
> 7 disk RAID6 per OSD. 
> Nothing aside from noatime for mount options and EXT4.

Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
and may not be innocent.

>  
> 2.6TB per OSD and with 1.4 million objects in the cluster a little more
> than 700k files per OSD.

That's nearly 3x more than my example OSD but it doesn't explain the
more than 10x difference in startup time (especially considering BTRFS
OSDs are slow to startup and my example was with dropped caches unlike
your case). Your average file size is similar so it's not that either.
Unless you have a more general, system-wide performance problem which
impacts everything including the OSD init, there's 3 main components
involved here :
- Ceph OSD init code,
- ext4 filesystem,
- HW RAID6 block device.

So either :
- OSD init code doesn't scale past ~500k objects per OSD.
- your ext4 filesystem is slow for the kind of access used during init
(inherently or due to fragmentation, you might want to use filefrag on a
random sample on PG directories, omap and meta),
- your RAID6 array is slow for the kind of access used during init.
- any combination of the above.

I believe it's possible but doubtful that the OSD code wouldn't scale at
this level (this does not feel like an abnormally high number of objects
to me). Ceph devs will know better.
ext4 could be a problem as it's not the most common choice for OSDs
(from what I read here XFS is usually preferred over it) and it forces
Ceph to use omap to store data which would be stored in extended
attributes otherwise (which probably isn't without performance problems).
RAID5/6 on HW might have performance problems. The usual ones happen on
writes and OSD init is probably read-intensive (or maybe not, you should
check the kind of access happening during the OSD init to avoid any
surprise) but with HW cards it's difficult to know for sure the
performance limitations they introduce (the only sure way is testing the
actual access patterns).

So I would probably try to reproduce the problem replacing one OSDs
based on RAID6 arrays with as many OSDs as you have devices in the arrays.
Then if it solves the problem and you didn't already do it you might
want to explore Areca tuning, specifically with RAID6 if you must have it.

>
> And kindly take note that my test cluster has less than 120k objects and
> thus 15k files per OSD and I still was able to reproduce this behaviour (in
> spirit at least).

I assume the test cluster uses ext4 and RAID6 arrays too: it would be a
perfect testing environment for defragmentation/switch to XFS/switch to
single drive OSDs then.

>
>> The only time I saw OSDs take several minutes to reach the point where
>> they fully rejoin is with BTRFS with default options/config.
>>
> There isn't a pole long enough I would touch BTRFS with for production,
> especially in conjunction with Ceph.

That's a matter of experience and environment but I can understand: we
invested more than a week of testing/development to reach a point where
BTRFS was performing better than XFS in our use case. Not everyone can
dedicate as much time just to select a filesystem and support it. There
might be use cases where it's not even possible to use it (I'm not sure
how it would perform if you only did small objects storage for example).

BTRFS has been invaluable though : it detected and helped fix corruption
generated by faulty Raid controllers (by forcing Ceph to use other
replicas when repairing). I wouldn't let precious data live on anything
other than checksumming filesystems now (the probabilities of
undetectable disk corruption are too high for our use case now). We have
30 BTRFS OSDs in production (and many BTRFS filesystems on other
systems) and we've never had any problem with them. These filesystems
even survived several bad datacenter equipment failures (faulty backup
generator control system and UPS blowing up during periodic testing).
That said I'm susbcribed to linux-btrfs, was one of the SATA controller
driver maintainers long ago so I know my way around kernel code, I hand
pick the kernel versions going to production and we have custom tools
and maintenance procedures for the BTRFS OSDs. So I've means and
experience which make this choice comfortable for me and my team: I
wouldn't blindly advise BTRFS to anyone else (not yet).

Anyway it's possible ext4 is a problem but it seems to me less likely
than the HW RAID6. In my experience RAID controllers with cache aren't
really worth it with Ceph. Most of the time they perform well because of
BBWC/FBWC but when you get into a situation where you must
repair/backfill because you lost an OSD or added a new one the HW cache
is completely destroyed (what good can 4GB do when you must backfill 1TB
or even catch up with tens of GB of writes ?). It's so bad that when we
add an OSD the first thing we do now is selectively disable the HW cache
for its device to avoid slowing all the other OSDs connected to the same
controller.
Using RAID6 for OSDs can minimize the backfills by avoiding losing OSDs
but probably won't avoid them totally (most people have to increase
storage eventually). In some cases it might be worth it (very large
installations where the number of OSDs may become a problem) but we
aren't there yet and you probably have to test these arrays extensively
(how much IO can you get from them in various access patterns, including
when they are doing internal maintenance, running with one or two
devices missing and rebuilding one or two replaced devices) so we will
delay any kind of RAID below OSDs as long as we can.

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com