Hi, # First a short description of our Ceph setup You can skip to the next section ("Main questions") to save time and come back to this one if you need more context. We are currently moving away from DRBD-based storage backed by RAID arrays to Ceph for some of our VMs. Our focus is on resiliency and capacity (one VM was outgrowing the largest RAID10 we had) and not maximum performance (at least not yet). Our Ceph OSDs are fairly unbalanced because 2 are on 2 historic hosts each with 4 disks in a hardware RAID10 configuration and no place available for new disks in the chassis. 12 additional OSD are on 2 new systems with 6 disk drives dedicated to one OSD each (CPU and RAM configurations are nearly identical on the 4 hosts). All hosts are used for running VMs too, we took some precautions to avoid too much interference: each host has CPU and RAM to spare for the OSD. CPU usage exhibits some bursts on occasions but as we only have one or two VM on each host, they can't starve the OSD which have between 2 and 8 full fledge cores (4 to 16 hardware threads) for them depending on the current load. We have at least 4GB of free RAM per OSD on each host at all times (including room for at least a 4GB OS cache). To sum up we have a total of 14 OSDs, the 2 largest ones on RAID10 are clearly our current bottleneck. That said until we have additional hardware they allow us to maintain availability even if 2 servers are down (default crushmap with pool configured with 3 replicas on 3 different hosts) and performance is acceptable (backfilling/scrubing/... pgs required some tuning though and I'm eagerly waiting for 0.80.7 to begin tests of the new io priority tunables). Everything is based on SATA/SAS 7200t/min disk drives behind P410 Raid controllers (HP Proliant systems) with battery backed memory to help with write bursts. The OSDs are a mix of: - Btrfs on 3.17.0 kernels on individual disks, 450GB use on 2TB (3.17.0 fixes a filesystem lockup we had with earlier kernels manifesting itself with concurrent accesses to several Btrfs filesystems according to recent lkml posts), - Btrfs on 3.12.21 kernels on the 2 systems with RAID10, 1.5TB used on 3TB (no lockup on these yet but they will migrate to 3.17.0 when we'll have enough experience with it). - XFS for a minority of individual disks (with a dedicated partition for the journal). Most of them have the same history (all being created at the same time), only two of them have been created later (following Btrfs corruption and/or conversion to XFS) and are avoided when comparing behaviours. All Btrfs volumes use these mount options: rw,noatime,nodiratime,compress=lzo,space_cache,autodefrag,recovery All OSDs use a 5GB journal. We slowly add monitoring to the setup to see what are the benefits of Btrfs in our case (ceph osd perf, kernel io wait per devices, osd CPU usage, ...). One long term objective is to slowly raise the performance both by migrating to/adding more suitable hardware and tuning the software side. Detailed monitoring is supposed to help us study the behaviour of isolated OSDs with different settings and being warned early if they generate performance problems to take them out with next to no impact on the whole storage network (we are strong believers in slow, incremental and continuous change and distributed storage with redundancy makes it easy to implement). # Main questions The system works well but I just realised when restarting one of the 2 large Btrfs OSD that it was very slow to rejoin the network ("ceph osd set noout" was used for the restart). I stopped the OSD init after 5 minutes to investigate what was going on and didn't find any obvious problem (filesystem sane, no swapping, CPU hogs, concurrent IO not able to starve the system by itself, ...). Next restarts took between 43s (nearly no concurrent disk access and warm caches after an earlier restart without umounting the filesystem) and 3mn57s (one VM still on DRBD doing ~30 IO/s on the same volume and cold caches after a filesystem mount). It seems that the startup time is getting longer on the 2 large Btrfs filesystems (the other one gives similar results: 3mn48s on the first try for example). I noticed that it was a bit slow a week ago but not as much (there was ~half as much data on them at the time). OSDs on individual disks don't exhibit this problem (with warm caches init finishes in ~4s on the small Btrfs volumes, ~3s on the XFS volumes) but they are on dedicated disks with less data. With warm caches most of the time is spent between: "osd.<n> <osdmap> load_pgs" "osd.<n> <osdmap> load_pgs opened <m> pgs" log lines in /var/log/ceph/ceph-osd.<n>.log (m is ~650 for both OSD). So it seems most of the time is spent opening pgs. What could explain such long startup times? Is the OSD init doing a lot of random disk accesses? Is it dependant on the volume of data or the history of the OSD (fragmentation?)? Maybe Btrfs on 3.12.21 has known performance problems or suboptimal autodefrag (on 3.17.0 with 1/3 the data and a similar history of disk accesses we have 1/10 the init time when the disks are in both cases idle)? # Init snap destroy errors on Btrfs question During each init on Btrfs backed OSD we get this kind of errors in ceph-osd logs (always come in pairs like that at the very beginning of the phase where the OSD opens the pgs): 2014-10-13 23:54:44.143039 7fd4267fc700 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint: ioctl SNAP_DESTROY got (2) No such file or directory 2014-10-13 23:54:44.143087 7fd4267fc700 -1 filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap 'snap_21161231' got (2) No such file or directory 2014-10-13 23:54:44.266149 7fd4267fc700 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-1) destroy_checkpoint: ioctl SNAP_DESTROY got (2) No such file or directory 2014-10-13 23:54:44.266189 7fd4267fc700 -1 filestore(/var/lib/ceph/osd/ceph-1) unable to destroy snap 'snap_21161268' got (2) No such file or directory I suppose it is harmless (at least these OSD don't show any other error/warning and have been restarted and their filesystem remounted on numerous occasions), but I'd like to be sure: is it? Best regards, Lionel Bouton _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com