Hi Christian, Sure ZFS is way more mature than Btrfs, but what is ZFS status in Linux ? I use ZFS on FreeBSD (72TB - 12 disks (2 vdevs RaidZ2) for backup purposes) and it works great, but it's something I would be afraid to do in Linux. -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On dim., 2014-12-28 at 14:59 +0900, Christian Balzer wrote: > Hello Jiri, > > On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote: > > > Hi Christian. > > > > Thank you for your comments again. Very helpful. > > > > I will try to fix the current pool and see how it goes. Its good to > > learn some troubleshooting skills. > > > Indeed, knowing what to do when things break is where it's at. > > > Regarding the BTRFS vs XFS, not sure if the documentation is old. My > > decision was based on this: > > > > http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/ > > > It's dated for sure and a bit of wishful thinking on behalf of the Ceph > developers. > Who understandably didn't want to re-invent the wheel inside Ceph when the > underlying file system could provide it (checksums, snapshots, etc). > > ZFS has all the features (and much better tested) BTRFS is aspiring to and > if kept below 80% utilization doesn't fragment itself to death. > > And the end of that page they mention deduplication, which of course (as I > wrote recently in the "use ZFS for OSDs" thread is unlikely to do anything > worthwhile at all. > > Simply put, some things _need_ to be done in Ceph to work properly and > can't be delegated to the underlying FS or other storage backend. > > Christian > > > Note > > > > We currently recommendXFSfor production deployments. We > > recommendbtrfsfor testing, development, and any non-critical > > deployments. *We believe that****btrfs****has the correct feature set > > and roadmap to serve Ceph in the long-term*, butXFSandext4provide the > > necessary stability for today’s deployments.btrfsdevelopment is > > proceeding rapidly: users should be comfortable installing the latest > > released upstream kernels and be able to track development activity for > > critical bug fixes. > > > > > > > > Thanks > > Jiri > > > > > > On 28/12/2014 16:01, Christian Balzer wrote: > > > Hello, > > > > > > On Sun, 28 Dec 2014 11:58:59 +1100 jirik@xxxxxxxxxx wrote: > > > > > >> Hi Christian. > > >> > > >> Thank you for your suggestions. > > >> > > >> I will set the "osd pool default size" to 2 as you recommended. As > > >> mentioned the documentation is talking about OSDs, not nodes, so that > > >> must have confused me. > > >> > > > Note that changing this will only affect new pools of course. So to > > > sort out your current state either start over with this value set > > > before creating/starting anything or reduce the current size (ceph osd > > > pool set <poolname> size). > > > > > > Have a look at the crushmap example or even better your own, current > > > one and you will see where by default the host is the failure domain. > > > Which of course makes a lot of sense. > > > > > >> Regarding the BTRFS, i thought that btrfs is better option for the > > >> future providing more features. I know that XFS might be more stable, > > >> but again my impression was that btrfs is the focus for future > > >> development. Is that correct? > > >> > > > I'm not a developer, but if you scour the ML archives you will find a > > > number of threads about BTRFS (and ZFS). > > > The biggest issues with BTRFS are not just stability but also the fact > > > that it degrades rather quickly (fragmentation) due to the COW nature > > > of it and less smarts than ZFS in that area. > > > So development on the Ceph side is not the issue per se. > > > > > > IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS > > > might become the better choice (in the future), with KV store backends > > > being an alternative for some use cases (also far from production > > > ready at this time). > > > > > > Regards, > > > > > > Christian > > >> You are right with the round up. I forgot about that. > > >> > > >> Thanks for your help. Much appreciated. > > >> Jiri > > >> > > >> ----- Reply message ----- > > >> From: "Christian Balzer" <chibi@xxxxxxx> > > >> To: <ceph-users@xxxxxxxx> > > >> Cc: "Jiri Kanicky" <jirik@xxxxxxxxxx> > > >> Subject: HEALTH_WARN 29 pgs degraded; 29 pgs stuck > > >> degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, > > >> Dec 28, 2014 03:29 > > >> > > >> Hello, > > >> > > >> On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote: > > >> > > >>> Hi, > > >>> > > >>> I just build my CEPH cluster but having problems with the health of > > >>> the cluster. > > >>> > > >> You're not telling us the version, but it's clearly 0.87 or beyond. > > >> > > >>> Here are few details: > > >>> - I followed the ceph documentation. > > >> Outdated, unfortunately. > > >> > > >>> - I used btrfs filesystem for all OSDs > > >> Big mistake number 1, do some research (google, ML archives). > > >> Though not related to to your problems. > > >> > > >>> - I did not set "osd pool default size = 2 " as I thought that if I > > >>> have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this > > >>> was right. > > >> Big mistake, assumption number 2, replications size by the default > > >> CRUSH rule is determined by hosts. So that's your main issue here. > > >> Either set it to 2 or use 3 hosts. > > >> > > >>> - I noticed that default pools "data,metadata" were not created. Only > > >>> "rbd" pool was created. > > >> See outdated docs above. The majority of use cases is with RBD, so > > >> since Giant the cephfs pools are not created by default. > > >> > > >>> - As it was complaining that the pg_num is too low, I increased the > > >>> pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num > > >>> 133 > > >>> > pgp_num 64". > > >>> > > >> Re-read the (in this case correct) documentation. > > >> It clearly states to round up to nearest power of 2, in your case 256. > > >> > > >> Regards. > > >> > > >> Christian > > >> > > >>> Would you give me hint where I have made the mistake? (I can remove > > >>> the OSDs and start over if needed.) > > >>> > > >>> > > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph health > > >>> HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck > > >>> unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num > > >>> 133 > > >>> > pgp_num 64 > > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph status > > >>> cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > > >>> health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 > > >>> pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool > > >>> rbd pg_num 133 > pgp_num 64 > > >>> monmap e1: 2 mons at > > >>> {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election > > >>> epoch 8, quorum 0,1 ceph1,ceph2 > > >>> osdmap e42: 4 osds: 4 up, 4 in > > >>> pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects > > >>> 11704 kB used, 11154 GB / 11158 GB avail > > >>> 29 active+undersized+degraded > > >>> 104 active+remapped > > >>> > > >>> > > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree > > >>> # id weight type name up/down reweight > > >>> -1 10.88 root default > > >>> -2 5.44 host ceph1 > > >>> 0 2.72 osd.0 up 1 > > >>> 1 2.72 osd.1 up 1 > > >>> -3 5.44 host ceph2 > > >>> 2 2.72 osd.2 up 1 > > >>> 3 2.72 osd.3 up 1 > > >>> > > >>> > > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools > > >>> 0 rbd, > > >>> > > >>> cephadmin@ceph1:/etc/ceph$ cat ceph.conf > > >>> [global] > > >>> fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788 > > >>> public_network = 192.168.30.0/24 > > >>> cluster_network = 10.1.1.0/24 > > >>> mon_initial_members = ceph1, ceph2 > > >>> mon_host = 192.168.30.21,192.168.30.22 > > >>> auth_cluster_required = cephx > > >>> auth_service_required = cephx > > >>> auth_client_required = cephx > > >>> filestore_xattr_use_omap = true > > >>> > > >>> Thank you > > >>> Jiri > > >> > > > > > > > > -- > Christian Balzer Network/Systems Engineer > chibi@xxxxxxx Global OnLine Japan/Fusion Communications > http://www.gol.com/ > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com