Re: HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

Thomas Lemarchand <thomas.lemarchand@xxxxxxxxxxxxxxxxxx> · Mon, 29 Dec 2014 11:22:34 +0100

Hi Christian,

Sure ZFS is way more mature than Btrfs, but what is ZFS status in
Linux ?

I use ZFS on FreeBSD (72TB - 12 disks (2 vdevs RaidZ2) for backup
purposes) and it works great, but it's something I would be afraid to do
in Linux.

-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information

On dim., 2014-12-28 at 14:59 +0900, Christian Balzer wrote:
> Hello Jiri,
> 
> On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote:
> 
> > Hi Christian.
> > 
> > Thank you for your comments again. Very helpful.
> > 
> > I will try to fix the current pool and see how it goes. Its good to 
> > learn some troubleshooting skills.
> > 
> Indeed, knowing what to do when things break is where it's at.
> 
> > Regarding the BTRFS vs XFS, not sure if the documentation is old. My 
> > decision was based on this:
> > 
> > http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/
> > 
> It's dated for sure and a bit of wishful thinking on behalf of the Ceph
> developers. 
> Who understandably didn't want to re-invent the wheel inside Ceph when the
> underlying file system could provide it (checksums, snapshots, etc).
> 
> ZFS has all the features (and much better tested) BTRFS is aspiring to and
> if kept below 80% utilization doesn't fragment itself to death.
> 
> And the end of that page they mention deduplication, which of course (as I
> wrote recently in the "use ZFS for OSDs" thread is unlikely to do anything
> worthwhile at all.
> 
> Simply put, some things _need_ to be done in Ceph to work properly and
> can't be delegated to the underlying FS or other storage backend. 
> 
> Christian
> 
> > Note
> > 
> > We currently recommendXFSfor production deployments. We 
> > recommendbtrfsfor testing, development, and any non-critical 
> > deployments. *We believe that****btrfs****has the correct feature set 
> > and roadmap to serve Ceph in the long-term*, butXFSandext4provide the 
> > necessary stability for today’s deployments.btrfsdevelopment is 
> > proceeding rapidly: users should be comfortable installing the latest 
> > released upstream kernels and be able to track development activity for 
> > critical bug fixes.
> > 
> > 
> > 
> > Thanks
> > Jiri
> > 
> > 
> > On 28/12/2014 16:01, Christian Balzer wrote:
> > > Hello,
> > >
> > > On Sun, 28 Dec 2014 11:58:59 +1100 jirik@xxxxxxxxxx wrote:
> > >
> > >> Hi Christian.
> > >>
> > >> Thank you for your suggestions.
> > >>
> > >> I will set the "osd pool default size" to 2 as you recommended. As
> > >> mentioned the documentation is talking about OSDs, not nodes, so that
> > >> must have confused me.
> > >>
> > > Note that changing this will only affect new pools of course. So to
> > > sort out your current state either start over with this value set
> > > before creating/starting anything or reduce the current size (ceph osd
> > > pool set <poolname> size).
> > >
> > > Have a look at the crushmap example or even better your own, current
> > > one and you will see where by default the host is the failure domain.
> > > Which of course makes a lot of sense.
> > >   
> > >> Regarding the BTRFS, i thought that btrfs is better option for the
> > >> future providing more features. I know that XFS might be more stable,
> > >> but again my impression was that btrfs is the focus for future
> > >> development. Is that correct?
> > >>
> > > I'm not a developer, but if you scour the ML archives you will find a
> > > number of threads about BTRFS (and ZFS).
> > > The biggest issues with BTRFS are not just stability but also the fact
> > > that it degrades rather quickly (fragmentation) due to the COW nature
> > > of it and less smarts than ZFS in that area.
> > > So development on the Ceph side is not the issue per se.
> > >
> > > IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS
> > > might become the better choice (in the future), with KV store backends
> > > being an alternative for some use cases (also far from production
> > > ready at this time).
> > >
> > > Regards,
> > >
> > > Christian
> > >> You are right with the round up. I forgot about that.
> > >>
> > >> Thanks for your help. Much appreciated.
> > >> Jiri
> > >>
> > >> ----- Reply message -----
> > >> From: "Christian Balzer" <chibi@xxxxxxx>
> > >> To: <ceph-users@xxxxxxxx>
> > >> Cc: "Jiri Kanicky" <jirik@xxxxxxxxxx>
> > >> Subject:  HEALTH_WARN 29 pgs degraded; 29 pgs stuck
> > >> degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun,
> > >> Dec 28, 2014 03:29
> > >>
> > >> Hello,
> > >>
> > >> On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I just build my CEPH cluster but having problems with the health of
> > >>> the cluster.
> > >>>
> > >> You're not telling us the version, but it's clearly 0.87 or beyond.
> > >>
> > >>> Here are few details:
> > >>> - I followed the ceph documentation.
> > >> Outdated, unfortunately.
> > >>
> > >>> - I used btrfs filesystem for all OSDs
> > >> Big mistake number 1, do some research (google, ML archives).
> > >> Though not related to to  your problems.
> > >>
> > >>> - I did not set "osd pool default size = 2 " as I thought that if I
> > >>> have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
> > >>> was right.
> > >> Big mistake, assumption number 2,  replications size by the default
> > >> CRUSH rule is determined by hosts. So that's your main issue here.
> > >> Either set it to 2 or use 3 hosts.
> > >>
> > >>> - I noticed that default pools "data,metadata" were not created. Only
> > >>> "rbd" pool was created.
> > >> See outdated docs above. The majority of use cases is with RBD, so
> > >> since Giant the cephfs pools are not created by default.
> > >>
> > >>> - As it was complaining that the pg_num is too low, I increased the
> > >>> pg_num for pool rbd to 133 (400/3) and end up with "pool rbd pg_num
> > >>> 133
> > >>>   > pgp_num 64".
> > >>>
> > >> Re-read the (in this case correct) documentation.
> > >> It clearly states to round up to nearest power of 2, in your case 256.
> > >>
> > >> Regards.
> > >>
> > >> Christian
> > >>
> > >>> Would you give me hint where I have made the mistake? (I can remove
> > >>> the OSDs and start over if needed.)
> > >>>
> > >>>
> > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph health
> > >>> HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
> > >>> unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
> > >>> 133
> > >>>   > pgp_num 64
> > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph status
> > >>>       cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
> > >>>        health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
> > >>> pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
> > >>> rbd pg_num 133 > pgp_num 64
> > >>>        monmap e1: 2 mons at
> > >>> {ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
> > >>> epoch 8, quorum 0,1 ceph1,ceph2
> > >>>        osdmap e42: 4 osds: 4 up, 4 in
> > >>>         pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
> > >>>               11704 kB used, 11154 GB / 11158 GB avail
> > >>>                     29 active+undersized+degraded
> > >>>                    104 active+remapped
> > >>>
> > >>>
> > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
> > >>> # id    weight  type name       up/down reweight
> > >>> -1      10.88   root default
> > >>> -2      5.44            host ceph1
> > >>> 0       2.72                    osd.0   up      1
> > >>> 1       2.72                    osd.1   up      1
> > >>> -3      5.44            host ceph2
> > >>> 2       2.72                    osd.2   up      1
> > >>> 3       2.72                    osd.3   up      1
> > >>>
> > >>>
> > >>> cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
> > >>> 0 rbd,
> > >>>
> > >>> cephadmin@ceph1:/etc/ceph$ cat ceph.conf
> > >>> [global]
> > >>> fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
> > >>> public_network = 192.168.30.0/24
> > >>> cluster_network = 10.1.1.0/24
> > >>> mon_initial_members = ceph1, ceph2
> > >>> mon_host = 192.168.30.21,192.168.30.22
> > >>> auth_cluster_required = cephx
> > >>> auth_service_required = cephx
> > >>> auth_client_required = cephx
> > >>> filestore_xattr_use_omap = true
> > >>>
> > >>> Thank you
> > >>> Jiri
> > >>
> > >
> > 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com