Re: Scheduling an IRC meeting on our private cloud setup

Ralph Bean <rbean@xxxxxxxxxx> · Tue, 25 Mar 2014 15:12:34 -0400

==================================================================
#fedora-classroom: Infrastructure Private Cloud Class (2014-03-25)
==================================================================

Meeting started by nirik at 18:00:03 UTC. The full logs are available at
http://meetbot.fedoraproject.org/fedora-classroom/2014-03-25/infrastructure-private-cloud-class.2014-03-25-18.00.log.html

Meeting summary
---------------
* intro  (nirik, 18:00:03)

* History/current setup  (nirik, 18:02:51)
  * LINK: https://fed-cloud02.cloud.fedoraproject.org/dashboard/
    (nirik, 18:10:48)
  * LINK:
    http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/README
    has a bunch of cloud specific info.  (nirik, 18:11:35)
  * IDEA: add openid support to openstack's horizon dashboard
    (threebean, 18:13:19)
  * LINK: https://wiki.openstack.org/wiki/Nova_openid_service
    (danofsatx-work, 18:17:06)

* Upcoming plans / TODO  (nirik, 18:26:33)
  * LINK: https://en.wikipedia.org/wiki/OpenStack#Components
    (threebean, 18:27:00)

* Open Questions  (nirik, 18:40:57)
  * LINK: https://fedoraproject.org/wiki/Infrastructure_private_cloud
    (nirik, 18:41:36)

Meeting ended at 19:02:55 UTC.

People Present (lines said)
---------------------------
* nirik (143)
* mirek-hm (29)
* danofsatx-work (21)
* threebean (19)
* tflink (16)
* smooge (5)
* zodbot (3)
* abadger1999 (3)
* webpigeon (3)
* jsmith (3)
* jamielinux (1)
* blob (1)
* relrod (1)
* janeznemanic (1)

18:00:03 <nirik> #startmeeting Infrastructure Private Cloud Class (2014-03-25)
18:00:03 <zodbot> Meeting started Tue Mar 25 18:00:03 2014 UTC.  The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:00:03 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
18:00:03 <nirik> #meetingname infrastructure-private-cloud-class
18:00:03 <nirik> #topic intro
18:00:03 <zodbot> The meeting name has been set to 'infrastructure-private-cloud-class'
18:00:22 <nirik> hey everyone. who is around for a bit of talking about fedora's infrastructure private cloud?
18:00:26 * threebean is here
18:00:32 <smooge> hello
18:00:38 <danofsatx-work> aqui! aqui!
18:00:44 * mirek-hm is here
18:00:45 * tflink is here
18:00:53 * blob is here
18:00:57 * relrod here
18:01:01 * webpigeon is here
18:01:11 * jamielinux is here out of curiousity
18:01:53 * abadger1999 is here
18:01:57 <janeznemanic> hi
18:02:24 <nirik> cool. ;) ok, I thought I would give a bit of background/history first, then a bit about ansible integration, then talk about plans...
18:02:51 <nirik> #topic History/current setup
18:02:58 <threebean> that sounds good.  can we interrupt you with questions?  or should be hold them for later?
18:03:10 <nirik> threebean: please do... questions are good. ;)
18:03:16 <threebean> cool, cool.
18:03:20 * jsmith is late
18:03:23 <nirik> so, our current setup is a openstack folsom cloud
18:03:31 <nirik> It was mostly manually installed by me.
18:03:38 <nirik> Ie, I installed rpms, ran setup, etc...
18:03:55 <nirik> there are currently 7 nodes in use
18:04:07 <jsmith> Nodes = servers?
18:04:08 <nirik> 1 (fed-cloud02.cloud) is the 'head node'
18:04:15 <nirik> 6 are compute nodes
18:04:20 <nirik> jsmith: yeah, physical boxes.
18:04:37 <jsmith> Perfect, just wanted to make sure I was getting the nomenclature right
18:04:53 <nirik> The compute notes only run openstack-nova-compute, and one of them also runs cinder (the storage service will get to that in a few)
18:05:04 <nirik> the head node runs all the other stuff.
18:05:18 <nirik> and acts as the gateway to all the other things.
18:05:34 <nirik> it runs network, the db (mysqld), amqp, etc, etc.
18:05:48 <nirik> It also runs cinder for storage.
18:05:50 <mirek-hm> head node is fed-cloud02.cloud.fedoraproject.org BTW
18:06:29 <nirik> when you fire off an instance openstack looks around and schedules it on a compute node.
18:06:34 <tflink> outside of cinder, is there any shared storage?
18:07:07 <nirik> when you ask for a persistent volume, it allocates it from one of the cinder servers... it makes a lv in a pool and shares it via iscsi
18:07:14 <nirik> tflink: nope. Not currently.
18:07:45 <mirek-hm> aha, that are VG cinder-volumes in lvs output
18:07:46 <nirik> all storage is either cinder from 02 or cinder from 08... or each compute node has a small amount of storage it uses to cache images locally.
18:07:56 <nirik> mirek-hm: yeah.
18:08:04 <tflink> is that a requirement of newer openstack? when I set up my dev/standalone system, I got the impression that something like gluster was highly reccomended for more than 1 mode
18:08:05 <nirik> lets see...what elese.
18:08:24 <nirik> tflink: we tried gluster, but it was really really really really slow at the time.
18:08:42 <nirik> oh, networking:
18:08:44 <danofsatx-work> no swift for image service?
18:09:01 <mirek-hm> nirik: you forget on one exception, Copr-be have one storage allocated on one compute node (800 GB)
18:09:02 <nirik> we have a number of tenants (projects). Each one gets it's own vlan.
18:09:18 <nirik> mirek-hm: thats from the 08 compute node via cinder there. ;)
18:09:44 <nirik> we have a /24 network for external ips. Each instance gets one by default and can also allocated a static one.
18:10:08 <nirik> danofsatx-work: yeah, we do have swift.
18:10:17 <nirik> it uses storage on fed-cloud02.
18:10:34 <nirik> There's a dashboard available for web (horizon):
18:10:48 <nirik> https://fed-cloud02.cloud.fedoraproject.org/dashboard/
18:11:04 <nirik> The main way we interact tho is via ansible on lockbox01.
18:11:35 <nirik> http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/README has a bunch of cloud specific info.
18:11:45 <nirik> basically we use the ansible ec2 module to spin things up, etc.
18:11:48 <webpigeon> does the web interface work for appentices
18:12:04 <danofsatx-work> fi-apprentice group can't login to dashboard :(
18:12:10 <nirik> webpigeon: it doesn't. It also doesn't interact with our other authentication at all. ;( it's all manually needing to add people.
18:12:10 <mirek-hm> webpigeon: no
18:12:25 <nirik> I keep hoping someday they will add openid support.
18:12:32 * webpigeon wonders if he can get access to have a look :)
18:12:47 <danofsatx-work> I thought I saw that mentioned in the Havana release notes.....
18:12:49 <nirik> sure, we could probibly do that. :)
18:12:53 <mirek-hm> you either have to have account there or you can root login to fed-clooud02 and there is admin password stored in keystonerc file
18:12:55 <nirik> danofsatx-work: oh? excellent.
18:13:19 <threebean> #idea add openid support to openstack's horizon dashboard
18:13:20 <nirik> ok, lets see what else on current setup... oh, we have some playbooks that make transient instances...
18:13:39 <nirik> so it can spin up one and then ansible configures it all for testing something.
18:13:52 <nirik> we also have persistent instances where it's for a specific use.
18:14:31 <nirik> we picked the ec2 interface over the nova one because we wanted to be portable to euca or amazon... we could possibly revisit that, but it seems to do most of what we need.
18:15:06 <nirik> oh, we also hacked onto fed-cloud02 a nginx proxy to proxy https for us. By default folsom had it's ec2 interface http
18:15:42 <nirik> also, one compute node has been down with a bad disk, but should be ready to re-add again now. I will likely do that in the next few days.
18:16:09 <nirik> So, any questions on the current setup? or shall I talk plans and then open up to more questions?
18:16:31 <threebean> so, last week the cloud blew up because...  one of the compute nodes got the ip of the router?
18:16:35 <threebean> where did you look to figure that out?
18:16:38 <mirek-hm> if you are logged to fed-cloud2 and want to check status of services you may run openstack-status which should give such output http://fpaste.org/88542/57713621/
18:17:06 <nirik> threebean: yes. I looked on fed-cloud02 with nova-manage at the ips... or perhaps I saw the .254 one in the dashboard first
18:17:06 <danofsatx-work> https://wiki.openstack.org/wiki/Nova_openid_service
18:18:03 <mirek-hm> openstack services are (re)stared by /etc/init.d/openstack-* init.d files
18:18:06 <nirik> nice!
18:18:13 <nirik> danofsatx-work: thanks for the link//info.
18:18:36 <danofsatx-work> np ;) - I'm teaching an Openstack class at uni right now
18:18:45 <threebean> cool.  what group(s) of people are allowed to login to fed-cloud02 (and the others?)  its all locally managed?
18:19:14 <nirik> threebean: yes, it's all local. sysadmin-main should have their keys there for root (there arent any local users I don't think). Also a few other folks like mirek-hm
18:19:36 <nirik> the cloud network is completely isolated from our other stuff and rh's internal networks.
18:19:50 <nirik> so if you need to talk to it you have to go via the external ips...
18:20:21 <mirek-hm> nirik: btw I will remove seth ssh key from authorizes_keys
18:20:27 <nirik> mirek-hm: ok. ;(
18:20:44 <tflink> does that have anything to do with the network goofiness where instances can't always talk to eachother on non-public IPs?
18:20:44 <nirik> on to a bit of plans?
18:20:50 <threebean> one more Q
18:20:53 <nirik> tflink: nope, thats a openstack issue. ;)
18:20:56 <threebean> how close to full capacity are we on the compute nodes?
18:20:58 <tflink> ok, wasn't sure
18:21:07 <nirik> it's due to the way that they setup firewalls. ;(
18:21:26 <nirik> instances on different compute nodes can talk to each other fine, but if they happen to be on the same one they cannot.
18:21:34 <nirik> I don't know if thats fixed in newer openstack.
18:21:49 <nirik> threebean: hard to say. ;) There's not any easy "show me my capacity"
18:22:03 <threebean> heh, ok.
18:22:11 <nirik> each compute node reports on itself...
18:22:20 <threebean> perhaps we need a scripts/cloud-info script in our ansible repo.
18:22:20 <nirik> so fed-cloud02 (which is also a compute node) says:
18:22:27 <nirik> 2014-03-25 18:21:44 30752 AUDIT nova.compute.resource_tracker [-] Free ram (MB): 10732
18:22:27 <nirik> 2014-03-25 18:21:44 30752 AUDIT nova.compute.resource_tracker [-] Free disk (GB): 951
18:22:28 <nirik> 2014-03-25 18:21:44 30752 AUDIT nova.compute.resource_tracker [-] Free VCPUS: -4
18:22:57 <mirek-hm> also routing in fedora cloud is little bit tricky as external IP does not work inside of cloud. Therefore if you comunicate from cloud machine to another cloud machine you must use internal IP.
18:23:10 * nirik nods.
18:23:26 <tflink> mirek-hm: that's not always
18:23:27 <tflink> true
18:23:32 <nirik> also, you get automatically a external ip with each instance, but it's dynamically assigned.
18:23:49 <tflink> but I've had trouble figuring out a way to always let instances talk with eachother
18:23:49 <nirik> so if you want a static/known external you have to assign one, and then you have 2 of them. ;(
18:24:16 <nirik> and you cannot give up the dynamic one.
18:24:33 <mirek-hm> nirik: what command you used to get that resource information?
18:25:00 <nirik> threebean: there is a newer project called celiometer or something that is supposed to give you usage info, but thats not available in the old version we have.
18:25:10 <tflink> a way to consistently let instanced talk to eachother over the network, rahter. it's always been a one-off "does this work with instances X and Y"
18:25:10 <nirik> mirek-hm: tail /var/log/nova/compute.log
18:25:24 <danofsatx-work> I thought that was heat, not celiometer.....
18:25:28 * danofsatx-work checks real quick
18:25:35 <mirek-hm> it is celiometer
18:25:48 <mirek-hm> heat is for configuration
18:25:48 <tflink> heat is coordination and setup of instances, no?
18:26:24 <danofsatx-work> ok, yeah....It's the Telemetry service.
18:26:33 <nirik> #topic Upcoming plans / TODO
18:26:48 <danofsatx-work> openstack isn't real good about actually giving the name of each service, you have to dig for it.
18:27:00 <threebean> https://en.wikipedia.org/wiki/OpenStack#Components
18:27:16 <nirik> so, we have a few new machines we are getting in soon (3). My plan is to try and set those up as a seperate cloud, ideally via ansible
18:27:32 <nirik> so we can easily rebuild it from the ground unlike the current one.
18:27:42 * danofsatx-work likes that plan
18:27:53 <nirik> also, if we can do openid that would be lovely.
18:27:54 <danofsatx-work> makes for a cleaner, easier transition
18:28:16 <nirik> once its all working, we could then migrate things over until all is moved
18:28:33 <mirek-hm> yes, becouse if somebody or something break current cloud, we have to manualy rebuild it, which would last ages.
18:28:34 <nirik> (and move compute nodes as needed)
18:28:58 <nirik> sadly, I haven't been able to find a good ansible recipe for setting up a cloud, but I can look some more.
18:29:18 <nirik> it would be nice to just have it in our regular ansible repo with everything else.
18:29:18 <mirek-hm> can we use TripleO for installation?
18:29:33 <nirik> mirek-hm: not sure. I have heard of that, but wasn't clear what it did.
18:29:40 <mirek-hm> I will ask collegues if that can be scripted
18:30:11 <mirek-hm> nirik: it is installation of OpenStack using OpenStack which use OpenStack :)
18:30:19 <nirik> cool. That would be nice. The way I read it is that it needs a cloud installed already?
18:30:34 <danofsatx-work> what about RH's own product, packstack?
18:30:38 <mirek-hm> you first start with disk image, which you boot, then it install undercloud and then it install normal cloud
18:30:59 <nirik> triple-o also looks a bit experemental.
18:31:03 <nirik> danofsatx-work: thats another option yeah.
18:31:15 <nirik> it's puppet/chef, but if we have to...
18:31:47 <mirek-hm> TripleO at devconf http://www.youtube.com/watch?v=Qcpe2gjdRz0&list=PLjT7F8YwQhr928YsRxmOs8hUyX_KG-S0T
18:32:51 <nirik> Also on upcoming plans... in Q4 we are supposed to get some dedicated storage for the cloud...
18:33:23 <threebean> for the new 3-node cloud.. do you expect we would move to a more modern openstack release?
18:33:31 <mirek-hm> nirik: how big?
18:33:40 <threebean> looks like icehouse is supposed to be released in ~2 weeks.
18:33:52 <nirik> threebean: definitely.
18:34:11 <nirik> mirek-hm: unknown yet, as much as we can get for the money. ;) Ideally something thats 2 units in a HA...
18:35:15 <nirik> also in Q3 we are getting some more compute nodes.
18:36:00 <abadger1999> nirik: What's our plans for what we're oging to use the cloud for?  (like -- are we going to move more and more services over to it?)
18:36:23 <nirik> abadger1999: currently, I think it's a good fit for some things and not for others.
18:36:31 <nirik> devel and test instances -> definitely
18:36:34 <nirik> coprs -> yep
18:37:04 <nirik> our main applications -> not sure... I like the control we have currently with them
18:37:10 <abadger1999> <nod>
18:37:13 <mirek-hm> nirik: extrapolated data say that copr will run out of disk space on christmass, so Q3 is really last date
18:37:32 <nirik> mirek-hm: well, can you split into 2 volumes?
18:37:49 <nirik> all the other compute nodes have similar space... so we can fire up cinder on them and add more 800GB volumes.
18:38:05 <mirek-hm> nirik: probably, if I can concate them using lvm
18:38:32 <danofsatx-work> is the old cloud hardware being rolled into the new instance, or decommissioned?
18:38:33 <nirik> not sure if that will work, but we can see...
18:38:40 <nirik> danofsatx-work: rolled into the new one.
18:38:51 <nirik> it's got a few more years on it's warentee
18:40:11 <nirik> mirek-hm: we could also try and move the storage to Q3 and move the new compute nodes to Q4. that might be best
18:40:13 <nirik> (if we can do it)
18:40:57 <nirik> #topic Open Questions
18:41:22 <nirik> So, we have a wiki page for our cloud, but it's pretty out of date. We should likely revamp it
18:41:36 <nirik> https://fedoraproject.org/wiki/Infrastructure_private_cloud
18:41:47 <nirik> That has some more older long term plans, etc.
18:42:10 <mirek-hm> will we add some arm64 machine to cloud as they arrive?
18:42:51 <nirik> I'd like to add one there yeah...
18:42:52 <tflink> I assume that the plan going forward is to retain network isolation from the rest of infra?
18:42:57 <nirik> tflink: yep.
18:43:21 <nirik> that has some downsides...
18:43:39 <nirik> but overall I think it's very good given the carefree nature of instances.
18:43:52 <tflink> yeah, I think it's a positive thing overall
18:43:53 <nirik> A few other thoughts to toss out for comment:
18:44:07 <tflink> just not for all my possible use cases :)
18:44:22 <nirik> 1. Should we continute to autoassign a external ip on the new cloud? or require you to get one if you need it?
18:44:29 <nirik> it has advantages/disadvantages
18:44:55 <tflink> how many instances are using non-floating IPs right now?
18:45:04 <mirek-hm> can I reach wild internet when I have internal IP only (e.g via NAT)?
18:45:35 <nirik> mirek-hm: not sure actually.
18:45:38 <danofsatx-work> EC2 only assigns (static) public ip's when asked for, and it's the only IP you get
18:45:53 <nirik> danofsatx-work: we have it set to autoassign one.
18:45:59 <nirik> (in the current cloud)
18:46:16 <nirik> auto_assign_floating_ip = True
18:46:25 <nirik> tflink: I'm not sure how to tell. ;(
18:46:26 <mirek-hm> danofsatx-work: EC2 instance have both, internal and externa IP
18:47:04 <danofsatx-work> I understand, I'm trying to figure out how to translate my thoughts into coherent IRC language ;)
18:48:16 <danofsatx-work> ok, scratch my last....it's not making sense anymore in my head :(
18:48:39 <nirik> there's 101 external ips being used right now.
18:49:20 <threebean> heh, how did you figure that out? :P
18:49:41 <nirik> and 34 instances seem to have 2 external ips.
18:50:07 <nirik> nova-manage floating list | grep 209 | awk '{print $3 " " $2 }'  | grep -v None | wc -l
18:50:22 <threebean> great, thanks.
18:50:23 <nirik> ugly, but should be right.
18:50:24 <tflink> ah, not as many as I suspected
18:50:47 <nirik> so, I guess we should see if instances with no external can reach out any. I'm not sure they can.
18:50:55 <nirik> but we can test that on the new cloud.
18:51:32 <danofsatx-work> unless there's a router set up to route the internal traffic out, no they can't get out.
18:51:45 <nirik> question 2. We have been very lax about maint (mostly because this current cloud is so fragile) would folks be ok with a more practive one on the new cloud? ie, more frequent reboots/maint windows?
18:52:14 <nirik> danofsatx-work: all the instances would hit the head node, but I don't know if nova-network will nat them or not.
18:52:19 <danofsatx-work> I would be, but I come from a different background than the rest of y'all
18:52:55 <threebean> yeah, that would be fine.  I worry that I'm responsible for wasted resources somewhere, some node I forgot about.
18:53:48 <threebean> erm, I misinterpreted the question.. for some reason I read it as being more proactive about cleanup.
18:53:48 <nirik> last time we rebooted stuff we were able to actually suspend instances... and mostly they came back ok
18:53:56 <nirik> that too. :)
18:54:16 <nirik> more reporting would be nice... like a weekly snapshot of 'here's all running instances'
18:54:24 <nirik> sometimes it's hard to tell what an instance was for tho
18:54:52 <mirek-hm> nirik: +1 to keep auto_assign_floating_ip = True; +1 to planned outages on ne Fedora Cloud
18:54:57 <threebean> oo, send some numbers to collectd?
18:55:53 <tflink> re: more proactive maintenance - I'd like to see coordination with standard backup times. ie - run shortly after backup runs so taht any possible data-loss would be minimized
18:55:57 <nirik> yeah, that might be nice. However, no vpn, so not sure it could talk to log02.
18:56:00 <smooge> well getting that data is hard because of the nat
18:56:12 <nirik> or it could run it's own I guess.
18:56:24 <smooge> I would say we might want to look at having a system on that network which could do that for us
18:56:40 <smooge> the various things that we would like but can't because of seperate networks
18:56:52 <nirik> actually, it could be a cloud instance even. ;)
18:57:20 <mirek-hm> March 2014:  Active Instances: 66 Active RAM: 352GB This Month's VCPU-Hours: 40433.81 This Month's GB-Hours: 2067464.96
18:57:29 <smooge> well it might be nice if the box was able to run when the cloud wasn't
18:57:42 <nirik> smooge: details, details.
18:57:49 <nirik> mirek-hm: thats only one tennant tho right?
18:58:23 <mirek-hm> nirik: that is from dashboard overview of admin user, so I would say that it count everything
18:58:24 <nirik> tflink: agreed.
18:58:46 <nirik> mirek-hm: I think it's only the project it has active at the time... if you change that the numbers change right?
18:59:12 <mirek-hm> yes
18:59:24 <nirik> so we would need to sum all those.
19:00:58 <nirik> ok, any other questions or comments?
19:01:02 <nirik> or shall we wrap up?
19:01:13 * nirik is happy to answer anything outside meeting, etc.
19:01:31 <danofsatx-work> any room for an apprectice on the cloud team?
19:02:03 <nirik> sure. always room for help... perhaps assistance setting up the new cloud?
19:02:31 <danofsatx-work> yeah, I can do that...I need to learn ansible, and I am building clouds at school and work anyhow ;)
19:02:39 <nirik> excellent. ;)
19:02:50 <nirik> ok, thanks for coming everyone... lets continue over in #fedora-admin...
19:02:55 <nirik> #endmeeting
Attachment:
pgpAMYhifpKjF.pgp

Description: PGP signature
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure