Minutes/Summary from Sign-vault01 outage Retrospective - 2011-03-22 at 20UTC

Kevin Fenzi <kevin@xxxxxxxxx> · Tue, 22 Mar 2011 15:10:58 -0600

=================================================================
#fedora-meeting: Infrastructure outage retrospective (2011-03-22)
=================================================================

Meeting started by nirik at 20:00:03 UTC. The full logs are available at
http://meetbot.fedoraproject.org/fedora-meeting/2011-03-22/infrastructure-retrospective.2011-03-22-20.00.log.html

Meeting summary
---------------
* Timeline/Recap  (nirik, 20:03:42)

* setup of sensitive boxes  (nirik, 20:13:27)
  * AGREED: backup nssdb to a non phx2 host. (may need another layer of
    encryption or not)  (nirik, 20:34:47)
  * AGREED: will check with mitr on sensativity and what needs to be
    backed up.  (nirik, 20:35:02)

* updates for sensitive boxes  (nirik, 20:44:26)

* Misc items  (nirik, 21:03:38)

Meeting ended at 21:10:10 UTC.

Action Items
------------

Action Items, by person
-----------------------
* **UNASSIGNED**
  * (none)

People Present (lines said)
---------------------------
* nirik (114)
* skvidal (95)
* Oxf13 (82)
* smooge (41)
* dgilmore (26)
* abadger1999 (10)
* zodbot (6)
* jsmith (4)
* seanjon (3)
--
20:00:03 <nirik> #startmeeting Infrastructure outage retrospective (2011-03-22)
20:00:03 <zodbot> Meeting started Tue Mar 22 20:00:03 2011 UTC.  The chair is nirik. Information about MeetBot at http://wiki.debian.org/MeetBot.
20:00:03 <zodbot> Useful Commands: #action #agreed #halp #info #idea #link #topic.
20:00:04 <nirik> #meetingname infrastructure-retrospective
20:00:04 <zodbot> The meeting name has been set to 'infrastructure-retrospective'
20:00:27 <nirik> Greeting everyone. We are going to be having a retrospective/lessons learned/brainstorming session here.
20:00:40 <jsmith> w00t!
20:00:57 <nirik> There was a hardware failure friday on our signing server, and we want to figure out better ways to mitigate any risks from such.
20:01:03 <nirik> who all is around? :)
20:01:09 * dgilmore is
20:01:19 <abadger1999> Hola
20:01:31 * skvidal is here
20:01:41 <nirik> Oxf13: you around?
20:01:53 * Oxf13 
20:02:04 * jsmith is here
20:02:16 <Oxf13> funny enough, I wanted to move the meeting to avoid lunch, and I haven't grabbed lunch yet :)
20:02:24 <nirik> smooge: you around?
20:02:51 <jsmith> Oxf13: That's OK... it's 4:00pm here and I still haven't had (a proper) lunch
20:03:24 * nirik had some crackers and cheesh. Lunch of champions. ;)
20:03:31 <nirik> anyhow, I guess lets get started?
20:03:42 <nirik> #topic Timeline/Recap
20:03:53 <smooge> here
20:03:59 <nirik> hey smooge.
20:04:24 <nirik> So, we have a small number of sensitive servers that don't follow our normal updates policy.
20:04:52 <nirik> last friday we determined it might be a good idea to update them and reboot them into new kernels, etc.
20:05:16 <smooge> yes.
20:05:17 <nirik> One of those is the sign-vault01 server. We applied updates to it and rebooted.
20:05:28 <nirik> it had a hardware failure and didn't come back up.
20:05:36 <dgilmore> eventually the plan for the sign-vault is to have it off of the network entirely
20:05:50 <nirik> we then took 2 approaches:
20:06:00 <nirik> a) move drives to spare hardware and bring it back up.
20:06:07 <nirik> b) get a new instance setup as a virtual.
20:06:17 <nirik> a managed to complete first and we were back up.
20:06:32 <nirik> (I would like to thank everyone who worked on getting it back up and working)
20:06:53 <nirik> So, I think thats the recap/timeline... anyone have anything I missed in there?
20:07:01 * dgilmore thinks b is bad
20:07:21 <nirik> it could be. ;)
20:07:33 <Oxf13> I have something
20:07:33 <nirik> #topic setup of sensitive boxes
20:07:36 <smooge> well at the time there wasn't any spare hardware
20:07:37 <nirik> #undo
20:07:37 <zodbot> Removing item from minutes: <MeetBot.items.Topic object at 0x34e5ddd0>
20:07:49 <dgilmore> Oxf13: shoot
20:07:50 <Oxf13> We did actually get B mostly up.  A virt host was created, and I got a copy of the sensitive data onto it
20:07:59 <Oxf13> the data isn't in a production directory though.
20:07:59 <smooge> I would like to thank RHIT for finding the spare hardware that they got out of somewhere for us
20:08:18 <nirik> smooge: good point. RHIT found us spare HW for that.
20:08:29 <nirik> Oxf13: ok.
20:08:51 <smooge> they also dropped another problem and moved that hardware for us.
20:09:15 <nirik> Many Kudos for their assistance.
20:09:39 <nirik> anyone have anything more on background?
20:09:43 <smooge> yes
20:09:52 <nirik> smooge: go ahead
20:10:02 <smooge> or is it a lesson learned as the data was not backed up anywhere
20:10:26 <nirik> yeah... backups is on the agenda. ;)
20:10:35 <Oxf13> it is good background data
20:10:43 <Oxf13> at least we never got a yes/no on whether or not the backup ever completed
20:10:50 <smooge> there is no backup on it
20:11:01 <nirik> :( ok, good to know...
20:11:19 <smooge> it was not put into backups because people were worried about mixing of sensitive data with regular backups
20:11:41 <nirik> ok, shall we move on and discuss backups?
20:11:46 <skvidal> yah
20:11:53 <nirik> #topic Backups of sensitive boxes
20:11:54 <skvidal> and the previous topic tooo
20:12:03 <skvidal> 'setup'
20:12:34 <abadger1999> <nod>
20:12:47 <nirik> why don't we save setup for later and hash out some of the things that might be easier/simpiler? or do we want to look at setup first to know the answer to others?
20:13:10 <Oxf13> maybe I could talk about the original plan for these boxes?
20:13:19 <nirik> ok.
20:13:20 <nirik> #undo
20:13:20 <zodbot> Removing item from minutes: <MeetBot.items.Topic object at 0x28a7d8d0>
20:13:24 <skvidal> I think we need to look at the original assumptions
20:13:27 <nirik> #topic setup of sensitive boxes
20:13:40 <skvidal> which are that these box are being treated in a way which ensures no one pays attention to them
20:13:40 <nirik> I'd like to note: https://fedoraproject.org/wiki/User:Mitr
20:13:51 <Oxf13> My original plan was that bridge would be connected to the network, and allow ssh and sigul connections
20:13:52 <skvidal> I'd like to note that is not in an obvious place :)
20:13:53 <nirik> has background info on the signing servers.
20:14:00 <Oxf13> I wanted to limit the attack vectors though
20:14:05 <dgilmore> originally the box was inteneded to be connected the the bridge via a crossover cable
20:14:05 <nirik> skvidal: agreed.
20:14:08 <Oxf13> so I wanted puppet off, and backups off.
20:14:12 <dgilmore> and have no network connection at all
20:14:23 <Oxf13> the vault, which has the sensitive data, was to be only crossover connected to bridge
20:14:30 <skvidal> which never happened, right?
20:14:37 <Oxf13> any admin work was going to require a serial connection
20:14:51 <nirik> "Non-network connection (USB/serial) does not provide enough infrastructure (packeting, checksumming, retransmissions, debugging tools) that I'd rather not reimplement now; this can be replaced later if necessary."
20:14:56 <Oxf13> and maybe even a on-site call to hook up the serial as to not have it sitting there waiting for somebody to bang on the serial port
20:15:20 <Oxf13> skvidal: that never happened.  The server remained connected to the network and allowed ssh in
20:15:35 <Oxf13> We had some stability issues that required frequent restarts of the vault and bridge processes
20:15:52 <Oxf13> and we were not comfortable severely limiting our access to the machine.
20:16:07 * nirik nods.
20:16:14 <Oxf13> also, these were hand installed and setup systems that got turned into production systems
20:16:17 <Oxf13> which was my fault.
20:16:50 <Oxf13> I believe we were under time pressure to use it for whatever Fedora release needed to be signed
20:17:07 <Oxf13> and did not take the time to use puppet to rebuild the boxes in an automated way
20:17:12 <smooge> 13 beta I think?
20:17:30 <Oxf13> As for backups, the plan was mostly hand-wavy
20:17:56 <Oxf13> "we" felt that too many people had access to the backup storage and could grab the nss dbs and brute force them at their leisure
20:18:04 <Oxf13> so we did not hook it into the backup system
20:18:13 <skvidal> who is 'we'?
20:18:15 <Oxf13> but we also failed to create and follow an alternative backup plan
20:18:34 <Oxf13> skvidal: Mostly me, and I believe mmcgrath and notting were in the conversation
20:18:44 <Oxf13> our goal was to limit the number of ways the nss dbs could be accessed
20:18:55 <Oxf13> number of ways and number of people.
20:19:25 <nirik> the nss dbs are still passphrase protected (but of course could be pounded on with brute force given access to them)
20:20:11 <nirik> so, at the very least we need some backup plan.
20:21:00 <nirik> do we wish to persue the orig plan for serial access only, etc? or something else?
20:21:02 <Oxf13> nirik: yeah, and with Amazon cloud, the amount of time / $$ it takes to brute force got significantly smaller IIRC
20:21:02 <dgilmore> the backups should only need updating when we add a new key
20:21:30 <Oxf13> dgilmore: I'd like to verify with mitr where data about what user has access to which key and the passphrases for the users are stored.
20:21:49 <Oxf13> dgilmore: while we can re-add users and such, it'd be a hassle, so that data should be backed up on change too
20:21:50 * dgilmore would nearly be ok with someone putting an encrypted usb key in and backing up to that
20:22:01 <nirik> the backup01 box is pretty limited / also a sensitive box. Perhaps we could store a backup there?
20:22:12 <dgilmore> Oxf13: true but that doesnt change often
20:22:16 <skvidal> just to be clear
20:22:26 <skvidal> the only confidential data are the keys, correct?
20:22:43 <Oxf13> skvidal: user passphrases should be treated as confidential too IMHO
20:23:01 <Oxf13> each user has their own unique passphrase to access the system, along with an ssl cert
20:23:03 <dgilmore> skvidal: there is the keys and users passwd to access the keys
20:23:09 <Oxf13> (well the ssl cert being the FAS cert)
20:23:12 <skvidal> but we don't need to back up those
20:23:16 <skvidal> they don't matter
20:23:20 <skvidal> they keys for signing matter
20:23:23 <Oxf13> skvidal: no, they can be re-created.
20:23:29 <skvidal> the ssl certs and passphrases can be nuked from orbit
20:23:29 <skvidal> right
20:23:31 <skvidal> so for BACKUPS
20:23:32 <Oxf13> the keys are the critical thing we can't lose.
20:23:36 <skvidal> rightr
20:23:36 <skvidal> so
20:23:44 <skvidal> why don't we just back those up to some place
20:23:47 <skvidal> double-encrtypt
20:23:49 <skvidal> treble, if you'd like
20:23:53 <skvidal> it doesn't really matter
20:24:03 <skvidal> we don't need to automatically restore this info
20:24:17 <jsmith> Encrypt on a USB key, tape the USB key to the back of the server
20:24:19 <skvidal> but EVERYTHING else on the box needs to be automatically re-provisional
20:24:22 <skvidal> jsmith: nah
20:24:28 <skvidal> jsmith: requires someone on sight
20:24:31 <skvidal> onsite, even
20:24:37 <skvidal> my point is just this
20:24:46 <skvidal> the only piece that we need to backup is trivially small
20:24:50 <skvidal> and hell if we PRINT THEM OUT
20:24:53 <skvidal> we can get away with it
20:25:04 <nirik> yes, it's a small amount of data.
20:25:05 <Oxf13> right, it is trivially small
20:25:10 <smooge> personally a gpg encrypted file of the backup files should probably be enough.. make the passphrase 64+ characters and it will still take an IPv6 full amount of computers to crack it.
20:25:15 <Oxf13> the question is how paranoid do we want to be.
20:25:31 <skvidal> well let's put it this way
20:25:32 <Oxf13> mitr may say that the encryption on the nss db is enough
20:25:35 <skvidal> none of the attacks we've faced
20:25:44 <Oxf13> and that we should be "safe" having it out there
20:25:44 <skvidal> have been as a result of someone brute-forcing anything
20:26:02 <skvidal> I've got no issue with keeping these outside of the normal backup routines
20:26:07 <skvidal> but they need to be backed up SOMEWHERE
20:26:10 <skvidal> and that location needs to be:
20:26:12 <skvidal> raptor proof
20:26:14 <skvidal> documented
20:26:16 <skvidal> duplicated
20:26:25 <nirik> right.
20:26:25 <skvidal> and afaict it is NONE of those right now
20:26:27 <skvidal> right?
20:26:34 <nirik> it's not in existance right now. ;)
20:26:39 <skvidal> exactly
20:26:49 <Oxf13> what about this
20:27:09 <Oxf13> what about a cron job on the vault that will check to see if the dbs have changed, and if so, scp them off to some other host
20:27:15 <skvidal> no
20:27:17 <Oxf13> that other host /is/ part of the backup process
20:27:22 <skvidal> the keys change 2 times a year, right?
20:27:39 <Oxf13> skvidal: and at odd times for EPEL
20:27:41 <dgilmore> skvidal: roughly
20:27:45 <nirik> we should be able to add a 'backup the nss dbs' to 'add a new key SOP'
20:27:49 <skvidal> nirik: +1
20:27:51 <skvidal> yes
20:27:53 <skvidal> and moreovr
20:27:54 <dgilmore> epel gets a new key with new rhel
20:27:55 <Oxf13> nirik: ok, fine by me.
20:28:01 <skvidal> if we do not have a releng person who can do this
20:28:03 <skvidal> then we're SCREWED
20:28:33 <nirik> so, proposal: gpg encrypt the needed files for another layer, and back them up on backup01 and/or another non phx2 host?
20:29:00 <skvidal> nirik: back them up to some place inside RHT if we want several more layers of obfuscation
20:29:08 <dgilmore> i would say on a non phx2 box
20:29:15 <dgilmore> maybe our dr box
20:29:17 <Oxf13> I vote non phx2
20:29:42 * nirik is fine with that... passphrase in the private puppet repo?
20:29:52 <Oxf13> I kinda like the idea of somewhere within RHT, but maybe that's too complicated or too politically sensitive.
20:29:52 <skvidal> nirik: no
20:30:12 <smooge> dgilmore, ok 1 we currently do not have a dr box inside a place I would "trust" with them
20:30:12 <Oxf13> I'd like to get mitrs opinion on whether or not we need a second layer of encryption or not
20:30:20 <nirik> ok. but it does us no good if a raptor proof amount of people don't have it. ;)
20:30:27 <skvidal> nirik: sure you can
20:30:42 <skvidal> nirik: give the passphrase to mark cox
20:30:46 <nirik> Oxf13: ok.
20:30:46 <skvidal> or red hat infosec
20:31:07 <skvidal> b/c if rh decides to screw fedora in some way  the gpg keys will be the least of our problems :)
20:31:38 <nirik> sure...
20:31:54 <nirik> that problem goes away if we don't need to do another layer on them.
20:32:37 <nirik> so: Check with mitr on what level of paranoid we need for the nssdb. Either encrypt again or not, and backup to non phx2 host.
20:32:55 <nirik> anything else on backups? or is that good for those?
20:33:17 <nirik> the bridge also has a nssdb, should that get the same treatment?
20:33:22 <Oxf13> we need to also have a periodic test if we can deploy a new host via puppet and restore from backup
20:33:39 <Oxf13> nirik: I'm not sure what all is in the bridge db, and how sensitive it is.
20:33:56 <Oxf13> it may just have the mappings of users to rights and user passphrases.
20:34:00 * nirik gets ready to make with action items if no one objects.
20:34:01 <Oxf13> I don't think it needs to be backed up
20:34:11 <Oxf13> so a mitr question
20:34:19 <nirik> ok
20:34:26 <smooge> There is a problem with the setup assumptions.
20:34:47 <nirik> #agreed backup nssdb to a non phx2 host. (may need another layer of encryption or not)
20:34:51 <smooge> s/is a/are a couple/
20:35:02 <nirik> #agreed will check with mitr on sensativity and what needs to be backed up.
20:35:06 <Oxf13> yeah, somewhat difficult to get the nss dbs off the box if it's off the network.
20:35:13 <Oxf13> well, actually
20:35:24 <nirik> it can't be totally off the net.
20:35:28 <nirik> but it could not allow anything in.
20:35:30 <Oxf13> no, still difficult, but not impossible.
20:35:50 <Oxf13> nirik: right, it at least has to have a network connection to bridge
20:35:51 <nirik> smooge: go ahead.
20:36:25 <smooge> 1) The sign-bridge is a virtual system. We can't do a crossover cable to it. So we need to get another hardware for it if we want that.
20:36:57 <nirik> so, we could say: new setup is no incoming connections allowed to the box. Access is via serial for adding new keys or restarting things or applying updates or doing a backup of nssdb.
20:37:36 <Oxf13> nirik: new key addition is done through the sigul client, no need to log into vault for it
20:37:37 <smooge> 2) New hardware is needed anyway as the box's current warranty is almost over.
20:37:56 <smooge> 3) The box also did not have a good warranty on it. It is next business day RMA only.
20:37:59 <nirik> smooge: well, the serial/crossover is not implemented, so I don't think we need to worry about it now.
20:38:20 <smooge> nirik, however the assumption has been that they could go to it anytime when it was implemented.
20:38:36 <nirik> yeah, not sure if it's on the roadmap or off.
20:38:37 <smooge> they can't.. without more resources.
20:39:19 <dgilmore> nirik: id still like to do it when we are confident that the vault will just run
20:39:21 <nirik> thats another mitr question I guess, but I don't think we can plan for it now without a roadmap.
20:39:39 <nirik> smooge: is there replacement hardware available for the box?
20:40:06 <smooge> I budgeted for one next quarter. But I didn't for second hardware :/
20:40:30 <smooge> so currently if the box has issues it can be 24-72 hours before it is fixed.
20:40:46 <smooge> Since it goes off contract in May I am not sure its worth fixing it.
20:41:28 <smooge> eg by the time I get it through the system...
20:41:34 <nirik> yeah.
20:42:47 <nirik> ok, so should we leave it as setup currently and revisit when new hardware arrives?
20:43:07 <nirik> or should we move to the "no incoming connections allowed to the box. Access is via serial for adding new keys or restarting things or applying updates or doing a backup of nssdb." plan
20:43:18 <smooge> well I think we don't have much choice beyond setting up some sort of gpg encrypt backup of the databases.
20:43:21 <nirik> or does anyone have another plan to toss out there. ;)
20:43:24 <skvidal> I think I am in favor of leaving it as is
20:43:27 <Oxf13> I'd vote for leaving it as is, except for adding the backup SOP
20:43:30 <skvidal> and NOT making those changes when hw arives
20:43:37 <skvidal> meaning - this is how far it goes
20:43:40 <skvidal> and no further
20:43:50 <skvidal> and we all just take the pills which keep us from being this paranoid
20:44:09 <nirik> ok, that goes to the next topic...
20:44:16 <nirik> #topic backups for sensitive boxes
20:44:16 <Oxf13> heh
20:44:21 <nirik> #undo
20:44:21 <zodbot> Removing item from minutes: <MeetBot.items.Topic object at 0x2ac21e10>
20:44:26 <nirik> #topic updates for sensitive boxes
20:44:29 <nirik> updates. ;)
20:44:41 <Oxf13> skvidal: well, I plan on washing my hands of the whole thing in about 9 months so....
20:44:42 <nirik> I think it's a bad idea to have these out of our regular backup cycle.
20:44:56 * nirik sighs
20:44:59 <skvidal> Oxf13: it's like the hotel california
20:45:01 <nirik> s /backup/updates/
20:45:03 <skvidal> Oxf13: :)
20:46:06 <nirik> so, proposal: we apply updates to sensitive boxes at the same time as others (taking into account freezes, etc).
20:46:13 <smooge> nirik "have these out of our".. what is these?
20:46:21 <skvidal> nirik: which means we enable funcd on those boxes?
20:46:52 <nirik> sign-bridge01, sign-vault01, backup1 (I think thats all... are there others that don't run func or the like)
20:46:58 <Oxf13> ugh, paranoia setting in again.
20:47:24 <smooge> backup02 sort of fits into that.
20:47:48 <smooge> though func runs on it.. and puppet so never mind
20:48:12 <nirik> I don't think we should do func unless we also are doing puppet, etc... which I don't think we want.
20:48:22 <skvidal> don't we?
20:48:28 * dgilmore thinks we should only ever install security updates
20:48:57 <nirik> well, I guess the idea is that these are 'off the grid' so compromise in puppet/fas wouldn't also get them.
20:49:20 <Oxf13> right, that was the paranoia
20:49:42 <nirik> dgilmore: we could remove a ton of packages on them too I think... they could be much more minimal. (something to do with new one)
20:49:46 <Oxf13> no attack vectors from automation such as puppet or func (or backups)
20:50:00 <dgilmore> nirik: right
20:50:10 <dgilmore> we should only have installed what we need
20:50:46 <nirik> so, I'd say no func, no fasClient (local users?), no puppet. However, we should apply updates as part of our other updates flow... not put it off.
20:51:09 <dgilmore> nirik: backup01 has local users
20:51:30 <nirik> I think we should have sign* do that too... it's using fas currently I am pretty sure.
20:51:42 <dgilmore> nirik: right
20:51:52 <Oxf13> what is sign* ?
20:51:52 <nirik> does anyone object too stongly to that level of paranoia?
20:52:02 <smooge> well if we are going that route.. we might as well make this a home brewed ARM PCI card. I am not sure where the last stop of the reasonable train is.
20:52:03 <dgilmore> i agree with no fas, no func, no puppet
20:52:06 <nirik> shorthand for sign-bridge01 and sign-vault01
20:52:08 <skvidal> nirik: I think it's a waste of time
20:52:17 <Oxf13> oh ok.
20:52:23 <skvidal> nirik: I think we're chasing down paths that will ultimately get us right back to where we were on friday
20:53:08 <nirik> counterproposal? :) func and puppet and fas like any other machines?
20:53:10 <smooge> I do. I think that at a certain point we might as well go back to hand signing it from a cdrom in Gafton's cubicle because we are now assuming a lot of technical details that can be completely circumvented by physical.
20:54:41 <Oxf13> whatever we do, it has to be better than having the keys just sitting in somebody's homedir on say releng2....
20:54:48 <dgilmore> i think we need to make sure we have the backups so should anything happen, we can rebuild
20:55:00 <Oxf13> "The Incident" is what started some of this paranoia.
20:55:08 <dgilmore> Oxf13: right now the epel key is sitting on releng01
20:55:13 <dgilmore> relepel01
20:55:21 <dgilmore> the one for el4 and 5
20:55:22 <smooge> 1) I think the current level with a set of backup hardware and a set of backups that are encrypted is what we need to deal with.
20:55:29 <dgilmore> i really should import that into sigul
20:56:05 <smooge> after that we are assuming a lot of physical controls that don't exist.
20:56:09 <skvidal> the incident has nothing to do wit hthis
20:56:20 <Oxf13> skvidal: I beg to differ.
20:56:25 <nirik> well, curently we are not doing func or puppet on them, and fasClient only when someone fixes it to run. ;)
20:56:40 * nirik sees we are getting near an hour now.
20:56:45 <Oxf13> here is my input.
20:57:18 <Oxf13> If we're going to turn on puppet/func/fasclient, then I really don't see the point in being extra paranoid about our backups, and we just turn on the backup stuff too and have those dbs sit with the rest of the backup data.
20:57:29 <Oxf13> (provided they aren't unlocked on the filesystem while the daemon is running)
20:58:25 <nirik> so, perhaps input from mitr would be helpfull for us to decide A or B? (since they are kinda wildly different sides. ;)
20:58:54 <smooge> yes.
20:59:16 <nirik> ok, lets gather that and revisit in the regular infra/rel-eng channels for further decision?
20:59:34 <Oxf13> data points for right now
20:59:49 <Oxf13> sign-vault02 exists and has a backup of the vault data as of Friday
21:00:07 <Oxf13> that should remain in place until we have an agreed upon backup solution going
21:00:13 * nirik nods.
21:00:28 <abadger1999> is sign-vault02 fs encrypted?
21:00:47 <Oxf13> I don't believe so
21:00:53 <nirik> also, another datapoint: replacement hardware is available and sign-vault01 drives are going to be moved back to it later today.
21:00:56 <nirik> abadger1999: nope.
21:01:00 <Oxf13> I discussed fs encryption with mitr, and he had the opinion that it was a waste
21:01:01 <abadger1999> k
21:01:07 <Oxf13> given that the dbs were already encrypted
21:01:21 <seanjon> smooge: ping
21:01:25 <abadger1999> A shutdown fs encrypted host would make a decently secure warm backup.
21:01:28 <smooge> hi seanjon
21:01:30 <skvidal> abadger1999: +1
21:01:34 <seanjon> smooge: lets push the hardware swap 5pm
21:01:41 <smooge> 5pm your time?
21:01:43 <seanjon> yea
21:01:47 <skvidal> abadger1999: it would mean getting through the fs encryption and then through the nss db encryption
21:01:49 <nirik> abadger1999: yeah, could be.
21:01:54 <abadger1999> I think skvidal suggested something similar in email.
21:01:57 <nirik> yep.
21:02:13 <smooge> seanjon, I think that will be 00:00 UTC
21:02:25 <Oxf13> abadger1999: it'd make the backup process a bit more complicated, and require yet another passphrase to be shared or stored somewhere.
21:02:31 <smooge> you are UTC-7 I believe
21:02:41 <nirik> ok, I had 2 more small items I wanted to note, shall we do them real quick, then call it a meeting and try and ponder longer term decisions?
21:02:48 <skvidal> Oxf13: backup process MORE complicated?
21:02:49 <smooge> seanjon, ok will move it to 00:00 UTC
21:02:51 <skvidal> you just snapshot the lvm
21:03:19 <skvidal> s/lvm/lv/
21:03:38 <nirik> #topic Misc items
21:04:04 <nirik> I'd like to suggest we not do updates on friday's... especially on hardware that only has next business day response. ;)
21:04:42 <Oxf13> skvidal: er, you'd have to boot the spare system, unlock it, then copy over the changed files, and shut the backup system down again
21:04:50 <skvidal> Oxf13: no
21:04:56 <nirik> Also, on this outage we didn't have any outage announcement, but should we have? it didn't cause any disruption to end users or developers... just need to consider when we notify about outages and how. (revisit that process)
21:04:58 <smooge> yes. sorry I wasn't watching the date when I ok'd doing updates
21:04:59 <skvidal> Oxf13: you run the primary on an lv
21:05:04 <skvidal> snapshot and dd to a file
21:05:16 <skvidal> run the warm-backup off of the dd
21:05:37 <nirik> smooge: no worries. I think it was me that was saying we should just do them then...
21:06:09 <smooge> I think we both did. I have been working so many weekends I didn't think.
21:06:18 <nirik> yeah, easy to do.
21:06:40 <skvidal> to be fair
21:06:48 <skvidal> this haD NOTHING to do with the updates
21:06:51 <skvidal> the updates were fine
21:06:55 <Oxf13> skvidal: I guess I was confused as to how the backed up data would get onto the shut down encrypted other system.
21:06:56 <skvidal> the kernel was fine
21:07:15 <skvidal> the only issue here is that the  hw pooped on itself
21:07:20 <skvidal> and was unrecoverable
21:07:28 <skvidal> let's not make this into something it is not
21:07:28 <nirik> yeah
21:07:44 <skvidal> it has NOTHING to do with updates nor, even, with rebooting
21:07:56 <skvidal> if we had opted to NOT reboot these boxes
21:08:03 <skvidal> it may well have failed in an identical way
21:08:06 <abadger1999> <nod>
21:08:18 <nirik> true.
21:08:27 <smooge> or worse... failed in ways we didn't see
21:08:36 <skvidal> post hoc ergo propter hoc
21:08:37 * abadger1999 thinks we should also consider doing periodic reboots of the box.
21:08:43 <skvidal> abadger1999: +100000
21:08:47 <skvidal> abadger1999: rebooting ALL OF OUR BOXES
21:08:49 <skvidal> every 150 days
21:08:51 <abadger1999> <nod>
21:08:52 <skvidal> no matter way
21:08:54 <smooge> just not on Friday.
21:08:56 <skvidal> err s/way/what/
21:09:09 <nirik> surely. entropymonkey. :)
21:09:19 <Oxf13> drifting off topic
21:09:33 <nirik> anyhow, shall we close up and take our thoughts to the list/infra&rel-eng meetings?
21:09:44 <skvidal> sure
21:09:50 <abadger1999> <nod>
21:10:02 <nirik> thanks for all the info and brainstorming everyone!
21:10:10 <nirik> #endmeeting
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
infrastructure mailing list
infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/infrastructure