This is a transcript. For the video, see LINK.

Michael Meyers: [00:00:00] Hello and welcome to our Tag1 Team Talk on automating infrastructure with EKS and Pulumi to deploy new enterprise web applications in minutes. I'm Michael Meyers and managing director of Tag1 Consulting and I'm joined today by two special guests. Jeff Sheltren, the CIO at Tag1 and Travis Whitehead, one of our senior infrastructure engineers. Today's talk is broken down into two segments. The first segment we talked about the business problems, goals, and solutions. This is part two. We're really going to dig into the technology and how we automate and deploy infrastructure and provision all of these applications.

[00:00:37]But before we do jump into that technology, I, I just, you know, real quick, you know, what we're talking about is a problem that I think most large enterprises struggle with, and we're going to be giving you insight into a real world project that we're working on for one of the top Fortune 500 companies and the work that we're doing for them to achieve this holy grail. Business users are often dependent upon technology resources to launch and deploy new applications. These technology resources are scarce and really busy, and what businesses want to be able to do is to be able to empower their business users. To be in control of their technology, not dependent upon technology resources, to be able to deploy these applications on their own and free up their technology resources to work on other areas that drive a lot of value for the company.

[00:01:33] So Jeff, can, can you give us a sense, you know, I'm, I'm an end user at a business group in a large Fortune 500 company. I need to get a site up really quickly for, you know, an important event that's coming up or for a new application our department needs. How does that work from my perspective as an end-user?

[00:01:53] Jeff Sheltren: Sure. So , I mean, currently it's, that's one of the main pain points we're trying to work around is. Currently, it's a very painful and manual process. You're going to have to find your own resources to build a website or find someone internal to do it for you and just go through manually provisioning servers, deploy the site, et cetera. What we're aiming to do is just make that as simple and push button as possible. so basically as an end-user, you would go to our signup site type in some basic information about your group, kind of what's what your expected usage of the site would be, the host name, domain name that you want , and submit that in a form and from there on once someone gives it the stamp of approval. It's our goal is to have this completely automated, where it just spits out a site that has, you know, your standard dev test and production environments, where we're able to scale it up and down as needed. we can do some cool things like integrate it into , our central Solr search service.

[00:03:02] These are all kind of options that would be available in the, in the web form when submitting that request.

[00:03:08] Michael Meyers: Yeah. We've, we've created these really powerful application templates or distributions for this organization that have enabled them to. Jumpstart and accelerate the you know, the release of new projects, right?

[00:03:22] You out of the box, you get integrations into all their internal systems, third party systems, all the basic functionality look and feel of the site. And what we'll talk about this more later. So you can, you can pick the, the template that best matches your needs. Enter your information. If you require approval, you know, you have to wait and get it.

[00:03:40] And then this spins up this application with your custom domain, instantly. , Sounds pretty awesome. How does, how does this actually work? Travis, like what's under the hood here. What are the, what are the key technologies that you guys are using to drive this?

[00:03:55] Travis Whitehead: Yeah. So I think the key difference between, you know, the traditional slow approach of going to every team and requesting your database, requesting your files, requesting your servers.

[00:04:05] And what we're doing here is that we're building this on the cloud. All of this, this whole system is running on EKS on AWS and we're leveraging infrastructure as code tools. Mainly Pulumi in this case to make all of this automation repeatable. And just represented as code, and it means you stand up a new site.

[00:04:23] You're just running the same program with a couple of different inputs that have changed across the sites. And what you were talking about earlier with like the, the same kind of the same needs across all these sites, the same themes, the same integrations with the company's internal services, same accessibility requirements.

[00:04:41] Instead of going with like, maybe the more old school approach of having different Drupal code bases and different Git repos with different modules enabled , different settings baked in we're actually just using one single code base for one Drupal distribution that has all these bells and whistles baked in building a Docker image out of that and all the differences between these sites, even though they're running on the same code bases are just parameterized as environment variables.

[00:05:06] And so when we - when, like a business user goes to that onboarding form, that Jeff mentioned earlier they can put in, you know, specific things like we want this DNS name. We want this LDAP group to have access to the admin role or the content role that all just pipes into Pulumi inputs. And then that gets piped into the Docker container as environment variables.

[00:05:26] So for us, it's, it's a lot less of like, maintenance overhead to just have. One site that we're maintaining. So we're cutting a release it's on one repo and not having like 12 different repos or 12 different sites.

[00:05:40] Michael Meyers: So the application is contained in these Docker images.

[00:05:44] Travis, could you tell me a little bit more about Pulumi and how are you using that on the infrastructure side of things?

[00:05:51] Travis Whitehead: Yeah. Yeah. So the application is on the Docker image level, but to run the full infrastructure, we need a lot of other pieces too, that are going to exist in AWS and also services within , the company like company has internal services that are going to integrate into it.

[00:06:05] And so if we want to do this all programmatically and automatically , we need to have a program that will say, okay, Spin us up a database, spin us up some file storage, spin us up a Kubernetes cluster and then like deploy onto that. So that's pretty much where Pulumi comes in. We kind of have sort of different layers , Pulumi, you can manage all the resources in one big program and one big stack, but it's more effective for us to kind of split those out into layers.

[00:06:32] So we have one Pulumi program for global or shared resources that are going to be things that we're only going to provision once for like a single environment or on a single AWS account. And that might be something like your EFS volume for Drupal's files , or like your RDS database. Because in this case we're having all the sites work with one RDS instance and then above that, we want to deploy an EKS cluster.

[00:06:58] So we'll have like an EKS Pulumi program that when we run it it's puts in your auto-scaling group. It sets up all the EKS bits , the mount points of EFS, everything you need to just have one single Kubernetes cluster that you can deploy to. And the reason we split that out into a separate program is because if we let's say we wanted to do like some maintenance on the cluster , we might want to stand up like a second cluster or maybe get a third cluster and like migrate all of the apps over there live.

[00:07:24] And then tear down the old cluster, instead of doing like, maintenance on the same cluster where everything's running , I haven't even a separate program. We can just stand up and tear down an arbitrary amount of ETS clusters based on like, whatever we're trying to do. And then the final slice is like the app slice, like the program you run that represents a single application.

[00:07:43] And we'll say we want that application to get deployed to this cluster. We want to go request a certificate to get signed by like the company's keys. We want some kind of load balancer integration. so they're all kind of different slices. And by having them in separate slices, we can repeat them as many times based on whatever we're trying to do.

[00:08:04] Does that make sense?

[00:08:06] Michael Meyers: [00:08:06] Totally. How do Pulumi stacks play into this?

[00:08:10] Travis Whitehead: Yeah. So the Pulumi program is kind of your code definition of saying when I run this, I want to have this, these resources in this kind of state. A stack is kind of representation of like, if I run a Pulumi program once and I had 10 resources created, the stack is going to be that group of resources.

[00:08:28] So in the case of our Pulumi program for deploying an app, every app we deploy is going to have a separate Pulumi stack. So when we grouped together, this is our Kubernetes deployments. This is our load balancer. This is our certificates. All of those are in little old stack groups based on each app.

[00:08:45] And that's really helpful because it means if we need to target a Pulumi run against a specific set of resources, like we only want to update this app in this environment, or we only want to run against this entire environment on an AWS accounts, or we only want to run against this cluster by having those resources grouped into separate stacks, we can ensure that we're not going to accidentally change something that we don't intend to.

[00:09:09] Michael Meyers: Cool.

[00:09:10] And you mentioned, you know, w when we were talking about, you know, the end-user process, you know, I think Jeff mentioned that, you know, these variables, how does, you know, how do the variables, the parameterization, the configuration? How does this play into, you know, the Pulumi side of things?

[00:09:28] Travis Whitehead: Yeah. So every Pulumi stack has a concept of configuration inputs.

[00:09:33] So, if we're going to deploy one application versus another one, we do need a way to specify what parameters are going to be different between those two. And so it's, it's really nice with this onboarding form. For the inputs that don't require a human to actually look at or think about, like, this is just an identifier for the group that has admin access.

[00:09:53] For example, we can pipe that directly into the new Pulumi stacks configs. and then when we run that program, the Pulumi program is going to propagate those configs down into all the places they need to be. maybe those configurations will go into like environment variables in the Kubernetes pod that's running.

[00:10:10] Maybe they're going to feed into like a config map somewhere. Maybe they're secret values and they're going to go into like a Kubernetes secret or like a secret in AWS secret manager. But it's really simple for us because all we have to do is just open up this YAML file of the Pulumi stack config.

[00:10:25] You know, that's assuming a human is editing it in reality. It's actually just getting piped in there, programmatically. so it's, this it's really like as long as we make everything parameterize in that way, where we know every possible value that might need to be different. it's already a config input, but then we don't need to change anything below that.

[00:10:44] Michael Meyers: That's awesome. Designed from the ground up to be, you know , very flexible and customized , by the end user, without having to understand anything that's going on under the hood. Which is, sounds like it's pretty complex. so , you know, I , you know, I, I, I bet a lot of people are wondering, you know, what is Pulumi and, and why did you guys choose it over, say, you know, Terraform, you know, could you give us just a sense of, of what Pulumi is?

[00:11:11] And

[00:11:13] Travis Whitehead: Yeah, Pulumi is just an infrastructure-is-code tool. You can apply it to a lot of different things. It's multicloud supports. It can, can handle cloud resources or things on like a much smaller scale than that. But basically all it is is it's a concept of like, you have resources that you want to manage, and it takes like a crud approach of like, you have a create operation of replace operation and update operation and a delete operation.

[00:11:39] And then people have implemented providers for stuff like AWS. Maybe my SQL resources, maybe Kubernetes resources, anything under the sun that you might want to manage with this tool. People implement providers that kind of follow that pattern. So it's, it's really comparable to Terraform. There's a few things tha, you know, we kind of favor in Pulumi over Terraform. So like with Terraform or even cloud formation, you're going to be learning like a domain specific language specific to that tool. But Pulumi is really nice because you can work in a couple of different languages like Python, Node.js, or even type scripts, if you want that strong typing.

[00:12:14] You know, we're all capable of learning specific languages or specific tools, but it's really nice to stick with the languages that we know and love. I really like how Pulumi does secrets handling like Jeff, have you ever ran a helm chart and you're figuring out how do I, how do I put my secrets in these helm values into git?

[00:12:31] Jeff Sheltren: Yeah. That's not fun.

[00:12:35] Travis Whitehead: Yeah. I mean, it's, it's what we ended up doing. Like, you can use helm secrets, that's wrapping SOPs, or since we're already using the Ansible half the time, we're just throwing Ansible, vaulted files for completely different tools. I, so, yeah, that's one of the things I like about Pulumi is that it does have a little bit better secrets management built in it's like the similar ideas Ansible, where you're you have a password it's like a symmetric key.

[00:12:57] But we can have configured values in our Pulumi stack and fig, and just say, this is a secret Pulumi will encrypt it with our password automatically. And then on the runtime, it knows where to find the password. It decrypts the secrets they're decrypted during runtime. And even with like the stack state, that's getting stored somewhere on your Pulumi backends.

[00:13:16] The secrets will be encrypted at rest there too. And I think with Terraform, like you might get secrets encrypted in the states, depending on the backend you're using, but it's not a guarantee. So depending on how you're storing your Terraform States, you might have to treat that state in itself as a secrets.

[00:13:35] So there's little things that I think make Pulumi a little bit more convenient, but really you can, you can do the same thing with Terraform.

[00:13:41] Michael Meyers: Yep. I mean, a big thing, you know, w we do our work with this organization under a confidentiality agreements. I don't want to go into too much detail and, and tip who they might be, but , they have a lot of experience in Pulumi, and so that, you know, it was a big reason for them wanting to use it as well. Certainly has many advantages as a tool. And we, as a company have done a tremendous amount , with Terraform. But in this case, it, it, you know, for the reasons that you mentioned and, and their particular unique expertise , seemed like the perfect choice.

[00:14:11] Travis Whitehead: Yeah. I mean, a big part of why we went with me in this case is just because, you know, we're not using purely AWS. We're also integrating with the company's internal services, you know, the way they manage certificates, their load balancers. And because it is just like a generic pattern, companies can implement providers for their own internal services.

[00:14:33] And so that was appealing to us because in this case they were standardizing on Pulumi for managing those types of integrations. and so when we looked perfect for our use case , and the funny thing is like, I, you don't, you don't even have to write, I'm a provider for Pulumi. If your a company already has like a Terraform provider or they've standardized on Terraform, but you, your team, for whatever reason, once you use Pulumi or you're considering migrating in that direction , the Pulumi community has a tool called Pulumi Terraform bridge.

[00:15:02] And it'll just take a Terraform provider and like, spit out a Pulumi provider. And even just like reading the code, you know, as I'm writing this Pulumi stuff, it doesn't, it doesn't look like messy translated code to me. It's like perfectly readable and preferably usable. So that's super pleasant.

[00:15:19] Michael Meyers: Wow. So you can bring in that , with really good translation. That's awesome. You talked about a lot of things that you, that you like about Pulumi. Are there aspects that you wish would change function differently that frustrate you? You know, what are some of the downsides of Pulumi?

[00:15:41] Travis Whitehead: Yeah, that's a great question.

[00:15:42] I wouldn't say I consider this a flaw because I think it's, it's a sense of a way of handling the situation, but it still does catch you off guard. when you're learning it. you can sometimes like when you're still writing your program and there's still bugs and stuff like that, like maybe you didn't get everything quite right.

[00:15:57] So you're , you deploy some Kubernetes [00:16:00] resources and your deployment is hanging because it's failing to find this or that's , And Pulumi. will just sit there and wait for the Kubernetes API to come back and say like, This isn't going to happen. You know, we waited five minutes. we're going to time out now and being humans, we get really impatient and we're like, I don't want to sit and wait for this.

[00:16:18] I want to see the error output. So you ctrl-C you inadvertently canceled your Pulumi run. And because of the Pulumi stack keeps track of the state of all those resources and they keep track of like the pending operations that are happening Pulumi now, like, okay, well I told them to deploy that pod.

[00:16:35] I don't know if it actually came back ready. Cause Pulumi really likes to track , readiness. And that will like, when a resource is fully initialized. So then the next time you go back to run Pulumi up, they're going to say, okay, wait, you just canceled earlier in the middle of like two or three pending operations.

[00:16:51] So now you're going to export the stack state, go through to this big old JSON file , find the pending transactions. And you can basically Mark that as a human I verified they've completed or the transaction didn't complete successfully. And then you'd reimport that state back into the Pulumi back ends.

[00:17:08] And Pulumi knows how it wants to proceed. And once you've learned how to do that, and you've learned how to verify what did and didn't happen behind the scenes, it's a pretty easy situation to resolve, but it can be tempting to just be like, I'm going to manually delete these resources on Kubernetes and I'm going to manually remove them from the stack.

[00:17:31] But then you realize that there's like five other resources, depending on the one you just deleted. And then you get yourself into this mess where you're touching things that only a robot should be touching. So it's, you know, it's a sensible way of handling that situation because Pulumi does need to resolve a state, but it can get messy.

[00:17:51]Michael Meyers: What are some of the bigger challenges that you guys have faced while building the system, whether that's organizational and integrating with an automating these systems, to the, you know, the technology?

[00:18:06] Travis Whitehead: I think some of the more fun challenges just stem from. Working within a company that has integrated with AWS in like some, some less, maybe not less standard, but in their own specific ways.

[00:18:19] So it's not like we, you know, as Tag1 just went and spun up an AWS account and we're starting with a blank slate. There's a lot of specific rules and policies and configurations that are already in place. And we've got to do a lot of documentation, deep diving to navigate that, like how the roles and permissions are set up.

[00:18:38] The process for requesting connections back to like internal services , because we already know AWS really well, but those are the things that are kind of new , going down this road and, you know, just, just dive in through the docs, discovering links, don't point to where they used to , all the kind of fun stuff.

[00:18:55] Michael Meyers: There's a lot to navigate in a big organization.

[00:18:58]Jeff Sheltren: I would, I would just add to that, like they've this organization has kind of allowed people to deploy stuff to AWS for a couple, few years now. But it hasn't really been organized and we're kind of, they're just starting to organize their kind of cloud strategy and come up with all these policies about how, you know, what can be deployed today,with AWS and how, and, what the lockdown restrictions are, and being one of the first teams to kind of work through that process.

[00:19:28] And. And help as it's being defined and kind of, you know, work around the internal processes and, and try to correlate that with what we would normally do on AWS , has been a bit of a struggle. Definitely a learning experience. And, you know, I, I love that we're doing this in Pulumi and making it just available and reusable for other groups within the company.

[00:19:51] Like it can be a learning tool for those that are new to AWS within the company.

[00:19:58] Travis Whitehead: Yeah. And we can totally tell, but that this is new within the company too, because it's the other teams using Pulumi and like writing these SDKs for, for working with their internal services. You know, they just landed the one point I was stable release, like pretty recently. So it's for awhile, we were surfing the bleeding edge, you know, finding bugs , reporting them, working closely with them to get stuff , doing what it's supposed to do. So it's, it's totally new, all arounds. That's a lot of what makes it fun.

[00:20:27] Michael Meyers: Awesome. It's exciting to be part of that initial Pulumi release and.

[00:20:31] I think one of the things I love about this organization is, you know, one they're, they're insanely large, you know, they're one of the largest companies in the world, and yet they do a pretty good job of sharing their technology across groups and departments and organizations, and to kind of be able to be a part of, you know, you know, The tip of the spear on this and how so many other groups, departments are going to leverage and work with this.

[00:20:53] And the future is really rewarding. Like, it's pretty amazing to know that not only are we helping, like, you know, this specific project, but it's gonna, you know, echo throughout the organization over time. any like standout moments that you got, you know, you mentioned, you know, being part of the Pulumi stuff and the release , you know what, what's next?

[00:21:13] You know, we're in the, you know, stages of this application, you know, we want to , build on this, right? This is sort of the initial release. What is coming up in the future? Can you give us , any sense of the roadmap, some of the things that you guys can be working on in the future?

[00:21:33] Travis Whitehead: I mean, I would say we're still a little bit in the early stages. You know, this is all happening in dev environments. But I really think that the path forward is to just keep pushing forward with the idea that we want to automate things that a human would otherwise have to be doing, and really create this whole platform that can run and keep running with minimal human maintenance input.

[00:21:54] So, you know, stuff being able to scale up based on increased resource consumption. you know, really, really robust monitoring and alerting for when things do go in directions, we don't expect them to , probably increase like flexibility in terms of what end users want in a website. Maybe we'll discover that there are needs that folks have that we don't yet support.

[00:22:17] I think, I think we'll have a better ideas. We, as we have more users , telling us what it is they need.

[00:22:23]Michael Meyers: awesome.

[00:22:26]Jeff Sheltren: Yeah, I mean, I would, I agree with what Travis said. There's definitely a lot of excitement internally about it. I mean, currently the work we're doing at this company is - a metaphor I heard was like equivalent to creating bespoke suits for people like that's the websites we create like super high performance, very large websites, you know, across the entire company.

[00:22:49] And to try to take the knowledge that we as Tag1 have about performance and scale and Drupal and, and all that stuff and make it into this kind of turnkey solution that smaller groups can use. I think is really cool. So I think we're going to learn some things along the way, because it's definitely a much different audience than we typically interact with.

[00:23:07] But I, I love it. Yeah. The idea of giving these smaller business groups the ability to just have all this technology that we've developed for these other large websites , letting them have those tools , connect to all their internal company tools. it's, it's really cool. I think, you know, like one thing I would love to see is.

[00:23:25] Even if someone wants to kind of deploy their own Drupal site on to like an EKS cluster on AWS, just take our Pulumi scripts and pipelines and run with it. Like they could easily deploy a very large scale Drupal site , to, to suit their needs and be able to customize it that way.

[00:23:46] Travis Whitehead: And even just taking a step away from Drupal.

[00:23:49] I think a lot of the work we're doing in just managing EKS clusters. those are all patterns that could be reused across the company for folks , like looking to do something similar, but maybe different in certain ways.

[00:24:01] [00:24:00] Michael Meyers: We talked about a lot of technologies and tools. And clearly you guys have learned a lot through this and other projects that are similar.

[00:24:09] Are there any resources that have really helped you, you know, any, you know, you have a favorite podcast or book or website that you guys read that you just want to say, Hey, you know, check this out.

[00:24:25] Travis Whitehead: I don't think there's a, a lot I've had to lean on other than just , community upstream documentation. You know, the Pulumi API reference docs are really great. And when they fail you, you can just read the source code on GitHub. And that's the beauty of open source.

[00:24:42] Michael Meyers: What I heard Travis, is that you're going to write some blog posts because there's a huge gap and ecosystem of information about Pulumi.

[00:24:52] Travis Whitehead: We'll see about that. I don't know.

[00:24:55]Michael Meyers: Well, thank you guys so much for giving us some insight into this project. This is , this is really exciting. I'm, I'm glad that we can share some of this with our listeners. remember to check out part one, you know, we get more into the business problem and the business solution.

[00:25:10] We'll put some links into the show notes below the video, so you can check out some of the things we talked about. If you like this talk, please remember to upvote, share, and subscribe. You can check out our past talks at tag1.com/tagteamtalks. And as always we would love your input and feedback on this show, as well as your ideas for future upcoming topics.

[00:25:34] When we hear from you guys, it means so much, and it's so rewarding to be doing this when, when you give us your input and feedback, so you can reach us at tag1teamtalks@tag1.com and again, a huge thank you to Jeff. And Travis for joining us today and to all of our listeners. We'll see you soon.

[00:25:55] Jeff Sheltren: Thanks.