Transcript: How to load test with Goose - Part 2: Running a Gaggle

This is a transcript. For the video, see How to load test with Goose - part 2: Running a Gaggle.

Michael Meyers: [00:00:00] Hello, and welcome to Tag1 Team Talks, the podcast and blog of Tag1 Consulting. Today, we're going to be doing distributed load testing how to: a deep dive into running a Gaggle with Tag1's, open source Goose load testing framework. Our goal is to prove to you that Goose is both the most scalable load testing framework currently available, and the easiest to scale. We're going to show you how to run a distributed load, test yourself. And we're going to provide you with lots of code and examples to make it easy and possible for you to do this on your own. I'm Michael Meyers, the managing director at Tag1, and joining me today is a star-studded cast.

[00:00:38] We have Jeremy Andrews, the founder and CEO of Tag1, who's also the original creator of Goose. Fabian Franz, our VP of Technology, who's made major contributions to Goose, especially around performance and scalability. And Narayan Newton, our CTO who has set up and put together all the infrastructure that we're going to be using to run these load tests.

[00:01:03] Jeremy, why don't you take it away? Give us an overview of what we're going to be covering and let's jump into it.

[00:01:10] Jeremy Andrews: Yeah. So last time we were exploring with setting up a load test from a single server and confirmed that Goose makes great use of that server. It leverages all the CPU's and ultimately tends to get as far as it can until the uplink slows it down.

[00:01:27] So today what we're going to do is use a feature of Goose called a Gaggle which is a distributed load test. If you're familiar with Locust, it is like a swarm. And the way it, the way that this works with Goose you have a Manager process that you kick off and you say, I want to simulate 20,000 users and I'm expecting 10 Workers to do this load.

[00:01:49] The Manager process prepares things and, and all the Workers then connect in through a TCP port and it sends each of them a batch of users to run. And then each of them the Manager coordinates a start, each of the Workers start at the same time. And then they send their statistics back to the Managers so that you can actually see what happened in the end.

[00:02:11] What this nicely solves is if you're uplink can only do so much traffic, or if you want traffic coming from multiple regions around the world you could let Goose manage that for you in all of these different servers. So today Narayan has set up a pretty cool test where we're going to be spinning up a lot of Workers.

[00:02:28] And he could talk about how many each one is not going to be working too hard. They'll run maybe a thousand users per server, which means it'll be at least 50% idle. It won't be maxing out the uplink on any given server. But in spite of that, we're going to show that working together in a Gaggle we can generate a huge amount of load.

[00:02:45] So now Narayan, if you can talk about what you've set up here.

[00:02:48] Narayan Newton: Sure. so what I built today is basically a simplistic Terraform tree. What is interesting about this is that we wanted to distribute the load between different regions and for those people that have used Terraform in the past, that can be slightly odd in that you can only set one region for each AWS provider that Terraform uses to spin things up.

[00:03:12] So how we've done this is defined multiple providers, one for each region and a module that spins up our region Workers. And we basically initialize multiple versions of the module passing each a different region. So in the default test, we spin up 10 Worker nodes in various regions. the Western part of the United States, the Eastern part of the United States, Ireland, Frankfurt.

[00:03:38] India and Japan with how the test currently works. It's the load testing truss, which is what we decided to call it. it's a little limited because once you start it, you can't really interact with the Workers themselves. They start up, they pull down Goose and they run the test. Then next revision of this would be something that has a clustering agent between the Workers to, so that you can actually interact with the Workers after they start it gets very annoying to have to run Terraform, to stand up these VMs all over the world, and then you want to make a change to them, you have to destroy all of them and then relaunch them which isn't terrible. But as a testing sequence, it adds a lot of time, just because it takes time to destroy and recreate these VMs. So the next revision of this would be something other than Goose, creating a cluster of these VMs. How it currently works is that we're using Fedora CoreOS so that we have a consistent base at each location.

[00:04:41] And so I could only send it a single file for initialization. And then Fedora CoreOS pulls down a container that has the Goose load test and a container that has a logging agent so that we can monitor the Workers and send all the logs from the Goose agents back to a central location.

[00:05:02] Fabian Franz: I had a quick question. So Narayan the basic setup is that we have EC2 instances, like on AWS, and then we run containers like normal Kubernetes like on them, or how is it working?

[00:05:17] Narayan Newton: It's using Docker. So that is the big thing that I want to improve. And I almost got there before today. What would be nicer is if we could use one of the IOT distributions or Kubernetes at the edge distributions to run a very slim version of Kubernetes on each Worker node so that we get a few things.

[00:05:37] One is cluster access, so we can actually interact with the clusters spread load, run, multiple instances of Goose. it would be interesting to pack multiple instances of Goose on things like the higher end and also be able to actually edit the cluster after it's up and not have to destroy it and recreate it each time.

[00:05:56] The other thing is to get containered and not Docker. Just because there are some issues that you can hit with that. as it stands right now, CoreOS ships with Docker running, and that's how you interact with it for the most part is a systemctl Docker, but you could also use Podman but I ran into issues with that for redirecting the logs.

[00:06:17] So we are actually using Docker itself and Docker is just running the container as you would in a local development environment.

[00:06:24] Fabian Franz: So what we are missing from standard Kubernetes deployment thing that we would normally have is the ability to deploy a new container. You were saying that if I want to deploy a new container versus simplistic infrastructure right now, I need to shut down the EC2 instance and then start them up again.

[00:06:42] Okay.

[00:06:42] Narayan Newton: So that's, that's the, so like what I did when before this test, Jeremy released a new branch with some changes to make this load test faster as on startup. what I did to deploy that is run Terraform destroy, wait for it, to kill all their VMs across the world, and then Terraform apply and wait for it to recreate all those VMs across the world.

[00:07:03] And like that is management style, honestly, but in this specific case, because we're doing sometimes micro iterations, it can get really annoying.

[00:07:13] Fabian Franz: Yeah, for sure.

[00:07:14] No, no, that makes perfect sense. I just want to understand, because I was like, in this container world, you can just deploy a new container, but obviously you need a Manager for that.

[00:07:23] Narayan Newton: Yes. Yes. I could totally deploy a new container. So what I could do is have Terraform output the list of IPs, and then I can SSH to each of them and pull a new container. But at that point,

[00:07:40] But seriously, there's another Git repository that I have started. The version of this that uses a distribution of Kubernetes is called K3s that is designed for CI systems and IOT and deployments to the edge. And it's a - it's a single binary version of Kubernetes where everything is wrapped into a single binary and starts on edge nodes and then can connect them all together and so we could have a multi-region global cluster of this little Kubernetes agents.

[00:08:08] And then we could spin up Gooses on that. And that I think will actually work.

[00:08:12] Fabian Franz: You totally blew my mind. So now you've just signed up for follow up to show that because that's, I mean, that's, that's what you want actually, but now I'm really curious, how does this Terraform configuration actually look, can, can you share a little bit about it?

[00:08:29]Narayan Newton: So this is the current tree. If everyone can see that it's pretty simplistic. So this is the main file that gets loaded. And then for everyone, there's a module that is named after its region. They're all hitting that same actual module is just different revisions of this module. And then they'll take a Worker count and their region and their provider and the provider is what is actually separating them into regions.

[00:09:02] And then if you look at the region Worker, which is where most of these things are happening, There's a variables thing, which is interesting because I have to define an AMI map because every region has a different AMI because the regions are disparate. Like there's no, there's no consensus building between these regions for images.

[00:09:27] So one of the reasons I picked CoreOS is because it exists in each of these regions and can handle a single start-up file. When when we do the K3s version of this K3s kind of run on Ubuntu of, and Ubuntu obviously exists in all these regions as well. but I'll still have to do something like this, or there's another way I can do it, but this was the way to do it for CoreOS.

[00:09:49] And then we set, instance type, this is this a default. And then the main version of this is very simple. We initialized our key pair, cause I want to be able to SSH into these instances at some point and upload it to each region. We initialize a simple security group that allows me to SSH into each region. And then a simple instance that doesn't really have much because it's, it doesn't even have a large root device cause we're not using it at all.

[00:10:21] Basically we're just spinning up a single container and then pushing the logs to Datadog, which is our central log agent. So even the logs aren't being written locally on that we associated a public IP address. We spin up the AMI, we look up which AMI we should use based on our region. and then we output the Worker address.

[00:10:41] So the other part of this is the Manager. The only real difference in this, we basically spent out the exact same way is we also allow the Goose ports, which is 5115, and we spin up a DNS record that points to our Manager because that DNS record is what all the region Workers are going to point at.

[00:11:03] Um, and we make use of the fact that they're all using Route 53. So this update propagates really quickly.

[00:11:14] And that's basically that it's pretty simple. each VM is running. Sorry, go ahead.

[00:11:22] Fabian Franz: Where do you actually put in the Goose part? Because I've seen the VM.

[00:11:28] Narayan Newton: Yep. So each CoreOS VM it can take an ignition file. The idea behind CoreOS is it was a project to simplify infrastructure that was based on containers.

[00:11:41] It became an underlying part of a lot of Kubernetes deployments because it's basically read only in essence on a configuration level. It even can auto update itself. It's a very interesting way of dealing with an operating system. It - its entire concept is you don't really interact with it outside of containers.

[00:11:58] It's just a stable base for containers that remain secure, can auto update is basically read only in its essence and it takes these ignition files that define how it should set itself up on first boot. So if we look at one of these ignition files,

[00:12:18] Okay. we can see that it's basically YAML. And we defined the SSH key we want to get pushed. We define an /etc/hosts file to push. We then define some systemd units, which include turning off SELinux because we don't want to deal with that on short-lived Workers. And then we define the Goose service, which pulls down the image.

[00:12:41] And right here actually starts Goose. This is mostly defining the log driver, which ships logs back to Datadog, the log driver, the actual logging agent is started here. but then like, this is one of the Workers. So we pulled the temp, Umami branch of Goose. We start it up, set it to Worker. Point it to the Manager host, set it to be somewhat verbose, set the log driver to be Datadog startup data dogs that we get metrics in the logs.

[00:13:12] And then that's just how it runs. And this will restart over and over and over again. So you can actually run multiple tests with the same infrastructure. You just have to restart Goose on the Manager and then the Workers will kill themselves and then restart.

[00:13:26] Narayan Newton: And so you get this plan, where it shows you where all the instances it's going to spin up. It's actually fairly long just because there are a lot of params for each EC2 instance that we're spinning up 11 of them, 10 plus the Manager, you say that's fine. And it goes.

[00:13:45] And I will stop sharing my screen now is this is going to take a bit.

[00:13:50] So is this already doing something now.

[00:13:53] Narayan Newton: Yes. And this is, you're probably going to see one of the quirks. And this is another thing I dislike about this. Because we're using CoreOS, these are all coming up on an outdated AMI and they're all going to reboot right there.

[00:14:12] Because they come up, they start pulling the Goose container and then they start the update process and they're not doing anything. So at that point, they think it's safe to update. And so they update and reboot. it's somewhat cool that that has no impact on anything like the entire infrastructure comes up, updates itself, reboots.

[00:14:31] Then it continues on with what it's doing, but it's another little annoyance that I just don't. You spin up this infrastructure and you don't really have a ton of control over it.

[00:14:41]And so this is the logs of the Manager process of Goose, and it's just waiting for its Workers to connect. They're all, they've all updated, rebooted and Goose is starting on them. As you can see, eight of them had completed that process.

[00:15:00] Michael Meyers: Is the, you know, all of this, the stuff that you put together here is this going to be available open source for folks to download and leverage?

[00:15:07] Narayan Newton: Yep.

[00:15:08] Michael Meyers: Awesome.

[00:15:10] Narayan Newton: It's all online. On our Tag1 Consulting GitHub organization and the K3s version will be as well. And that's the one I'd recommend you use. This one's real annoying. I know I keep going on about it, but like, this is how it skunkworks projects work. You make the first revision and you hate it and then you decide to never do that again. And then you make the second revision. Okay. This is starting. Now I'm going to switch my screen over to the web browser so we can see what it's doing.

[00:15:40] Fabian Franz: Sure.

[00:15:41] Great. the logs that we're seeing there are they coming from the Datadog or just the Manager director. And so

[00:15:49] Narayan Newton: That was a direct connection to the Manager. If we go over to Datadog here, there, these are going to be the logs. As you can see, the host is just like what an EC2 host name looks like, and they're all changing, but we're getting logs from every agent as well as the Worker. You can see they're launching. If we go back to Fastly, we can see that they're getting global traffic. So we're getting traffic on the West coast, the East coast, Ireland, Frankfurt, and Mumbai, and the bandwidth will just keep ramping up from here

[00:16:34] Fabian Franz: For the Datadog is that way to also filter by the Manager, like,

[00:16:42] Narayan Newton: Sure. This is the live tail. We'll go to the past 15 minutes and then you can go service Goose. And then we have Worker and Manager, so I can do all my Worker. And that's sorry, only Manager. The Manager is pretty quiet. The Workers are not.

[00:17:07] Jeremy Andrews: You must've disabled displaying metrics regularly. Cause I would have expected on the server to see that.

[00:17:12]Narayan Newton: If I did, I did not intend to, but I probably did.

[00:17:17] Jeremy Andrews: Can we, is it easy to quickly see what command you passed in or not to go back there from where you're at right now?

[00:17:24]Fabian Franz: It's in Terraform, I think.

[00:17:26] Narayan Newton: It is all set here.

[00:17:31] Jeremy Andrews: So it must be interesting. I have to figure out why you're not getting statistics on the Manager because you should be getting statistics on the Manager. Is this the log you're tailing or is this what's verbosely put out to the screen?

[00:17:44] Narayan Newton: This is what is put out to the screen.

[00:17:46] Jeremy Andrews: Yeah. Interesting. Okay.

[00:17:48] I would have expected statistics every 30 seconds.

[00:17:53] Narayan Newton: So what's kind of interesting is you can expand this in Fastly and see we're doing significantly less traffic in Asia Pacific, but that makes sense. Considering we're only hitting one of the PoPs and then Europe and North America tends to be about the same, but you can even drill down further.

[00:18:11]Fabian Franz: One quick question. I saw you hard-code the IP address end point in the Terraform. How does Fastly is still know essentially to which PoP to route and they're doing it through magic.

[00:18:22]Narayan Newton: You mean I'd put the IP the same IP address everywhere in /etc/hosts? Yep. Yeah. It's because of how they're doing traffic.

[00:18:31] So it is the same IP address everywhere, but they the IP address points to different things. Basically. It's cool. A lot of CDNs do it that way. so instead of different IP addresses, it's basically routing tricks.

[00:18:47] Jeremy Andrews: We seem to have maxed out. Can you look at the

[00:18:49] Narayan Newton: Yeah, this should be about it. It should be all started at this point.

[00:18:53] Yeah. So we've launched a thousand users, we've entered Goose attack. So we have evened out at 14.5 gigabits per second, which is, I think what we got on one server with 10,000 users as well.

[00:19:05]Jeremy Andrews: This is more, this is more than a single server single server. I think we max out at nine gigabit.

[00:19:10] Michael Meyers: Awesome. Thank you guys. All for joining us. It was really cool to see that in action. All the links we mentioned are going to be posted in the video summary and the blog posts to that correlates with this. Be sure to check out Tag1.com/goose that's tag the number one.com. That's where we have all of our talks, documentation links to GitHub.

[00:19:33] There's some really great blog posts there that will show you step-by-step with the code, how to do everything that we covered today. So be sure to check that out. If you have any questions about Goose, please post them to the Goose issue queues that we can share them with the community. Of course, if you like this talk, please remember to upvote subscribe and share it out.

[00:19:53] You can check out our past Tag1 TeamTalks on a wide variety of topics from open source and funding, getting funding on your open source projects to things like decoupled systems and architectures for web applications at tag1.com/tag1teamtalks as always we'd love your feedback and input on both this episode, as well as ideas for future topics.

[00:20:20] You can email us at ttt@tag1.com Again, a huge thank you, Jeremy, Fabian and Narayan for walking us through this and to everyone who tuned in today. Really appreciate you joining us. Take care.