Transcript: How to load test with Goose - Part 3: Bigger instances

This is a transcript. For the video, see How to load test with Goose - part 3: Bigger instances.

[00:00:00] Michael Meyers: Hello, and welcome to Tag1 Team Talks the podcast and blog of Tag1 Consulting. Today we're going to be talking about distributed load testing and doing a deep dive into running Gaggles with Tag1's open source Goose load testing framework. Our goal today is to prove to you that Goose is both the most scalable load testing framework available currently and the easiest to actually scale. This is a follow-up talk to one we did very recently on a similar topic. We're going to stress the servers even more in this one. I'm Michael Meyers, the managing director at Tag1 Consulting, and I'm joined today by Narayan Newton, Tag1's CTO, and Fabian Franz, our VP of Technology.

[00:00:44] Fabian, can you give us just a quick background on how this is a follow-up to our last talk?

[00:00:50] Fabian Franz: Sure. So in our last talk, what we did is, so essentially we spun up EC2 instances all over the world, but if you need to change something, you essentially have to destroy the cluster. Re- deploy the cluster and while recording it we actually ran into the problem that we had to change something and it wasn't easy and not easily possible to do that. And we want to change the ulimit because with Goose, if you Put a lot of users and you usually need to increase the ulimit that Linux comes with and then you need to do in the VM obviously.

[00:01:24] And we have no real control about that because we only had to start script. So why the solution that you've presented was very straightforward, very simple and easy to use, essentially, if you quickly want to iterate on something. Yeah. It can take quite a while because you have to wait for all the clusters to shut down and really don't want any EC2 machines like hanging around for 10 years and then thousands of dollars that you are paying for nothing because you were in a load test once. It's so important to cleanly shut down and then start up. But that costs a lot of time and development. Time is also costly. So today we are having a completely new solution and I'm totally fascinated and excited by it.

[00:02:09] And Narayan, please tell us more.

[00:02:11] Narayan Newton: Yes, it was. If you watch the last talk I spent an unfortunate amount of it talking about why I disliked the thing I built to spin up EC2 instances from them because I couldn't control the end points. And then all the things that I was complaining about happened, we had to like stop recording.

[00:02:28] So I got annoyed and what we have today is similar to what we ran last time, where it is still like, kind of the same Terraform tree. And we're spinning up CoreOS nodes in various different regions, but instead of pushing just a Goose container to each of them to run the load test It is installing K3s which is a Kubernetes distribution.

[00:02:57] That is, designed for IOT and CI and running at the edge. It's very small. It's not it actually, it is a full Kubernetes distribution, but it's not a full Kubernetes distribution in that is running everything on a standard one. They have made some changes to make it lighter and spin up faster. so you can now instead of running Terraform apply and it's spinning up the load test.

[00:03:23] It spins up a multi-region Kubernetes cluster which was interesting to do there were some oddities to doing that because each node, when you're spinning up a node, EC2, it has an internal IP address and external IP address. And if you're spinning up a Kubernetes distribution in a single region, that doesn't really matter because everyone's talking to each other via the internal IP address.

[00:03:45] But. When you're doing multi-region, everyone's talking to each other via the external IP address and that IP address does not appear on the VM at all. So that was interesting. I will share my screen. Before we started, I, where we were last time is basically spinning up 10 nodes. Two nodes in each region, five regions. I did that again, but with the new truss..

[00:04:14] Narayan Newton: Okay. So now we are on the Manager node. Cube control gets set up automatically on the Manager node. So.

[00:04:28] There is our cluster as it stands currently. So I just ran a get nodes wide so I can see extended fields. And you can see that we have the control plane, which is what we are on currently. And then all of our Worker nodes, you can see the internal IP addresses and the external IP addresses. If you look at one of these, like, let's look at probably this one.

[00:04:58] You can see it's even tagged by the region its in. So we're fetching the region that spin up and tagging whatever region this node is in.

[00:05:05]Yeah, these are just boring regions, but there are more interesting regions as well. So to run the load test,

[00:05:19] I have a little YAML directory here, and this is what we're doing instead of pushing the Docker image to each CoreOS node.

[00:05:32] Okay. Should. And just letting it run without any control, is that now you spring up the cluster and you can submit these jobs. So this is the Manager job it's going to ________ Workers, but we want 10,

[00:05:46] Fabian Franz: But real quick. I got lost a little bit in the translation of things, so we have EC2 instances and they are now having this huge Kubernetes network.

[00:05:58] And now what are we using to now deploy Goose or is Goose already there?

[00:06:03] Narayan Newton: Goose is not there. So this, when I'm copying up to the Manager, node is our deployment of Goose. So this is a deployment telling Kubernetes to spin up one replica of the Goose Manager and so this is going to be the Manager that all the Workers connect to.

[00:06:21] Fabian Franz: Excellent. Perfect.

[00:06:23] Narayan Newton: So we are going to create that and you can see it creating here. And if we look back at that deployment, you'll see that we're, we have a node selector to tell Kubernetes that I want this to run on the Manager node. I don't want it to run on the Worker nodes because this is the management instance of Goose.

[00:06:51] Fabian Franz: Excellent. So how can I now? Um, can I now look.

[00:06:54] At the as soon as the Kubernetes, one is open, can I now look that is waiting for nodes? Can I like see the output of that as well?

[00:07:04] Narayan Newton: You can, once it's done creating the container. So what it's doing right now is it's pulling the container down from our container registry.

[00:07:13] Fabian Franz: How long does container deployment usually take with Kubernetes?

[00:07:16] Narayan Newton: It scales by the size of the container. And our container is actually quite large because I have not put the effort and making it smaller yet.

[00:07:24] Fabian Franz: Okay. So just downloading a few gigabytes of data just for the distribution and all the rest dependencies, et cetera.

[00:07:34] Narayan Newton: Exactly. Okay. And now this is our Goose Worker same container, but different arguments to the container, obviously.

[00:07:42] We have some pod anti affinity rules here, which are kind of interesting. Basically. This is me saying to Kubernetes, I don't want you to schedule this on any node that already has a Worker running, and I don't want you to schedule it on any node that has the Manager running. So it will distribute it to every single node so that there won't be an inactive one.

[00:08:11] Yeah. You wanted to see this. So, there. Before I start the Workers, the Manager is not running and we can do a group control logs and there are logs.

[00:08:22] Fabian Franz: Nice.

[00:08:23]Narayan Newton: And it's waiting for 10 Workers.

[00:08:28] So let's create the Workers.

[00:08:34] Oh, I fixed this on the other one. So what it's complaining about is these pod affinity rules. It needs a topology key for each one. And so that one's responsible.

[00:08:47] Fabian Franz: So then this typology essentially meant that it's it's per host name, not per something else. Okay. Per

[00:08:53] Narayan Newton: host name instead of per zone, per region, per rack.

[00:08:57] Fabian Franz: So I could also decide to. Deploy, just one Kubernetes to each region, essentially.

[00:09:07] Narayan Newton: And now we have our Workers spinning up. And so this is what was happening last time as well. It's just, this was happening when you ran Terraform apply. So it would bring up all the nodes and then each node would be doing this, but without our direct interaction,

[00:09:24] Fabian Franz: I think I like the Manager way more. It's nice to have a little bit of control and to easily take a look at logs.

[00:09:32] Narayan Newton: Yeah.

[00:09:39]Fabian Franz: We have already started to send users. That's great.

[00:09:44] Narayan Newton: Yep. As part of this, the ulimit issues we were hitting cause in the old one, we weren't changing the ulimit. So we had a maximum open file descriptors 1,024. We are actually inheriting a ulimit fix that K3s pushes in when it installs, which is helpful.

[00:10:04] Fabian Franz: That's indeed helpful if they have already solved the problem for us,

[00:10:10] Narayan Newton: So that's starting up you can see. These are Workers transitioning to the running state. we can look at the logs of the Workers as well. Which is kind of cool. If you think that these Workers are in like various regions, I can just run something to pull logs from like the central Europe

[00:10:30] Fabian Franz: Japan is cooler than central Europe.

[00:10:32] Narayan Newton: Yeah, that's true. Okay. And our load tests should be starting.

[00:10:38] Fabian Franz: I think it was around 15 seconds right I'm just coming in. Yeah, there we go.

[00:10:47] Oh my God. How fast do we have to ramp up right now?

[00:10:57] Narayan Newton: I had a hundred and you start going up and it, it crushes our site. So the ramp up is slower, but it kind of needs to be.

[00:11:07]**Fabian Franz:**For sure.

[00:11:10] Narayan Newton: And actually, why don't we look at one for Workers so we can see the ramp up.

[00:11:17] Fabian Franz: That's freakingly cool. This is all. This will all be open source on the, on the Tag1 server.

[00:11:24] Narayan Newton: It's already there. It has one thing that you should know about it currently is that there's when you spin up a Kubernetes cluster, I'll just talk about this while its ramping up there's network.

[00:11:36] There's like a backend network that all the pods communicate on. That is not a real network. It's a network. That's Kubernetes specific and this particular speech and uses Flannel for that. That's even a, it's a pluggable thing. So you can decide what you want, your network control plan to be. Flannel is detecting the wrong IP address is detecting the internal IP address, not the public IP address.

[00:11:58] I haven't personally fixed that because there's a bug open about it with traction, but I'm going to push on the bug to fix that. So as it stands, if you hit, if you want to just do something like this, where you're going to be using the host network, which is the network of the VM, not the network of the pod.

[00:12:14] That's not an issue. But if you want to do something like Interpod communication, that will be an issue. So if you just pull it down, you would have to fix that. I'll probably dump the URI of the bug in question in the Read me just so that people can track that.

[00:12:29] Fabian Franz: So if I wanted to do something that I would need to apply a patch or what do we need?

[00:12:34] Narayan Newton: Okay. Yeah. In that issue. There's a little deployment you can deploy that will go in and you've changed the, it changes the annotation. On the nodes to point to the correct IP and then Flannel will update. So it's a bonus.

[00:12:51] Fabian Franz: So you had to get the external IPs manually and put them into configuration or

[00:12:56] Narayan Newton: No, I, well, so we're at 20 gigabytes, but while that's happening where's a good place for that. Sure. So I am running curls against the ECT metadata service to grab the public IP and the region before doing the install and then passing it to the installer.

[00:13:26] Which is actually a huge security vulnerability. If the wrong people get access to that. But it is very useful for setting things up. Like that's how, that's how an EC2 VM knows stuff about itself. Oh we got our first error. I've noticed that at about 26 gigabits per second, you start pulling errors every once in a while.

[00:13:47] Michael Meyers: Is there a corollary chart to our Fastly bill here?

[00:13:51] Narayan Newton: Yeah. Yeah, they're really is. We've been testing this awhile and I don't know. It just never occurred to me that the Fastly bill would be high, but it is.

[00:14:05] Fabian Franz: Man that's amazing

[00:14:06] Narayan Newton: I think we should be at around 25 at this. When you see where we're at. Yes, we've launched all the users.

[00:14:17] So we should just sustain at around here.

[00:14:20] Fabian Franz: And this people is how you are testing a CDN, as you can see.

[00:14:26] Narayan Newton: And you can see, we have, we have fewer locations near the Asia Pacific Pop's, but we're holding five gigabits per second, Asia Pacific, and then 10 and 10 in Europe and North America.

[00:14:38] Fabian Franz: Yep. And just, just again, to. To reiterate that a little bit. Our Goose users are not real users. Like they are way faster usually.

[00:14:47] That's why they also create discrete benches, but every user is like downloading all the, I mean, we have little breaks in there. but all of the users also downloading all the assets. So When a page is loaded from Umami, like a nice recipe, which we are talking about then all those images are also downloaded, like, it's real browsers, like browsing the site or the JavaScript is not execute because you're not, doing that.

[00:15:10] But it's really parsing a lot. It's ensuring everything is correct in that. And you're doing this with 2000 Workers on 10 nodes, 20,000 Workers in total. Right. Yeah. users. Yeah.

[00:15:24] Narayan Newton: So 2000 users per node over 10 minutes and keep two nodes in each region.

[00:15:30] Fabian Franz: Yeah. So those 20,000 users are now hitting this site real hard.

[00:15:35] Probably the user doesn't click as fast as we are clicking here. So it's probably more like 200,000 or so. Generating this kind of traffic . Amazing. Now really cool. Oh, then we have an error.

[00:15:49] Narayan Newton: And this was the error we got, but like on our old setup, I would have no real way to do that. Like we were pushing logs centrally, and by the way our bill for the central logging was also great way to like separate and look at individual Workers or anything like that. So this is a big improvement as far as manageability.

[00:16:13] Fabian Franz: Yeah, Datadog probably was not as amused with that many logs.

[00:16:17]Narayan Newton: No, they emailed me.

[00:16:22] Yeah. Can we schedule a call to discuss your new usage.

[00:16:29] Fabian Franz: Sorry. It was just a one off. We need to show the world how to test Fastly. What's nice about Goose and what you can see here is that every error is very nicely reported. Goose just got a new patch in that also allows us to get an overview of all the errors that ever happened like for all the Workers and everything.

[00:16:46] So this will be a very nice new feature that's launched in next release so that you don't have to. Look through the log of what errors you have, but you'll get an aggregated per error type thing report in the end. And I think we can essentially stop. We can see Fastly is handling 25 gigabytes per second easily.

[00:17:08] Narayan Newton: There's no, there's no real reason to make our bill higher.

[00:17:14] So what we can do is just delete the deployments. And just terminate them all..

[00:17:23]Fabian Franz: Not the nicest way to stop a load test, but yeah,

[00:17:28] Narayan Newton: No, no, but it is very forceful.

[00:17:33] Fabian Franz: Yeah. in theory we could have also given the Manager, essentially, a stop signal. And then it would have given us a nice end of load test report which can also be in HTML. And this, we can show that again some other time, but yeah.

[00:17:47] Narayan Newton: Oh, and with this, you can, one thing you can do not right now, cause they're terminating, but if you want to, you can actually exec into these containers.

[00:17:57] So you can pull a bash prompt from any of these containers, even the ones in different regions.

[00:18:06] Not something I did. That's just Kubernetes.

[00:18:10] Fabian Franz: I mean, for sure. No, I, every love this new K3s is it like an acronym for something K3s?.

[00:18:17] Narayan Newton: Okay. No, it's a, so Kubernetes is the acronym for Kubernetes is K8s. so K3s would be their joke as it's a lighter version. It was really cool. It's literally a single binary and the install, like the install deals with all the prerequisites and then places the binary and spins up a systemd unit file that sets everything up.

[00:18:39] Kubernetes by default has a, like a HA multi-instance data store and then replace that with SQL-lite. Like that sort of thing.

[00:18:50] Fabian Franz: This feels really, really cool. And I think that's, that's really nice too, for multi region service. Running so easily. Great.

[00:18:58] Narayan Newton: Yep. I think it's a step up from what we said before.

[00:19:02] Fabian Franz: Absolutely. Not only would it be shared before, but also what you had to do before. I remember SSH-ing into four different machines to start a load, to start a Locust test, manually Starting eight Workers on each. And so, yeah, it's, it's really nice to have everything automated that way.

[00:19:21] Narayan Newton: And now I'm just bringing it down.

[00:19:27] Michael Meyers: Awesome.

[00:19:27] Thank you guys so much. That was really cool. look forward to our Fastly and Datadog bill. I really appreciate you guys coming back to show us how that works. We will do another Goose webinar in our series here in a couple of weeks. So please stay tuned.

[00:19:43] We're going to make this a regular series where we show you how to use different features and functionality as we release them. And also show you how to use the tool and profile websites and effectively performance tune, not just get it up and running and, and slamming your site with traffic. The links we mentioned, we'll throw into the show notes and the description.

[00:20:03] You can check out these other Goose talks at Tag1.com/Goose. That's where we have links to documentation and code and all of the talks and videos that we've done. if you have any questions, please head over to the GitHub issue queues and ask them over there. If you have questions about Goose the product and how to use it, but if you want to get engaged and contribute, we'd love it.

[00:20:26] If you have input and feedback on the talk itself please contact us at. Tag1teamtalks@tag1.com. please remember to upvote share, subscribe with all your friends. You can check out our past Tag1 Team Talks at tag1.com/ttt for Tag Team Talks. Again, huge thank you, Fabian, Narayan for joining us and thank you to everybody who tuned in and listened today.