This is a transcript. For the video, see How to analyze Goose load testing results using New Relic.

[00:00:00] Michael Meyers: Welcome to Tag1 Team Talks, the podcast and blog of Tag1 Consulting. Today, we're going to be discussing the practical side of load testing, and we're going to walk you through how to analyze load testing results using New Relic, so you can find the problems that are impacting your site and address them yourselves, using the Goose Load Testing Framework.

[00:00:18] For our example, today we're going to be load testing the Drupal framework in CMS with the default Umami install, and we're gonna be trying to find a performance problem in Drupal core. That we're of course then going to fix. I'm Michael Meyers, the managing director of Tag1 Consulting, and joining me today are two of our performance and scalability experts.

[00:00:39] Jeremy Andrews the founder and CEO of Tag1 and creator of Goose, and Fabian Franz, our VP of Technology, a major contributor to Goose and the man responsible for a lot of Drupal's performance and scalability capabilities. Jeremy and Fabian. Welcome. Thank you guys for joining us again.

[00:00:58] Before we get started, I just want to say a quick thank you to Pantheon. is providing us with a lot of free hardware for our Drupal instances. We're using their elite instances with their global CDN, which is a huge asset when you're throwing a lot of traffic at websites. So thank you, Pantheon.

[00:01:16] Jeremy, could you give us a quick intro and background on the plan?

[00:01:21] Jeremy Andrews: Yeah. Hello. So today we're going to, as Michael said, we're going to run a quick Goose load test against Umami site. It's Drupal with the standard Umami install. Goose is running on a R5N 8x, large EC2 instance.

[00:01:38] So I will be connecting in there right now and get this set up to make sure everything is working. Yeah.

[00:01:44] I've, pre-typed a command here. I'm in the Goose directory. The first thing I did was I actually increased the ulimit cause when simulating lots of users, you can run out of resources if you don't do that. with the standard install we're using Cargo, Rust's package manager, to launch Goose.

[00:01:59] And we're building the example that's in there. So it's an unmodified version of Umami. We run against the host, which is The dash at ease shows the output. I'm going to start off with just running with three users just to make sure everything's working.

[00:02:13] And sure enough, it should kick off quickly, log us in once it's all up and running, it spits out some statistics. And now it's running a load test with just the three users. I chose three users because this particular load set is configured with with this example has three users. It has anonymous user browsing and English pages.

[00:02:31] It has an anonymous user, browsing Spanish pages, and then it has an admin user that actually logs in and posts a couple of things. So with it all up and running, I'm gonna go ahead and cancel it because we saw that it worked. it's going to tell all the users to shut down and then we'll just go back real quick and make sure of a couple of things.

[00:02:49] One, that everything ran at least one time. It was working. So in the first one in the English user, you can see it was working through all the tasks. He didn't quite get as far as contact form, but that's because we didn't leave it running very long. Spanish, same thing. It got all the way almost to the, to the contact form and the admin user successfully logged in, which was the part I cared about the most.

[00:03:09] So now what we're going to do is spin up a more realistic load test. We're going to kick off 1200 users. But we're going to launch them at the rate of two a second, and they should all launch in about 10 minutes, but we'll let it run for about 20 minutes so that there's ample time for it to put some load on the system.

[00:03:25] So kicking that off, it would help if I accidentally wrote the collector, I always think of hatch rate with an R for rate, but it's H for hatch. So

[00:03:37] Fabian Franz: I think you were missing a T on the 20 minutes. We need a T for time. And I think it's R too. That was correct.

[00:03:46] Jeremy Andrews: Let's try this one more time.

[00:03:50] Fabian Franz: There we go.

[00:03:51] Jeremy Andrews: Alrighty.

[00:03:52] And with that, I'll stop the screen share so that Fabian can show us what this traffic looks like.

[00:04:00] Fabian Franz: Great. Hello. Let me share my screen now. So Google Chrome share. Here we go. So we have the I brought up here the Umami site. So for those not knowing it, it's a beautiful Drupal 9 brochure like site it's Umami food magazine.

[00:04:20] It has articles. It has recipes. It has lots of nice things here. Like a vegan chocolate and nut brownies. I think if you want to do that right now feel free while continue watching. And as you can see, we are now hitting the site already with quite some traffic, which we see the side still performing pretty snappy.

[00:04:41] So that's really great. Now let's take a look at how this all looks in, in New Relic. So we're gonna use this, the default view of New Relic. Yeah. And we have like two different views available. Now there's a new view of New Relic that has been recently coming. and there's the old view?

[00:04:59] I personally like the old view a little bit better, so let's switch back and that's also the one I got certified on. So I'm a New Relic certified professional. So what we can see here is we are starting to get traffic. So right now we are still at 19:00 PM. So not that much yet, and because we don't want to wait lots and lots and lots of time, we just choose different part of our time window.

[00:05:25] So I'll be looking at six hours ago and let me see this big thing here, screwing everything up. This kind of happens after we've run a few load tests. We still need to figure out what happened here actually but like just push out this outlier and we can see here, the load test we've already run. And we can see here.

[00:05:44] This was a 1,500 user test, I suppose, a 750 user test. This was a test that actually did go up to 2500 end users. And if you look at the graphs you can to see two things that are interesting here. And the point is that you can see these graphs here kind of flat there's up and downs obviously.

[00:06:06] But if you look, the database section and the PHP section are kind of the same, however in this graph, what happened was the database did go off the roof. So just to show that again, in, in, in isolation, essentially, let's just look at the database so we can see the more users we put on it in the more we done it, let's make me focus just on this load test real quick.

[00:06:31] The more users we had essentially the larger, this did go here and here we also show so Pantheon elite is great. And we have to understand one thing, essentially, all that it goes through and that Pantheon had to deal with are little, the things cleared by the cache of that admin user editing, some content, the searches and the contact form posts and everything else essentially is already buffered out by the Pentagon global CDN.

[00:07:00] And so all the anonymous traffic, we're not even seeing here, but that's how it should be. because. Mostly it offers

[00:07:06] Jeremy Andrews: No that there's so many pages, Fabian, that some of the traffic still does get to Pantheon for a considerable amount of time.

[00:07:13] Fabian Franz: Yeah. Some traffic does get through, but at least it does not get through 1000 times because that's another advantage of CDNs like Pantheon global CDN, they have a kind of shield feature so if there's like like 1000 users coming in.

[00:07:26] Then essentially only one user gets pushed through to the edge servers in the end. And the other part-timer that we are seeing is where we increased the number of users more and more. Our scalability from here started to drop off. So the more users we put, actually the less throughput we got and the reason is that the database really did go to problematic Heights here and this authenticated traffic or traffic that goes to those through to the web server and is not fronted by the CDN is so important because oftentimes we deal with commerce sites. Well obviously you want people to purchase something or they want to fill out a donation form, or they want to fill out a contact form, well, even as simple as sign up for a newsletter and all of those it's data you then want to have, you don't want your database to only be like static.

[00:08:24] But you also, where Drupal's already really good and CDN can really help scale you up, but also for this authenticated traffic where users actually interacting with the site. And this is kind of what we see in this example here: that the traffic is scaling way up in the database doesn't scale anymore.

[00:08:42] But let's quick, take a quick look at a load test where it still did scale. So we go back to our six hour view, half this big chunk in here the 750 users, one just looking at that one. Okay. So here we have one where we know where we put a Goose 750 users. We also see, we have a nice little ramp up.

[00:09:03] This is something I like to do and I like to do it also really much slower still. However, it was so much fronted by the CDN. It was okay to go two users per second. And it's essentially showing how we get more and more and more and more users, till we have to saturate it. And then we, we keep kind of the load test the running for a while and then it goes, goes off again.

[00:09:28] And When I do a test like this and not directly go to the many users, but have a slow ramp up New Relic gives me something really great. And that's a scalability report. So if you go up here over to scalability and didn't take my zoom, that's the wrong one. Okay. Let's go.

[00:09:53] Sometimes New Relic in the new RQI does not take the zoom you want. So let's take that again and click apply. And hopefully if not, we just have to remember the times essentially that is go go one from 1:00 PM to 2:24. Yep. That's the one we want a slow ramp up. And now if we go to scalability, hopefully it keeps it.

[00:10:20] Yes, it does. Because what we want to see is we want to use time selection. And now we can see a really nice graph here up to a 4k RPM. And we see this is kind of flat, and this is what we want. We want a flat graph easily, so it can grow a little bit linearly. but it should be flat. So regardless how many more users I'm putting on it?

[00:10:44] I want a flat graph. And again, if you check the time window selection on the database here, we also see not as flat, but essentially we have this line here at 225 millisecond. And here we have also this kind of wall that we've hit. However, if we look back into the larger time window and look at this, this very first test where we really stressed out the system.

[00:11:09] Didn't take it now. It takes it. Apply to one Oh seven where we had the scalability problem, which was also clearly visible here. we can also see this scalability problem now clearly here. Again, we use the time window selections; CPU is kind of flat, but if you look at the database in the time window selection, you see it goes all over, but you also see something really interesting.

[00:11:40] So we have the same line at 200 milliseconds, and then we go up and go over to the 400 milliseconds. And then we go down again. Why is that? The reason is the number of users and that's not really shown here. but the concurrency at 3K users in the end was much larger. So essentially we just cut off the graph here.

[00:12:02] And this is our problem right here and there. so this is how you read the scalability report, but that's now. Look, the live view. So because our load test is running and here we go. this is a test that will be running up to a 2,200 users, 1,200 users and push away,

[00:12:21] Jeremy Andrews: 100 users. It's almost that scale.

[00:12:24] Fabian Franz: Oh, we're almost there already. Perfect. Great. So we can see, we are scaling nicely up here and again, the RPM is starting and we did, didn't want to use more time here, but with Goose you can even use a fraction of a user. So we can use, for example, 0.2, which we did earlier last week. to really show a very slow ramp up.

[00:12:46] And that gives you lots of great information, but now let's look at some of the traces because essentially when you're running such a load test, you'll always have some requests that are slow and that gives you a great ability to find find problems on the side. Yeah. essentially just to show real quick of what we have here, there's the distributed tracing which we don't have because there's no things for that.

[00:13:09] Then we have this browser applications service, mySQL, Redis, and external. That's essentially how the system is built, gives you a great overview where the hardware used. Then there's an edge case clearer, which we are also contacting sometimes when we are editing content and changing it, there's always a transactions like a node view controller, the contact controller for the contact page search controller for searchers an empty form controller for showing that.

[00:13:37] And a view paid controller, so there's lots of things. And even our, our darling, the page cache of Drupal, serving Drupal since Drupal 3, well and hastily. Yeah, and we can see page caches is really performing really well. We don't have too much throughput here. but we can see essentially all the page cache does is Redis connect and Redis get and that's it.

[00:14:00] So no slow traces. And we see on average, it's around 17 milliseconds, per request. So this is all you can can see here. And yeah, those are the responses where Jeremy said that are still getting through here. We see some like 36 RPM. It's a fraction that hits page cache, but still Drupal's protecting us.

[00:14:21] So it's pretty nice. then we have, for example, the node view controller, and now we see something interesting. New Relic immediately shows us what's going on here. So we have this controller view and this essentially the PHP codes that's running, but then we have this one thing that looks really interesting.

[00:14:38] That's no division select. So I just remove the response time and we see this makes a huge difference in, in what we are seeing here. Then we have key value select we still need to build a module or a module still needs to be finished essentially to front a key value store with Redis, because a lot of that is conflict reads and a lot of those conflict rates are repeatable, so they could be cached perfectly - something to do here.

[00:15:04] So, okay. Let's assume we have that done. Page alias select, okay. Past alias was it changed in, in entities already, but seemingly not yet having the entity cache. Maybe there's something possible there. So that's another candidate for caching and then we have node revision body select, but it's tiny.

[00:15:23] So let's remove that. So we even see it, note, field revision, select, okay. That's that's that, and then there's other, but essentially we can see the biggest problem here is the node revision select. And now we can also see this in a breakdown. We spend 78% of the time on the node view controller for an authenticated user.

[00:15:43] That's looking at this to see this. So

[00:15:47] Jeremy Andrews:Tthat's the authenticated users and not anybody else. You said it's an authenticated user. How do you know it's an authenticated user?

[00:15:56] Fabian Franz: Because the, the probability of it being an authenticated user is much, much higher than it's just being a random user for the frequency of calls we have.

[00:16:06] Yeah. We also see in a moment. So let's take a look at this trace. Maybe we see something interesting. So now we look at one trace where we have some kind of problem with the node revision, select where, but essentially 96% of time spent here. And we can also see that sometimes spent in the renderer.

[00:16:25] Redis, team manager, but that's all not our problem. It's really no division. That's showing this now we need to figure out why first of all, this might be just set. This is a very slow query. So this is the query and we can see if you read some other take one articles, there's an left outer join in here, and then there's an inner join.

[00:16:47] And why joins a fine on databases inner joint? Usually much, much cheaper than if you're doing left joints or left auto joints. So there might be a query that, maybe it's just bad because it performs really bad. That's what we can see here. And now we can look because it took so long. New Relic is essentially providing us here with the database view and we can look at where does this called?

[00:17:13] We call it late supervision and to view builder. So yeah, this seems to be something deep in, but here's something interesting. Here is a module that's click edit module. So there might be something that's that's doing that. And there we see, it's quick edit and review alter. We could have come to the same conclusion using a different trace.

[00:17:32] Just looking at the core stack spent essentially what each function calls each. And then again, we go way down here and see the And note revision select here and can go back and here seeing it's a quick edit interview, older that is creating that. And especially this function is latest revision, get latest revision ID.

[00:17:57] This seems to be one that's really slow and, and screwing us up here. Now, the question is, does ------ entity even need the latest revision or would it be okay for example, it being the default revision, maybe that would be an easy optimization we could be doing because each entity now is if it's a current revision or if it's a later revision.

[00:18:19] So we'll see what happens like that. But for now, what we can see is we've identified successfully that this quick edit interview altar is making things slower. And that brings us to this that is most likely it's not an authenticated user. And most likely even authenticate admin user because probably only an admin user would have this quick edit.

[00:18:42] However, imagine a site with a huge group or whatever. And you were like, Hey, I want to build like a little JIRA-like interface. And you have lots of users in the forum where they can essentially just use click edits to input a title instead of directly having to edit the posts. And then you're using quick edit for normal authenticated users, not admin users, and then you would run into this exact performance problem.

[00:19:08] So it's not just pure theoretical, but it's something that happens. And while it here only happens on scale if you have like a slower mySQL instance, because you're using like a Lando, or DDEV or whatever, local Docker instance locally then you could definitely also be had to hitting this because it can happen that Yeah, this is this is showing that, so this is what, which we could be getting out of this load test already.

[00:19:37] So let's take a look at some more. For example, we can now take a look at the context site page, and there we fortunately don't have revisions, but again, we have some key values select showing up. So if you want to scale the database, we want to remove all the queries as much as possible.

[00:19:55] And we see essentially the big things in core, that's still hitting my SQL and then Drupal 7, we had kind of the goal of of no database access at all. For example, key-value select, pass alias select, menu tree select, conflict select, and then the end we only have Redis get and other, so there's that.

[00:20:17] And here we see a lot of the code is it's essentially just PHP. Why are we spending most of the time? And then the queries are neglectable, but for scaling, it's much easier to scale a brighter center, scale up a database besides just buying a bigger database, which you can do on various cloud platforms obviously.

[00:20:41] And search controller just recently, we had the question of, Hey, what about this Drupal core search? Is it fast enough? Because in all the versions of Drupal core search was a little bit of like, Hm, maybe the search query isn't as optimal for that. And now we have a search API, which for example also has a database backend.

[00:21:04] So what about core search? And it turns out core search actually is pretty okay. Because it's the search index select query, is not much time off the whole thing. On average, it takes two millisecond. So, and this is another ----. That's fine. core search seems to be pretty okay to use. And it also doesn't it does also scale up.

[00:21:28] So even if we get more throughput for the search query we get definitely the same response time. there's some max values. So that's always, when the database, a little bit overloaded, like max three, five, three milliseconds, etc, which again, points to give the database less to do, and you'll have more fun ways to a search queries or other important queries that need to be live and so on.

[00:21:51] And this is a databases view of, of New Relic also really interesting because you can see all database overview, again, not revision, select shows up key value pass alias, and just then Redis and menu tree select. So those would be the three things on the site. we can optimize now and concentrate on optimizing to then run another load test and find the next problems.

[00:22:15] And here you can see the most time consuming queries of all, we can also swap as a slowest query time. So not revision watchdog. Yeah. He has watchdog delete. So I'll be using, right now DB log seemingly should probably replaced it with Twistlock. So we are not logging to the database, which can take some time and that's only running on Drush.

[00:22:38] However, so not too big of a problem. then we have Pantheon queues, key value expire. It's also an interesting thing which we could, could move to Redis probably because there's a module for that, for the expiring key-value store to give them that and such an extended. Again, we can look at the old database operations table.

[00:23:01] We can sort by throughput, for example, so select to see often key-value select this most often Pacey CLLs menu tree. so we see key-value as a query much more often than for example, the note revision, but note revision is much slower. So that might just be a bad query. Have to look at how it's built essentially.

[00:23:22] Jeremy, do we have questions so far?

[00:23:31] Jeremy Andrews: I had some, I should have written them down because I've moved on with my brain. Sorry.

[00:23:37] Michael Meyers: How did people do all this without New Relic?

[00:23:43] Fabian Franz: Almost impossible to do it without New Relic. It's I didn't know. New Relic is still one of my most favorite tools and I was beyond delighted when they both open sourced their PHP agent, as well as made the pricing much more fair, I would say because it's now based on the data you are accumulating and less on how many instances you're running it on .In a Docker world with lots of thousands of instances would if not on a scaled.

[00:24:14] So it was a good move by them, in my opinion, to change their licensing model there. And better. Yeah, it's great. And with Pantheon and you'll get New Relic automatically as part of your plan. So this is just great. And you see Pantheon is scaling perfectly up. We accidentally did one load test on dev and then Pantheon was like, huh, this instance might be a little bit too small for this kind of traffic.

[00:24:40] And then they scaled it up even a little bit. So it was fun that they immediately looked at the traffic on automatically, and saw as something is wrong and scaled up. And this has happened with elite as well. So yeah, and we can see how much more we can push this, essentially. In terms of concurrency, once we have solved this database problems and we can could also essentially disable the external CDN and then do a lot of tests for just the anonymous part.

[00:25:08] And then look at the little bit more into Drupal there, but you probably also need to disable page cache, and, dynamic page cache, because, well, we have a lot of great caches in Drupal, by now, and they all essentially allow Drupal to scale so away. Well, he is like a little Drupal Views thing.

[00:25:28] So they see recipe collections right now takes the most time in that. And it's also called quite a lot. probably one of our admin users are calling that and there are some articles aside sidebar view which we have here was much Monaco account. So you can see which of your views are essentially working well, which are not working well.

[00:25:51] We can see recipes is the most slowest average call time. view right now, A function called count recipe. Collections is called the marvelous. So yeah, this is the load test.

[00:26:04] Jeremy Andrews: Good job because you're right. The the recipe page is one of the ones that the logged in user does and the only other one that calls the front page other than editing a node and the initial login, of course,

[00:26:17] Fabian Franz: yup.

[00:26:17] Then there's Drupal hooks, which is also really interesting. And I love New Relic that you implement already for Drupal 7. And now for Drupal 8, 9, 10. Also this, this hooks implementation essentially because we can immediately seen that hook. MTXS is kind of using, it's consuming the most time, ends up with systems.

[00:26:35] So probably there's something in the entities system access system that takes a bit, a little bit longer than we have here, language fallback can candidate interview all the time, which also takes some time and called quite a lot. And then we have page attachments for mile to obviously, et cetera.

[00:26:55] So again, we can, can look at a lot at, and how much time is spent, which is kind of the energy cost you're paying or the payment in terms of CPU cycles. So if you're, for example, on an EC2 instance where you have a certain number of cycles, then at this most time consuming, directly relates to that, the more cycles you spend.

[00:27:18] Yeah. The more money essentially. And it's, you're on a fixed pay host. Like Pantheon on the more as a cloud. Let me see. The slowest average call time is to view data hook not called a lot of, probably just some Drush or whatever, a cron job, but ND view olders also called a lot. And we now, why we now because of quick edit, because quick edit makes it slower.

[00:27:43] And this brings to average way down. Function call count, feed access, empty access. So we can again, get a, get a lot of ideas of what's taking how long what's just called a lot, but does not really take long. Like this seems, suggested block, it's called a lot, lot, lot, lot, but it's just 0.2. 0.53 millisecond so. Really does not matter. We can also overall see which module is the most time consuming. And here we can see again, as we are logged in admin user content moderation is, is taking the top spot here.But it's yeah. And it's also called quite a lot and then content translation is next. So again, we get a, get a nice idea and overview of, of what takes what, and so can translation may be also is doing some stuff that's a little bit more time consuming.

[00:28:42] And again, we have those things and here again, quick edits shows up as a top spot as lowest average call time. So we now something is opposite. We already figured out what, but if we hadn't yet to, for the traces, it would, this would have given us another entry point of finding, Hey, what's going on with the system. Why is this performance bad?

[00:29:03] Then we have errors here. Pantheon is really great that regardless how much traffic you put there's no errors here. If we essentially sustain all their posts and even larger instance then this is ended before New Relic even sees it. They said advantages and disadvantages at the same time.

[00:29:22] Usually it just means that you just have to wait a little bit longer. They may have like this violations is if you said some performance metrics that you want to hit deployments really useful, if you say, Hey, was this last release really we regressed and we can maybe show that some other time when we improve Drupal core and hopefully fix those problems.

[00:29:42] And so there's lots of nice things. Yeah. And thread profiles and available. Obviously there's more reports which you can go more in, but that would be kind of like the nice overview of of what we, what we have

[00:29:58] Yeah, I think it would be kind of through, we can also look at our load test and since 30 minutes ago we are still running a little bit, but shortly for wrapping up. So I think we should be wrapping up also on load test perspective. And again, we can see, we hit this 3.72 K and then we get to go a little bit down, but overall, the scalability till 1,200 users are still fine, but here a little bit of a peak could now analyze this.

[00:30:27] Sometimes you find things by chance. So we also have some more, way slower queries previous. We'll see larger thing when he here even 94 points, 94 seconds, let's see what's up with that. And here we see just this response sent this essentially meant this web path try to to send a response, but couldn't because something else was, was blocking it for being scheduled again.

[00:30:55] So it was just waiting for a very, very long time to actually send out response. So this is kind of like a false positive. Which is why I like to also look at the to traces that I'm seeing while the load test is just ramping up, but not just when it's running for a very long time already. So if you looked at the more first part of this test, and I have to do it here again, because it does not take it, take

[00:31:23]Jeremy Andrews: it also just finished.

[00:31:27] The load test did just finish as well.

[00:31:29] Fabian Franz: Yep. Perfect. And we look at the transactions within this timeframe in summary. Let me see here we not yet had those crazy request just 16, six sections and could now check again, but probably just node revision again. Yep. Node revision select, sorry.

[00:31:52] This is where we stop, essentially, because we need to fix this problem first and then we can continue. Oh, and one last thing - very important. If you look at database things and you see lots of cache key, cache node, cache, whatever things, then your Redis is not set up correctly. That was what we had first and they fixed it and then it was much quicker already.

[00:32:17] So yeah.

[00:32:19] Michael Meyers: Awesome. That was really cool. I think, you know, what fascinates me is that there's so many components to this, you know, knowing how to set up a load test, you know, and how to properly configure, you know, anonymous versus authenticated traffic and how you split out your users that would interpret the results and find the problem, fix that problem before you move on to the next set of testing.

[00:32:42] So we're gonna continue to cover topics like this. We've touched on a lot of what I just mentioned in conference presentations and other Tag1 Team Talks that give you insight into how to plan your load test and the process. so it was really great to cover this because we haven't really dug into to this aspect much yet.

[00:33:01] You can check out for more information on those other topics. You can find the documentation for Goose, the GitHub links, as well of our other Tag1 team talks, blogs and conference presentations. You guys have any questions about Goose? Please use the GitHub issue queue. If you like this talk, please remember to subscribe and share it out.

[00:33:25] You can check out past Tag1 team talks at As always. We'd love your feedback on this show. If there are other topics that you'd like us to cover with Goose, or in general, please email us at and again, you know, Jeremy, Fabian, thank you guys so much for joining us and walking us through that. That was, that was really cool to watch. When I say a huge, thank you again to Pantheon for providing all of the hardware for our Drupal instances and their global CDN, I saw our last CDN bill. So I'm particularly grateful. You know, we throw a lot of traffic at these sites in our tests and it's we're really appreciative for that.

[00:34:06] And of course everybody who tuned in and joined us today. Thank you all for joining us. We'll see you soon.