This is a transcript of the A09 - Security Logging and Monitoring Failures training session. Click here to return to the OWASP Top 10 Security Vulnerabilities page.


Greg Lund-Chaix: [00:00:00] Well, hello everybody, and welcome to a pivot from development to infrastructure. So this, this is all infrastructure all the time now. Um, my name is Greg Lund-Chaix, and here, I'll pop up my info.

Let's get, there we go. Um, I am an infrastructure engineer with Tag1 Consulting. I've been doing this for close to 30 years now. Uh, I think I started in the early 90s and yeah, so I'm here to talk about initially security and logging. Um, but what I would like to invite all of you to speak up and interrupt me because I can kind of riff on this stuff for hours and I want to make sure I'm talking about stuff that's actually interesting to all of you.

So please don't hesitate to speak up and ask questions and interrupt me because I think that'll make it a much more interesting presentation.

So.

For this first segment. Um, what I would really like to talk about is kind of [00:01:00] in descending priority here. Um, I want to talk about what it means, you know, the importance of these various things and talk about some of the tools that are important for both logging both at the application level, as well as the system level.

Um, and then what do we do with those logs once they've been created? Um, and one of the things that I'm going to kind of talk as a recurring thread here is that it's important to remember doing something is better than doing nothing. So and it's one of the reasons that they're kind of in this order is, you know, first of all, you have to log something.

Um, and then you have to do something with those logs and then you have to actually analyze those logs and then notify somebody on it. Um, intrusion detection is kind of off on the side a little bit and we'll get to that a little bit later, but I think it's important to note that if you don't log anything, none of the rest of the stuff matters. So that's where we're gonna start. [00:02:00] Um, and really there are two things, or there are two kind of two branches to this. There's the logging that happens within your application. You know, if it's a web app or something like that, where you need to be looking at some key things, and these are not, this is all of these lists that I'm posting here in this presentation are a starting place.

They aren't necessarily exhaustive. Um, especially when I'm talking about tools because there are zillion tools out there and finding the right tool really depends on your use case, but some of the things that you really want, the most important things that you really want to make sure you're tracking are user interactions.

You know, users logging in, logging out I assume you have authentication of some sort. Um, you always want to track errors within the application and log them somewhere so that because that's one of the most common places for an attacker to break into the system is to find an error [00:03:00] and exploit it. Um, you want to, whenever possible Also log when an application starts and stops, because that can be an indication of somebody doing something they shouldn't, if there's an unexpected start, stop you know, like maybe they've exploited something and they're restarting the system with exploited code or something like that.

So you want to be watching for those things and making sure that you're logging them. And of course, this is just the beginning, right? You have to actually log them before you can do anything with these logs. Um, and the last thing is similar to the start stop. And that is you want to log your deployments.

You want to log when you're making changes to the system so that you have a line in your log files that say, okay, before this point, we're running on this code. After this point, we're running on this code. Um, so you can know what, you know, if you're, if you're say hotfixing a vulnerability or something like that, you know, where in those logs to start looking for things [00:04:00] also this can be an indication that somebody has done something.

If they've deployed, you know, if they've broken into your system and deployed something malicious. Logging those deployments is an important way of detecting something hinky is going on.

On the system side, it's a little different. Um, on the system side, it's more likely, when you're tracking things, you're going to detect outliers. Um, you're going to look and see, you know, is there a spike in CPU utilization? Is it using more RAM all of a sudden, is it using a ton of extra disk space?

Um, you know, is, is all of a sudden you're using way more bandwidth. Um, and so logging this information. Especially after there's an incident is really useful because you can go back and look and you can, especially when you're logging the application log and the system logs, and you're aggregating them together, you can then look and say, okay, look, memory spiked here.

Let's [00:05:00] and compare those timestamps to when something was in the application logs, you can sometimes by putting those two pieces together, detect, oh, here's what happened. Um, so monitoring these is, is a key piece for detecting incidents. Um, and of course also it's just useful in the regular day-to-day application.

If you've got slashdotted, you kind of want to know that all of a sudden you're getting 10 times the normal traffic, right.

You definitely want to also track access logs for people logging into the backend. Um, you know, either the, the Kubernetes servers or the virtual machines, something like that, because of course, that's the most likely place if somebody is going to break in, they're trying to get into the server and try to do something to change your application code.

Um, and just like in the, at the application layer, I'm looking for errors in your logs out of memory errors that's the big one. Um, [00:06:00] because usually overflows are a common place to break into systems, but in general, looking for anomalies and errors inside your system log is a good place to indicate that maybe somebody is doing what they shouldn't. So, and by the way, I'm very caffeinated, so I'm going to talk quite fast. So please by all means interrupt.

Uh, but also I have a bit more to talk about here, and so I don't want to waste too much of our time. Um, so piling all these together, then it's really useful just to have these logs in the first place. And if you're doing absolutely nothing else, at least gathering this base information will help you determine, hopefully at the time, but also definitely after the fact, if something has gone wrong and then be able to trace it back and figure out what's going on. Obviously including in these for network applications, including the source for, you know, the logins both [00:07:00] user authentication, as well as the session that triggered the error, or something like that.

Um, all of that information, you know, logging source, IP addresses, and things like that is very important because that'll help you look for outliers. You know, if you have an application is primarily used in the continental US or something like that, and you see a log in from Siberia, you're probably going to think.

Hmm. I don't think that's quite right. Uh, and in fact, the question of geo-fencing is a pretty common thing to do. Um, as far as blocking systems off, if you have something, you know, will never be used outside of a certain area.

So, okay. I'm going to move on because I'm not hearing any questions.

Greg Lund-Chaix: The second thing, and this is important for multiple reasons is you need to pull those logs off of those servers. You need to store them somewhere else. Um, this is very important because the first thing a bad guy is going to do when they break into a system is [00:08:00] they're going to erase the logs. They're going to try and cover their tracks and, and hide their footprints.

So, in the early days of, you know, running systems, we would, you know, rsync logs off the server periodically. So every five or 10 or 15 or hour or whatever we would shovel the local logs off to somewhere else. Now that doesn't do a whole lot of good, because that means an attacker has that sync window to erase their footprints.

They can erase their login information, things like that. Um, so. It's very, it's very, very useful to use tools. Like even something as basic as our syslog, but you know, more common doing something like, um Elkstack, Elasticsearch, Logstash, Kubota or Splunk or using, you know, some of the commercial tools like Papertrail or Datadog to stream those logs in real time off to another location.

So that they're stored somewhere where the bad guy that broke into the system can't [00:09:00] get to them because then they can't cover their tracks. So it's very important that you have consistent logs. The other nice thing about aggregation is of course, when you were dealing in a clustered service with a lot of containers or a lot of servers, It can be a real pain to find something because you have to figure out okay, which server they on go look at that server or look at the logs in that particular server.

Um, and it also allows the aggregation also by piling them all together allows you to see all of the systems together in one place, which makes it much easier to see patterns or, you know if there's there is a spike, but the spike only, the spike happens one machine, it's going to be more obvious if you have everything aggregated together.

Attendee: I actually have a question for you if you don't mind.

Greg Lund-Chaix: Yeah. Bring it on.

Attendee: So you were talking about streaming logs to some sort of service. What I've been wondering is the [00:10:00] that's scenario sounds like, because it's going to go over obviously like HTTP or something like that. Most likely, that seems like a lot less reliable than writing logs to disk per se.

So I guess the question is that reliable enough to not write to disk? Or would you write to both or?

Greg Lund-Chaix: Good, good question? Um, it is reliable. Um, yeah. Um, they, they tend not to use HTTP. They usually, they use their own protocol usually on a, on a different port, but generally, yes, it's good to write both.

Um, if for no other reason that if a bad guy breaks in, they're going to change the one written to disk and you'll be able to see the difference. But also it's sometimes convenient just to have local logs. Um, you know, if you know, you're having a problem with this specific server and you don't want the noise of all the other servers pouring in, you can just go to that one and look at those logs.

So generally on systems that I manage, that's exactly what I do. I tee the logs off, so they get written to the local VAR and then also get written to you know, Datadog or wherever I'm doing my [00:11:00] aggregation.

Attendee: Cool.

Greg Lund-Chaix: Awesome. Cool. Thank you. Thanks for the questions.

So the aggregation also and we'll get into Datadog, especially, but Papertrail and Datadog and, and, and Elk and Splunk. Really, these are beyond just aggregation, obviously, because there's the alerting component as well. So these kind of overlap into the next piece, which is of course monitoring and alerting.

Um, so, you know, once we have all these logs, that's good. At least we have them. So we can go back and dig into them and do a little spelunking into the logs and say, okay, what have we got? What happened? Why did this happen? Um, and go back and hopefully resolve the issue. Um, of course. It doesn't do a whole lot of good unless you actually have somebody looking at those logs and monitoring them.

So there are a ton of tools out there from the very basic to [00:12:00] you know, the pretty in depth. Um, and we have kind of two levels here and I didn't necessarily, well, actually, there's kind of, this is where things start to branch a bit because all the tools are a bit different. You know, we've got something very basic like CloudWatch which, you know, will be just pulling in metrics from all your systems, monitoring them.

And if they hit a certain threshold kickoff, an alert, that's a very useful it's, it's very basic, but it's useful. Um, you know, if you're using Nagios same thing, right. You know, Nagios will look and say, oh, look, CPU's high, throw an alert. Um, and in fact, Nagios even has plugins for scanning for, you know, patterns and log or uh, strings and log files and things like that.

So you can have it watching log files and regexing for error messages and things like that to kick them up to an alert. Um, so at the very basic level, if you have nothing else, at least, you know, get something like that in place so that you have something going, whoa, daddy's something's going on. Um, And then we have [00:13:00] something, you know, we have kind of the two branches of Promethease and Elastalerts and Watcher.

Um, Prometheus looks at metrics, it looks at numbers. So it's going to see, it's going to watch for numbers. And if they hit a certain threshold, they're going to kick off an alert, whereas Elastalert and Watcher they're looking at. So the, for Prometheus's, it's got an agent that's monitoring the system, whether it's a backend, you know, Kubernetes server or whether it's a VM or whatever.

And so it's going to watch those, you know, RAM usage, things like that. Um, a lot of those systems stats and and alert you if it sees something, whereas something like Elastalert, Watch watching is looking at that stream of log files as they're coming in and aggregating together and watching for things in those.

And so there, it's going to be more, more likely to see error messages or log-ins or, and things like that. [00:14:00]

These two tools do similar thing, but in different ways and they watch for different things. And so that's one of the things that you'll, you'll see a lot in security is talking about layered defense.

And this is also true in your tools here, is you want to. You don't wanna, you don't want to watch just one dimension. You want to watch multiple dimensions at once. So you want to watch, you know, so it's very likely you could be using both, right. You could be using Prometheus to watch, to monitor system stats and things like that.

But then also be watching the logs with Elastalert or Watcher or something like that. And so there are organizations I've worked with that do both, and that can be useful because they're going to alert on different things. Um, and then we get cutting into some of the fancier stuff. Um, Datadog is a good example.

Papertrail does something similar, but I think Datadog's a bit more advanced. Um, and I use it quite a bit on projects that I work on. Um, and it's a service, you stream your logs to it. Um, and it has [00:15:00] huge number of integrations that can monitor anything from the application layer to the server to and it will do.

A whole bunch of interesting kind of intelligent alerting. Um, there are some dangers here that I'll get into in a minute, but it will, it can kind of, it's not the smartest, but it can learn. So it can, it can kind of watch for, okay. You know, this group of servers, they, they tend to run a load of four, but this group of servers tends to run a load, a load of eight and that's okay.

That's normal. And so what it can do is it can watch for deviations from the norm, you know, if it gets off by one standard deviation, it'll alert. Um, so it can do some intelligent alerting, which is really useful. Um, and I use that quite a bit. There is a danger there. The danger is also true with New Relic.

However, that, you get too many false positives. And when you get too many false positives, you start paying attention. And so when you do get a real [00:16:00] alert, it gets lost in the noise of the false positives. So you want to be really careful with, especially these alerting systems on your tuning and, and what you do with them.

And especially early on, when you're setting them up, you want to tune them down as best you can, so that you're minimizing the number of false positives. So you don't have that noise and the, and the likelihood that you're going to miss a real alert. Lastly, with New Relic if, if any of you are familiar with that, there are other tools out there, but basically APM application.

I forget what PM stands for, but application performance monitoring. I believe it is And so it actually looks for deviations in the performance of the application. So usually it's inserted into say the PHP stack or something like that. And so it's monitoring response time, both on the application server level, as well as usually it inserts some code into the page that's being served.

So we can actually report back end user [00:17:00] timing and things like that. Um, and so the nice thing about that is not only is it going to provide some performance monitoring, but it also provides you with notifications. If there is say a huge increase in traffic or a huge increase in traffic from another location, something like that.

Um, so. With monitoring and learning more than likely you're going to be using more than one tool because they all monitor different things. Um, and some tools are more expensive than others. Um, Datadog and New Relic are especially tricky in that if you don't set them up, right. They get very expensive because up until very recently, New Relic charged per node.

Um, and when you're working in say a container environment that becomes. Crushingly expensive because you have, you could have hundreds of nodes. Um, I noticed actually somewhat recently, New Relic has just changed. They're now switched to a per user and per gigabyte injusted [00:18:00] so their pricing has changed.

I haven't had a chance to really analyze that, to figure out what that means, pricing wise, but for example, Datadog charges per node which is less than ideal, however, at least Datadog it charges when you're doing on a containerized environment, at least charges per backend note, not per container. So, you know, if you have say an ECS cluster with, you know, eight machines and it's going to charge you for eight nodes, not the however many hundred containers you have floating around out there.

Um, so the, you know, the roll, your own of from Prometheus and Elkstack and Elastalert and things like that Are often more cost-effective, especially in larger environments. Um, certainly tools like Datadog, especially with their built in intelligence and, and capacity to do that sort of deviation based metrics is really helpful if you can make it fit [00:19:00] in your environment.

Okay. And I keep an eye on time here. So intrusion detection. So this is an interesting sort of side topic in that IDS is in some ways kind of being sidetracked. Um, I was talking with my coworkers yesterday about this, and we tend not to do very much with IDS these days in an environment where we're dealing with much more abstraction from the hardware.

Um, and especially in things like Kubernetes or Open Shift or especially past providers, things like that. Um, It really doesn't make a whole lot of sense. They're still useful. However the problem we ran into a lot with them is they're even more likely than the monitoring tools I mentioned in the last slide.

They're even more likely to have [00:20:00] false positives. Um, and there are a lot more work to tune. They are a lot more work to monitor. Um, and it's, it's very much getting to the point of diminishing returns. There are good tools out there, you know, Snort, Suricata things like that are very good. Um, you know, there are a zillion commercial tools out there for this, and if there is a organizational requirement to have it for one reason or another, yeah.

By all means use them. But I haven't found a whole lot of need currently. And in order to cover. You know, to have what I consider a secure well monitored system. IDS really doesn't add that much and tends to detract from, I'd rather be spending time on tuning my aggregation and monitoring systems than I would spending time trying to tune an IDS, because at this point with the way we're dealing with stream blogs and monitoring those and looking for outliers, and those, those [00:21:00] are going to detect things as quickly as an IDS.

Anyway we're going to see somebody trying to brute force an account at the same time, the IDS is going to see it and it's going to alert already. So it it's in a lot of ways becoming redundant. Um, they're still useful. They're still, you know, and, and in some environments they're still worth money, but they're less important I think, than you know, the previous stuff, the monitoring and learning tool.

Um, and then of course, then you look at things like the, the SolarWinds breach last year, where SolarWinds being one of the large IDS providers. In fact, I had them on this slide until I pulled it off because they had a breach. And so then your IDS is actually a gateway and a problem.

So I covered a whole bunch of information and I realized, you know, we had a relatively short time here to do this. Um, but I would like to know if [00:22:00] anybody has any questions or are there any specific talk topics you'd like to talk about? Um, and then also our I suppose, you know, we probably ought to take a short break here for, you know, so I can get a drink, but Also then our next session is general Q and A discussion, which is, I haven't necessarily had any specific talking points because I wanted to hear from all of you about what things you're interested in, and then we can talk about them.

So I think what I'll suggest is let's take a couple of minutes break I'll refill my coffee cup and then we can come back and start the kind of general Infrastructure discussion. Um, if nothing else, I suppose I could share a couple of horror stories to start things off because you know, in the course of my career, I've had some pretty large infrastructure disasters.

And so it's always fun to talk about those. And so we'll, we'll start with that. And so I'll be back in maybe a [00:23:00] minute, minute and a half.