Cloud Outages: Lessons in Resilience and Error Budgets

Joining Today:

Matt Glenn

Nate Sherman

In this episode of 10,000 Feet, Nate Sherman and Matt Glenn dive into the recent wave of major cloud outages impacting AWS, Azure, and Cloudflare, exploring what went wrong and why these failures are so disruptive. They discuss the growing risks of globally applied changes, the importance of error budgets, and strategies for building resilience in modern infrastructure. The conversation also covers best practices in site reliability engineering, monitoring, and alerting, as well as the role of AI and automation in change management. Packed with insights for architects, SREs, and IT leaders, this episode offers practical guidance on balancing speed, reliability, and risk in today’s cloud-driven world.

Enjoy the episode!

Need to know more?

Matt and the team are always excited to chat!

Get in touch

Episode Transcript

View Full Transcript

Transcript (Auto Transcribed)

00:00:00 Nate

Hello, welcome to 10,000 feet, the Vervint podcast. My name is Nate Sherman, one of our architects here. I have been here for just about a year now. I’m joined here with Matt Glenn.

00:00:09 Matt

Hello there, Matt Glenn. I had our modern infrastructure team and our managed services teams.

00:00:16 Nate

All right, so.

00:00:18 Nate

Matt, there’s been a lot of talk in the news about all of these major cloud outages, something that we’ve seen a lot in the field, something that a lot of people have read about and it’s really caused a lot of disruption. I know there’s been a lot of big players that have been hit. Can you just maybe talk, I guess, a little bit of just a general overview of what’s been going on and why it’s caused so many problems?

00:00:39 Matt

Yeah. Well, there’s there’s over the past few weeks, I think there’s been a few major outages. You know, first outage is, you know, a WS had an outage essentially of most services in one of their regions. Azure had an outage of what they call the front door service which sits in.

00:00:59 Matt

In front of you know, a lot of the other services that they have as well as is used to protect a lot of web applications and APIs. And then most recently we have the cloud flares outage.

00:01:13 Matt

Which, you know, took down about 20% of the Internet is is what some estimates are and caused a lot of disruption.

00:01:25 Nate

That’s a lot of disruption. So when people talk about the cloud going down, I know one misconception is it’s an entire.

00:01:27 Matt

Yes.

00:01:35 Nate

An entire service or an entire company, right? It sounds like what you are describing is more of a a subset of services that’s going down that’s impacting functionality, which would you agree with?

00:01:47 Nate

That assessment. Yeah, well.

00:01:49 Matt

Just look at the the Cloudflare outage. So Cloudflare is 1 company and.

00:01:55 Matt

There are other companies that do the same thing that cloud flare does, but there’s, you know, 20% of the services out there that you use every day actually use cloud flare in their service. And what they do, which is kind of important to the outage is they are.

00:02:12 Matt

Sure, the content distribution network, so Cloud player has like about I think the last number I saw was 330 data centers around the world and they have, you know, data centers that are essentially close to you. In fact, there’s one here in Grand Rapids. And when I say.

00:02:32 Matt

Data center is just like.

00:02:34 Matt

Two racks of equipment. It’s not huge, but it’s close and what they do is they cache data from big websites like say, Netflix or something like that. So that when you go get the loading images and things like that for a website, it’s actually.

00:02:54 Matt

Coming from something that may be actually miles away rather than you know, on the East Coast or West Coast or something like that.

00:03:04 Matt

That’s one main thing that they do, and then the other thing that they also do is they help protect websites against bot traffic. So you.

00:03:14 Matt

Know.

00:03:14 Matt

Bots that are essentially trying to go to nike.com and buy out the latest shoe getting released so that they can drive up prices and resell them or something like that.

00:03:25 Matt

So you know they do those two things. The thing The thing is is that as a consumer, you have to go through this cloud flare service in order to get to nike.com.

00:03:36 Matt

And you know, there’s. So when it goes down, it feels like the internet’s down because the things that you want to go to are also down. It’s also why it’s so impactful and difficult.

00:03:50 Matt

To work around.

00:03:51 Matt

So the Internet hasn’t gone down, but some very big services have gone down and.

00:03:57 Matt

As as engineers, it becomes more and more difficult for us to figure out how do we protect ourselves and the apps that we put out there against this stuff.

00:04:07 Matt

So I can talk a little bit more on, I just dive in and talk a little bit more about like what happened with cloud flare because that’s kind of interesting and it becomes different than the other two outages that I talked about or that I mentioned. So cloud flare.

00:04:26 Matt

What happened is actually with this bot protection service.

00:04:32 Matt

So you know, to get in a little bit of the technical, the way that it works is you know they have this essentially an executable that is running on a server, right? And there might be 10s of thousands of these servers. But essentially your traffic goes through this executable and it’s looking for behavior of a bot, you know, and trying to figure out.

00:04:53 Matt

Do I need to flag you and have you go through CAPTCHA or something like that and it reads the text file and that text file contains the rules that it’s supposed to evaluate you?

00:05:06 Matt

And it was, you know, this executable was recently rewritten in a language called Rust. Rust is one of its biggest features, is that it? It’s big on memory protection. So, like, if you allocate A variable, you have to specifically scope it and.

00:05:25 Matt

You know a hacker can’t figure out how to access that memory by feeding you weird stuff.

00:05:32 Matt

And so they rewrote it in rust to provide better security.

00:05:36 Matt

Well, that means, you know, they allocated the ability to have 200.

00:05:42 Matt

Rules and then the code essentially just assumed it’s never going to have more than 200 rules, so that’s the amount of memory it allocated and then.

00:05:54 Matt

If that doesn’t happen, it just dies, right? Because they basically had a piece of code that said, oh, this will never happen in then you know, there’s a script that essentially generates those rules by reading from MySQL a database.

00:06:14 Matt

And somebody made a security change to MySQL which effectively caused that script to return a lot more stuff. Things from like Dev and test and whatever else. And so now there’s more than 200 rules.

00:06:27 Matt

That little executable reads it, it dies, your traffic stops, and that that process started rolling out across the globe. All 330 data centers.

00:06:43 Matt

Blocking everybody’s track.

00:06:46 Matt

Is also very difficult to troubleshoot because there’s thousands of MySQL clusters that are out there across all the different sites. Every site has probably multiple clusters and so forth, and so when that script goes to refresh, it might point to a SQL cluster that hasn’t been updated yet.

00:07:06 Matt

And so it writes the right stuff. Things run. Then the next time it refreshes, it points to one that is screwed.

00:07:14 Matt

Up and pulls the wrong data and then it fails and so as they first started seeing the problem, not everything failed. It’s just like little pieces going off here and there and here and there. And until eventually I think 2 hours into the outage.

00:07:32 Matt

Everything finally got to a stable point where it was all 100% failed, which is really difficult to.

00:07:39 Matt

Yeah. So as an end user, it doesn’t matter what company like Netflix or whatever else has done or actually more specifically because this did actually hit Nike. It doesn’t matter, you know what they do.

00:07:55 Matt

You can’t get to them because all of your all of the traffic to nike.com from you goes through this content delivery.

00:08:03 Matt

Network. The only options that they have.

00:08:06 Matt

Take a lot longer to resolve than what Cloudflare is going to do, so there’s not really a lot that you can do as an engineer.

00:08:16 Matt

Except, you know, possibly evaluate other services.

00:08:20 Matt

Now that to say I don’t say the Cloudflare service is.

00:08:23 Matt

Bad there, good.

00:08:25 Matt

Service and they’ve done a wonderful job, you know, given all the technical details of what went wrong with their doing to fix it and prevent it again. But.

00:08:34 Matt

It does have some disturbing some disturbing things about.

00:08:38 Nate

It. Yeah, it almost sounds like the CDN for a lot of different companies that are out there is really more of a single point of failure, right, with not necessarily easy ways to work around it just by nature of how it.

00:08:50 Matt

Works well and to be honest.

00:08:53 Matt

The only thing that makes it a single point of failure is how how Cloudflare managed the update. So if you were to roll out to say 1/8 of the data centers.

00:09:07 Matt

And then leave it for an hour. You would probably see problems, but your your blast radius. You know the maximum amount of damage that you could have done is 1/8 of your.

00:09:17 Matt

Traffic.

00:09:18 Matt

Still pretty big, but not everything. Not everybody. And you’d also have better diagnostic data to be able to roll it back because you know what change.

00:09:28 Matt

Caused it rather than a bunch of little things firing all over the place.

00:09:34 Matt

And then fixing themselves and then breaking again.

00:09:38 Matt

So the scary part really is blasting all 330 data centers with an automated.

00:09:46 Matt

Flow.

00:09:48 Matt

That’s what makes it the single point of failure.

00:09:51 Nate

Sure.

00:09:52 Nate

I want to jump ship here a little bit because I know all of these outages were, although it was an outage in a disruptive services, they were all very different from a technical perspective, right? Yep. Maybe diving into the Azure one a little bit, I know that was something we got a lot of activity from with some of our clients. What happened there with the front door service, I know it’s similar in the way that serves web traffic.

00:10:13 Nate

As cloud flare does, but I know that there were some technical differences between what actually went.

00:10:17 Matt

Wrong. Yeah. So front door for those that don’t know what front door.

00:10:20 Matt

It does. It essentially does a combination of a few things. One is it acts as a a web application firewall for websites or APIs that are behind them. It also acts as a traffic router, so if you have multiple instances that might be serving that.

00:10:43 Matt

Traffic it it can route to them depending on you know what path you’re going to and so forth. And there’s a lot of features there. So it’s really it’s doing multiple things at 1:00.

00:10:55 Matt

Time and so essentially Azure front door, you know started having issues with some changes that were rolled out and stopped responding. The problem is is that Azure front door is not only used by us to protect things like APIs or websites or whatever else.

00:11:16 Matt

It’s also used by Azure and Microsoft 365.

00:11:22 Matt

To protect all of their services. So now you start to see problems where.

00:11:28 Matt

There.

00:11:29 Matt

You can’t access the actual Azure portal that a lot of people would use to administer Azure and to try to figure out how to get around it or figure out what the heck’s going on, you start having issues where Office 365 is inaccessible. You can’t get to the web interface.

00:11:49 Matt

Which a lot of people may use. And then there’s other services inside of Azure that you might depend on that are then inaccessible and so that chain gets pretty complicated and also takes a while to resolve.

00:12:03 Matt

With Azure it also became kind of a interesting single point of failure problem because their issue occurred globally as well. It wasn’t just, you know, Azure has regions and regions are generally supposed to be separate, but front door kind of works.

00:12:24 Matt

Across regions and the issue is is that when when the failure started to happen, it impacted Global Services for front door. So it doesn’t matter if you have an application that’s deployed in two or three regions even you know a region on the other side of the planet and users.

00:12:44 Matt

Still can’t get to it now. There are ways. As an engineer that you can, you can account for that. It’s difficult, you know, you could actually have multiple services instead of just front door. You could have a backup service that you could fail over to.

00:13:01 Matt

But then you would have to have some sort of external provider hosting your DNS that would actually be able to test and see if something’s going wrong and failover. That’s not really practical for most purposes.

00:13:16 Matt

So again, it’s another case of a global failure that’s very difficult to to engineer.

00:13:22 Nate

Well, it sounds like when front door failed, it was almost a double whammy, right? Because not only is that powering the services that an organization is hosting with that particular tool, but it’s also powering the management interface that you would use to get in and troubleshoot what’s going on or even failover, possibly manually to something different. So you’re.

00:13:41 Nate

Kind of stuck, you know.

00:13:42 Matt

Yeah, and sometimes it also causes the services that you want to use to complain about the failure. You know, one of the one of the most comical things that happened with the, the other failure with cloud flare is.

00:13:58 Matt

There’s a website called down detector that people used to report if they’re, you know, Netflix or whatever else is starting to be inaccessible and down detector is fronted by cloud flare. So when cloud flare went down, down detector was down, which is has caused some pretty funny things to pop up. Like there’s a website called.

00:14:15 Nate

So.

00:14:20 Matt

Down detector down detector that detects if down detector is down and so it’s it’s gotten. It’s gotten pretty funny but.

00:14:28 Matt

Yeah, that that definitely complicates our our work because the services that might be impacted are also the same ones that we use to figure out how we’re impacted and how to fix it and how to manage it.

00:14:40 Matt

So having backup plans there is also a good idea. Now one thing I would say about that is you know we did. So we, you know experience problems accessing things like the Azure console but command line services and the ability to access via APIs.

00:15:01 Matt

And some of that still we’re working.

00:15:06 Nate

So man, it sounds like a lot of these outages were really a result of maybe some poor change management and there’s only so much that you can do as an organization to plan for that in many situations. You can’t because you’re getting these tools as a service, you know, with the impression that these services are going to be available to you.

00:15:26 Nate

So.

00:15:27 Nate

In these situations where you have these large companies that are, you know, powering a lot of what we do, how can organizations be better equipped to not necessarily prevent these types of issues from coming up, but work around maybe some change management issues or things similar when a service does?

00:15:45 Matt

Stop working. Well, I think one of the.

00:15:47 Matt

Biggest things is trying to kind of change the perspective. You know, a lot of us come from traditional infrastructure backgrounds where we’re told, you know, XY and Z has to be up 99.995% of the time.

00:16:01 Matt

And there’s there’s really no business process that has been understood to needing to have that level of uptime. And one of the biggest things in the world of site Reliability engineering is coming up with this idea of an error.

00:16:18 Matt

Budget.

00:16:19 Matt

And why that’s important.

00:16:21 Matt

Is if you look at cloud flare.

00:16:24 Matt

The outage was big and you know things were down effectively for hours. I think in total end to end real outage time was about 6 hours.

00:16:37 Matt

Which is is a big deal.

00:16:39 Matt

But as a business, if you design a process for your customer like say say Uber booking rides, right?

00:16:52 Matt

In your process you need to define a level of what’s the amount of rides that are unbookable that is acceptable to an end user.

00:17:03 Matt

So if out of every 100 rides I try to schedule with Uber one of them I can’t schedule, is that acceptable to a customer or do we have to go further? And then what’s what’s the cost of going to that level? It might mean you have to over schedule?

00:17:23 Matt

Drivers and things like.

00:17:25 Matt

Well, which might then have a cost in order to be able to provide that. So you have to, you have to run the business with that idea of an error budget. Here’s the amount of problem we’re allowed to have. And then you have to extend that into how we plan infrastructure and how we plan.

00:17:46 Matt

You know, around these types of problems, if you have an air budget and you’ve determined, OK, we’re allowed to miss one ride out of 100.

00:17:55 Matt

Well, that’s 99% uptime.

00:17:59 Matt

You have a 1% error budget, 1% of the time, out of all of the hours in a given month.

00:18:06 Matt

That’s that’s quite a bit. That’s what 7-7 hours, a little more than that, so.

00:18:13 Matt

If you have that 1% error budget that you’ve designed your business around and cloud players out for six hours, yeah, it’s painful.

00:18:22 Matt

But it’s not painful enough to say I have to completely redesign our entire infrastructure to get around it.

00:18:31 Matt

And so thinking in terms of air budgets is really important as we start getting into this area where there are a lot of things that are outside of your control and there might be 4 or five companies that essentially power everything on the Internet that you use.

00:18:49 Matt

If we then take a look at, you know cloud flare and we say well, I’m using them as a provider, they had an outage that used most of my error budget this.

00:18:58 Matt

Month.

00:18:59 Matt

But what’s their trend? Are they generally staying way under that? And this was very unfortunate. And, you know, we’ve been able to apologize and we haven’t really lost any customers.

00:19:11 Matt

It doesn’t mean you go trying to spend $10 million reengineering everything to work around a CDN.

00:19:20 Matt

You.

00:19:20 Matt

Know.

00:19:21 Matt

You could actually deploy multiple CDNS. You could deploy Cloudflare, you could deploy an AWS CDN and then you could. There are ways that are really complicated that you could load balance between the two at the end. User device is really complicated. You have a lot of complexity.

00:19:42 Matt

Managed to do that. It’s not practical for most people. So for your business, if one percentage of an era.

00:19:49 Matt

Budget is good, then you know that’s what you target for. Even in healthcare, you know Healthcare is one of our. Our biggest focus is and people think well in healthcare you can’t have systems go down at all because lives are on the line and people don’t realize is that.

00:20:10 Matt

Most of the processes that are used by nurses and doctors and so forth to do the everyday care also have a paper backup process and also have offline systems.

00:20:22 Matt

That are used to be able to access some of that patient care data.

00:20:26 Matt

So yes, even even a hospital can have an error budget and you can’t not have an air budget. It is impossible for you to have 100% uptime, no matter what you do.

00:20:37 Matt

You have to plan around it as a business and then evaluate your services of whether those services are meeting your error budget or not.

00:20:47 Matt

So cloud flare, I mean it depends what what happens over the next few months they have another.

00:20:53 Matt

Edge.

00:20:54 Matt

Or do they show signs that they have put in place root cause prevention to prevent this type of thing in the future? They’ve learned from it, etcetera. Then it’s it’s a blip. It’s painful one, but it’s a blip.

00:21:10 Nate

Yeah, it sounds like from a business continuity perspective, it almost makes more financial sense to just have an air budget than trying to spend a bunch of money to.

00:21:22 Nate

To reengineer something for failover that realistically maybe isn’t going to happen that often or is going to exceed the cost of your air.

00:21:31 Nate

Budget.

00:21:31 Matt

Right. Yeah. And two, we like to talk in terms of how many nines is something. Is it 99.9% available, 99.999.

00:21:44 Matt

4 nines or five nines or 3 1/2.

00:21:48 Matt

And The thing is is each additional 9 is 10 times the cost.

00:21:53 Matt

And there becomes a point where the reliability cost is.

00:22:01 Matt

A lot greater than the benefit that you’re going to get from it.

00:22:06 Matt

Traditionally, in the past these things have been managed separately. You know it, managers and so forth have one objective that they’ve been given and the business is managing to something else. But nowadays we have to, we have to architect all the way through the business to the technology.

00:22:26 Matt

And understand what is realistic.

00:22:30 Nate

Yeah, up to change gears here a little bit, Matt, I want to dive into a topic that we always talk about, especially with site reliability, engineering, observability. I’m imagining that when these services started going down, outside of obviously seeing that something stopped functioning from the end user’s perspective, there must be something behind the scenes.

00:22:52 Nate

That gives some type of indicator that hey, something’s not working.

00:22:56 Nate

Right. How do we go about managing that in the context of site reliability engineering and what does that look like in these types of situations?

00:23:06 Matt

I think, well start the monitoring 1st and monitoring is a big deal that you know I still see in many environments and and with many clients is.

00:23:18 Matt

Still an afterthought or as or as tough. And you know, from a monitoring perspective, in the case of cloud flare, I I would hope that they have monitoring that shows that services were failing and they could see that firing. And it was it was more of a manner of trying to figure out why but there’s.

00:23:38 Matt

Also, is this concept of white box monitoring versus black box monitoring so white boxes where you are actually looking at the service. You can see this bot service is actually running and while it died and OK so a monitor could.

00:23:54 Matt

Buyer but black box monitoring would actually be, say a service deployed on a desktop someplace that goes and looks at a site that’s fronted by cloud flare from Grand Rapids MI and sees it can’t get to it all of a sudden.

00:24:12 Matt

And it’s pretending to be a user. It’s called black box monitoring because it’s engineered to not understand what’s happening behind the scenes. It just knows that as an end user, I can’t get to.

00:24:25 Matt

As site reliability engineers, we have to make sure that we have both these angles covered and that black box monitoring is especially critical. See, you know what matters to the end user and how do I test that and test it continuously like every.

00:24:40 Matt

Every 5 minutes or or greater to see can an end user access the service, but then the the other piece too is you need those alerts to be meaningful.

00:24:53 Matt

And that’s that’s the other pitfall I see people walk into quite often as you deploy a new application or something like that, you turn on all the monitors and all the all the default alerts across the board. And I recently saw this, you know, over the last year.

00:25:14 Matt

With a client that we have that has an Internet enabled device that they were rolling out some updates to and.

00:25:22 Matt

Essentially, they have a whole bunch of, you know, alerts firing to a slack.

00:25:28 Matt

Channel.

00:25:29 Matt

And they get 1000 alerts coming in there any given week and the engineers are looking at this and and they’re they’re trying to figure out what’s going on. And and most of that traffic is.

00:25:41 Matt

Not useful, but every once in a while there’s going to be 1 item in there like you know, hey, Dynamo DB is getting to a point where we’re hitting the maximum traffic threshold and it’s going to start peeking out and that’s going to impact users.

00:25:59 Matt

Very easy to miss that one alert if you got 1000 a week coming into this slack channel. Plus, how do you know somebody’s handled it? So one of the big things is is, you know, focusing you got to make sure everything is monitored, but then you should only alert on the things that are actionable or end user impactful.

00:26:21 Matt

So if something is slow or something is about to run out of capacity or something has stopped, you know those are things that should fire or alerts. CPU usage should never fire an alert because what are you going to do about that?

00:26:36 Matt

You don’t know if that’s actually impacting anybody. You know, if it might be being used for a reason we bought it. We’re trying to use it, but a user having a slow response time or being unable to get to a site at all.

00:26:53 Matt

That’s very impactful. So we want to fire alerts for those things, keep them focused, but then we also don’t want to use things like a slack channel as a way to funnel alerts.

00:27:03 Matt

Alerts should always be turned into tickets. That should be, you know, paging people that they then have to take accountability for and take ownership of and then follow through and troubleshoot to the end and close. Otherwise they will be lost. And that one one alert that you’re about to run out of capacity.

00:27:25 Matt

Is always going to be seen and it’s going to be owned and somebody’s going to do something with it.

00:27:30 Matt

So, and that’s that’s a high level of some of the real big learning learning points I’ve had over the.

00:27:38 Matt

Years.

00:27:39 Matt

Here we do that all the time. That’s that’s how we’ve built managed services for years. But you know out there, you know, outside of Vervint there’s there’s a lot of cases where people are putting up these, putting up new applications and these are all new things.

00:27:58 Matt

And the alerts go into a big recycle bin and everybody’s trying to figure out what’s.

00:28:03 Matt

But.

00:28:04 Matt

So it’s it’s a very valid point to always focus on.

00:28:09 Nate

And it sounds like there’s a difference between.

00:28:13 Nate

Alerting on something versus collecting telemetry data right. Like maybe you want to log the use of a CPU over time, but not necessarily for trigger when that gets higher. What you want to trigger on are things that actually have something that somebody can take and run with, or an indicator that there’s a stoppage of.

00:28:21

Oh.

00:28:22

Sure.

00:28:31 Matt

Services I’ve always said for.

00:28:33 Matt

For things like monitoring metrics.

00:28:36 Matt

Collect everything but only only alert on the things that are actionable or impactful and.

00:28:47 Matt

You know, I’m. I’m adjusting that more recently because in in the cloud logging everything also has a cost. So you do have to.

00:28:58 Matt

You still do have to dial in the things that you monitor and and keep for research purposes, but you know, so log everything asterisk within the realm of feasibility of cost, but only alert on the things that are actionable.

00:29:17 Matt

Or impactful to the?

00:29:19 Matt

End user.

00:29:20 Nate

And how do organizations make some of those determinations? Because if we think of an application within the context of an SDLC, there’s a lot of different people and a lot of different roles that are involved in that process that might be interested in different things. So how do you put in actionable monitoring and logging?

00:29:39 Nate

Into the SDLC process for everyone that’s involved, not just infrastructure, but also the developers or DevOps engineers or other people that are looking at different facets of the.

00:29:50 Matt

Well, I mean that’s that’s kind of the role of a dev OPS engineer is to work with each of the different teams to find out what do you need, what do you need to see and help build some of those dashboards to put put that information together in the right place, the developer is going to care about tracing.

00:30:10 Matt

And things like that, they’re going to want to be able to to look at logs, to understand if an end user hit this website and they went from this service to this one to this one to this one, collect all of the different points and all the different logs across all the different services for that one transaction. And let me see it.

00:30:30 Matt

And they’re they’re not going to want to search hundreds of different separate logs or gigs of content.

00:30:38 Matt

And so that’s where, you know, creating a dashboard that pulls together traces. So a trace being a, you know, pulling together those pieces across all the different logs for that one transaction.

00:30:52 Matt

Is critical to a developer, so creating a dashboard for them, creating a dashboard for your SRE team that has to deal with issues, you know what are all the performance counters that we want to be able to look at. We don’t necessarily want to be able to alert on things like the amount of traffic latency.

00:31:12 Matt

Capacities etc. As well as you know what alerts have fired, what log entries are actually relevant. So different dashboards for different.

00:31:24 Matt

Teams, but the Dev OPS team, that’s one of their primary jobs is to kind of consult with all of those teams to come up with.

00:31:35 Matt

What’s best for them and build out those dashboards?

00:31:38 Nate

So Matt, I imagine that much like everything else in the tech world nowadays, that machine learning and AI, I’ve probably had a pretty major impact within how we handle logging and alerting. What have you seen recently with developments in that space and how is that transformed the way that?

00:31:55 Nate

Organizations respond to incidents and collect that information.

00:31:59 Matt

Well, there’s, I’m going to say there’s the marketing answer and there’s the real answer. So one of the biggest things that you know has has been making a shift in the past five years is the idea of AI OPS, where you have a tool that is sitting between.

00:32:20 Matt

Those monitoring metrics and your engineers or the ticketing system that they look at?

00:32:26 Matt

And what it’s doing is it’s trying to figure out help figure out what traffic, you know what, what alerts, what metrics and so forth are relevant. It’s trying to deduplicate, you know, when a service goes down. It’s one service that goes down, but 400 things that fire because of it.

00:32:47 Matt

And you don’t want 400 tickets, you want one and you know, it’s also attempting to prioritize. You know, what’s important versus what’s.

00:32:57 Matt

And to some extent, you know, there’s the dream of automatically resolving issues based off of what we learned.

00:33:05 Matt

So the thing with AI OPS is.

00:33:09 Matt

You know, there’s there’s the promise of, you know, completely automating this troubleshooting workflow and reducing ticket volume astronomically. But it requires dedication to constantly tuning those those rules and helping it understand helping machine learning algorithms.

00:33:29 Matt

To understand what data was actually relevant in XYZ.

00:33:36 Matt

The biggest, the most powerful thing for a site reliability engineer in preventing failure is having failures.

00:33:46 Matt

You know when something goes down, using that as an opportunity to learn how to prevent things from going down and in AI OPS, you need to use those failures as a way to train the system to detect what prevents it. You know what capacities were getting used up, what indicators happened beforehand and trigger.

00:34:06 Matt

Something before those issues go off, you need to also help it understand what things have been relevant versus not.

00:34:16 Matt

Then you know the reason why I say that the marketing answer and a real answer is because a lot of organizations focus on just simply getting the tool implemented rather than getting the value out of the tool. So they’ll spend most of their time actually focusing on how many services or systems or applications.

00:34:35 Matt

We have covered in the project’s success is based on, you know, what percentage of things are now covered by this tool.

00:34:45 Matt

Whereas you’ve actually bought this thing to try to reduce your number of incidents overtime. So if if a project is successful then the number of incidents needs to go down. Who cares how many things you’ve covered it. It matters how many incidents have been reduced. Maybe that only means 10% of your environments.

00:35:06 Matt

Covered but.

00:35:07 Matt

It’s the noisiest part of your environment, and you’ve reduced the noise by, you know, tenfold so.

00:35:16 Matt

That’s one of the biggest, biggest things that’s evolved over the past five years. But I see a lot of organizations have a challenge, really implementing it and taking advantage of it. It just becomes the new, fancier, more expensive monitoring system.

00:35:31 Matt

Traffic and work hasn’t actually reduced because we’re not really focusing on the right things and trying to leverage it.

00:35:39 Nate

Yeah. I mean, it sounds like really for AI to have a successful implementation in this area, it’s not just a matter of having the tooling right to support it. It’s really having the people to support it. I know here at Verba.

00:35:48

OK.

00:35:50 Nate

We obviously have a team that’s dedicated just to data and AI.

00:35:53 Nate

That’s all that that team does and what they specialize in. It sounds like maybe in the context of monitoring you have to have something similar to an extent where as you are running through, whether it be simulations or through, you know, real world events, when these failures come up, you have to have resources that are dedicated to that purpose. And that’s often the investment that maybe gets missed.

00:36:14 Nate

And organizations go to implement AI and adopt that in this.

00:36:17 Matt

Area. Yeah. And one of the things that we’re doing, a number of organizations have.

00:36:22 Matt

Limited.

00:36:23 Matt

AI in in ticketing systems, for instance, that when a ticket gets generated for an alert or something like that, it includes, you know, the best guess of here’s how you probably are going to need to resolve it, or we’re going to need to go to fix it. And we have that as well. But one of the things that.

00:36:43 Matt

What we’re doing is in the ticketing system, you know, engineers, when they close, that ticket are going to be able to rate the relevance of what a I put in the ticket.

00:36:54 Matt

And then that will feed back automatically into training or fine tuning, not training. They’ll feedback into fine tuning for that process to, you know, help it understand what it got right and what it didn’t, and also help us refine what data goes into it, but automatically.

00:37:14 Matt

And so everybody’s got to be involved in helping the AI understand.

00:37:20 Matt

What? What’s going to be a value in the process?

00:37:25 Nate

So.

00:37:26 Nate

Looking back at just kind of everything that we’ve talked about, we started this conversation talking about some of the more major failures within cloud services. Anything that maybe if you were in that position as the SRE and in that environment, whether it be for someone that’s hosting their application with these services or maybe even in those page organizations yourself that we’re.

00:37:47 Nate

Providing the service, is there anything that maybe you would have done differently at all or anything that could have been done differently to make the situation?

00:37:54 Matt

Better. Well, I think.

00:37:58 Matt

From 2 perspectives. If I’m in. If I am a customer of cloud flare and I’m running a web application that’s that’s fronted by it. First of all, I’m going to start measuring my different service providers to see you know what? What effective air budget am I getting from them and is that?

00:38:18 Matt

Fitting within my allowed error budget and is it forecasted to go up or down?

00:38:25 Matt

But also I’m going to try to figure out some ways that I might be able to conceivably work around it. One of the things with a CDN is if you own the client. So if you have, you know, a mobile app.

00:38:40 Matt

That is, you know, like, say, nike.com, they have the Nike app.

00:38:45 Matt

Right. Well, the Nike app should be able to have logic that says if it can’t get to the APIs and the website that it’s trying to reach to have some form of a method to fail over to, you know what’s back at home base. You know beyond the CDN.

00:39:06 Matt

Or, you know, selectively decide what things it’s not going to go get. Maybe it doesn’t get 10 pictures of a shoe it gets.

00:39:14 Matt

00:39:16 Matt

And it gets it from a central space rather than from the CDN. So if you own the client, you know you have the ability to work around some of those things. Now as the the other perspective is from the providers, from cloud flare or so forth is the thing that I see that’s the disturbing trend that I hope reverses.

00:39:36 Matt

Is globally applying changes.

00:39:40 Matt

And and having them roll out so fast that they are not prevent that the disaster is not prevented.

00:39:50 Matt

That you know in in the case of Cloudflare, it’s a little tough because they are trying to respond to real time threats and so they’re trying to roll that data out as quickly as they can, you know, because say, a a bot service out there has.

00:40:08 Matt

Come up with some new code that they can emulate random mouse movements on a website that they’re trying to scrape, and then that fools the system into thinking it’s not a bot and they then build in detection for that signature. Well, during the time during that time, before your protections in place.

00:40:28 Matt

Websites are are getting hit by that that bot service now so.

00:40:33 Matt

You know, there’s still a way to roll out in a staggered manner that we don’t hit the entire globe at one time. And I think that’s that’s one of the biggest things is an Sr. that I see as a, as a a failure that really has to be answered to.

00:40:53 Matt

And the the other issue is most of these organizations as are we are switching to very heavily AI driven development processes and AI driven automation. So you know some of these processes might not even have a human babysitting.

00:41:13 Matt

It might go from something being recognized to a change being rolled out to the entire environment within minutes.

00:41:20 Matt

Which is is great from a perspective of protecting you quicker until it doesn’t work and and then you have a problem.

00:41:31 Matt

And getting into this process of using AI to do development and AI to do automation on its own is a new space. We’re going to have problems, there’s going to be things that are going to fail and go wrong and.

00:41:45 Matt

We need to build that into that process because there’s always going to be learning anytime that we make a change.

00:41:51 Nate

So we talked a little bit about a I what else do you see as being the next big thing as we start talking about the reliability of these cloud services and observability, what do you think is next? And as a follow up to that, what would your advice be for people that are in the SRE role or architects or organization leaders that are trying to adopt?

00:42:10 Nate

Around what’s going to be next?

00:42:13 Matt

Well, yeah, it can. It’s probably.

00:42:17 Matt

Brain is going to go a pretty broad space there. I’ll focus in on a couple of things. One is we’re all relying on AI services like, you know, just your simple ChatGPT and stuff like that. But things like, you know, we rely pretty heavily on cloud code and those things.

00:42:36 Matt

Those services are very heavily dependent on services like cloud flare because they are they’re very chatty and in order to get responsiveness and the user interface they have to use services like Cloudflare or you know AWS is equivalent and so forth.

00:42:57 Matt

To increase that performance.

00:43:00 Matt

That means that a lot more of our workflow in.

00:43:03 Matt

The day-to-day.

00:43:05 Matt

Is going to be vulnerable to some of these services having a problem, so it used to be, you know, I’ll tell you when I when I write a significantly sized document these days.

00:43:19 Matt

I don’t write it anymore, I write an outline and I do all the research and say I want to say this, this, this and this.

00:43:26 Matt

Here’s my past writing. Please fill in the grammar for me and I’m going to have a I write it and then I’m going to rewrite it from there. I don’t. It’s not. It’s a long time from the old days of just having a word processor sitting down and writing something.

00:43:43 Matt

But then, if that stuff’s not available.

00:43:46 Matt

That becomes a big impact and there’s going to be a point in time over the next few years where.

00:43:51 Matt

You know, just like years ago, everybody used typewriters and eventually moved to, you know, word perfect. I’m old, but moved to something like WordPerfect, which changed the way that we did all, you know, all the work that we do. And now, you know, a network cable in the closet.

00:44:13 Matt

Getting cut is the end of the world, whereas it wasn’t before. Now we’re moving from doing a lot of manual work to having a I do a lot of the, you know, content creation and business process, stuff that we’re doing.

00:44:29 Matt

The day when that service becomes unavailable, everything can stop and stop in ways you didn’t didn’t know to expect before.

00:44:38 Matt

You know, now emails aren’t getting sent, meetings aren’t getting transcribed, you know.

00:44:47 Matt

The stuff that we need to do to get our work done is just not happening.

00:44:51 Matt

So that’s, that’s the thing that I see coming.

00:44:55 Matt

We’ll eventually we’ll figure out what those pain points are, and we’ll work around it. We’ll engineer around it, but there’s going to be pain points.

00:45:04 Nate

Well, Matt, thanks so much for your time and for all the insight around all these topics. If you have any questions out there, we have Vervint. This is the stuff that we do. This is the space that we live in and we would love to hear feedback on anything that we’ve talked about or if you have ideas for future topics to talk about on this podcast, feel free to reach out to us. We’d love to get your commentary.

00:45:25 Nate

And hear your feedback and yeah.

00:45:28 Nate

Thanks again, Matt.

00:45:28 Matt

Fantastic. Well, thanks for having me.