Episode 48: Automation Infrastructure with Jim VanderMey and Allen Derusha – Part 1

Episode 48: Automation Infrastructure with Jim VanderMey and Allen Derusha - Part 1

In this episode:

In this episode of Ten Thousand Feet, CIO Jim VanderMey and Sr. Engineer Allen Derusha talk about automation infrastructure. Learn more about Allen’s day-to-day responsibilities. VanderMey and Derusha chat about security and latency in the cloud. We’ll also examine the question of making a cloud-like atmosphere for private workloads. Enjoy the episode!

This podcast content was created prior to our rebrand and may contain references to our previous name (OST) and brand elements. Although our brand has changed, the information shared continues to be relevant and valuable.

Episode Transcript

Kiran: Welcome to 10,000 feet, the OST podcast, I’m Kiran Patel. And for today’s episode, we have OST CIO, Jim VanderMey, who will be having a conversation with Allen Derusha about automation infrastructure. Enjoy the episode.

Jim: So I’m Jim VanderMey, I’m the chief innovation officer here at OST, and I have one of our oldest employees– longest term employees.

Allen Derusha. Allen, would you like to introduce yourself?

View Full Transcript

Allen: Hey, Jim, thanks for joining me today here. My name is Allen Derusha. I work as a solution architect, kind of, within OST and just a technology driven sort of data center guy. And so I focus on the realities of operating a data center full of expensive and hard to configure equipment and just how people deal with that at scale and in our customer enironments.

Jim: Allen joined us years ago when we made the shift from proprietary Unix systems into the Intel x86 space with the HP C-Class blade enclosures in the early two thousands. And you’ve been living in the data center space since then. What is your current gig at OST?

Allen: So currently I’m engaged with one of our customers that’s a large financial organization and who runs a pretty major international trading platform. So they’re a stock exchange that I’ll just say that you’ve heard of. They have recently purchased a market data broker company who is a, also a large wall street presence I’ll just say. They operate in 500 different points of presence, tracking market data for 2500 different exchanges around the world. So they’re exchanges that you’re thinking of right there, the New York stock exchange and, you know, stuff like that. But it’s also the Chicago mercantile board of trade and, you know, futures trading for cattle in Thailand and all over the world. So just think of that. Just any sort of place where commodities goods and services are sold at the price and we track those prices. And then that information is bottled up and sold to people who would trade or make investment decisions based on that sort of pricing.

So the realities of operating in that environment it’s heavily regulated, right? So we’ve got a lot of regulators breathing down your neck pretty regularly because billions to trillions of dollars are trying transacting. It’s also a target, presumably for those who would be interested in the movement of a trillion dollars of things. So security is obviously a major concern. A scale obviously is a big deal, but for us, and this is something that’s a little bit unique in this customer environment that I haven’t experienced in other places, latency is a huge issue for us, right?

So we need to be able to turn that data around, you know, as an example, that during market open, we’ll see somewhere, you know, globally market open, kinda in the middle of the day the United States, we’ll see somewhere on the order of about 85 million price changes per second that are happening around the world. Those are communicated to customers with an SLA of 10 milliseconds, 98% of those hit within one millisecond. So between someone making a trade somewhere and you being able to respond to that information somewhere around a thousands of a second, depending on where you are in the world. So it’s high bandwidth, low latency, hundreds of thousands of systems under management spread across the world in a very tightly regulated and highly secure environment.

Jim: So you are currently serving as a, if I recall correctly, a virtualization architect.

Allen: That’s correct.

Jim: And some of those systems are in the U S some of them are under clearly in global locations as well. And we hear a lot about the moving to the cloud and these workloads by virtue of security, scale, predictable SLA, and latency. Aren’t going into the public cloud as of right now.

Allen: As of right now. Right. And of that list that you put together, I think number three, Is by far the most important. So it’s the latency question that right now, and I say that with a huge asterix behind it, right now the major public cloud vendors don’t have a great answer for that is probably going to change soon.

And there’s been a lot of traction from both Microsoft and Amazon looking at this problem. And, you know, they’re certainly very interested in providing low latency services to their customers. And so I think it’s reasonable to expect at some point in the future, you know, public cloud might be an option for an organization like this for those workloads, which for the other reasons, you know, are able to go into the public cloud.

So, you know, latency is a huge issue for us, as you know, as I’ve already said, and that’s a problem in the public cloud. Yeah, this is an organization that’s been around for a while. Right. You know, they’ve got a hundred years worth of legacy and right. And the data centers that go back decades and that sort of thing, like a lot of our customers.

And so, okay. Yes. This is a, some of a rare situation where they’re, you know, this huge financial thing, but legacy systems are a problem that are almost all of our customers have to deal with at some point, and these guys are no different. So yes, I’m virtualizing x86 workloads, but I’m also currently looking at a digital equipment alpha VMs emulator that we’re evaluating for some 20 plus year old systems that, you know, we have to keep running for some part of it.

So there’s always these workloads, which just were created without the cloud in mind.

Jim: No, I’m not, I’m a little older than you. So you’re not suggesting that I dust off my open VMS skills to help you out in this are you?

Allen: You know, it’s a billable engagement. And so we’d love to have you, if you could join us right now.

So I want to be careful here, right? There’s a lot of reasons that some of these workloads are not in the cloud and some of them are just, you know, what the cloud provides. But I think a lot of our customers are facing this problem. They have workloads that are likely just for one reason or another are going to stay in their data center for some reason or another. Right. And because. All right. How do we operate it? How do we make a cloud-like experience for our internal customers?

Jim: And I think that’s The most important quality that I wanted to talk about because you and I had lunch about a month or so ago, and we were talking about this work and there’s always a constraint that prevents a workload in this case because of the SLAs of a shared public infrastructure, not meeting the needs, that this is going to be in the data center, but that does not mean that you’re going to lose the qualities of public cloud inside of your server state. You can build that in. You can build automation and flexibility and agility and speed of deployment and the ability to stand up and tear down workloads. So this server environment that you’re describing is highly heterogeneous.

Allen: Correct. In many parts on purpose, you know, I think that’s a good observation. So I think any legacy data center, right? Anyone that’s operating a data center today and has been for awhile, they’re heterogeneous just by nature of it been around for 20 years or 30 years or whatever else, unless they are extremely disciplined or maybe on a lease cycle where they are rotating systems in and out every three to five years that the people who sell you, the server systems suggest you should do so unless they’re doing that, which let’s be honest most people aren’t.

Then they have these workloads kicking around, right? Like, so this is a reality that I think that comes from everybody. And historically, a lot of the data center technologies were focused around this notion that, you know, you own and operate all the pieces. All the complexity is surfaced to the organization that owns it. Right. And so if I’m going to deploy an enterprise application, I’ll just pick something randomly SAP, you know, it’s a very popular, big fat application. You’ll spend millions of dollars on the software. You spent millions of dollars on the hardware that supports it. You’ll spend another million dollars on the people, you know, the swarm of consultants that will show up.

We don’t do SAP consultant to my knowledge. I don’t know.

Jim: We don’t.

Allen: All right. So I’m not advertising our services here, but there’s, there has never been an SAP put in without a zillion consultants showing up, right? So you have all this stuff going on and they’re going to highly tune this environment for your specific needs that will meet– that will work in your data center.

And if they don’t do that fire those guys and get somebody that will do that. And so once they’ve done all this, you’ve got a highly tuned optimized, I’ll say hand optimized in most cases, right. And where consultants and, you know, she sat in front of the keyboard and you know, made that tweak sync.

Jim: Made all the configurations..

Allen: Yeah. Right. And so on that’s why you hired her to come in and do that thing. That’s where these consultants come from. However, that doesn’t always translate well into the cloud. Right? All those knobs and buttons aren’t always available to you. So you have this kind of surface level interface that you’re able to interact with. And you’ve got to kind of work within that.

So now you have this legacy of highly. Hand optimized systems that you paid a truckload of money for, like you, you hired those consultants that she did, what you asked her to do. And now you’re trying to think of, okay. Yeah, one step is transitioning that into cloud, right? Don’t have all that capability where I can’t dig to that layer necessarily. I might not want to, maybe that cloud vendor has their own staff of smart people that are going to do that for you, or maybe not depending on the application. But when we go into this concept that I want to make a cloud-like environment for my on premises data center customers, for my internal customers, how do we marry these two things?

This notion that we might want highly tuned, a highly optimized hand tuned application deployments, but we also want to automate and deploy these at scale, right? Like we’ve got these conflicting set of requirements. We’ve got very hands-on one by one operations. And then we have this need to bring this out to a hundred thousand systems spread across the world in 500 different data centers.

Jim: And being supported by cisco, IBM, HP..

Allen: Dell, white box,, Supermicro, right? Whoever it might be server manufacturers. Yeah. And that happened pure and like all your storage manifests and all your backup manufacturers and your anti-virus agents, Andrew security, like, yeah.

Jim: Yeah. And this is the piece that I latched onto in our conversation is that if you believe that change is inevitable and if you believe that we have to create data centers that are responsive to change. It’s some of it is vendor change, because I can say that Dell might be the vendor of choice right now, but it might be HP next time. It might be Supermicro the next time, but I don’t want to have a discreet management process for every single brand of technology in my environment. And so we then have to create an abstraction tier don’t we? For managing these highly varied server state.

Allen: Yeah. Yeah. So, you know, in the case of that SAP application that we kind of were talking about applications, I think that people understand that kind of stack. And so we have all this legacy approach to doing things that we’re trying to shoe horn into this cloud-like experience.

So then there’s these parallels that can be drawn to the underlying infrastructure that supports all this. So, you know, in the cloud, if I go to amazon web services. And I upload a Linux image. Like what physically is that going to be running on? Well, the answer is, it depends. And also they’re not going to tell you like, so it doesn’t matter.

It doesn’t matter. It’s just, it’s x86 compute in the case of Amazon and Facebook and Google and Microsoft. The answer to that really is Amazon, Google, Facebook, and Microsoft that’s who designed that hardware. They make their own server systems. I can’t do that. My customers are big, but they’re not that big.

They’re not going to design their own hardware. Right.

Jim: You don’t even know what chip set. You might know the chip set.

Allen: And some cases it might not be even x86. Right. So Amazon’s been going long on their arm and as is Microsoft, right. Amazon bought an arm license. And so they’re cranking out their own stuff.

So yeah, it’s all over the map, but you as a customer don’t care, right? So that’s the service that Amazon is offering to you is they’re operating all of this stuff. You don’t know what it is. You don’t care. Sometimes. Sometimes you’ll see the abstractions sneak through. So it used to be, if you go into AWS and you set up a load bAllencer, you would see things that looked a lot like a Citrix NetScaler. It turns out that in fact that was what they used to do. Right. My understanding is they since replaced it, but for the most part, there are no leaky abstractions.

When you ask for a service and you interact with a service, and one of those cloud providers. You have a very tight wrapper around whatever it is. They’ve gotten the backend. You have very little control or knowledge of what’s sitting in the back end. And that is a huge benefit to those who are operating it from the front end for the customer, because when maybe a storage migration happens, so they swap out some load bAllencers or whatever else, I don’t care because the interface being presented to me hasn’t changed.

All right. So let’s take that back into our infrastructure world. So when you buy x86, you know, we’ve kind of gone through that list, HP Dell, Lenovo, Cisco, Supermicro, probably the major ones you’re going to find probably in that order. So, you know, in my current customer, it’s a mix of HP and Dell.

I will say they do that strategically just because if you get yourself buying everything from one vendor. We all know what that relationship can turn into. If the vendor believes they’ve got you locked in, behaviors are going to change. And so I will say in many environments, this isn’t the result of just legacy, but it’s the result of strategic decision making and their purchasing process that they’re going to have multiple vendors on the floor.

So let’s talk a little bit about kind of the history of configurable automateable, what HP calls composable, hardware. You mentioned earlier in this conversation, the HP blades. And that’s when I came to OST was wrapped around that product offering, it was back in the P class. I think when we first started here and there was a generation before I said, what blade offerings did for the customer, at least what they proposed to do. Was to create something of an abstraction between your compute and the resources, which compute relied on namely storage and network. Right. And so I could separate those parts out and then kind of mix and match to a degree. Oftentimes, I won’t say through code earlier they, their APIs were not present. You had to log into a console and click around on things and it was Java or activex, right? Like this is a while ago. Right. But you did those things and you can configure this thing in a certain way. And I could recreate that with a few more mouse clicks. And what that allowed me to do is to purchase a piece of equipment from HP.

I put an HP blade in. I could buy an HP blade, slot that in. And I could, instead of walking up to that server and physically moving cables around, I could log into a web interface and I could say, okay, you know, this port, I want connected to this thing over here, kind of a very limited capacity, but I could do stuff like that.

Cisco came out with some great tools around that and their own blatant offerings. Dell’s done the same, that sort of thing. And none of them are compatible with each other in any reasonable way. So as soon as I bring that Dell hardware into my environment, I now have a completely different set of procedures, which I have to go through in order to do those same tasks that I did with the HP hardware.

So if I want to connect this network port to this switch of this uplink, I log into a web interface, but it’s a different set of credentials. It’s a different web interface, finally different versioning and all the rest of it. Okay. We can live with that. Right.

Jim: Well, And, but variation, process variation and technical variation is a huge source of risk.

Allen: Yeah. It’s all sorts of problems that come with it. So, you know, if you looked at it 15 years ago, this wasn’t too bad because everything was managed by logging into a web interface and clicking on stuff. It’s just how this stuff works. Well, HP and others, Dell, Cisco, all follow suit on this figured out that it would be better if there was an API to this, it would be better if I needed to configure a hundred of these systems, which many of their customers were at that scale or thousands or more.

That you could either log into a hundred different web interfaces and click through a, you know, a written procedure a hundred times, or man, I wish I had an API for this. Like, as soon as I do something more than three or four times, I’m going to start coding to it. I’m going to reach for automation as soon as a task looks annoying to me.

Jim: So , I want to highlight that though, because this was a really important pivot for you professionally and not all of our clients have made this pivot. And because I know a number of server and storage administrators that will say that they prefer clicking through, they want to do the discreet actions of managing through the existing tooling, because they’ll say, well, I’ll automate something if I have to do it enough times that it becomes annoying, versus saying I’ll automate it first because I know I’m going to have to do this again. And I know that the repeatable predictable processes are what’s going to manage the reliability of my infrastructure. And that was a really important pivot for you.

I remember when you started doing PowerShell programming at one of our large insurance company clients. And you made that move from clicking on the interface to doing the administration of the environment through PowerShell.

Allen: Yeah. I think that’s a great example of this kind of situation. So I’m going to kind of sideline stepping away from infrastructure for a moment.

First of all, I’m just going to put the blame for this squarely on Microsoft windows. So Microsoft windows has raised a generation of system administrators that are used to connecting into a system as a, you know, as a user account, interactively on a desktop and clicking their way through to whatever solution that they’re, whatever problem they’re trying to solve and the trouble with that is it doesn’t scale.

Like, and so that’s fine if you want to do it once, not so fine if you want to do an a hundred times. Intractable without a team of people if you need to do it a hundred thousand times, then you start running into just limits as to what an individual can do.

So, you know, yes, yeah. Around that automation. And I think you bring up a very good point in this, and it’s a common, logical approach. I need to do a thing. I want to do it in as least amount of time as possible. And if I only need to do it, you know, three or four times, and it takes me five minutes per and it’s gonna take me an hour to write an automation, then why would I go through that whole process of automating a thing and have it take longer for me?

And. Yeah, I gotta tell you, well, I don’t know about you, but like, I do a lot of stuff at home, you know, like I’m a tinkerer type thing and I will automate the stupidest things in the world, right? Like just stuff that takes me three seconds. I will automate to take one second and I will spend days working on this problem. Right. It’s pathological. And I think–

Jim: You do. And having seen it– exactly.

Allen: I’ll spend months. So I have to only push one button on my coffee pot instead of one button followed by a second button. So I appreciate that this isn’t always logical behaviour. However it is going to play a huge role when you’re operating at scale. I don’t care how many times you’re recreating a thing. And then I think as we get into this a little deeper, there are going to be places where, you know, what I would generally recommend is that on any production server, windows, Linux, or otherwise ideal, if no human ever interacted with that individual node directly ever right.

Better if you had a way to automate your jobs through any sorts of mechanisms, right? Some way to take a batch process and send it to that system, let it execute that process, sends you any sort of information logging results back. And then if there’s fault, well, you should have caught that in your dev test cycle, but maybe you did it and that sort of thing.

So maybe you log in to do an investigation, or if your automation game’s on point, you just shoot that systemin the head, get rid of it, redeploy, because that takes you seconds because you’ve automated all of that too. And then let it go on about its day. So there are huge gains in automation that come from the troubleshooting process, because if I know that I’ve got whatever the system is fully automated soup to nuts to bring it back into production then I spent a lot less time troubleshooting it. Cause why would I even bother if this thing’s goofy, I’m just going to nuke it and then bring it back so it can join it’s 10,000 buddies and go on about it’s way .

Jim: And so it changes your approach to troubleshooting because instead of trying to go in and diagnose through looking at a set of diagnostic processes. Instead, you’re looking at logs, you’re looking at what happened. You’re saying, oh, I’m going to reset this back to a known good state. And I’m going to let this start from there.

Allen: Yeah. Or maybe sideline it, like put it into a sandbox. So if I need to do an analysis I can do that. And then just bring up a second copy so I can keep production running or, you know, whatever. Right. There’s a lot of ways to skin that one, but you know, that’s the magic of automation, so it’s not what I’m doing today. And these next five hosts I need to touch, but it’s my ability to recreate that until the end of time, across a hundred thousand hosts whenever I should.

And then there’s all these knock on benefits that come off the other end of it. So, but that’s all just talking about, okay. You know, we’re talking about a window system as an example of automation, but if we go back to now, we’ve got this blade enclosure and we have it. We bought it from HP in this example, right. Doesn’t matter who it was. And they, so they’ve got a way to configure it. And now they’re going to give me a way to automate it. So if you go to basically all of these vendors and I’ve worked with their APIs for Dell, HP, and Cisco. And I will say hats off to all three organizations. They both have extremely well-documented API APIs.

They can do it from rest. All three of them have PowerShell systems. They all have Python, bindings. They have bindings into configuration tools like Ansible and Terraform and the others. So great tooling, good job from all these dudes. So I can create a– I’ll just use PowerShell. Something I’m most familiar with. I’ll make a power cell script in order to perform some tasks on an HP blade enclosure.

Great job Allen. It does whatever it’s supposed to do. I can run it. I can execute it over and over again. And I can handle all of my HP stuff. Now we buy some DELL kit and put that on the floor. It’s competitive, you know, it’s great stuff. Love Dell hardware. And so we do that now, what I want to perform that task over again, except all the tooling is different.

There is an API, but it’s a completely different API. There are PowerShell modules, they are different modules and they perform different actions in different orders. Like constructs that exist in one environment. It might just not even be there at all. They just, the very logical notion of the task that you’re trying to accomplish the workflow might be completely different or those management constructs might not even exist in this other system.

And so now, okay, great, good job. We’ve automated our way out of this problem. And we can handle this one estate for all of those things, which happened to be HP blades. But I’m going to buy a bunch of Dell rack mounts and put those in. So how does this help me out there? Well, chances are not at all.

You know, I’ve done the legwork. I understand at a conceptual level of what this task should look like from inputs and outputs. In so far as that aligns with HP. So I got a map of what I want to do when I want to translate that over to Dell, but I have to recreate the entire journey. So, okay. We’ve automated something, but we still have this problem. When I back into this heterogeneous environment, which again, I recommend, I think it’s a good idea to have tooling from multiple vendors in the environment. There’s a lot of upsides to doing that, but there’s this one major downside. If I do something that applies to one system, it may not apply at all to some other system because they’re just two different vendors.

Jim: And so as a result, you end up with vendor lock-in or resistance, intrusive, and second vendor by virtue of the automation stack that you’re using to manage the server. And so you’ve successfully abstracted the access of the system administrator to the server by using management tools, but you have not abstracted the task from the different vendor implementations. So now we’ve got to abstract the tasks through a potentially open-source tool that might work across multiple manufacturers.

Allen: It’s almost like that was on purpose. So HP has provided some great tooling and, you know, weirdly it only talks to HPE stuff. Right. And Dell made some great software.

Jim: And it’s licensed.

Allen: And it’s licensed. Yeah. Yeah. You have to, well, some of it, you have to pay for it. Right. And so I’m kind of, you know, they’re, they centralized management solutions from the major vendors as well for, you know, An extra fee and it’s not a small one per system and it will automate a bunch of these things for you.

And I’m kind of leaving those a little bit out of the conversation, but they definitely exist. And they’re just compounding this problem. HP has got some great tools at Dell. I love OME. I can log in. I can automate a bunch of stuff right within that environment.

Jim: And there are third party tools that do the same thing.

Allen: and that will lack it. But with laxy window that Dell offering. All right, so we’ve gone from this time-consuming error prone click. I can do five of these in an hour to this okay, I’ve got a much better solution for it. I can go to dell.com. I can download the Python modules for an MX series blade enclosure, and I can bang out some code and I can do this task now in minutes across a thousand enclosures.

Okay, great. But I can’t do that across. The HP enclosures and I can’t do it to Dell rack mounts or, you know, whatever that might be. So then there’s this next level. And I think this is kind of what we’re, as people who operate in the data center, we’ve just been living with this. So then the cloud came around and I’m a big fan of cloud.

I’m a data center guy. Like I, I love touching hardware cloud drives me nuts because I can’t touch it. Right. But I will acknowledge it at the end of the day. Someone can touch that hardware and I sleep better at night because it’s hardware and it’s blinking some lights and I’m happy. So, but you know, I’m back in this data center land, and, you know, we’ve got these systems that, you know, from these various vendors and I would like to be able to solve this problem in a platform agnostic way.

Wouldn’t it be great if someone had come up with some tooling that will talk to disparate systems from disparate vendors and thank goodness my cloudy friends have already invented it. And they’ve been using it for a good long time. Right. And so, you know, the joy of cloud is they’ve never had this heritage of being able to walk up to a system and push buttons on it.

Like it’s locked away. They don’t know where it is. Right. Your AWS code, like a half of their stuff is somewhere across 30 data centers in Virginia. Right. But the reality is like, you know, all that stuff is away from me you can’t get in. So they write these tools that they have that allow a, usually a software developer, not a data center enterprise storage administrator or anything else.

It’s an application. Someone who’s writing, no JS apps or whatever it is. Right. So from someone who, and I’m not dogging on developers here, but generally they are not data center monkeys, right? Like they have not spent their career dealing with storage IO performance, right. Just not a problem that a node developer normally has to deal with.

And we put this control in their hands, that someone who knows very little about enterprise store, can run a couple of commands and get some storage provision. Right.

Jim: And so think about, VLANs think about–

Allen: networks, load, bAllencers, firewalls, like all this stuff that wraps around provisioning network, but also just, you know, operating systems and patches and all the rest of it and configuring of those operating systems to enable an application.

Jim: We’ll talk about firmware updates. Sorry, we won’t. That’s another conversation.

Allen: Another conversation. Yes. Yeah. And as soon as we solve this one, we’ll have that done. So the cloud world has always operated in this notion one, we’re not going to click on stuff. We’re going to do our best to make everything just automated by default. And if it can’t be automated, it doesn’t show up. They can just, it has to be automated because they can’t sell it to a customer if it doesn’t have an API attached to it, right. That’s just their billing model fundamentally requires that those tooling has to be platform agnostic. It’s no good if it can only talk to HP, maybe your vendor is running HP. Maybe they aren’t. Chances are real good. They, aren’t not like we talked earlier, they build their own stuff, but even if they were, they wouldn’t let you talk to those administrative interfaces anyway, like you don’t have access to that stuff. So they’ve got to abstract these tools..

And I will say the cloud marketplace is a healthy competition right now. We don’t have a runaway bestseller. For awhile it was Amazon. 90 plus percentage of the market was all Amazon. Microsoft has made some inroads. I think they’re both great, you know, whatever. But what I really like is that they’re competitive.

Because of that someone making these tooling, can’t just say, Hey, we’re going to focus on AWS. Or they’re going to leave 40% of the market out of their own capabilities. So they’ve gotta be able to talk to AWS and Azure and probably also Google cloud and some of the other vendors that are out there. Wouldn’t it be great if I could use that tooling instead?

Kiran: This was the first episode of a two-part series on automation infrastructure featuring OST CIO, Jim VanderMey and senior engineer, Allen Derusha.

In the next episode, we’ll talk about Redfish infrastructure and which of these have been adopted at OST.