Welcome to the first installment of the summer series of The Download: Tech Talks by HPCC Systems
Meet our presenters
Flavio Villanustre - Flavio Villanustre is CISO and VP of Technology for LexisNexis® Risk Solutions. He also leads the open source HPCC Systems platform initiative, which is focused on expanding the community gathering around the HPCC Systems Big Data platform, originally developed by LexisNexis Risk Solutions in 2001 and later released under an open source license in 2011. Flavio’s expertise covers a broad range of subjects, including hardware and systems, software engineering, and data analytics and machine learning. He has been involved with open source software for more than two decades, founding the first Linux users’ group in Buenos Aires in 1994.
Richard Chapman - VP Technology, HPCC Systems Development, LexisNexis Risk Solutions. Richard is the core lead for open-source platform development. He has been involved in all cloud-related projects and every aspect of the planning and delivery of new enhancements. Richard has been a part of HPCC Systems for over 20 years, and is responsible for the creation and development of a number of components within HPCC Systems. ROXIE is his brainchild as one of the original designers of the HPCC Systems platform.
Flavio: Several years ago, HPCC Systems began supporting certain types of containers, but has recently made a renewed effort to support containerization and other capabilities that benefit those users that are using HPCC Systems in a cloud dynamic platform. What are the latest developments as HPCC Systems moves toward the operation of the HPCC Systems platform on the cloud?
Richard: Well, as you say, for several years people have been using HPCC Systems in containers or in cloud environments. But, they’ve been doing so using the standard HPCC Systems platform, built for bare metal environments, and simply installing it into a container, or into a lift and shift type operation, which is fine, and it works. HPCC Systems has been designed ever since day one to work on clusters. So, the cloud is a natural home for it. But, there are things about doing it that way that are not ideal. Partly, you’re not taking advantage of some things that the cloud will give you, like the ability to increase your capacity when a job comes in, and decrease it when there are no jobs running. Or, like different options for storing your data that may have different performance profiles and different cost profiles.
The other disadvantage of doing it this way is we’ve always, in the traditional HPCC Systems installations, used the disk storage that is on our computer nodes as our primary data storage. We’ve looked at it as having disk storage almost “for free” because you need the computer power or, vice versa, you get the computer power almost “for free” because you need the disk. But we’ve always used the local storage. It’s given us great economies of cost because it almost comes “for free” when you use one or the other. But, you get both. And, it’s also given us great performance because it’s local.
However, that doesn’t work well in a cloud environment because it relies on you keeping expensive nodes up simply for the storage when you don’t need the compute power, for example. Also, cloud infrastructures tend not to be organized around keeping machines up for long periods of time and caring what’s on the machines. It treats the machines as ephemeral objects there for you to start up and close down at will. “Treat your servers like cattle, not like pets,” as someone put it. And in the HPCC Systems world, traditionally they’ve been a little more like pets.
So we needed a little more re-imagining of how HPCC Systems would work if we were designing it for the cloud. And that’s what we’ve got. From version 7.8.0 of the platform onwards, we have what we are calling the “cloud native mode,” where we are actually building containers for use on the cloud. We are providing Helm charts for deploying those containers on any cloud platform that supports Kubernetes, which is basically any cloud provider. And we’re providing example configurations for how to connect your data to a persistent cloud storage, such as Azure blob storage or S3 files on Amazon.
Flavio: So, are you saying that if you have data somewhere in this blob storage, for example, Amazon or Azure, and you submit a work unit, you would not necessarily have a Thor running at all times? Will Thor just spawn dynamically with another work unit?
Richard: Yes, in the cloud native system, Thor is not running unless there is a job to run. If there are multiple jobs to run, we will spawn as many Thor’s as we want to, but run those jobs in parallel.
Flavio: How much more complicated does this make it for the person that is approaching HPCC Systems for the first time, or even the user who has been using it for a long time? What does the configuration look like compared to what we had before?
Richard: The configuration is actually much simpler. Because we’re targeting a containerized system, we don’t have to worry about an awful lot of variables that there might be on a bare metal system. We don’t need to let people say, “I want to choose which port so that it doesn’t clash,” or, “I want to use the exact options that suit my particular target operating system,” because we’ve already made that choice for them. We’ve chosen the right operating system in the container. We choose the options that are appropriate for that installation. So, it’s actually much simpler to configure than the traditional system.
Flavio: That is amazing. It seems to me like a pretty significant change of the mindset right there. You also mentioned the assumptions that we originally made where we said, “Well, if you already have computer, I’m sure you already have storage attached to that because it’s essentially free. It’s in the computer, and vice versa.” Available storage - well you have some CPU’s in that computer that we can use. How does the de-coupling of these variables changing these assumptions change also the execution model? Do you think that this is going to get us some benefits from a performance standpoint, too?
Richard: It’s too early to say what the performance impact is today. In general, I suspect the performance will be worse, rather than better. But there are lots of ways of measuring performance. Do you measure it in absolute clock time, or do you measure it in cost, or do you measure it in reliability, or whatever? In terms of how long a single existing Thor job might take, it may well take a little bit longer. But, if you remember that you can run as many Thor jobs as you want in parallel, your throughput may well be higher. You’re not paying for a cluster that isn’t being used, so you don’t have to size your cluster for the maximum load, like we do in the bare metal version. In the traditional case you just can’t say, “Oh well, I’ll build out 500 Thor clusters so that I can run 500 jobs in parallel.” You could, but it would be very expensive, and you wouldn’t be using most of them most of the time. Whereas in the cloud, if you have a lot of jobs to run in parallel, you’ll run them, but you won’t be paying for the peak when you’re not at peak load.
Flavio: Maybe the runtime of an individual job is not that important if you are able to run as many jobs in parallel as you need. Perhaps, it’s okay to spend a little extra time on just one job.
Richard: Yes, if you look at how people are spending their time. When people are waiting for the results of their Thor job, what are they waiting for? They are waiting for a Thor to become available; they are waiting for the compiler to become available. Some of the jobs take a while to compile. And they’re waiting for the Thor jobs to actually run. Well, the third of those may take a teeny bit longer, but the first two can be much quicker. So, I’m expecting the overall user experience to be generally better. In fact, one thing we're going to have to be careful of is that we don't make the user experience so much better that the costs go through the roof because people think, “Oh, I'll submit a job, it'll run right away, and I can re-run if it’s wrong,” rather than, “This job is going to run in four hours’ time, so I better make sure it's perfect before I submit it (or there’s another four hour delay before I can fix it).”
Flavio: I can see how that could help in here. I can imagine setting up quotas to ensure the people don’t run off with cloud costs that go up to the roof. But Richard, how entangled are these to a particular cloud provider? If we go this route, will it require AWS or Azure, or can I run it in my own data center, or even in my own computer? What are the requirements?
Richard: Well my target isn't a specific cloud provider, my target is Kubernetes, which is very neutral when it comes to cloud providers. We will be providing specific plugins for interfacing to specific cloud provider’s storage. So, there will be an Azure blob plugin. There will be an Amazon EFS plug-in. There will be whatever the equivalent for Google is. And there'll be more than one of those for Azure. There are different classes of storage, which have different price points and different performance points. And we might have different plugins for different ones, and the same is true in Amazon. You can run Kubernetes in the data center on top of bare metal or on top of something like OpenStack. You can run it on your desktop machine on top of Docker Desktop, for example. So, we're absolutely cloud provider neutral on this by targeting Kubernetes.
Flavio: That is really amazing. I can't think of a reason anyone would want to use the old method if this is the case. And then I also assume, and I think it's probably the right assumption, that when you deploy this in a container it is a lot less dependent on what you have installed in your operating system. So if you need to patch or you need to change the entire operating system underneath, as long as you support Kubernetes you are good to go, right?
Richard: Yes, absolutely. It's very much a different philosophy as well, in terms of how you install and maintain the systems in your traditional data center. In the traditional data center, your servers are pets, if you like. You maintain a server. You add patches to a server and update the system on it, and stuff like that. In the Kubernetes world it's all zero touch. If you want a new version of the system you don't deploy the new version to your existing servers. You bring up new Kubernetes pods that have the new system on. And on the old system it can carry on running, finishing the jobs it's on, or whatever. It is unrelated to the new system you just brought up. It's all zero touch configuration.
Flavio: Okay, I'm sold. So for anyone on the audience that wants to test it today, what do they need? What version? What particular branch? When will this be available for people to use?
Richard: Well, version 7.8.0 that we released in April was the first version that revealed this technology to the world. And we have been patching the 7.8.x series ever since. Once a week, we've been doing point releases. It's not production-ready. It is currently proof of concept level. So, if you're running HPCC Systems 7.8.x, yes you can bring up a cluster on your jobs. There are a some blogs out there on how you can get started on that: HPCC Systems and the Path to the Cloud by Richard Chapman, and Setting up a default HPCC Systems cluster on Microsoft Azure Cloud Using HPCC Systems 7.8.x and Kubernetes, by Jake Smith. But by default where it's configured, we're not using persistent storage anywhere. So when you bring your cluster down again your data will disappear. This is brilliant for developers like me. It means we always start with a completely clean system, but it's not great for production work. Now in 7.10.0, which is due out at the start of July, we will have included everything that's needed to be able to persist your data. You can do it now, read this blog on on Persisting data in an HPCC Systems Cloud native enviornment by Gavin Halliday. But there is a caveat. This is very much a moving target. In particular, we are not going to want to slowdown progress for the eventual final version by having to think about, “Did we just break existing installations.” So if you're using HPCC Systems cloud native in 7.10.x, you need to be aware that we may require you to make changes if you use it in 7.12.x. And the same will be true until we will consider it fully production-ready, when we get to version 8.0.0 next year.
Flavio: Okay, version 8.0.0 will be the production ready version. But nevertheless you will have the ability to test everything that Richard described with the ephemeral storage today. You will be able to test some persistent storage, blob storage for example, in the following version 7.10.x. And we expect that version 7.12.0 will be better suited for additional capabilities, and version 8.0.0 will be production ready by January.
Richard: By version 7.12.0, we think we will have finished providing all functionality, and don't expect to need to make significant changes that would break back compatibility. So that's the point where we would think people can start evaluating whether it's going suit them moving it into production. But it's not until we get to version 8.0.0 that we'll be saying, “Okay, this is it. This is our offering. This is not going to change.” We're not going to make you change everything. We'll maintain back compatibility to this version as much as we can in future releases.
Flavio: Thank you very much Richard. This was a great talk and start to the summer series. We will start connecting with each one of the core developers on the platform. We currently have many interesting features that are being developed and conceptualized. So, it's great to be a little bit inside the kitchen and see how things are happening, which is the beauty of open source projects. We can really get into the source code and talk to those that are actually developing new things.
More information on HPCC Systems and the cloud can be found in the following blogs:
- HPCC Systems and the Path to the Cloud by Richard Chapman
- Setting up a default HPCC Systems cluster on Microsoft Azure Cloud Using HPCC Systems 7.8.x and Kubernetes, by Jake Smith
- Persisting data in an HPCC Systems Cloud native enviornment by Gavin Halliday