Don’t turn your data lake into a data swamp

In today’s world, we’re creating and storing more and more data. It’s even said that each year we generate more new data than in all the previous years combined and since we’re generating a little over 2 exabyte (or 2 billion gigabyte) of data per day, they’re probably right. So there’s no shortage of data to put into your data lake but should you store everything you can in there? The short answer is “No”. The long answer is “Only if it’s useful”.


Data isn’t free

Storing data in your data lake isn’t free. Other than the usual suspects of costing you storage, there’s also the need to onboard the data in the first place and maintain the connection that keeps the data flowing. All this needs to be supported by a team of professionals and those guys (and girls) aren’t cheap either. For this reason, you should only store data in your data lake if you actually use it or expect to use it in the near future.


The Use Case drives the need

Adding data to your data lake should be driven by a need and not some pokémon-esque compulsion to “catch them all”. Yes, I’m sure that your security team and compliance officer are saying that you should log and store everything™ but having that data is only relevant if you can (and will) actually do something with it.


It starts with a problem

At first there will be a problem. The problem can be highly complex but it usually boils down to a lack of information that’s costing your company money. Either because you can’t make the proper decisions on time and are bleeding money because of it or a more informed decision would’ve helped you leave less (or no) money on the table for your competitors.


Use cases in the different domains

From an IT Operations perspective it could be a lack of information when it comes to end-to-end visibility in your most critical components and thus a lack of capacity management. Meaning that you’ve been caught with your pants down when you should’ve been expanding on capacity due to the growth of your customer base.


The same thing goes for IT Security, where everything is done in order to reduce risk to near-zero. A lack of information (and thus visibility) here could be that there’s no vulnerability management. With new exploits arising almost daily, you’ll have no way of knowing which of your assets are vulnerable. Combined with a lack of visibility into your patch management and you won’t know if your landscape has installed the required updates to mitigate the exploit and even worse, you might be on your way to a real security incident without you even knowing it.


And then there’s the business side of things; Each organisation makes money in their own way and thus has a mostly unique need for information which can only come from their own data. The days of data warehouses taking 2 weeks to generate a report aren’t good enough anymore. Information needs to be available now and in real-time in order to make business decisions that will help your organisation forward. But is real-time even enough anymore? With technology such as Machine Learning, the world of Predictive Analytics is coming closer and closer to everyone. No longer will you have to guess when something’s going to happen but instead your own data will tell you (with a fair degree of certainty).

How to approach the use case?

Whatever your problem is, it will require a solution and the solution will require different components. We might need some updated processes, some new technology, maybe some infrastructure and most likely, data. It’s when this last bit happens that you decide that you want to unlock a new data stream towards your data lake. The data is necessary for satisfying a specific information need that your organisation has so it has a place in your data lake.


Next is determining how we’re going to pull the information out again because only then will it become valuable. Not all data makes sense in a pie chart and the wrong type of visualisation can even be misleading. It’s important that however you represent your data, that it encompases enough data points to give a true representation of the reality and that the human brain can absorb the information in one go. Nobody’s going to study your dashboard or report, either it’s easy to understand or you need to rethink the way you’re presenting the information.


Last is the aspect of orchestration and automating a response. Any decent data lake can automate the process to search for pre-configured occurrences and deviations from what is normal. Do you want automated processes to kick in when something happens? Maybe you want to add a couple of servers to a load balancer when more customers are visiting your website? Or blacklist an IP address that’s doing one too many probes on your firewall? Perhaps even create a new work order for an engineer when the predictive maintenance algorithms are detecting wear and tear in your industrial process? Automating your use cases can save your organisation millions.

And to wrap it all up

Data in your data lake needs to have a reason for being there. When you’re storing data just to fill it up, your data lake will start to feel mushy like a swamp. Not only will you drive up the cost for no reason but the data that you actually need will become harder to work with because there’s too much unrelated ones and zeros clogging up the storm drain. Make sure that everything in your data lake has a purpose and don’t succumb to the human desire to collect everything just for the sake of collecting. In the end, your data lakes should be as clear as any bounty beach and not murky like the swamps of Dagobah.


December 2018