Why it’s time to dive into data lakes

Miguel Blanco

11 April 2019

190411 datalakes wp 1

The power of data is undeniable. Any business worth their salt has invested in Machine Learning and Artificial Intelligence. These two tools have been the subject of hundreds of articles – we’ve talked about how business are using them, how chat bots, forecasting and predictive technology, and automatic clustering are going to change the productivity game. But, we’ve yet to fully explore the tools that matter most – those that will have the biggest impact.

In this article, we’ll be focusing on data lakes – the key to meeting any upcoming challenges head on.

What exactly is a data lake?

Seven years ago, Forbes published an article in which technology analyst Dan Woods explained that the most impactful data could not be organised into rows and columns. To make the most of this data, we have to develop a way of organising and storing many different data formats – in their purest and most raw state – in order to avoid any transformation of data that could bias future analysis.

We’re on a moving train right now, and it’s becoming increasingly difficult to predict the kind of analysis that will be possible two or three years from now. That said, we can be certain of one thing: the analysis you’ll be doing in a couple of years time will be informed by the data you’re failing to store today.

Technologies change quickly. Data Marts and Data Warehouses proved useful in the past – and they continue to prove useful in some cases – because the data we’re analysing is structured – we can access this data quickly and extract insights easily.

But in a world where the IoT is taking over our living rooms, and 2.5 quintillion bytes of data are created every day, we should assume that, over the next few years, our current ability to structure real-time data is not going to be enough. We need to start storing data in its rawest state. Formats like .jpg and .pdf are becoming easier to structure every day, and machine learning allows us automatically tag and organise images, taxi receipts, incoming packages, and extract beautiful and powerful insights.

Dealing with all that data

The main difference between data lakes and traditional data warehouses (or data marts) is that data is no longer organised and structured at point of entry. Still – we need to figure out how to approach handling that much data. Well, there’s a number of available technologies designed to help you do just that:

captura de pantalla 2019 04 11 a las 16 41 08

Depending on your needs, your Data Lake can be made up of any number of useful tools.

The data mart is dead, shop somewhere else

Data Marts and Data Warehouses are not as powerful as data lakes, and the era of AI demands a powerful approach to data storage. Ask yourself – is there an expert in your organisation capable of fishing for information inside your new data lake? What you need is a Data Scientist – the sexiest job of the century according to some. It’s a role that’s gone through a significant evolution over the past few years. However, keep in mind that you need more than just a single data scientist with an in depth understanding of data storage and analysis – you need to ensure that the majority of your workforce has, at least, a basic understanding of the tools and processes undertaken by data scientists, in order to keep track of and assess the value of the data being extracted.

The good news is, the development of new tools is making the field of data science more and more accessible. Platforms like BigML, a Good Rebels partner, offer both structured and unstructured machine learning tools with an easy user interface and a powerful workflow that allows for limitless escalation of projects. Remember, there’s a little bit of data scientist in all of us.