Since the digital transformation, many data have begun to be gathered in the midst of industrial processes and operations. Technologies such as machine learning, big data, and cloud computing are some of the tools developed in this period in order to handle this data so that it can become useful.

 

Within this scenario, a new concept of storage starts to be discussed. In this text we will address what are the main characteristics, definitions, and benefits of a Data Lake, and what opportunities this concept creates for data scientists who want to extract maximum value from data.

 

What is it?

Data Lake is a relatively recent concept, conceived by James Dixon. A Data Lake is a sort of repository that is meant to store large volumes of raw data in native format. The Lakes provide an unrefined view of the data since it has not yet been processed for any specific purpose.

 

Analyzing data. Close-up of young businessman pointing on the data presented in the chart with pen while working in creative office

The data in a Data Lake is only defined after it has been queried.

 

The framework of a traditional Lake consists of:

Data capture: stream of data coming from applications, databases, cloud services, and IoT devices.

Storage: a single repository for holding many different types of data.

Distribution: knowledge, insights, and analytics, drawn from querying the repository. From there, the data becomes insightable and could be used as Analytics, and Data Science tools.

 

What is it used for?

 

The static native format storage can be used by analysts for predictive modeling, with fast access and no restrictions to its original format.

Analysis stock market data lake on digital screen

The data lake stores all kinds of data without restrictions, so the data can be used as a database that in the future is able to feed Machine Learning and Big Data algorithms.

Access: Understand the relationship between Artificial Intelligence and Big Data.

Access to the data held in a data lake can be done autonomously, not requiring data transfer between different systems.

 

Data Swamps

Data lakes require routine maintenance to prevent them from becoming junk. Since all kinds of data are stored inside the repository, lack of control can make the lake inaccessible to users. In addition, another characteristic of this type of repository is that they are highly scalable, i.e. they can reach gigantic sizes very quickly.

 

Conclusion

Data lakes are large repositories of structured, unstructured, or semi-structured data in native format. Because they are highly scalable they require maintenance to prevent them from becoming inaccessible repositories to users.

The use of data lakes is a great advantage for data scientists who can work with information in native format quickly and assertively. They can also make use of this data to execute projects involving AI and ML in a faster way without the need to import data into other systems.

Want to understand a little more about the concepts that involve data? Click and understand all about Big Data.

 

About the Author: Andre Andrade
modern-petrochemical-oil-refineryHow smart industries will be
Artificial intelligence, brain getting measured by light rayThe active role of AI in industries' future