You have 1 article left to read this month before you need to register a free LeadDev.com account.
In the world of AI, cloud services, and automation, the tools and expertise engineering managers need to mine large-scale data sets are changing, fast.
Big Data was born with the internet age, when web servers, social media platforms, and ecommerce applications started generating so much valuable information that only automated tools could deliver the insight digital businesses craved. As we enter the generative AI era, big data is only getting bigger, and the tools that defined the previous era are getting left behind.
Data storage options are proliferating, leaving engineering managers to grapple with a complex web of terminology and vendor options to store and mine their organizational data for insights. As big data sheds its capital letters and becomes just, well, data, it’s time to rethink your approach.
The history of big data: Hadoop and beyond
To understand the current big data landscape, you need to know how the industry has evolved over the past decade.
The early big data age was largely defined by modern analytics frameworks through the Apache projects Hadoop, Spark, and Kafka. These open-source software utilities allowed organizations to process large amounts of data in ways that had been previously impossible. Initially released in 2006, Hadoop made parallel data processing on commodity hardware possible, putting the computing power needed to deal with the rising tide of internet-scale data within reach of organizations without access to supercomputer-scale machines. Leveraged correctly, it enabled a wider range of businesses to handle huge volumes of data, from a variety of sources at near real-time velocity.
These technologies powered companies like Cloudera, Hortonworks, Confluent, and Databricks to massive valuations. But now, Hortonworks and Cloudera are one company, Confluent’s market cap has dropped from $24 billion to $6 billion, and Databricks is busy pivoting towards AI ahead of a hotly anticipated IPO.
While revolutionary, all of these tools were complex and expensive, with a steep learning curve. “It demanded professionals with extensive certifications and experience, limiting its adaptability. Small to mid-size companies found it resource-intensive to implement and sustain the big data promise,” says Ayush Chauhan, associate director at TechAhead.
Meanwhile, as cheap, flexible storage and compute resources increased out of the hyperscale cloud providers, organizations were suddenly able to expand the scale of their stored data.
“Product companies and managers began appreciating the value of raw data to derive analytics for informed decision-making regarding their products,” Chauhan says. “The advent of new, user-friendly, cost-efficient services and technologies such as data warehousing, data lakes, and data lakehouses, along with highly custom architecture which can be targeted to give the best of both worlds while complying with various standards straight out of the box, nudged Hadoop closer to obsolescence.”
Data lake vs. data warehouse vs. data lakehouse
Data warehouses, lakes, and lakehouses are the various places where organizations can store their data, but the homely terminology can be confusing for the uninitiated. Let’s break it down:
- A data lake is designed to capture and store raw data, which may or may not be structured. “Data lakes are flexible, raw, and unstructured, making them ideal for exploratory analytics and predictive modeling,” says Ben Herzberg, chief scientist of data security platform Satori. For instance, Hubspot stores data for marketers in large data lakes, then provides them with the tools and interfaces they need to understand it.
- A data warehouse, by contrast, consists of “structured, cleaned, and integrated data optimized for analysis and reporting,” says Kenneth Henrichs, senior customer solutions architect at Caylent. This data may be stored in a traditional relational database or similarly structured format, where they can be analyzed using traditional business intelligence reporting and analysis tools. Financial institutions typically make frequent use of data warehouses, as financial data is generally stored in well-established formats, with the structure of a database and associated reporting tools providing secure and accurate reports and transactions.
- A data lakehouse, as the name implies, lies somewhere between the two. It “combines the advantages of both architectures, providing a unified platform for batch and real-time processing, analytics, and machine learning,” says Herzberg. For instance, a retail business might create a data lakehouse that includes both structured financial data from banks and point-of-sale systems, along with other less structured inventory or marketing information, which can be analyzed together for business insights.
The new big data vendor landscape
This shift has created a whole new category of options to consider for your data management needs. Where the big data era was defined by companies building enterprise versions on top of powerful open-source technology like Hadoop and Spark, the new era has been defined by the hyperscale cloud providers delivering these options as part of their platforms. Snowflake has also done its part to simplify big data management in the cloud, but certainly not at a lower cost.
“As data volumes grow exponentially, engineering managers must embrace more automated, scalable approaches like data lakes and lakehouses powered by cloud platforms and machine learning pipelines for managing and deriving insights from big data,” says Caylent’s Henrichs. “Mastering these new architectures and tools will be essential for engineering leaders to deliver data-driven products and decision-making in their organizations.”
“This story has completely changed. Indecipherable technologies have been replaced by boring ones, like a big Postgres database worth north of $50 billion, and open source SQL templates,” writes Benn Stancil, CTO and co-founder of the business intelligence company Mode.
There are a variety of vendors and platforms defining the new era of big data. Here’s a taste of some of the main players.
- AWS Redshift, a managed data warehouse offering from the cloud giant, which uses columnar databases to allow fast processing of really large data sets.
- Azure Synapse, Microsoft’s big data analytics platform, which combines Apache Spark (the successor to MapReduce as the engine for Hadoop) with various structured query language (SQL) technologies and the company’s own Azure Data Explorer for log analytics.
- Google BigQuery, a managed, serverless data warehouse with built-in machine learning capabilities. BigQuery can store petabytes of data and be queried using SQL.
Other important commercial offerings in this space include:
- Snowflake, a cloud-based big data analytics platform that offers support for structured and semi-structured data. Snowflake provides automation that makes the process of storing data and assigning metadata simpler.
- Databricks, a cloud-based tool based on Apache Spark for processing, transforming, and exploring large datasets, including AI features such as natural language queries.
- Hubspot, provides organizations with big data tools to analyze vast pools of marketing information.
There are also some important open-source tools you should be aware of:
- Dask, a Python library for distributed computing that can scale up Python data science tools to work with very large datasets.
- PostgreSQL, a free and open-source Relational Database Management System (RDBMS).
- Apache Iceberg, an open-source table format specifically designed for use in data lakes, which allows multiple SQL engines to access the same tables at the same time.
- Apache Samoa, a big data analytics tool that provides a platform for developing machine learning algorithms.
- Atlas.ti, a qualitative data analysis tool that can help ingest and analyze data as varied as transcripts, field notes, and network diagrams.
Looming behind developments in these platforms – along with the questions of what will come next – is the rush to make sense of how generative AI fits into the big data picture.
AI needs big data in the sense that truly massive data sets are required to train useful AI models, and organizations are beginning to think about what they can do with AI models trained on their data in particular. Generative AI can make it easier for humans to query and make sense of that data, with tools like Microsoft Power BI and Tableau AI bringing natural language queries to questions that previously required specialized command interfaces like SQL.
New considerations when managing big data
One of the key roles of an engineering manager in this space is to hire and mentor engineers which means having a better understanding of what they can (and can’t) do with your company’s data resources. New tools aim to be simpler to use than their predecessors, but the increasingly heavy emphasis on AI and ML tools means that AI-savvy hires have a place on your team.
- Data science teams can start to leverage tools like Amazon SageMaker to build, test, and train custom AI/ML models.
- Data engineering teams can operationalize AI/ML models by applying DevOps principles.
There are also some security and data privacy concerns to stay abreast of. “To truly succeed in this space, data privacy, compliance, and security must remain top priorities,” says Satori’s Herzberg. An ecosystem of data governance tools has emerged to meet these needs, but tools on their own are not enough. Engineering managers need to put policies into place early on to maintain user privacy and keep data locked down. In particular, organizations need to know what personally-identifying information they’re dealing with and how they’re obligated to protect and handle it in each region.
Finally, as a manager, you’ll be tasked with communicating with your organization’s management to help them understand what business problems your data strategy can solve. Developments in machine learning are expanding that problem space like never before, so you’ll need to keep up with industry developments to suggest new ways that your team can unlock value – but also to head off hype bubbles your executive team might encounter in the press.
The future of big data
In the near term, managers must also be realistic about how much investment in time, tools, and personnel they’ll need in order to launch any data initiative, and should make sure management’s expectations are similarly realistic. That means understanding what your data can and can’t do, which can help ensure that such initiatives are worth it. For instance, marketers have been grappling with the limits of what collected customer data can tell them, and coming to understand where human judgment is necessary.
You will also need to find the right approach and tools to get the most value from their data. Fortunately, lower cloud storage and compute costs can put that possibility within reach, even for smaller businesses.
Big data’s future, then, is getting a little smaller. Small and midsized companies, undertaking more focused projects using lower-cost cloud compute platforms. This, says TechAhead’s Chauhan, is what the post-Hadoop world looks like: companies seeking to do big data with fewer resources found “a more inclusive approach, favoring less resource-intensive technologies, easier to manage, and more in line with modern data analytics requirements.”