Data Mesh is transforming how organizations are managing their data at scale. Despite the benefits that Data Mesh is expected to bring; it also introduces a couple of challenges. What can you do to avoid basic mistakes and make sure you get it right?
While many CSPs pride themselves on striving to let data drive innovation, as their businesses grow, they discover that issues with that data inevitably begin to emerge...
- Organizational silos and lack of data sharing
- No shared understanding of what data means outside the context of its business domain
- Incompatible technologies prevent the ability to extract actionable insights
- Data is increasingly difficult to push through ETL pipelines
- Growing demands for ad hoc queries and shadow data analytics
One solution is to have an architecture that emphasizes data democratization at the business domain level with creating space for different technologies and data analytic approaches. But as experiments in this technique have been proved expensive, any organization taking this path should tread with caution: start small, monitor the project to ensure it proves its worth and then allow it to scale as the business grows.
For many organizations, a data mesh architecture offers a much better solution. It can start small, grow as needed and is budget friendly. Building a success data mesh architecture requires organizations to address many technical and operational hurdles. This blog (1/2) is the beginning of a two-part series. Let’s delve into the technical challenges and how they can be overcome.
A data mesh is a distributed approach to data management that views different datasets as domain-oriented “data products”. Each set of domain data products is managed by product owners and engineers who have the best knowledge of the domain. The idea, is to employ a distributed level of data ownership and responsibility sometimes lacking in centralized, monolithic architectures like data lakes. In many ways, it’s similar to the microservice architectures commonly used throughout the industry. And because each domain implements its own data products and is responsible for its own pipelines, it avoids the tight coupling of ingestion, storage, transformation and consumption of data typical in traditional data architectures like data lakes.
#1: Failure to follow DATSIS principles
The success of your data mesh is contingent on it being discoverable, addressable, trustworthy, self-describing, interoperable and secure (DATSIS):
- Discoverable: enable consumers to research and identify data products produced by different domains – typically via a centralized tool like a data catalog
- Addressable: like microservices, data products must be accessible via a unique address and standard protocol (REST, AMQP, possibly SQL)
- Trustworthy: domain owners must provide high-quality data products that are useful and accurate
- Self-describing: data product metadata must provide enough information to ensure consumers don’t need to query domain experts
- Interoperable: data products must be consumable by other data products
- Secure: access to each data product must be automatically regulated through in-built access policies and security standards
#2: Failure to invest in automated testing
Because a data mesh is a decentralized collection of data, its crucial to ensure consistent quality across data products owned by different teams who may not even be aware of one another. Following these principles helps:
- Every domain team must be responsible for the quality of their own data. The type of testing will depend on the nature of that data and be decided by each team.
- Take advantage of the fact the data mesh is read-only. This means that not only mock data can be tested but tests can often be run repeatedly against live data too. Also take advantage of time-based reporting: testing historical data, which is static, allows you to easily detect issues such as changing data structures.
- Run data quality tests against mock and live data. These tests can be plugged into developer laptops, CI/CD pipelines or live data accessed through specific data products or an orchestration layer. Typical data quality tests verify a value that contain values between 0-60, alphanumeric values of a specific format, or that the start date of a project is at or before the end date. Test-driven design is another approach that can be used successfully in a data mesh.
- Include business-domain subject-matter experts when designing your tests
- Include data consumers when designing your tests. Data meshes should be driven by data consumers and it’s important to make sure your data products meet their needs
- Use automated testing frameworks that specialize in API testing
#3: Tight coupling between data products
The design influence of microservices on a data mesh is apparent in its flexible nature. A data mesh can expand and contract to match your data topology as it grows in some areas and shrinks in others. Different technologies like streaming can be used where needed and data products can scale up and down to meet demand.
As with microservices, tight coupling is the enemy of a highly functional data mesh. The “independently deployable” rule as applied to microservices also applies to data meshes: every data product on a mesh should be deployable at any time without making corresponding changes to other data products in the mesh. Adhering to this rule in a data mesh often implies there is some versioning scheme applied to data products.
#4: Failure to accurately version data products
Data products need to be versioned as data changes, and users of that data product (including maintainers of dashboards) must be notified about such changes – both breaking and non-breaking. Meanwhile, consumed data products need to be managed like resources in Helm charts or artifacts in Maven Artifactory.
#5: Sync vs async vs pre-assembled results
If you’re using synchronous REST calls to package the output from multiple data products, chances are the performance will be acceptable. But if the data mesh is used for more in-depth analytics, combining a larger number of data products (such as the analysis typically done by a data lake), it’s easy to see how synchronous communication might become a performance issue.
One solution is to use Command and Query Responsibility Segregation (CQRS) to pre-build and cache data results on a regular cadence. The cached results can then be combined into a more complex data structure when the data product is run unless you literally require up-to-the-moment results.