Data architecture
Design and evolve a fit-for-purpose data and analytics architecture.
A well-designed data and analytics architecture
The data and analytics architecture is the blueprint of an organization's capabilities to derive value out of data. It helps the organization gain new business insights and is a catalyst for business growth. To support business needs, a modern data architecture should align with short-term and long-term business goals and be unique to the organization's cultural and contextual requirements. In today's world, the successful implementation and adoption of a data and analytics architecture are based on the principle of enabling the right data at the right time to the right consumer.
This is achieved by planning and organizing how an organization's data assets are modeled, physically or logically, how the data is secured, and how these data models interact with one another to address business problems and to derive unknown patterns and generate insights.
Start
Define overarching capability
In the current business environment, it is critical for the modern data analytics platform
to derive value from data to support various domains in the organization. Instead of adopting a
single data architecture approach, modern data
architecture
Organize data zones
How the data is organized and stored for quick and easy access is a critical aspect of data architecture. This can be achieved by setting up custom data zones within a data lake. The data zones are categorized as follows:
-
Raw data that's collected from heterogeneous sources
-
Curated and transformed data to support the analytical needs of each domain
-
Use case or product-based data marts for reporting needs
-
Externally exposed data with security and compliance controls
Plan for agility and democratization of data
The effectiveness of an analytics platform depends on the speed of provisioning data as well as democratizing the provisioned data for consumption. Data provisioning agility is achieved by the ability of the data architecture to procure and process data in a variety of ways―such as real-time, near-real time, batch, micro-batch, or hybrid―based on the use case. Data democratization is achieved by defining data sharing and access control workflows that are monitored by data stewards. Implementing a data marketplace is one of the enablers for democratizing data.
Define secure data delivery
A modern data architecture is a fortress to the outside world in security but allows easy
access to employees or data users, as defined by their job functions, and adheres to compliance
restrictions such as the Health
Insurance Portability and Accountability Act (HIPAA)
Plan for cost-effectiveness
Traditional data warehouses provide tightly coupled computing and storage with a high cost
of resource utilization. A modern architecture decouples computing and storage, and implements
tiered storage based on the data lifecycle. For example, on AWS, you can use Amazon Simple Storage Service (Amazon S3) to
control costs and decouple data storage from compute. Amazon S3 storage classes
Advance
Modern data architecture could be further enhanced to increase the breadth of data usage―from standard analytics that supports business and operational functions to more complex capabilities that support predictions and insights―and helps support faster decision-making. To achieve this, the architecture supports the capabilities described in the following sections.
Understand feature engineering
Feature engineering uses machine learning and involves setting up feature stores or feature marts. Data science teams create new features (derived attributes) for both supervised and unsupervised learning models and store them in feature marts for simplified transformation and enhanced data accuracy. Enterprises can reuse the features across multiple analytics models, which improves speed to market.
Plan to denormalize datasets
Constructing denormalized datasets or data marts could significantly simplify the datasets for business users by making the required data readily available at a single location and increasing the speed of analytics. If designed carefully, one record could support multiple usage models and reduce the overall development lifecycle. Effective governance of denormalized datasets is also significant for two reasons. Implementing denormalized data could create a large number of redundant datasets, which could become a challenge to manage at scale. In addition, these datasets could be increasingly difficult to repurpose if they aren't modeled correctly.
Design portability and scalability
Large organizations seldom have all their applications and users on a single data platform. Their applications and data stores are typically distributed across legacy on-premises and cloud platforms, making it difficult for analytics teams to mix and merge data. We recommend that you containerize data based on characteristics such as domain, geography, business use cases, and so on. This containerization increases portability between various platforms and applications and supports more effective consumption. Segmenting data into containers and exposing them through APIs helps you scale your data architecture more easily. It enables hybrid, end-to-end data flow and helps on-premises and cloud-based applications work seamlessly.
Excel
As a modern analytics architecture evolves within an organization, it is important to manage that change by introducing reusable concepts. These concepts increase durability and adoption while keeping costs in check. Some of the concepts to consider are discussed in the following sections.
Design a configurable framework
Organizations often create multiple, complex models to address their unique business needs. These models require the creation of multiple data pipelines and engineered features. Over time, this creates significant redundancy and increases operating costs. Creating a framework that incorporates a set of parameter-driven, configurable base models reduces the development time and operating costs. The analytical engine can implement these configurable models to provide the desired output.
Plan to build a unified analytical engine
Business problems are unique and often require custom technologies to address requirements, resulting in multiple analytical engines in an organization. Designing and developing a unified AI-based analytical engine interface that can support multiple programming paradigms simplifies usage and reduces costs.
Define DataOps
Most data professionals spend a significant amount of time performing data operations such as locating the right data, transforming, modeling, and so on. Having agile data operations (DataOps) can greatly enhance the data architecture by breaking down the silos of data engineers, data scientists, data owners, and analysts. DataOps enables better communication between teams, reduces cycle time, and ensures high data quality. Data and analytics architectures have undergone numerous transformations over time because of changing business needs and technological advancements. An organization must strive to develop, implement, and maintain a data and analytics architecture that evolves over time and supports its business.