The Data Quality Jigsaw: Why & How
Written By: Chakravarthy Varaga
Unreliable Data Leads to Hindered Decision Making
Imagine your finance team presenting a financial health report to the CEO, encompassing various metrics such as revenues, profits, and cash flows categorized by product. It's vital that essential business metrics, like Total Bookings, Total Revenues by product, Total Spend on products, Available Cash, etc., are accurate to validate the financial health. Accurate data empowers the CEO to make informed decisions in line with business growth.
Inaccuracies in the report quality can result in:
- Impeded business decisions hindering growth.
- Questions about the ROI of data teams, potentially leading to reduced investments.
- Disconnection among teams within the organization.
The quality of data also mirrors the inefficiencies and operational complexities within the organization.
Why Would a CTO/CDO Invest in a Data Team?
The CTO's/CDO's willingness to invest in a data(engineering) team hinges on their ability to produce accurate, reliable data. Data Quality is a measure that directly impacts the ROI of a Data Team.
Advocacy for data teams often comes from sponsors such as product teams, marketing, and finance, who heavily depend on various metrics like Business Metrics, North Star Metrics, Product Metrics, and Engineering Platform Metrics. Data Quality is a measure of the success of the partnerships between product, data, and business.
The quality of data also mirrors the inefficiencies and operational complexities within the organization.
Data Warehouse and Data Lake: Central Hubs with a Pitfall
A Data Warehouse acts as a centralized hub for storing diverse data types, structured and unstructured, at any scale. Its primary goal is to mitigate data silos within an organization by providing a unified storage solution. By consolidating data into a single repository, it facilitates seamless integration and analysis of data from various sources across the organization. However, as the Data Warehouse expands, managing the data becomes increasingly complex.
From my past experience in managing data teams: For companies operating at scale, envision managing over 20 product lines generating thousands of tables with dozens of columns each. These tables undergo daily transformations and enrichments via 1000s of pipelines, resulting in terabytes of data flowing daily. Amidst this data ecosystem, involving tens of analysts, product teams, and data scientists, understanding the domain, validating data, and its usage, any glitch in the source or pipelines can escalate the data quality problem.
The data storage process initiates with data generation by the data producer, who then transmits the data to the data infrastructure and stores it in the data lake. This data is then enriched and output to the data warehouse. Subsequently, downstream consumers (Data Science, BI, Analytics/Reporting, Analysts, Applications) face challenges in comprehending the data within the data lake or the warehouse due to a lack of domain knowledge about the data producer. To bridge this understanding gap, consumers often consult the data producer to grasp the intricacies of the data.
As the Data Warehouse grows in size and complexity, it can transition from being a strategic advantage to a technical debt for the organization.
Understanding Data Quality
Data quality is good when the data is reliable and meaningful. Several factors determine this reliability:
Accuracy: Is the data accurate? For example, are the total average orders/day for the last 7 days accurate? Can we corroborate this with verifiable alternate pipelines?
Consistency: Inconsistencies can lead to problems. For instance, if marketing campaigns are personalized based on the 'current_country' field, a discrepancy can lead to a bad customer experience.
Completeness: Different stakeholders have varying expectations of data completeness. Business heads might focus on error percentages or deviations in presented metrics, while product and engineering teams determine mandatory fields in domain entities.
Validity: Validity often depends on business and product teams. Representing customer addresses consistently across geographies is an example of maintaining validity.
Apart from these characteristics, addressing two fundamental problems is crucial to resolve data quality issues:
Governance & Integrity: A lack of proper data governance can result in unintended consequences. For example, an incorrect promotional campaign sent to the wrong country due to improper access can lead to a negative customer experience.
Ownership & Accountability: Data quality is a shared responsibility. Different data elements have different owners at various stages of the pipeline, and ensuring a clear lineage of changes and raising alerts for respective owners and consumers is essential.
If the format of 'order_price' changes, it can disrupt downstream pipelines and metrics owned by other teams. Similarly, pipeline breaks, even without format changes, can lead to data loss and inaccuracies in metrics.
In essence, "data quality" is a collective responsibility involving all stakeholders.
Addressing Data Quality Challenges: Data Contracts
While there's no one-stop solution to address data quality challenges, one effective approach is to left-shift the detection of breaking changes to the data source or ownership. This entails an automated process for reviewing and approving changes with data producers.
The industry is beginning to embrace the concept of "Data Contracts." These contracts serve as schematic representations of data agreements between data producers/owners and data consumers. They act as a means to enforce end-to-end data quality. When a producer modifies the data, downstream data owners and applications are promptly alerted. This forces the producer to review the changes and cascade them downstream if necessary.
From a technical perspective, the solution involves defining the contract and validating it at various stages.
Data Contract Definition Should Include:
- Schema: Data field or data set schemas, allowed ranges, data types, allowed values, lengths, etc.
- Semantic Definition: Downstream applications, their owners, and metrics powered by the data field.
- SLA Definitions: Data freshness, validity, and allowed data latency.
- Alerting: Teams to be informed of changes.
- Governance Layer: Access permissions, data sensitivity, PII information, and roles allowed to access the column.
Validation: This process involves validating and enforcing contracts at various stages of the pipelines.
Wrapping Up
In today's economic landscape, prioritizing cost-efficiency in data management, enhancing data return on investment (ROI), and leveraging existing tools and platforms are critical strategies. This approach is favored over allocating additional resources to address data quality issues.
Transferring ownership responsibilities to producers shifts the focus towards data contracts, elevating their significance in upholding data quality and fostering a collaborative environment in organizations that emphasize data.
The adoption and integration of data contracts within broader governance frameworks within organizations will be a fascinating development to observe within the community.
#dataquality#datascale#datacontracts