Blog & News

Data & Machine Learning

The Case For Responsible Data Collection

The race to implement artificial intelligence solutions has many companies feeling the pressure to move quickly and get a solution up and running. And while we fully support a streamlined approach to AI implementation, doing so without the right foundation in place can be risky.

We’re seeing this play out with cloud-based HR and finance giant Workday, Inc., which has landed itself in hot water over reports that its AI-driven hiring tool, HiredScore, was discriminating against job seekers aged 40 and over.

The reason? Biased training data.

This real-world case shows how poor data collection practices can lead to legal liability, reputational damage, loss of user and employee trust, and real harm to vulnerable populations. Even for organizations with mature AI solutions, this story serves as a wake-up call to make sure responsible data collection practices are firmly in place.

Getting Data Collection Right from the Start

Responsible data collection is not a one-time activity. Organizations committed to data integrity recognize it as a continuous process, with clear roles and responsibilities for how data is collected, maintained, and used. That includes regular audits and feedback loops, an active commitment to sourcing diverse and representative datasets, and full transparency with data subjects. Informed, explicit consent is critical—but so is ensuring genuine understanding.

Of course, putting these steps into practice is often easier said than done. Some of the biggest challenges we see include:

  • Pressure to deploy AI models quickly without adequate governance
  • Absence of clear accountability
  • Difficulty tracking data origins, modifications, or lineage
  • Lack of proper documentation
  • Not planning for future strategic use

Without a strong process in place, companies tend to fall into predictable patterns: limited diversity due to convenience sampling, over-reliance on automated data without validation, missing or incorrect fields, and false assumptions about sample representation. As the Workday example shows, these patterns can trigger a chain of ethical, reputational, and legal risks that are tough and expensive to recover from.

Balancing Speed with Ethics

At Quantum Rise, we take an education-first approach that helps our clients move quickly without cutting corners. That means investing in clean, reliable data early to prevent costly mistakes and to build more trustworthy models from the start.

We also help teams implement agile-friendly tools like DBT, Airflow, Azure Data Factory, and Microsoft Purview. These tools support rapid data lineage mapping and proactive quality monitoring, so teams can maintain momentum without sacrificing integrity.

Part of this process involves establishing clear guardrails around “minimum viable data quality,” which enables fast iterations without compromising essential ethical or performance standards.

And when the foundation is in place, the impact can be immediate. After tightening its defect taxonomy and relabeling just a few thousand images, a global steel mill watched its vision model accuracy jump from 76% to 93% across 38 defect classes in just two weeks. No changes were made to the network architecture. The improvement came entirely from better data.

Seven Steps to Better Data Collection Practices  

If you're looking to take one concrete step toward more responsible data practices, start by creating a cross-functional data council with executive support. This team should have the authority to block any data feed—shop-floor ERP screens included—until it passes basic logging, schema, and validation checks. Tying clean-data KPIs to each department’s objectives and key results (OKRs) ensures issues get fixed at the source, before a machine learning project is already in motion.

Beyond that, consider the following best practices:

  1. Define your scope: Clearly outline which data you truly need to minimize unnecessary collection
  1. Include diverse perspectives: Involve end-users, subject matter experts, and impacted communities from the outset
  1. Standardize labeling: Create clearly defined standards for consistent, objective data annotation
  1. Conduct mid-collection audits: Regularly audit datasets during collection to detect emerging biases early
  1. Establish documentation frameworks: Use tools like Datasheets for Datasets to document data composition, limitations, and known biases
  1. Assign ownership: Identify and empower team members to oversee bias detection and data quality
  1. Track drift: Monitor usage and shifts over time to guide downstream analytics and AI models

The Quantum Rise Approach

At Quantum Rise, we guide clients through the complete journey of responsible AI implementation. From documenting data lineage and facilitating governance committees to untangling complex data systems, we help organizations build AI solutions that are both powerful and ethical.

We also promote a culture of transparency through frameworks like Datasheets for Datasets and Data Nutrition Labels, helping ensure AI systems perform reliably, responsibly, and in a way you can stand behind.

Ready to build an AI solution you can trust? Contact us today to talk through your data strategy.

_____

Matt King, Senior Data Engineer

Subscribe to our newsletter
to be the first to know about the latest and greatest from Quantum Rise.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.