Home
News
Tech Grid
Interviews
Anecdotes
Think Stack
Press Releases
Articles
  • Enterprise AI

Protege Launches DataLab to Advance AI Data Science


Protege Launches DataLab to Advance AI Data Science
  • by: Source Logo
  • |
  • March 12, 2026

Protege, an AI data platform specializing in trusted, real-world data at scale, has announced the launch of DataLab at Protege, a new research institution dedicated to elevating the science of AI data. DataLab brings rigorous standards, reproducible methodologies, and scientific discipline to dataset design, construction, evaluation, and safety—addressing critical limitations in data quality, representation, complexity, and selection that increasingly constrain frontier AI progress.

Quick Intel

  • DataLab at Protege establishes AI data as a formal scientific discipline, focusing on high-quality dataset innovation, methodological rigor, and measurable performance impact.
  • At launch, a majority of the “Magnificent 7” AI companies and leading frontier AI labs are actively collaborating with DataLab on training and evaluation data projects.
  • The institution operates across three pillars: scientific partnerships with frontier researchers, development of high-value datasets and data products, and leadership in open AI data research through publications, benchmarks, and evaluations.
  • Led by Engy Ziedan, Co-Founder and Chief Scientific Officer at Protege, DataLab combines machine learning researchers, economists, and domain experts to integrate academic rigor with applied commercial expertise.
  • Early releases include multimodal healthcare benchmark datasets (MedScribe and Medcode) reflecting diagnostic ambiguity and longitudinal context, with ongoing collaborations on high-stakes challenges in advanced cancers, agentic task selection, audio de-identification, and international healthcare representation.
  • DataLab aims to treat data as core infrastructure—not exhaust—driving more capable, reliable AI systems through disciplined marginal analysis and opportunity-cost-aware dataset decisions.

Elevating the Data Layer in AI Development

As AI models scale in size and capability, progress increasingly hinges on access to high-quality, well-curated training and evaluation data rather than solely on compute or architecture. DataLab addresses this underdeveloped pillar by applying scientific ambition comparable to frontier model labs—establishing clear quality standards, reproducible processes, and rigorous evaluation frameworks.

“We understand the three core pillars driving AI: models, chips, and data. We are convinced that with the right datasets—the third, underdeveloped pillar—you can push the entire frontier forward,” said Bobby Samuels, CEO of Protege. “We created DataLab to treat data as infrastructure, not exhaust. If we want more capable, reliable systems, we need standards, reproducibility, and real scientific discipline at the data layer.”

“The strength of DataLab is its ability to integrate perspectives that are often siloed,” said Engy Ziedan, Co-Founder and Chief Scientific Officer at Protege. “Advancing AI requires more than larger models or more data alone. It requires thinking at the margin, where we weigh the marginal value of a datapoint on learning and the opportunity cost of choosing the wrong dataset. This requires disciplined dataset design, careful evaluation, and a deep understanding of real-world complexity. Our team is structured to deliver exactly that.”

Real-World Impact and Early Collaborations

DataLab engages directly with frontier AI researchers to co-design datasets and navigate complex technical challenges, while translating insights into commercially viable data products. Its active research agenda includes publishing cutting-edge work, creating new benchmarks, and identifying gaps in current training and evaluation data landscapes.

Recent and ongoing efforts include multimodal healthcare benchmarks (MedScribe and Medcode) that incorporate diagnostic ambiguity and longitudinal clinical context, alongside collaborations addressing advanced cancer data challenges, agentic task selection, audio de-identification, and improved international healthcare representation.

“Data quality has become the defining constraint in frontier AI development, yet investment and innovation have lagged,” said Nikhil Basu Trivedi, Co-Founder and General Partner at Footwork. “That changes with DataLab at Protege, which brings the same level of rigor and expertise to AI data that we have for AI chips and models. DataLab experts are doing the essential AI data infrastructure work and research that moves the AI frontier forward.”

DataLab invites collaboration from frontier labs, academic researchers, and domain experts committed to raising standards for how AI data is built, validated, and measured.

About Protege

Protege is an AI data platform designed to unlock real-world data at scale. By enabling high-quality, cross-domain data networks, Protege helps AI teams overcome the most critical bottleneck in AI development and deploy more capable, reliable models across industries such as healthcare, media, audio, motion capture, and beyond.

  • Frontier AIAI DatasetsAI EvaluationAI Research
News Disclaimer
  • Share