Protege, an AI data platform specializing in trusted, real-world data at scale, has announced the launch of DataLab at Protege, a new research institution dedicated to elevating the science of AI data. DataLab brings rigorous standards, reproducible methodologies, and scientific discipline to dataset design, construction, evaluation, and safety—addressing critical limitations in data quality, representation, complexity, and selection that increasingly constrain frontier AI progress.
As AI models scale in size and capability, progress increasingly hinges on access to high-quality, well-curated training and evaluation data rather than solely on compute or architecture. DataLab addresses this underdeveloped pillar by applying scientific ambition comparable to frontier model labs—establishing clear quality standards, reproducible processes, and rigorous evaluation frameworks.
“We understand the three core pillars driving AI: models, chips, and data. We are convinced that with the right datasets—the third, underdeveloped pillar—you can push the entire frontier forward,” said Bobby Samuels, CEO of Protege. “We created DataLab to treat data as infrastructure, not exhaust. If we want more capable, reliable systems, we need standards, reproducibility, and real scientific discipline at the data layer.”
“The strength of DataLab is its ability to integrate perspectives that are often siloed,” said Engy Ziedan, Co-Founder and Chief Scientific Officer at Protege. “Advancing AI requires more than larger models or more data alone. It requires thinking at the margin, where we weigh the marginal value of a datapoint on learning and the opportunity cost of choosing the wrong dataset. This requires disciplined dataset design, careful evaluation, and a deep understanding of real-world complexity. Our team is structured to deliver exactly that.”
DataLab engages directly with frontier AI researchers to co-design datasets and navigate complex technical challenges, while translating insights into commercially viable data products. Its active research agenda includes publishing cutting-edge work, creating new benchmarks, and identifying gaps in current training and evaluation data landscapes.
Recent and ongoing efforts include multimodal healthcare benchmarks (MedScribe and Medcode) that incorporate diagnostic ambiguity and longitudinal clinical context, alongside collaborations addressing advanced cancer data challenges, agentic task selection, audio de-identification, and improved international healthcare representation.
“Data quality has become the defining constraint in frontier AI development, yet investment and innovation have lagged,” said Nikhil Basu Trivedi, Co-Founder and General Partner at Footwork. “That changes with DataLab at Protege, which brings the same level of rigor and expertise to AI data that we have for AI chips and models. DataLab experts are doing the essential AI data infrastructure work and research that moves the AI frontier forward.”
DataLab invites collaboration from frontier labs, academic researchers, and domain experts committed to raising standards for how AI data is built, validated, and measured.
About Protege
Protege is an AI data platform designed to unlock real-world data at scale. By enabling high-quality, cross-domain data networks, Protege helps AI teams overcome the most critical bottleneck in AI development and deploy more capable, reliable models across industries such as healthcare, media, audio, motion capture, and beyond.