Meetings are messy by default. People talk over each other, shift contexts constantly, and bury key decisions inside long, chaotic conversations. Turning that into accurate, reliable AI summaries is far more difficult than transcription alone.
Brian Farrell from Fathom.ai unpacks the engineering behind modern AI meeting intelligence — from solving speaker detection and hallucinations to balancing latency, accuracy, and contextual depth in real time. He also shares why specialized models, retrieval systems, and high-quality training data are becoming critical to building AI that can truly understand human conversations.
The messiness of meetings is something you don’t fully appreciate until you’re in the data. People talk over each other, trail off, circle back to something from twenty minutes ago, and make commitments so casually you almost miss them. The model has to understand what was said and what was meant. That’s a much harder problem than transcription.
Speaker detection is a good example. It sounds simple until you try to build it. With a bot in the meeting, the platform itself tells you who’s speaking. With bot-free capture, you don’t have that benefit. You’re reading signals from a user’s machine that are filtered, sometimes encrypted, and not designed for this use. And it’s not a one-size-fits-all solution either; every solution is tailored towards the OS and the client. For instance, one of our engineers spent a long time digging through Zoom client signals just to find a couple of hidden markers that map to the active speaker. None of that is documented. You have to go find it.
On the AI side, the challenge shifts depending on the task. A post-call, final summary takes the full meeting, which is dense, noisy input where you’re extracting signal. A live summary works off the last few minutes, but the quality bar is higher because it’s being read in real time. There are different constraints and different failure points. You’re essentially building separate systems off the same source material.
The transcript is the ground truth. Everything in a Fathom summary should trace back to something that was actually said. That constraint does a lot of the work because we’re not asking the model to generate from general knowledge, but rather to summarize a specific, bounded input.
On the AI team, we spend a lot of time reading through data. When we train a new model, a significant amount of time and energy goes into reading through test outputs to ensure accuracy. This leads to an iterative process of generating outputs, finding issues, making small changes, and repeating the process until we feel the model is as accurate as possible. Over time, you build up a strong intuition and skillset for this process, allowing you to be as efficient as possible.
That said, errors still happen. Models can attribute something to the wrong speaker, merge separate threads, or smooth over ambiguity in a way that changes meaning. The way we push against that is through very high-quality training data. When you’re working with smaller datasets, each example matters more, so the bar for what goes in is extremely high. A model trained on mediocre data produces consistently mediocre output. That’s not acceptable if someone is relying on a summary to brief their team or close a deal.
When a model is well-trained for a specific task, it doesn’t need to do much exploration to get to a good result. It already knows what it’s supposed to do. The real trade-off happens upfront in the training, not at inference.
They’re really two different products that happen to use the same data.
A live summary is read in the moment. If you zone out for a few minutes, you want to catch up without interrupting the meeting. The model works on a small, recent slice and needs to be fast enough that the output is still relevant when you read it. Depth is intentionally limited. You don’t need everything. You need what just happened.
A final summary works with the full meeting, which means dealing with a longer, denser input and figuring out what actually matters across an hour of conversation. Latency isn’t really the concern there since you can wait a bit after the meeting ends. The real challenge is maintaining coherence while prioritizing the right information.
In practice, these are separate models with separate pipelines. If you treat them as one problem, you end up with something that’s okay at both and great at neither.
Recurring meetings are where things get interesting. The real context goes beyond today’s conversation and into things like what was committed to last week, what’s been a pattern over time, and what never got resolved.
If the model only sees the current meeting, the summary can be accurate but shallow. Retrieval adds depth. Before generating an answer or summary, the system can pull in relevant history like prior summaries, action items, and key moments, and incorporate that into the input.
Ask Fathom is a good example. You can ask something like “what did we agree on about pricing in the last three calls?” and get a synthesized answer grounded in actual meeting history. The model isn’t guessing because it’s working from real records.
User feedback is one of the highest-signal inputs we get. When someone flags a missed action item or an incorrect summary, they’re pointing directly to a failure case. That’s much more useful than aggregate metrics.
Not all feedback is equal, though. Some reflect real model errors, others reflect ambiguity in the original conversation. That distinction matters because training on noisy or unclear examples makes the model worse. There’s a curation step before anything gets fed back into training.
When high-quality feedback does make it in, iteration is targeted. We’re not retraining everything. We focus on specific failure modes, expand the data in that area, and check that improvements don’t introduce regressions elsewhere. The goal is to get better at the things users actually notice.
We handle that mostly through specialization. If you rely on one general-purpose model for everything, you either pay a latency penalty where you don’t need it or sacrifice quality where you do. Instead, we break things apart.
The live summary model does one thing. It tracks a conversation in real time and produces a running summary. It’s not general-purpose, and that’s the point. Because it’s narrowly scoped, it can be both fast and reliable.
More complex tasks – cross-meeting synthesis, deeper question answering – run in a different part of the pipeline where latency matters less. Real-time experiences use lightweight, purpose-built models. Tasks that require depth get the computing support they need.
Future-proofing is a big challenge at Fathom, and consequently a priority for us. One of the biggest issues we face is the constant updating and depreciation of frontier models from the big labs. At the labs, they are all chasing bigger and better models with little consideration for speed or cost, and due to the scarcity of resources, this often leads to smaller models that we rely on being retired early. At Fathom, we are addressing this problem by training our own in-house model; these are usually smaller open-source models that we can custom-train for our purposes. By doing this, we gain control of our models and ensure future access, thus greatly reducing the amount of time and resources spent on future migrations.
The other piece is staying current. This field moves fast. What was hard last quarter can be straightforward today. Keeping up with new models, research, and how other teams are solving similar problems isn’t optional.
Brian Farrell is an AI Engineer at Fathom, where he works directly on the core AI pipeline behind the platform's meeting intelligence, including training and fine-tuning the models that power Live Summaries, final meeting summaries, and action item detection. He holds an M.S. in Natural Language Processing and Artificial Intelligence from UC Santa Cruz and a B.S. in Computer Science with First Class Honors from Trinity College Dublin, and previously conducted hallucination research in large language models at the UCSC NLP Lab. His blend of academic research and hands-on industry experience gives him deep technical insight into how AI systems are built and applied in real-world meeting workflows.
Fathom, the #1-rated AI meeting partner, captures what matters in every conversation and turns it into clear, actionable outcomes. By surfacing and connecting key moments, decisions, and action items, it makes every meeting searchable and gives teams the information they need to work smarter and faster. Fathom syncs insights directly into CRM and productivity tools, eliminating manual work. Recognized on G2’s 2026 Best Software Awards Top 100 and named as HubSpot’s 2025 Most Used App of the Year, Fathom is trusted by hundreds of thousands of companies worldwide and backed by Telescope Partners, Maven Ventures, Character, Active Capital, Rackhouse Ventures, and more than 1,300 individual investors.
Learn more at fathom.ai