Building Mathew: Designing an AI Tutor for Responsible Learning
This semester I’m taking a course called Generative AI for Social Impact, which explores how large language models can be applied to real-world problems while also examining their societal implications. Much of the class focuses on reading and discussing research about the ethics of AI development, the potential impact of automation on jobs, and the risks associated with deploying generative systems at scale. At the same time, we are building our own AI applications to better understand the practical challenges of turning large language models into usable products.
As part of the course, I worked with a team of three to build Mathew, an AI-powered math tutor designed to help students work through problems step by step. Rather than simply generating answers, the goal of the system is to guide students toward understanding the reasoning behind a solution. The project became an opportunity to explore what it actually takes to design an AI system that supports learning rather than replacing it.
Throughout the semester, many of the readings highlighted the broader risks and responsibilities of building generative AI systems. Papers like On the Dangers of Stochastic Parrots examine how large models can embed bias, environmental costs, and opacity into the technologies we deploy. Other readings explored how generative AI could be used in areas like civic engagement, education, and community health. These discussions reinforced an important idea: building AI products responsibly requires thinking about the people using them, the incentives created by the system, and the unintended consequences that may arise.
Working on Mathew made these ideas much more concrete. One of the first product questions we considered was what the real user goal is when a student asks an AI tutor for help. In many cases, the immediate request is for the answer to a problem. However, the deeper goal is learning and understanding the concept. If an AI system simply returns a fully worked solution, it may solve the short-term request but undermine the long-term learning objective.
Because of this, one of the main design principles behind Mathew was guidance rather than solution delivery. Instead of producing a complete answer immediately, the system encourages students to think through the next step in the problem. This might involve providing hints, asking follow-up questions, or prompting the student to explain their reasoning. The goal is to mimic some of the behavior of a human tutor who supports the learning process rather than doing the work for the student.
Designing these interactions highlighted an interesting challenge with generative AI products: the model itself is only one part of the system. Much of the behavior of an AI product is shaped by prompt design, guardrails, and how the user experience is structured. For educational applications in particular, these choices are critical. Without thoughtful constraints, a model may default to providing full answers, which may be technically correct but counterproductive for learning.
Another major theme we encountered was the difficulty of evaluating generative AI systems. Unlike traditional software, where outputs can be clearly defined as correct or incorrect, responses from language models exist on a spectrum of quality and usefulness. One of the papers we read later in the course, LLM-as-a-Judge, explores new approaches to evaluating AI-generated responses by using language models themselves to assess output quality. This reflects a broader challenge in AI product development: determining whether a system is actually helping users achieve their goals.
Working on Mathew also highlighted how important it is to think about user behavior when designing AI systems. Users naturally try to get the most efficient answer to their question, which means they may attempt to bypass safeguards or prompt the system in ways that lead to shortcuts. Designing a responsible AI product therefore involves anticipating these behaviors and structuring the interaction in a way that still supports the intended outcome.
Overall, the course has reinforced that building generative AI systems is as much a product design challenge as it is a technical one. The success of an AI application depends not only on the underlying model, but also on how the system is framed, how users interact with it, and how carefully the experience is designed around real user needs.
Projects like Mathew illustrate this shift. Rather than simply asking what a model can generate, the more important question becomes what kind of behavior a product should encourage. As generative AI becomes integrated into more everyday tools, these product decisions will increasingly shape how people learn, work, and interact with technology.