At Intuit, I was tasked with reducing the manual effort involved in tax filing workflows. The solution? Build an agentic AI system powered by Anthropic Claude that could autonomously handle multi-step reasoning โ asking questions, interpreting documents, and making filing decisions.
Here's exactly how I built it, what worked, and what I'd do differently.
A standard LLM interaction is stateless โ you send a prompt, you get a response. An agentic workflow is different. The model is given tools, memory, and a goal โ and it figures out the steps to achieve it autonomously.
In our case, the goal was: given a user's tax documents and financial situation, determine the optimal filing strategy and pre-fill their return.
The most critical part of an agentic system is the system prompt. Here's a simplified version of ours:
You are a tax filing assistant for TurboTax. Your job is to help
users complete their federal tax return accurately.
You have access to the following tools:
- get_user_documents(userId): Fetch uploaded W2, 1099, receipts
- calculate_deduction(type, amount): Compute deduction eligibility
- validate_filing_rule(rule_id, context): Check IRS compliance
- submit_draft_return(data): Save progress
Rules:
1. Always verify document data before making calculations
2. Never guess โ if data is missing, call get_user_documents first
3. Explain each decision to the user in plain language
4. If confidence < 80%, ask a clarifying question before proceeding
Claude's tool use feature was the core of the agentic loop. When Claude decides it needs to call a tool, it returns a structured JSON response that our orchestrator executes:
// Spring Boot tool dispatcher (simplified)
public String dispatchTool(String toolName, JsonNode input) {
return switch (toolName) {
case "get_user_documents" -> documentService.fetch(input.get("userId").asText());
case "calculate_deduction" -> deductionEngine.calculate(input);
case "validate_filing_rule" -> irsRuleValidator.check(input);
default -> throw new UnknownToolException(toolName);
};
}
The loop runs until Claude either completes the filing or asks the user a question it can't answer itself:
| Metric | Before | After | Improvement |
|---|---|---|---|
| Manual review time per return | 18 min | 4 min | 78% reduction |
| Filing accuracy | 91% | 97% | +6% |
| User drop-off rate | 34% | 21% | 38% reduction |
| Avg tokens per session | โ | ~4,200 | Baseline |