Keeper's chatbot vs industry incumbents
A detailed comparison & lessons learned
Keeper's chatbot vs industry incumbents
A detailed comparison & lessons learned
AI chat assistants are the subject of an arms race in the consumer tax prep space, with Intuit TurboTax and H&R Block investing heavily in AI-powered chat assistants. But the initial reviews of these features have been poor: a recent Washington Post article tested the TurboTax and H&R Block chatbots and found they were wrong or not helpful up to half the time.
We at Keeper have also invested a lot of energy in AI-backed live tax assistance. We’ve learned the hard way that simply hooking ChatGPT up to your product with a few tax-related resources isn’t nearly good enough. You need a robust embedding system, well-calibrated evaluations, integrated human expert review, contextual hooks to encourage users to try this new experience… In short, not all chatbots are created equal.
So we asked the exact same questions that stumped the TurboTax and H&R Block bots to Keeper’s AI assistant. We present a few of its responses below to illustrate some of the lessons we’ve learned along the way.
The incumbent bot responses exemplify two of the major challenges in generating helpful responses:
- Retrieving the relevant tax code information, and
- Synthesizing that information into an actionable, easy-to-understand response
TurboTax fails at part 1: The bot pulls irrelevant information from a TurboTax Community forum thread. Keeper solves this problem by 1) curating a knowledge base of tax law documents and expert-written articles, and 2) using a cutting-edge embedding model to retrieve only the most relevant portions of these documents before generating the response.
H&R Block’s bot retrieves the relevant tax code information, but fails at part 2. Its response requires a lot of reading/parsing/synthesis on the user’s part. It’s correct but it’s not *helpful*.
In other words, returning a factual answer isn’t the only thing that matters. Our automated evaluation system employs a suite of criteria, all of which must be met for a response to be satisfactory. This H&R Block response satisfies the “factuality” criterion, but it fails “succinctness” (the user doesn’t need to know this much!) and “delight” (so much reading! taxes feel *difficult*!).
Keeper’s answer is conversational. It raises the concept of claiming the grandparent as a dependent. If the user has questions about this, they can ask a follow-up question. Our users do this all the time: 80% of our conversations involve multiple user messages.
A good accountant doesn’t just dump info on you. They meet you where you are and help you make the best decisions!
Again, the TurboTax bot fails to retrieve relevant information, instead returning irrelevant search results. The H&R Block response is misleading: it assumes the daughter will have income in Arizona. But if she doesn’t, she won’t need to file a return in Arizona. This illustrates another advantage of Keeper’s conversation-oriented approach: if the bot makes an inappropriate assumption, the user can ask or clarify. In this case our bot handles the situation with a conditional answer (“if she earns income…”) which simplifies things further – the user doesn’t need to ask a follow-up question at all.
Unfortunately, when the H&R Block makes a mistaken assumption, there’s no way to correct it because it doesn’t have any conversational memory: it treats each question as the beginning of a new conversation. LLMs are designed to handle conversations – why not take advantage of this capacity?
In the language of our automated evaluation system, the H&R Block response has failed the “comprehensiveness” criterion. By assuming the daughter has income in Arizona, it fails to completely address the user’s more general question.
This time the H&R Block bot decides it can’t answer the question. (If only the TurboTax bot had this much wisdom!) We can’t know for sure why the H&R Block made this determination, but it’s reminiscent of another issue we’ve dealt with: tax code changes.
Renewable energy is a rapidly changing area of the tax code. Incentives expire and are replaced by new ones annually, the IRS issues guidance continuously, and we’re constantly incorporating new expert articles into our knowledge base. Giving the correct answer here requires keeping your knowledge base up-to-date, resolving conflicts between sources of truth as they arise.
We’ve discovered that a well-calibrated evaluation system is your best bet here – it can easily double as a surveillance system for your knowledge base. Every time we generate a response, we automatically evaluate whether the resources we retrieved contradict one another. If so, a human steps in to remove outdated information or add clarification. The result is a near-seamless knowledge base.
But even so, there will inevitably be gaps in the bot’s knowledge. “I don’t know” is not a good enough answer – the user is paying you to help them make decisions. We’ve solved this problem with a combination of chain-of-thought reasoning and old-fashioned human support. First, our bot assesses whether or not it can answer the user’s question. If not, it notifies a human tax expert to join the conversation. (In our system, humans handle about 5% of responses.) After addressing the user’s concern, the human provides feedback on the bot’s performance in the conversation. This feedback is incorporated into the knowledge base, creating a virtuous cycle of improvement.
Ultimately the proof is in the pudding: our users love our live tax assistance. Why wouldn’t they? They get instant help from a helpful, attentive tax expert. Our customer satisfaction has never been higher. Users can file more quickly than ever. And our costs are down. That’s what’s possible when you build a really great AI product.
Pellentesque id nibh tortor id aliquet. Iaculis nunc sed augue lacus viverra vitae congue eu. Commodo viverra maecenas accumsan lacus. Purus sit amet luctus venenatis lectus magna fringilla urna porttitor.