Ask an AI Accountant
Version 2.0 by
This AI has ingested US tax law
Every few days, a (human) tax professional reviews all answered questions
So far, the AI has been broadly correct 97% of the time
AI has been fine-tuned on US tax law
The base model is OpenAI’s GPT-4, but it has been fine-tuned on US federal and state tax legislation, and IRS announcements.
Answers are periodically graded by tax pros
Graded answers will show a symbol indicating whether it was correct or incorrect.
The AI has 96% accuracy (vs 94% for human tax pros)
The AI response accuracy will be continuously updated as more questions get submitted. So far, 129 questions have been graded.
How should I use this tool?
We trained GPT-4 on the latest tax updates for 2023 and made some important technical updates from V1. (More on those below!)
When will I get an answer?
You’ll see the AI accountant’s answer right away (or… in about 10-40 seconds). It’ll show up right under the window where you typed it in.
Will the answers be fact-checked?
A (human!) CPA, EA, or JD will review responses periodically. (Read about one CPA's experience fact-checking AI responses here!)
We’ll incorporate their feedback — if they had any — into the version of the answer published above.
How do I know which answers have been graded?
When you’re browsing past questions, a symbol next to the answer means a tax professional has already reviewed it. A filled in green checkmark means the AI got it right, while a red "stop" icon means it got it wrong. You'll also see the tax pro's comments on the individual question's page.
Who are our graders?
Below is a list of the graders used in this assessment. They were pre-vetted by the Keeper team for reputability and reliability. “Accuracy” is determined using a blind test — by comparing the answers of each grader to a source-of-truth answer. There is no difference between how the AI is graded and how the humans are graded.
Please also keep in mind that the questions being evaluated are designed to be tricky, and that the single Q&A format of this experiment is not entirely representative of the typical way accountants engage with clients.
What guidelines were used for grading?
- Overly ambiguous questions are not graded. This means questions that are essentially “unanswerable” without making major assumptions on behalf of the user, e.g. "I have two kids and am making 35k in California; how much will I owe in taxes?"
- Overly contentious topics are not graded. This includes topics without clear case law, or where two or more of the five tax professionals disagreed with the rest, e.g. “I’m a basketball referee who needs to be able to run up and down the court with the players. Can I claim my gym membership fee as a tax deduction as an ordinary and necessary business expense?”
- Only the accuracy and relevancy of each answer is assessed. Other factors such as tone, or the offering of information that might be helpful but doesn’t directly answer the question asked, was not a factor in grading.
- Only questions pertained to US individual tax law are graded. Corporate tax questions and other financial advisory inquiries are excluded from official grading.
- Only questions written in English are graded. While the AI can answer questions written in other languages, those responses are not graded due to the language limitations of our human graders.
What technical improvements were made over V1?
- Comprehensive embeddings encompassing all federal US tax codes, every state tax code, and IRS announcements from the past ten years.
- A chain-of-reasoning retrieval system devised to identify the most relevant vector embeddings for each question and generate a suitable response based on those embeddings.
- More rigorous evaluation system with an extended training period that included dozens of human accountants providing their expert grading of answers and establishing correct responses.
How does V2 perform on about standardized accounting tests?
We ran V2 against a sample of questions from Part 1 of the EA exam. It scored 88%, which is a passing grade. (Last year, humans only needed 66% to pass.) You can find the detailed results here.
Note that these questions have less ambiguity and involve fewer contentious topics than the questions on our sample.
What are V2's known weaknesses?
- Certain city and state tax law intricacies. We haven’t finished embedding all state and city tax law into the V2 yet. These will be part of V3.
- Complex math. A known issue with LLMs, the AI may fail if asked to do math involving more than five or so steps. This is especially evident when asking the AI to calculate income tax for an individual in a high tax bracket. This will be addressed via a plugin a later version.
- Overly conservative. Due to its training and reliance on the actual underlying tax code, it can sometimes give answers that are correct but overly conservative, e.g. claiming that a business owner cannot claim per diem rates for business travel on a cruise ship due to specific IRS cruise conventions laws. This is technically correct, but likely to be overlooked in practice.
Disclaimer
We’ve provided this information for educational purposes, and it does not constitute tax, legal, or accounting advice. If you would like a tax expert to clarify it for you, feel free to sign up for Keeper.