07/21 2025
477
This article is based on publicly available information and intended solely for information exchange purposes. It does not constitute investment advice.
When perusing financial reports, we often yearn to extract key financial data but find ourselves sidetracked by complex business descriptions and lengthy management statements. This task demands significant effort to sift through and identify useful information.
Especially for Hong Kong and US stocks, most domestic financial software displays information tailored to domestic market standards. Consequently, when confronted with non-standard financial statements, there are often discrepancies in the extracted items.
With the advent of the era of AI large models, such obstacles in financial research may be overcome, as these models excel at language summarization and data computation.
In this article, we evaluate six prominent domestic large models to assess their financial report analysis capabilities and identify any existing issues.
Reading tip: Due to the technical nature and length of the evaluation content, you can scroll down to the "Conclusion" section at the bottom of the report to obtain the final evaluation results.
01
Evaluation Objects, Logic, and Standards
For the evaluation, we selected six major domestic models:
DeepSeek-R1
Qianwen3-235B-A22B
Hunyuan-T1
Kimi-K1.5
ERNIE-X1-Turbo
GLM-4-Plus
In terms of evaluation logic, we adopted a "layered and progressive" question construction approach. To become an exceptional "AI financial analyst," a model must possess multi-level capabilities.
Therefore, we designed four levels of tests across six dimensions, gradually progressing from basic to advanced:
Level 1: Basic Information Extraction
The fundamental ability an AI must possess is to accurately read financial reports. Errors in data extraction render analysis meaningless.
Level 2: Analysis, Calculation, and Verification
While models excel at calculations, they must also know how to utilize data, evolving from "readers" to "analysts".
Level 3: Inductive Reasoning and Insight
Models need to delve deeper, surpassing literal information to uncover the hidden logic behind the words. Hence, we designed two evaluation dimensions: "efficient inductive and refining ability" and "keen risk and emotional recognition ability".
Level 4: Strategic Summary and External Knowledge Integration
Top-tier analysis necessitates industry vision, necessitating an understanding of a company's strategic statements. Additionally, limited knowledge base content is insufficient, and models need to connect with external sources for horizontal comparisons. For this reason, we also designed two evaluation dimensions: "identification of corporate strategy and positioning" and "external information search and integration".
At the standard level, we input the same prompt into each model (detailed prompt information provided later) to ensure uniformity in rules.
02
Cross-Evaluation of Six Financial Analysis Capabilities
1) Accurate Data Extraction Ability - The Foundation of Models, Precision is Paramount
Can the model, akin to a meticulous accountant, extract key financial data, specific expense items, and business achievements mentioned by management from PDF financial reports without any errors? The performance of this ability directly influences the reliability of subsequent analyses. We will focus on examining its accuracy and stability.
Prompt:
Test1.1: Please extract the following key financial data from the provided "Meituan - Q1 2025" financial report and return the results in a table format: 1. Total operating revenue; 2. Operating costs; 3. Net profit.
Test1.2: Please identify and list the specific amounts of the following expense items and return the results in a table format: 1. Research and development expenses; 2. Sales and marketing expenses.
Test1.3: Please carefully read the "Business Review and Outlook" section of the "Meituan - Q1 2025" financial report and summarize the three most important business highlights or achievements mentioned by management.
Evaluation Conclusion:
All models evaluated in this article successfully extracted the specified core financial data and specific project expenses.
Among them, ERNIE-X1-Turbo, Hunyuan-T1, Kimi-K1.5, and Qianwen3-235B-A22B kindly converted the units in the financial report from thousands of yuan to hundreds of millions of yuan, aligning more closely with user habits.
For non-financial key information, the models' focus varied slightly but primarily centered on robust growth in core local business revenue and profit, rapid development of flash sales and instant retail businesses, continuous optimization of catering takeout businesses, and upgrades to the rider rights and interests protection system.
2) Rigorous Calculation and Verification Ability - Beyond Mere Counting, Explaining is Crucial
After extracting data, can the model function as an "auditor"? This involves two levels:
First, can it utilize the correct formulas to calculate core financial indicators like gross profit margin and current ratio based on the extracted data and explain their meanings?
Second, when confronted with management's performance statements, can it independently verify the data and assess their authenticity? This directly tests the model's logical reasoning and "critical thinking".
Prompt:
Test2.1: Based on the data in the "Meituan - Q1 2025" financial report, calculate the company's gross profit margin. Please list the calculation formula, specific data used, and explain what the gross profit margin value reflects about the company's profitability.
Test2.2: Please use the balance sheet data from the "Meituan - Q1 2025" financial report to calculate the company's current ratio. Specify which data you used for the calculation and explain the short-term debt repayment risk revealed by this ratio.
Test2.3: Management claims in the report that "the operating profit margin of the core local business increased by 3.2 percentage points year-on-year to 21.0%". Please verify the accuracy of this statement based on the financial report data and explain your basis for judgment.
Evaluation Conclusion:
Of the six models, only Kimi-K1.5 failed this test.
Although Kimi-K1.5 had obtained the correct operating revenue and operating costs, it made an error in the calculation, with the correct answer being 37.4477 and the model's answer being 37.49.
Figure: Kimi-K1.5 calculating gross profit margin
Simultaneously, when calculating the current ratio, Kimi-K1.5 incorrectly identified "Cash and cash equivalents" in the "Concise Consolidated Statement of Financial Position" as "Total current assets," leading to another calculation error.
Figure: Kimi-K1.5 calculating current ratio
Regarding the explanation of financial ratios, all models provided the definitions of the above financial ratios and concluded that the short-term debt repayment ability was stable.
Additionally, the other information provided by different models varied:
DeepSeek-R1: Advantages and risks of Meituan's asset structure, and hidden dangers that need attention;
ERNIE-X1-Turbo and GLM-4-Plus: Did not provide any additional information;
Hunyuan-T1: Sufficient safety margin, advantages in asset liquidity structure, controllable current liabilities, and potential risk points;
Kimi-K1.5: Strong profitability, effective cost control, and profitability reflected in business structure optimization;
Qianwen3-235B-A22B: Explanation of profitability, cost control ability, and industry comparison.
In terms of data verification, all models correctly calculated the operating profit margin for Q1 2024 and 2025, verifying the given statement in the prompt.
Notably, DeepSeek-R1 also provided business implications, while Hunyuan-T1 included potential risk warnings.
3) Efficient Inductive and Refining Ability - Transitioning from "Copy and Paste" to "Extracting the Essence"
Financial reports contain intricate information, and the ability to distill core points for different audiences is crucial in measuring AI efficiency.
This ability assesses whether the model can act like a seasoned editor, crafting a concise 200-word performance summary for ordinary investors and accurately summarizing the main challenges mentioned by management in the "Discussion and Analysis" section.
We will evaluate the accuracy, completeness, and information value of its summaries.
Prompt:
Test3.1: Please write a summary of the three most important conclusions of this financial report for an ordinary domestic investor in no more than 200 words.
Test3.2: Please summarize the main challenges faced by the company mentioned in the "Management Discussion and Analysis" section.
Evaluation Conclusion:
In terms of overall performance summaries, all models were able to accurately provide correct conclusions supported by data.
Among them, DeepSeek-R1, Hunyuan-T1, Kimi-K1.5, and Qianwen3-235B-A22B listed the conclusions in bullet points, making the structure clearer compared to the other two models that placed the conclusions in a single paragraph.
DeepSeek-R1 also demonstrated another highlight, using easy-to-understand language styles such as "soaring profitability" and "strong financial position to resist risks".
In terms of specific section summaries, all models demonstrated good information location accuracy and induction and organization, able to accurately locate the original text and logically summarize and classify the challenges faced by the company, presenting them in a clear bullet point format with strong readability.
Among them, DeepSeek-R1, ERNIE-X1-Turbo, and Qianwen3-235B-A22B all showed relevant data in their responses, making their conclusions more convincing, while DeepSeek-R1 also additionally marked the source of the information.
In terms of information comprehensiveness, although GLM-4-Plus provided multiple answers, the content appeared somewhat empty due to the lack of specific evidence. ERNIE-X1-Turbo, as always, continued its concise response style.
4) Keen Risk and Emotional Recognition Ability - Deciphering the "Subtext"
Top analysts can "read between the lines." We tested whether the models possess this advanced cognitive ability through this capability.
Can it identify potential business risks implied but not explicitly stated in the financial report? Can it comprehensively analyze performance and management wording to accurately judge the overall emotional tone (optimistic, cautious, pessimistic) conveyed by the entire report?
Prompt:
Test4.1: Does the financial report imply any other potential business risks? Please provide examples.
Test4.2: Based on the performance data and management wording in the entire financial report, do you think the overall tone conveyed to investors is optimistic, cautious, or pessimistic? Please give your judgment and provide at least 2 reasons.
Evaluation Conclusion:
When analyzing potential business risks, all models except Kimi-K1.5 were able to list potential risks item by item based on statements mentioned in the financial report.
Kimi-K1.5 analyzed from a macro perspective based on Meituan's main business and did not focus on hidden information in the financial report.
Figure: Kimi-K1.5 analyzing potential business risks
Additionally, Kimi-K1.5 initially provided 50 risks in one response, which was confusing.
The responses from DeepSeek-R1, Hunyuan-T1, and Qianwen3-235B-A22B were the clearest, using a fixed structure and clearly indicating the source of information, making it easy for users to quickly identify risks.
DeepSeek-R1 first elaborated using the structure of "risk category" - "driving event" - "original text from the financial report" - "risk point," then provided risks not explicitly stated but derivable from the financial report, and finally gave conclusions and recommendations for investors.
Figure: DeepSeek-R1 analyzing potential business risks
Hunyuan-T1 and Qianwen3-235B-A22B also adopted similar response structures, demonstrating strong reasoning abilities while accurately grasping the core contradictions.
ERNIE-X1-Turbo and GLM-4-Plus adopted a segmented discussion approach, explaining the causes of risks and the sources of evidence in the financial report in each paragraph. The content was complete but not sufficiently rich in expansion, and the structure was less clear compared to the above three models.
In the overall emotional tone judgment task, all six models gave an overall tone of optimism.
However, DeepSeek-R1, Hunyuan-T1, and Qianwen3-235B-A22B directly or indirectly used the term "cautiously optimistic".
While GLM-4-Plus and Kimi-K1.5 identified the risks and challenges mentioned in the report, they believed that the flaws did not overshadow the merits.
ERNIE-X1-Turbo's response did not mention any pessimistic factors.
It is evident that DeepSeek-R1, Hunyuan-T1, and Qianwen3-235B-A22B, while reading the entire text and controlling the overall emotional tone, have a slight edge in understanding details and the overall picture, possessing the ability to balance "facts" and "emotions," making their conclusions more three-dimensional and credible.
5) Corporate Strategy and Positioning Inference Ability - A Comprehensive Question Demanding "Knowledge Reserve"
This is a leap from data to insights.
Can the model integrate financial report data with its own knowledge to function as a 'strategic analyst', identifying the competitive landscape? We need the model to deduce a company's competitive strategy (whether cost leadership or technology-driven) based on data such as gross margin and R&D investment, and comprehensively evaluate its market position in the industry (whether as a leader, a challenger, or a niche player) by synthesizing all information.
Prompt:
Test5.1: Using the description of its business in the 'Meituan - Q1 2025' financial report, combined with your general knowledge, please list the main competitors in the industry where the company operates (at least two).
Test5.2: Analyze the 'Gross Margin' and 'R&D Expenses as a Percentage of Revenue' in the report. Based on these metrics and your knowledge of typical industry levels, deduce which competitive strategy the company is more likely to adopt: a 'cost leadership' strategy (prioritizing efficiency and cost reduction) or a 'differentiation/technology-driven' strategy (emphasizing product uniqueness and high added value). Please explain your reasoning process.
Test5.3: Based on the entire financial report (including revenue growth rate, profit margin, and management's discussion), provide a comprehensive assessment of the company's market position in the industry. Do you consider it an 'industry leader', a 'strong challenger', or a 'specific niche market participant'? Support your conclusion with at least two pieces of evidence:
1. One piece of financial data (e.g., profit margin or growth rate compared to industry average).
2. One qualitative description from the 'Management's Discussion and Analysis' section.
Evaluation Conclusion:
In identifying the competitive landscape, all six models tested in this article accurately listed the most significant competitors in the current market (Ele.me, Douyin Local Services, and JD Daojia), and appropriately matched them with specific business lines.
This demonstrates AI's capability to precisely match business descriptions in financial reports with real-world business entities in its knowledge base.
However, the models approached the answers differently. DeepSeek-R1, GLM-4-Plus, Hunyuan-T1, and Qwen3-235B-A22B listed competitors first, followed by their areas of competition and rationale. In contrast, ERNIE-X1-Turbo and Kimi-K1.5 listed areas of competition first, followed by major competitors and their competitive relationships.
DeepSeek-R1 and Hunyuan-T1 cited the financial report verbatim when providing rationale, making their answers more persuasive; other models relied more on content from general knowledge bases.
Additionally, Qwen3-235B-A22B and Kimi-K1.5 respectively noted international competitors and proprietary delivery systems, which were unexpected insights.
Inferring competitive strategies was the most challenging task, requiring AI models to complete the full closed loop of 'data extraction' - 'external knowledge comparison' - 'business theory application' - 'logical reasoning'.
In terms of data extraction, GLM-4-Plus used hypothetical data, resulting in incorrect gross margin figures used in subsequent analysis, rendering its results unreferenced; the other models extracted correct data.
Figure: GLM-4-Plus inferring competitive strategy
In the reasoning analysis, although industry average data is not authoritative, all models except ERNIE-X1-Turbo used it as a reference for external knowledge comparison, effectively enhancing the quality of analysis.
Figure: ERNIE-X1-Turbo inferring competitive strategy
Due to their different focus areas, ERNIE-X1-Turbo, Hunyuan-T1, and Kimi-K1.5 generated 'nuanced' conclusions based on the above comparisons, rather than choosing from pre-defined options.
Regarding market position assessment, all six models judged Meituan as an 'industry leader' by citing management discussion quotes, quantitative analysis, and qualitative analysis. The argumentation was rigorous and highly credible, with little difference in ability among the models.
6) Online Comparison Ability Integrating External Knowledge - Expanding Capability Boundaries
Finally, we transcended the limitation of a single document to examine the model's ability to connect with the real world.
Can it obtain financial data (such as gross margin, current ratio, etc.) of competitors during the same period through online search functions and make accurate horizontal comparisons?
Prompt:
Test6.1: In Q1 2025, how does Meituan rank in terms of sales gross margin compared to JD.com, Alibaba, Baidu, and Kuaishou? Obtain required data through online search, ensuring accuracy and prohibiting fabrication or assumption of data.
Test6.2: In Q1 2025, how does Meituan rank in terms of current ratio compared to JD.com, Alibaba, Baidu, and Kuaishou? Obtain required data through online search, ensuring accuracy and prohibiting fabrication or assumption of data.
Test6.3: In Q1 2025, how does Meituan rank in terms of asset-liability ratio compared to JD.com, Alibaba, Baidu, and Kuaishou? Obtain required data through online search, ensuring accuracy and prohibiting fabrication or assumption of data.
This ability is directly linked to the practical value of AI as an intelligent assistant.
Evaluation Conclusion:
The six models evaluated in this article had unsatisfactory abilities in collecting online information.
For sales gross margin, DeepSeek-R1, ERNIE-X1-Turbo, and Hunyuan-T1 obtained all correct data for the five companies.
However, for current ratio and asset-liability ratio, no model obtained all correct data.
DeepSeek-R1 and ERNIE-X1-Turbo had relatively stronger information search capabilities, obtaining more than 10 correct data points each, with the former having no fabricated data and the latter having one incorrect data point.
Kimi-K1.5 and Qwen3-235B-A22B had moderate information accuracy rates, with some cases of failing to obtain data or fabricating data when calculating the current ratio and asset-liability ratio.
GLM-4-Plus and Hunyuan-T1 performed poorly, especially when calculating the asset-liability ratio, frequently fabricating data.
GLM-4-Plus even searched for an unrelated webpage and fabricated five false data points, causing significant distress to users.
In summary, since large AI models rarely query authoritative data channels when searching for information online, and the internet is rife with false and misleading information, there is still much room for improvement in this area. This can lead to serious errors in financial report analysis, so it is not recommended to use online search functions to obtain critical financial data.
03
Conclusion
To present the evaluation results more intuitively, we have created the following table:
Without considering online information search:
For professional investors or financial analysts, DeepSeek-R1, Hunyuan-T1, and Qwen3-235B-A22B are trustworthy 'assistants' that can not only enhance work efficiency but also provide valuable insights.
For ordinary users or students, ERNIE-X1-Turbo is also a good choice, fully capable of quickly obtaining core data and basic information.
However, the accuracy of online information search is a threshold difficult for all models to cross at this stage. We can accept AI not finding information, but we cannot accept AI providing false information as true.
Finally, based on our somewhat subjective evaluation criteria, we have compiled a radar chart of the financial analysis capabilities of the six major models for your reference: