Skip to content
EN-US 

SHARE:

The State of AI in Insurance (Vol. I): Large Language Models for Insurance - How do They Compare?

Executive summary

  • Performance comparison of six different Large Language Models (LLMs) applied to common insurance industry processes
  • LLMs featuring a larger context size - the maximum number of tokens the model can remember when generating text - generally perform better, although there are exceptions
  • Larger context size comes at a cost premium, but is necessary to achieve desired performance for certain use cases
  • Effective prompt engineering is key to obtaining the best possible performance from LLMs
  • Performance metrics for LLMs are unique to the use case and must be evaluated carefully to ensure business requirements are being met
  • The choice of which LLM to use should be based on a combination of use case, acceptable performance
    and cost

From the Editor

Generative Artificial Intelligence (Gen AI) applications and the underlying Large Language Models (LLMs) that support this technology have captured the attention of the insurance industry. This technology has the potential to significantly increase the efficiency and accuracy of underwriting, claims and fraud/risk processes.  

Yet, for as much interest as there is in Generative AI, there is also uncertainty and unanswered questions. Insurers face an unprecedented amount of information about where Generative AI can benefit, where it may fall flat, and which models may be best for a variety of use cases. These are just some of the issues insurers must evaluate when considering how to bring Generative AI into their technology stack and business processes.

Shift Technology has been a pioneer in AI for insurance since 2014. Over the past decade we have built one of the industry’s largest data science teams dedicated to AI in insurance. This team is engaged in research and development to advance the state of AI for insurance use cases, as well as the application of that R&D to develop innovative solutions for our insurance customers.

This report is the first in a series that will periodically highlight the findings from research our data scientists have undertaken to better understand the performance of specific LLMs when applied to common insurance processes. The goal is to provide insurance professionals with a trusted source of information when it comes to AI  in order to help them make the best decisions possible when evaluating this technology.

Thank you to the Shift data scientists and researchers who made this report possible.

LLM model comparison: Data extraction and document classification

Methodology

The data science and research teams devised four test scenarios to evaluate the performance of six different
publicly available LLMs: GPT3.5, GPT4, Mistral Large, Llama2-70B, Llama2-13B, and Llama2-7B.

The scenarios include:

  • Information extraction from English-language airline invoices
  • Information extraction from Japanese-language property repair quotes
  • Information extraction from French-language dental invoices
  • Document classification of English-language documents associated with travel insurance claims

The LLMs were tested for:

  • Coverage - did the LLM in fact, extract data when the ground truth (the value we expect when we ask a model to predict something) showed that there was something to extract.
  • Accuracy - did the LLM present the correct information when something was extracted.

Prompt engineering for all scenarios was developed by the Shift data science team. For each individual scenario,
the team engineered a single prompt that was utilized by all six of the tested LLMs.

Reading the Tables

Evaluating LLM performance is based on the specific use case and the relative performance achieved. The tables included in this report reflect that reality and are color coded based on relative performance of the LLM applied to the use case, with shades of blue representing the highest relative performance levels, shades of red representing subpar relative performance for the use case, and shades of white representing average relative performance. As such, a performance rating of 90% may be coded red when 90% is the lowest performance rating for the range associated with the specific use case. And while 90% performance may be acceptable given the use case, it is still rated subpar relative to how the other LLMs performed.

Results & analysis

English-language airline invoices

85 anonymized English-language airline invoices were used in this scenario

The extraction prompt sought the following results:

  • Provider Name
  • Start Date
  • End Date
  • Document Date
  • Booking Number
  • Flight Number (for all associated flights)
  • Last Four Credit Card Digits
  • Currency
  • Base Fare for all Passengers
  • Taxes and Fees for all Passengers
  • Additional Fees for all Passengers
  • Payments - this is a complex field consisting of the following: Payment Date, Amount & Status
  • Travellers - this is a complex field consisting of the following: Traveller Name, Basic Fare, Total Taxes & Total Amount

Screenshot 2025-07-31 143236

Analysis

GPT4, GPT3.5 and Mistral Large proved most adept in this scenario. The Llama models proved to be significantly behind, especially when it comes to Coverage. We may be experiencing a situation where the Llama models simply have a harder time finding the relevant information or formatting the output. The results may also be influenced by Llama’s established context size of only 4k, which is smaller than any of the other models tested. In this situation, any document that was larger than the context size would simply not be processed and the model would not return any result, thus impacting its Coverage score.

GPT4 and Mistral Large performed well when dealing with complex fields. These LLMs can not only extract nested information but also output the result in a usable format.

While adequate, performance related to list fields may have been negatively affected by the complexity associated with these extractions.

In the case of Payment Date, we did witness lower accuracy, which can be attributed to the models’ tendency to substitute the document date for payment date if the payment date is unavailable.

Japanese-language property repair quotes

In this scenario we applied each of the test LLMs against 100 anonymized Japanese-language property repair quotes. Documents represented quotes from multiple different providers in non-standard formats. These documents would not be considered templated.

The extraction prompt sought the following results:

  • Provider Name
  • Provider Address
  • Post Code
  • Provider email
  • Tax Amount
  • Total Amount with Tax
  • Discount Amount

Analysis

Overall, GPT4, GPT3, and Mistral Large performed best in both Coverage and Accuracy, with some exceptions. While Llama70 and Llama13 showed only slightly worse in Accuracy, Coverage is clearly lacking. This may be due to similar characteristics identified for underperformance in the previously described airline invoices scenario.

French-language dental invoices

In this scenario, each LLM was applied against a dataset of 119 French-language dental invoices. 79 of the invoices are considered to have a strong layout, meaning they could be described as templated documents. The remaining 60 were selected at random to mimic what may be experienced in an insurer’s data.

The extraction prompt sought the following results:

  • Document Date
  • Provider Name
  • Provider FINESS (Fichier National des Établissements Sanitaires et Sociaux)
  • Provider RPPS (Répertoire Partagé des Professionnels de Santé)
  • Provider Post Code
  • Total Incurred Amount
  • Paid Amount

Analysis

For this scenario, GPT4, GPT3.5 and Mistral Large performed well in both Coverage and Accuracy. Of the remaining models, Llama70 performed well, but not at the same levels of the best performers.

We did note that for both Coverage and Accuracy the Provider FINESS identifier underperformed across the board with all models. This may be attributed to a unique feature of French health invoices. The Provider FINESS identifier is not always clearly indicated or may be easily confused with other provider identifiers such as SIRET (Système d'identification du répertoire des établissements). This could impact the models’ ability to accurately identify what should be extracted as well as the content ultimately extracted.

The witnessed underperformance could also be the result of general confusion. The field is confusing for the LLM because it is actually confusing in and of itself, even for a human. What this means is that the labels we use to evaluate the LLM may not be as accurate as the labels for the other fields. While additional prompt engineering could potentially help improve performance, if the ground truth itself is inherently unreliable, it would be hard to improve the performance. This demonstrates the importance of establishing good quality labels when evaluating LLMs.

English-language documents for travel claims

This dataset consisted of 405 anonymized English-language documents provided to support travel insurance claims.

The extraction prompt sought the following results:

  • A classification for each page
  • A group of pages related to the same document

The expected output would be a list of segmented documents including the document type and its span of pages (indicating start and end page).

In addition to metrics for individual document type, we also compute an aggregated performance at the file level, as defined below as PerfectClassif and PerfectTypes. We consider the outputs for the models correct when all the segmented documents in a file (PerfectCalssif) are correct or when all the document types in a file are correct (PerfectTypes).

Analysis

GPT4 is clearly ahead of all other LLMs tested when it comes to Coverage and Accuracy for document types individually as well as for our aggregated categories. The Llama models did not fare well in this particular test which may be related to the context size (4k) associated with these models or because they were unable to deliver results in the output format mandated in the prompt.

We believe what could be considered underperformance for certain documents types (e.g. Receipt - Activities Reservations, Cancellation Policy, Proof of Payment) may be due to either slightly vague document type descriptions included in the prompt or document types that closely resemble others. Additional prompt engineering could resolve some of the witnessed underperformance.

Cost comparison

For the LLMs we tested, there are essentially two price ranges. GPT3.5 and the Llama models are relatively inexpensive, while GPT4 and Mistral Large cost more to use. Perhaps not surprisingly, our analysis shows that overall, the more expensive LLMs performed better. However, it is interesting to see that GPT3.5, while less expensive, delivers performance levels closer to that of the expensive models. We may surmise that when analyzing cost to performance for GPT3.5 when compared to the Llama models the key to this performance discrepancy lies in context size. We see that although the cost of GPT3.5 is similar to that of the Llama models, it features a context size four times that of any of the Llama models, providing a measurable performance edge.

Conclusion

In the world of Generative AI and LLMs, it is important to remember that one size does not fit all. You must first determine the job that needs to be done and apply the right tool to accomplish that. Evaluating performance compared to cost is also important. As we have seen GPT4 and Mistral Large consistently outperform the other LLMs in this comparison, with GPT4 performing exceptionally well for classification tasks.

The performance witnessed by GPT3.5 is close behind that of the leading LLMs, and may perform well enough
for many use cases. This is especially true when its price point is taken into consideration.

The Llama models we tested, specifically when compared to the pricing and performance for GPT3.5, were simply
not competitive, especially so in the classification scenario.

Our data science team is consistently testing LLMs and we will continue to report on their results in future editions
of this report.