Executive summary
Generative Artificial Intelligence (Gen AI) applications and the underlying Large Language Models (LLMs) that support this technology have captured the attention of the insurance industry. This technology has the potential to significantly increase the efficiency and accuracy of underwriting, claims and fraud/risk processes.
Yet, for as much interest as there is in Generative AI, there is also uncertainty and unanswered questions. Insurers face an unprecedented amount of information about where Generative AI can benefit, where it may fall flat, and which models may be best for a variety of use cases. These are just some of the issues insurers must evaluate when considering how to bring Generative AI into their technology stack and business processes.
Shift Technology has been a pioneer in AI for insurance since 2014. Over the past decade we have built one of the industry’s largest data science teams dedicated to AI in insurance. This team is engaged in research and development to advance the state of AI for insurance use cases, as well as the application of that R&D to develop innovative solutions for our insurance customers.
This report is the first in a series that will periodically highlight the findings from research our data scientists have undertaken to better understand the performance of specific LLMs when applied to common insurance processes. The goal is to provide insurance professionals with a trusted source of information when it comes to AI in order to help them make the best decisions possible when evaluating this technology.
Thank you to the Shift data scientists and researchers who made this report possible.
The data science and research teams devised four test scenarios to evaluate the performance of six different
publicly available LLMs: GPT3.5, GPT4, Mistral Large, Llama2-70B, Llama2-13B, and Llama2-7B.
Prompt engineering for all scenarios was developed by the Shift data science team. For each individual scenario,
the team engineered a single prompt that was utilized by all six of the tested LLMs.
Evaluating LLM performance is based on the specific use case and the relative performance achieved. The tables included in this report reflect that reality and are color coded based on relative performance of the LLM applied to the use case, with shades of blue representing the highest relative performance levels, shades of red representing subpar relative performance for the use case, and shades of white representing average relative performance. As such, a performance rating of 90% may be coded red when 90% is the lowest performance rating for the range associated with the specific use case. And while 90% performance may be acceptable given the use case, it is still rated subpar relative to how the other LLMs performed.
English-language airline invoices
85 anonymized English-language airline invoices were used in this scenario
GPT4, GPT3.5 and Mistral Large proved most adept in this scenario. The Llama models proved to be significantly behind, especially when it comes to Coverage. We may be experiencing a situation where the Llama models simply have a harder time finding the relevant information or formatting the output. The results may also be influenced by Llama’s established context size of only 4k, which is smaller than any of the other models tested. In this situation, any document that was larger than the context size would simply not be processed and the model would not return any result, thus impacting its Coverage score.
GPT4 and Mistral Large performed well when dealing with complex fields. These LLMs can not only extract nested information but also output the result in a usable format.
While adequate, performance related to list fields may have been negatively affected by the complexity associated with these extractions.
In the case of Payment Date, we did witness lower accuracy, which can be attributed to the models’ tendency to substitute the document date for payment date if the payment date is unavailable.
In this scenario we applied each of the test LLMs against 100 anonymized Japanese-language property repair quotes. Documents represented quotes from multiple different providers in non-standard formats. These documents would not be considered templated.
Overall, GPT4, GPT3, and Mistral Large performed best in both Coverage and Accuracy, with some exceptions. While Llama70 and Llama13 showed only slightly worse in Accuracy, Coverage is clearly lacking. This may be due to similar characteristics identified for underperformance in the previously described airline invoices scenario.
In this scenario, each LLM was applied against a dataset of 119 French-language dental invoices. 79 of the invoices are considered to have a strong layout, meaning they could be described as templated documents. The remaining 60 were selected at random to mimic what may be experienced in an insurer’s data.
For this scenario, GPT4, GPT3.5 and Mistral Large performed well in both Coverage and Accuracy. Of the remaining models, Llama70 performed well, but not at the same levels of the best performers.
We did note that for both Coverage and Accuracy the Provider FINESS identifier underperformed across the board with all models. This may be attributed to a unique feature of French health invoices. The Provider FINESS identifier is not always clearly indicated or may be easily confused with other provider identifiers such as SIRET (Système d'identification du répertoire des établissements). This could impact the models’ ability to accurately identify what should be extracted as well as the content ultimately extracted.
The witnessed underperformance could also be the result of general confusion. The field is confusing for the LLM because it is actually confusing in and of itself, even for a human. What this means is that the labels we use to evaluate the LLM may not be as accurate as the labels for the other fields. While additional prompt engineering could potentially help improve performance, if the ground truth itself is inherently unreliable, it would be hard to improve the performance. This demonstrates the importance of establishing good quality labels when evaluating LLMs.
This dataset consisted of 405 anonymized English-language documents provided to support travel insurance claims.
The expected output would be a list of segmented documents including the document type and its span of pages (indicating start and end page).
In addition to metrics for individual document type, we also compute an aggregated performance at the file level, as defined below as PerfectClassif and PerfectTypes. We consider the outputs for the models correct when all the segmented documents in a file (PerfectCalssif) are correct or when all the document types in a file are correct (PerfectTypes).
GPT4 is clearly ahead of all other LLMs tested when it comes to Coverage and Accuracy for document types individually as well as for our aggregated categories. The Llama models did not fare well in this particular test which may be related to the context size (4k) associated with these models or because they were unable to deliver results in the output format mandated in the prompt.
We believe what could be considered underperformance for certain documents types (e.g. Receipt - Activities Reservations, Cancellation Policy, Proof of Payment) may be due to either slightly vague document type descriptions included in the prompt or document types that closely resemble others. Additional prompt engineering could resolve some of the witnessed underperformance.
For the LLMs we tested, there are essentially two price ranges. GPT3.5 and the Llama models are relatively inexpensive, while GPT4 and Mistral Large cost more to use. Perhaps not surprisingly, our analysis shows that overall, the more expensive LLMs performed better. However, it is interesting to see that GPT3.5, while less expensive, delivers performance levels closer to that of the expensive models. We may surmise that when analyzing cost to performance for GPT3.5 when compared to the Llama models the key to this performance discrepancy lies in context size. We see that although the cost of GPT3.5 is similar to that of the Llama models, it features a context size four times that of any of the Llama models, providing a measurable performance edge.
In the world of Generative AI and LLMs, it is important to remember that one size does not fit all. You must first determine the job that needs to be done and apply the right tool to accomplish that. Evaluating performance compared to cost is also important. As we have seen GPT4 and Mistral Large consistently outperform the other LLMs in this comparison, with GPT4 performing exceptionally well for classification tasks.
The performance witnessed by GPT3.5 is close behind that of the leading LLMs, and may perform well enough
for many use cases. This is especially true when its price point is taken into consideration.
The Llama models we tested, specifically when compared to the pricing and performance for GPT3.5, were simply
not competitive, especially so in the classification scenario.
Our data science team is consistently testing LLMs and we will continue to report on their results in future editions
of this report.