Executive Summary
- The LLM market is rapidly evolving, with various models now available which are appropriate for a variety
of use cases - Determining which LLM is best for which use case involves comparing context size, overall cost and
performance - Focused prompt engineering and tuning can be the difference between exceptional and disappointing
performance
Advances in the use of artificial intelligence (AI) and generative artificial intelligence (GenAI) to improve critical
insurance processes continue to captivate the industry. At the same time, it can be incredibly difficult to navigate
the rapidly changing landscape and make the best decision about how to implement these innovations to reap the
best results.
In the inaugural The State of AI in Insurance report, we explored the performance of six different Large Language
Models (LLMs) when applied against several insurance-specific use cases. Shift data scientists and researchers
sought not only to compare relative performance against a set of predetermined tasks, but also illustrate the cost/
performance comparisons associated with each of the LLMs tested.
In Vol. II, we are testing eight new LLMs and have retired two that appeared in the previous report. The newly tested
models include Llama3-8b, Llama3-70b, GPT4o, Command r, Command r+, Claude3 Opus, Claude3 Sonnet, and
Claude3 Haiku. The Llama2 models which appeared in the inaugural report have been removed from the comparison
and replaced with the Llama3 models. Llama3 models are more representative of the current state-of-the-art for
available LLMs.
Further, the report now features a new table highlighting an F1 score generated for each model. For this report the
F1 score aggregates coverage and accuracy against two axes - the specific use case (e.g. French-language Dental
Invoices) as well as the individual fields associated with the use case. The approach allows us to generate a single
performance metric per use case as well as an aggregated overall score including the cost associated with analyzing
100,000 documents. The following formula was used to generate the F1 score: 2 x Cov x Acc / (Cov + Acc).
Thank you to the Shift data science and research teams that make this report possible.