The Report in Brief
- LLM technology is advancing at a rapid pace with both new versions of existing models and entirely
new models being introduced. - We are reaching a confluence point with several models achieving highly comparable performance.
- Price/performance comparisons may be an important determining factor when selecting LLMs until new performance gains can be established.
The introduction of the F1 score in Vol. II of this ongoing series of publications has allowed us to think a
little differently about how we report our findings related to the evaluation of Large Language Model (LLM)
performance when applied to specific insurance use cases. You will see that evolved thinking reflected in
this report.
We believe the aggregated F1 score, both per scenario and overall, provides the relevant insights required
to understand how the tested LLMs perform against common insurance industry use cases and if the costs
associated with their deployment are in line with performance. This approach also makes it easier to understand
how each model performs against what could be considered “simple scenarios” (information extraction from text
fields, amounts, dates, etc.) as opposed to “complex scenarios” (tasks including several steps and/or information
extraction from lists or fields that are themselves complex objects).
Based on advancements in LLM technology which have been introduced to the market since Vol. II our data
science and research teams have included six new LLMs - GPT4o-Mini, Claude3.5 Sonnet, Mistral Large 2407,
Llama3.1-405b, Llama3.1-70b, and Llama3.1-8b - into the testing. Command r and Command r+ were removed
from evaluation.
As always, this report would not be possible without the efforts of Shift’s data science and research teams.