ChainladderGPT: Large Language Reserving

Published on 3 July 2023

Here is a fun thought I had the other day: ChatGPT is a conversational model. But, in addition to deep meaningful conversations (DMCs), it can also complete things. So... can it complete a triangle?

Let's find out :)

The Data: A Claims Development Triangle

First, we need a triangle. chainladder-python can help with that:

import chainladder as cl
import pandas as pd
 
data = cl.load_sample('raa')
data

	12	24	36	48	60	72	84	96	108	120
1981	5,012	8,269	10,907	11,805	13,539	16,181	18,009	18,608	18,662	18,834
1982	106	4,285	5,396	10,666	13,782	15,599	15,496	16,169	16,704
1983	3,410	8,992	13,873	16,141	18,735	22,214	22,863	23,466
1984	5,655	11,555	15,766	21,266	23,425	26,083	27,067
1985	1,092	9,565	15,836	22,169	25,955	26,180
1986	1,513	6,445	11,702	12,935	15,852
1987	557	4,020	10,946	12,314
1988	1,351	6,947	13,112
1989	3,133	5,395
1990	2,063

The Language Model Approach

Now, we need to think about how we ask ChatGPT to reserve this triangle. The easy option is to leverage CodeInterpreter which is a ChatGPT plugin that allows ChatGPT to run code. But that's a bit boring. It will just use the chainladder-python package and apply chainladder. I want ChatGPT to really think about what its doing.

What if, we trick ChatGPT to think of a triangle as a numeric sentence? Then, ChatGPT can use all its great language knowledge to complete the sentence!

But, how do we make a triangle a sentence?

Well, each accident year is like a line in a story. And for some reason we have cut the triangle across the diagonal to lose half of it. Getting the idea?

Like this:

# First sentence:
data.to_frame().iloc[0].to_csv()
 
# Second sentence:
data.to_frame().iloc[1].to_csv()
 
# ... and so forth

So let's give this a shot.. Let's make the latest diagonal of our triangle a series of placeholders and ask ChatGPT to fill them in. No other guidance!

train_data = data.copy().to_frame()
for j, accident_date in enumerate(train_data.index):
    train_data.iloc[len(train_data.index) - j - 1, j] = "BLANK"
 
train_data

	12	24	36	48	60	72	84	96	108	120
1981-01-01	5012.0	8269.0	10907.0	11805.0	13539.0	16181.0	18009.0	18608.0	18662.0	BLANK
1982-01-01	106.0	4285.0	5396.0	10666.0	13782.0	15599.0	15496.0	16169.0	BLANK	NaN
1983-01-01	3410.0	8992.0	13873.0	16141.0	18735.0	22214.0	22863.0	BLANK	NaN	NaN
1984-01-01	5655.0	11555.0	15766.0	21266.0	23425.0	26083.0	BLANK	NaN	NaN	NaN
1985-01-01	1092.0	9565.0	15836.0	22169.0	25955.0	BLANK	NaN	NaN	NaN	NaN
1986-01-01	1513.0	6445.0	11702.0	12935.0	BLANK	NaN	NaN	NaN	NaN	NaN
1987-01-01	557.0	4020.0	10946.0	BLANK	NaN	NaN	NaN	NaN	NaN	NaN
1988-01-01	1351.0	6947.0	BLANK	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1989-01-01	3133.0	BLANK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1990-01-01	BLANK	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Testing GPT-3.5 (The Control Group)

Now we convert the triangle to a sentence:

triangle_sentence = ""
 
for i in range(train_data.shape[0]):
    triangle_sentence += train_data.iloc[i].dropna().to_csv()
 
triangle_sentence

(Output removed for brevity, as it is a long CSV string.)

Finally, let's ask the latest (at the time of writing) GPT-3.5 model behind ChatGPT to fill in the blanks!

(Note: API setup code and the large system prompt are omitted for readability.)

The calculated Root Mean Square Error (RMSE) against the actual known completion values was:

3985.3233871293305

Not a very good showing, at all. In fact, there seems to be no real reasoning to what ChatGPT is doing.

Could it be the GPT-3.5 model? What blew us away late last year has been eclipsed by GPT-4. GPT-3.5 is basically useless now (just kidding).

Testing GPT-4 (The Chainladder Contender)

Let's try GPT-4. I don't have API access to it so I had to do the prompt manually using the website. So imagine me typing it in laboriously before realising I can copy and paste.

...

Using GPT-4, the response is a lot more enjoyable! Here it is:

To fill in your triangle, I'll need to use some form of actuarial reserving methods. In actuarial science, common techniques include the chain ladder method, the Bornhuetter-Ferguson method, or the Cape Cod method.

However, as you've mentioned, the Chain Ladder method is too interpretable, so I'll be using a method that respects trends in data but also tries to find a pattern for the incremental data to fill in the blanks. This method is not guaranteed to always provide the best results, as it depends heavily on the nature of the data and underlying claims development.

Here are the filled blanks for your data:

120, 18735.0 108, 16423.0 96, 23357.0 84, 26775.0 72, 26632.0 60, 13317.0 48, 11300.0 36, 7200.0 24, 5500.0 12, 2000.0

Please note that the exact values could be different if different methods or different assumptions are used. Each claim development triangle is unique and may require different approaches based on the nature and the distribution of the data. Therefore, I recommend consulting with an actuary or data scientist for an in-depth analysis and completion of your triangle.

The calculated RMSE for the GPT-4 predictions against the actual diagonal values was:

2069.0908631570533

Much better! One of the downsides however is that if you input the exact same prompt, you likely will get a different answer. Not good for reserving, but great for bootstrapping :).

Using the API, we can set a low temperature to be more deterministic. Temperature is a parameter that controls the creativity of GPT-3.5 and GPT-4. But without API access for GPT-4, we just have to settle on chaos.

Comparison with Basic Chainladder

How do our results compare to a basic chainladder (BCL) run using chainladder-python?

First, the BCL score (excluding the latest accident year, as BCL cannot predict it without assumption):

Basic Chainladder RMSE (1981-1989):

2476.6124040713353

And our GPT-4 score for the comparable set (1981-1989)... drumroll...

np.mean((actual[:-1] - gpt4_preds[:-1]) ** 2) ** 0.5

Adjusted GPT-4 RMSE (1981-1989):

2180.9121689992216

Better than a basic chainladder! Not bad. Not bad at all.

Of course, this is an actuaries worst nightmare. Its the most opaque black box, the least parsimonous, least understood way to get an IBNR reserve. And worst of all, it didn't use Excel!

But an interesting test, nonetheless.