ChainladderGPT: Large Language Reserving

Published on

Here is a fun thought I had the other day: ChatGPT is a conversational model. But, in addition to deep meaningful conversations (DMCs), it can also complete things. So... can it complete a triangle?

Let's find out :)

The Data: A Claims Development Triangle

First, we need a triangle. chainladder-python can help with that:

import chainladder as cl
import pandas as pd
 
data = cl.load_sample('raa')
data
12 24 36 48 60 72 84 96 108 120
1981 5,012 8,269 10,907 11,805 13,539 16,181 18,009 18,608 18,662 18,834
1982 106 4,285 5,396 10,666 13,782 15,599 15,496 16,169 16,704
1983 3,410 8,992 13,873 16,141 18,735 22,214 22,863 23,466
1984 5,655 11,555 15,766 21,266 23,425 26,083 27,067
1985 1,092 9,565 15,836 22,169 25,955 26,180
1986 1,513 6,445 11,702 12,935 15,852
1987 557 4,020 10,946 12,314
1988 1,351 6,947 13,112
1989 3,133 5,395
1990 2,063

The Language Model Approach

Now, we need to think about how we ask ChatGPT to reserve this triangle. The easy option is to leverage CodeInterpreter which is a ChatGPT plugin that allows ChatGPT to run code. But that's a bit boring. It will just use the chainladder-python package and apply chainladder. I want ChatGPT to really think about what its doing.

What if, we trick ChatGPT to think of a triangle as a numeric sentence? Then, ChatGPT can use all its great language knowledge to complete the sentence!

But, how do we make a triangle a sentence?

Well, each accident year is like a line in a story. And for some reason we have cut the triangle across the diagonal to lose half of it. Getting the idea?

Like this:

# First sentence:
data.to_frame().iloc[0].to_csv()
 
# Second sentence:
data.to_frame().iloc[1].to_csv()
 
# ... and so forth

So let's give this a shot.. Let's make the latest diagonal of our triangle a series of placeholders and ask ChatGPT to fill them in. No other guidance!

train_data = data.copy().to_frame()
for j, accident_date in enumerate(train_data.index):
    train_data.iloc[len(train_data.index) - j - 1, j] = "BLANK"
 
train_data
12 24 36 48 60 72 84 96 108 120
1981-01-01 5012.0 8269.0 10907.0 11805.0 13539.0 16181.0 18009.0 18608.0 18662.0 BLANK
1982-01-01 106.0 4285.0 5396.0 10666.0 13782.0 15599.0 15496.0 16169.0 BLANK NaN
1983-01-01 3410.0 8992.0 13873.0 16141.0 18735.0 22214.0 22863.0 BLANK NaN NaN
1984-01-01 5655.0 11555.0 15766.0 21266.0 23425.0 26083.0 BLANK NaN NaN NaN
1985-01-01 1092.0 9565.0 15836.0 22169.0 25955.0 BLANK NaN NaN NaN NaN
1986-01-01 1513.0 6445.0 11702.0 12935.0 BLANK NaN NaN NaN NaN NaN
1987-01-01 557.0 4020.0 10946.0 BLANK NaN NaN NaN NaN NaN NaN
1988-01-01 1351.0 6947.0 BLANK NaN NaN NaN NaN NaN NaN NaN
1989-01-01 3133.0 BLANK NaN NaN NaN NaN NaN NaN NaN NaN
1990-01-01 BLANK NaN NaN NaN NaN NaN NaN NaN NaN NaN

Testing GPT-3.5 (The Control Group)

Now we convert the triangle to a sentence:

triangle_sentence = ""
 
for i in range(train_data.shape[0]):
    triangle_sentence += train_data.iloc[i].dropna().to_csv()
 
triangle_sentence

(Output removed for brevity, as it is a long CSV string.)

Finally, let's ask the latest (at the time of writing) GPT-3.5 model behind ChatGPT to fill in the blanks!

(Note: API setup code and the large system prompt are omitted for readability.)

The calculated Root Mean Square Error (RMSE) against the actual known completion values was:

3985.3233871293305

Not a very good showing, at all. In fact, there seems to be no real reasoning to what ChatGPT is doing.

Could it be the GPT-3.5 model? What blew us away late last year has been eclipsed by GPT-4. GPT-3.5 is basically useless now (just kidding).

Testing GPT-4 (The Chainladder Contender)

Let's try GPT-4. I don't have API access to it so I had to do the prompt manually using the website. So imagine me typing it in laboriously before realising I can copy and paste.

...

Using GPT-4, the response is a lot more enjoyable! Here it is:

To fill in your triangle, I'll need to use some form of actuarial reserving methods. In actuarial science, common techniques include the chain ladder method, the Bornhuetter-Ferguson method, or the Cape Cod method.

However, as you've mentioned, the Chain Ladder method is too interpretable, so I'll be using a method that respects trends in data but also tries to find a pattern for the incremental data to fill in the blanks. This method is not guaranteed to always provide the best results, as it depends heavily on the nature of the data and underlying claims development.

Here are the filled blanks for your data:

120, 18735.0 108, 16423.0 96, 23357.0 84, 26775.0 72, 26632.0 60, 13317.0 48, 11300.0 36, 7200.0 24, 5500.0 12, 2000.0

Please note that the exact values could be different if different methods or different assumptions are used. Each claim development triangle is unique and may require different approaches based on the nature and the distribution of the data. Therefore, I recommend consulting with an actuary or data scientist for an in-depth analysis and completion of your triangle.

The calculated RMSE for the GPT-4 predictions against the actual diagonal values was:

2069.0908631570533

Much better! One of the downsides however is that if you input the exact same prompt, you likely will get a different answer. Not good for reserving, but great for bootstrapping :).

Using the API, we can set a low temperature to be more deterministic. Temperature is a parameter that controls the creativity of GPT-3.5 and GPT-4. But without API access for GPT-4, we just have to settle on chaos.

Comparison with Basic Chainladder

How do our results compare to a basic chainladder (BCL) run using chainladder-python?

First, the BCL score (excluding the latest accident year, as BCL cannot predict it without assumption):

Basic Chainladder RMSE (1981-1989):

2476.6124040713353

And our GPT-4 score for the comparable set (1981-1989)... drumroll...

np.mean((actual[:-1] - gpt4_preds[:-1]) ** 2) ** 0.5

Adjusted GPT-4 RMSE (1981-1989):

2180.9121689992216

Better than a basic chainladder! Not bad. Not bad at all.

Of course, this is an actuaries worst nightmare. Its the most opaque black box, the least parsimonous, least understood way to get an IBNR reserve. And worst of all, it didn't use Excel!

But an interesting test, nonetheless.