ChainladderGPT: Large Language Reserving
Here is a fun thought I had the other day: ChatGPT is a conversational model. But, in addition to deep meaningful conversations (DMCs), it can also complete things. So... can it complete a triangle?
Let's find out :)
The Data: A Claims Development Triangle
First, we need a triangle. chainladder-python can help with that:
import chainladder as cl
import pandas as pd
data = cl.load_sample('raa')
data| 12 | 24 | 36 | 48 | 60 | 72 | 84 | 96 | 108 | 120 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1981 | 5,012 | 8,269 | 10,907 | 11,805 | 13,539 | 16,181 | 18,009 | 18,608 | 18,662 | 18,834 |
| 1982 | 106 | 4,285 | 5,396 | 10,666 | 13,782 | 15,599 | 15,496 | 16,169 | 16,704 | |
| 1983 | 3,410 | 8,992 | 13,873 | 16,141 | 18,735 | 22,214 | 22,863 | 23,466 | ||
| 1984 | 5,655 | 11,555 | 15,766 | 21,266 | 23,425 | 26,083 | 27,067 | |||
| 1985 | 1,092 | 9,565 | 15,836 | 22,169 | 25,955 | 26,180 | ||||
| 1986 | 1,513 | 6,445 | 11,702 | 12,935 | 15,852 | |||||
| 1987 | 557 | 4,020 | 10,946 | 12,314 | ||||||
| 1988 | 1,351 | 6,947 | 13,112 | |||||||
| 1989 | 3,133 | 5,395 | ||||||||
| 1990 | 2,063 |
The Language Model Approach
Now, we need to think about how we ask ChatGPT to reserve this triangle. The easy option is to leverage CodeInterpreter which is a ChatGPT plugin that allows ChatGPT to run code. But that's a bit boring. It will just use the chainladder-python package and apply chainladder. I want ChatGPT to really think about what its doing.
What if, we trick ChatGPT to think of a triangle as a numeric sentence? Then, ChatGPT can use all its great language knowledge to complete the sentence!
But, how do we make a triangle a sentence?
Well, each accident year is like a line in a story. And for some reason we have cut the triangle across the diagonal to lose half of it. Getting the idea?
Like this:
# First sentence:
data.to_frame().iloc[0].to_csv()
# Second sentence:
data.to_frame().iloc[1].to_csv()
# ... and so forthSo let's give this a shot.. Let's make the latest diagonal of our triangle a series of placeholders and ask ChatGPT to fill them in. No other guidance!
train_data = data.copy().to_frame()
for j, accident_date in enumerate(train_data.index):
train_data.iloc[len(train_data.index) - j - 1, j] = "BLANK"
train_data| 12 | 24 | 36 | 48 | 60 | 72 | 84 | 96 | 108 | 120 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1981-01-01 | 5012.0 | 8269.0 | 10907.0 | 11805.0 | 13539.0 | 16181.0 | 18009.0 | 18608.0 | 18662.0 | BLANK |
| 1982-01-01 | 106.0 | 4285.0 | 5396.0 | 10666.0 | 13782.0 | 15599.0 | 15496.0 | 16169.0 | BLANK | NaN |
| 1983-01-01 | 3410.0 | 8992.0 | 13873.0 | 16141.0 | 18735.0 | 22214.0 | 22863.0 | BLANK | NaN | NaN |
| 1984-01-01 | 5655.0 | 11555.0 | 15766.0 | 21266.0 | 23425.0 | 26083.0 | BLANK | NaN | NaN | NaN |
| 1985-01-01 | 1092.0 | 9565.0 | 15836.0 | 22169.0 | 25955.0 | BLANK | NaN | NaN | NaN | NaN |
| 1986-01-01 | 1513.0 | 6445.0 | 11702.0 | 12935.0 | BLANK | NaN | NaN | NaN | NaN | NaN |
| 1987-01-01 | 557.0 | 4020.0 | 10946.0 | BLANK | NaN | NaN | NaN | NaN | NaN | NaN |
| 1988-01-01 | 1351.0 | 6947.0 | BLANK | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1989-01-01 | 3133.0 | BLANK | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1990-01-01 | BLANK | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Testing GPT-3.5 (The Control Group)
Now we convert the triangle to a sentence:
triangle_sentence = ""
for i in range(train_data.shape[0]):
triangle_sentence += train_data.iloc[i].dropna().to_csv()
triangle_sentence(Output removed for brevity, as it is a long CSV string.)
Finally, let's ask the latest (at the time of writing) GPT-3.5 model behind ChatGPT to fill in the blanks!
(Note: API setup code and the large system prompt are omitted for readability.)
The calculated Root Mean Square Error (RMSE) against the actual known completion values was:
3985.3233871293305
Not a very good showing, at all. In fact, there seems to be no real reasoning to what ChatGPT is doing.
Could it be the GPT-3.5 model? What blew us away late last year has been eclipsed by GPT-4. GPT-3.5 is basically useless now (just kidding).
Testing GPT-4 (The Chainladder Contender)
Let's try GPT-4. I don't have API access to it so I had to do the prompt manually using the website. So imagine me typing it in laboriously before realising I can copy and paste.
...
Using GPT-4, the response is a lot more enjoyable! Here it is:
To fill in your triangle, I'll need to use some form of actuarial reserving methods. In actuarial science, common techniques include the chain ladder method, the Bornhuetter-Ferguson method, or the Cape Cod method.
However, as you've mentioned, the Chain Ladder method is too interpretable, so I'll be using a method that respects trends in data but also tries to find a pattern for the incremental data to fill in the blanks. This method is not guaranteed to always provide the best results, as it depends heavily on the nature of the data and underlying claims development.
Here are the filled blanks for your data:
120, 18735.0 108, 16423.0 96, 23357.0 84, 26775.0 72, 26632.0 60, 13317.0 48, 11300.0 36, 7200.0 24, 5500.0 12, 2000.0
Please note that the exact values could be different if different methods or different assumptions are used. Each claim development triangle is unique and may require different approaches based on the nature and the distribution of the data. Therefore, I recommend consulting with an actuary or data scientist for an in-depth analysis and completion of your triangle.
The calculated RMSE for the GPT-4 predictions against the actual diagonal values was:
2069.0908631570533
Much better! One of the downsides however is that if you input the exact same prompt, you likely will get a different answer. Not good for reserving, but great for bootstrapping :).
Using the API, we can set a low temperature to be more deterministic. Temperature is a parameter that controls the creativity of GPT-3.5 and GPT-4. But without API access for GPT-4, we just have to settle on chaos.
Comparison with Basic Chainladder
How do our results compare to a basic chainladder (BCL) run using chainladder-python?
First, the BCL score (excluding the latest accident year, as BCL cannot predict it without assumption):
Basic Chainladder RMSE (1981-1989):
2476.6124040713353
And our GPT-4 score for the comparable set (1981-1989)... drumroll...
np.mean((actual[:-1] - gpt4_preds[:-1]) ** 2) ** 0.5Adjusted GPT-4 RMSE (1981-1989):
2180.9121689992216
Better than a basic chainladder! Not bad. Not bad at all.
Of course, this is an actuaries worst nightmare. Its the most opaque black box, the least parsimonous, least understood way to get an IBNR reserve. And worst of all, it didn't use Excel!
But an interesting test, nonetheless.