Chain Ladder in Python

Published on

Recently I was trying to implement a simple chain ladder model in Python. I had tried this a few years back when my skills were not as... skilled. Back then I knew there was a chain ladder library for R, but when I tried searching for a similar package in Python, I found nothing notable.

I hadn't approached the problem for a while until a few months back when out of interest I searched again. I found an excellent module developed by a GitHub user by the name of JBogaardt. It seems to be a passion project of his because he is consistently working on and developing it by himself.

https://chainladder-python.readthedocs.io/en/latest/index.html You can view the documentation here

What's great is that it aims to be compatible with the Pandas and Scikit-Learn APIs. This means that if you are used to using either of these modules you'll quickly grasp chainladder. Further, if you are a fan of sklearn pipelines, then you can easily incorporate chainladder into your pipelines, as I will show below.

There is some good documentation available in the docs, but I figured I'd write a bit of an introductory guide here too to demonstrate its power.

The Use Case -- Calculating IBNR for an Insurer

To demonstrate some of the abilities of chainladder we will be calculating the IBNR for an insurer across six lines of business, using a blend of the loss ratio and Cape Cod methods.

(As a note, this post assumes prior knowledge of short-term actuarial reserving. Also, I won't be worrying about the quality of the reserving; the methods are purely for demonstration.)

Loading the data

We will use a dataset made up of claims data from 376 US insurance groups. This dataset is commonly used in research and is built into chainladder. You may recognize it as Schedule P data from the Casualty Actuarial society.

The data is already triangulated for us and in the format required by chainladder. chainladder has a Triangle class that is the required format for its methods. I will explain how the Triangle class works after I load the data.

import chainladder as cl
import pandas as pd
 
data = cl.load_sample('clrd')
data
Triangle Summary
Valuation: 1997-12
Grain: OYDY
Shape: (775, 6, 10, 10)
Index: [GRNAME, LOB]
Columns: [IncurLoss, CumPaidLoss, BulkLoss, EarnedPremDIR, EarnedPremCeded, EarnedPremNet]

This is a summary of the triangle data. It gives us a high level snapshot of all the pertinent details.

Valuation tells us the valuation date of the data, in this case December 1997.

The Grain is the granularity of our triangle. This data is loaded by default on an Origin Year Development Year basis (OYDY), but it can be loaded with another grain. We can also easily change grains between monthly, quarterly and annually.

The Shape is the same as the shape we are familiar with from pandas. You will note the shape of the triangle is four dimensional. This might sound strange, since the triangles we are familiar with are two dimensional. But part of the power of the chainladder Triangle data type, is that it allows us to work with multiple triangle at once. The first two items of the shape tuple tell us the number of rows and columns our set has, and the last two tell us the number of origin periods and development periods we have.

If we were to visualise the triangle, it would look something like this:

Column 1 Column 2 ... Column 6
1 triangle(1, 1) triangle(1, 2) ... triangle(1, 6)
2 triangle(2, 1) triangle(2, 2) ... triangle(2, 6)
... ... ... ... ...
775 triangle(775, 1) triangle(775, 2) ... triangle(775, 6)

So each cell of our dataframe is itself a 2-dimensional triangle with 10 origin periods and 10 development periods.

Now, let's look at Index and Columns to complete our understanding of the Triangle data type.

We have a multi-level index. The first level for this dataset is GRNAME which is the name of the insurance group. The second level is LOB; the line of business. For each insurance group, we have claims data for each line of business the group writes. Finally, we have the Columns. This is telling us for each GRNAME and LOB, we have a triangulation of each column. Let's revisit the visualisation, but this time allowing for Index and Columns:

IncurLoss CumPaidLoss ... EarnedPremNet
(Adriatic Ins Co, othliab) triangle(1, 1) triangle(1, 2) ... triangle(1, 6)
(Adriatic Ins Co, ppauto) triangle(2, 1) triangle(2, 2) ... triangle(2, 6)
... ... ... ... ...
(Zurich Ins (Guam) Inc, wkcomp) triangle(775, 1) triangle(775, 2) ... triangle(775, 6)

Now, let's zoom into one of the triangles, say the cumulative paid losses (CumPaidLoss) for the Workmen's Compensation line of business (wkcomp) line of business for Allstate Ins Co Grp:

data.loc[('Allstate Ins Co Grp', 'wkcomp'), 'CumPaidLoss']
12 24 36 48 60 72 84 96 108 120
1988 70,571 155,905 220,744 251,595 274,156 287,676 298,499 304,873 321,808 325,322
1989 66,547 136,447 179,142 211,343 231,430 244,750 254,557 270,059 273,873 NaN
1990 52,233 133,370 178,444 204,442 222,193 232,940 253,337 256,788 NaN NaN
1991 59,315 128,051 169,793 196,685 213,165 234,676 239,195 NaN NaN NaN
1992 39,991 89,873 114,117 133,003 154,362 159,496 NaN NaN NaN NaN
1993 19,744 47,229 61,909 85,099 87,215 NaN NaN NaN NaN NaN
1994 20,379 46,773 88,636 91,077 NaN NaN NaN NaN NaN NaN
1995 18,756 84,712 87,311 NaN NaN NaN NaN NaN NaN NaN
1996 42,609 44,916 NaN NaN NaN NaN NaN NaN NaN NaN
1997 691 NaN NaN NaN NaN NaN NaN NaN NaN NaN

And there is the triangle we are familiar with! We have 775 * 6 = 4 650 of these in our data, all easily accessible using *pandas* style indexing.

Okay, so that is actually a lot of data, and it is a bit overwhelming. Let's just focus on one insurance group, National American Ins Co. And the only columns we really care about for now are CumPaidLoss and EarnedPremDIR.

naic = data.loc['National American Ins Co', ['CumPaidLoss', 'EarnedPremDIR']]
naic
Triangle Summary
Valuation: 1997-12
Grain: OYDY
Shape: (6, 2, 10, 10)
Index: [LOB]
Columns: [CumPaidLoss, EarnedPremDIR]

This is the data we will work with. It has only one index now, LOB, because we selected only one group. And only two columns, the cumulative paid loss and earned premium.

Useful methods of the Triangle class

Triangle's have a number of useful built in methods to make working with them easier. Let's focus on Cumulative Claims Paid for Workmens Compensation for these examples just to simplify some of the indexing.

wkcomp = naic.loc['wkcomp', 'CumPaidLoss']
wkcomp
12 24 36 48 60 72 84 96 108 120
1988 106 287 340 473 544 571 581 581 581 581
1989 672 1,839 1,547 1,650 1,810 1,838 1,872 1,872 1,897 NaN
1990 1,436 2,963 4,184 4,728 5,033 5,134 5,011 5,120 NaN NaN
1991 1,810 3,852 5,056 5,703 6,148 6,313 6,526 NaN NaN NaN
1992 3,938 9,948 12,783 13,800 14,195 14,509 NaN NaN NaN NaN
1993 4,961 11,196 14,062 15,193 15,696 NaN NaN NaN NaN NaN
1994 4,581 11,727 15,090 16,442 NaN NaN NaN NaN NaN NaN
1995 4,364 11,029 13,526 NaN NaN NaN NaN NaN NaN NaN
1996 5,295 12,017 NaN NaN NaN NaN NaN NaN NaN NaN
1997 6,478 NaN NaN NaN NaN NaN NaN NaN NaN NaN

The Triangle class gives us quick access to many useful methods. We can easily convert from cumulative to incremental, and back:

wkcomp.cum_to_incr()
12 24 36 48 60 72 84 96 108 120
1988 106 181 53 133 71 27 10 NaN NaN NaN
1989 672 1,167 -292 103 160 28 34 NaN 25 NaN
1990 1,436 1,527 1,221 544 305 101 -123 109 NaN NaN
1991 1,810 2,042 1,204 647 445 165 213 NaN NaN NaN
1992 3,938 6,010 2,835 1,017 395 314 NaN NaN NaN NaN
1993 4,961 6,235 2,866 1,131 503 NaN NaN NaN NaN NaN
1994 4,581 7,146 3,363 1,352 NaN NaN NaN NaN NaN NaN
1995 4,364 6,665 2,497 NaN NaN NaN NaN NaN NaN NaN
1996 5,295 6,722 NaN NaN NaN NaN NaN NaN NaN NaN
1997 6,478 NaN NaN NaN NaN NaN NaN NaN NaN NaN

We can quickly get development factors for each cell of the triangle

wkcomp.link_ratio
12-24 24-36 36-48 48-60 60-72 72-84 84-96 96-108 108-120
1988 2.7075 1.1847 1.3912 1.1501 1.0496 1.0175 1.0000 1.0000 1.0000
1989 2.7366 0.8412 1.0666 1.0970 1.0155 1.0185 1.0000 1.0134 NaN
1990 2.0634 1.4121 1.1300 1.0645 1.0201 0.9760 1.0218 NaN NaN
1991 2.1282 1.3126 1.1280 1.0780 1.0268 1.0337 NaN NaN NaN
1992 2.5262 1.2850 1.0796 1.0286 1.0221 NaN NaN NaN NaN
1993 2.2568 1.2560 1.0804 1.0331 NaN NaN NaN NaN NaN
1994 2.5599 1.2868 1.0896 NaN NaN NaN NaN NaN NaN
1995 2.5273 1.2264 NaN NaN NaN NaN NaN NaN NaN
1996 2.2695 NaN NaN NaN NaN NaN NaN NaN NaN

And because the API is similar to pandas, we can also plot very easily

wkcomp.T.plot()

wkcomp.link_ratio.plot()

There are numerous other methods and I urge you to explore them in the documentation, or just while playing around with the data in a jupyter workbook.

Using basic chain ladder to calculate IBNR

Now, since we have our triangle, the next step is to apply good old chain ladder. In a standard actuarial workflow, this normally involves a bit of Excel work where we calculate development factors and multiply it out to get our ultimate claims figures. We then subtract the latest developments to get our IBNR portion. It is not terribly difficult if we have standardised workbooks, but there is always some issue that pops up. Not to mention, without really stringent workbook controls, Excel is a very uncontrolled environment prone to human error and operational error. I am sure we have all experienced the darker side of Excel (usually late at night before a deadline).

With chainladder things are a lot easier:

bcl = cl.Chainladder().fit(naic)

And that's pretty much it.

Okay, I am oversimplifying things. But, technically, that one line applied a basic chain ladder to every line of business. We can get an IBNR for our insurance group with the following line:

bcl.ibnr_['CumPaidLoss'].sum()
2261
1988 NaN
1989 161
1990 392
1991 592
1992 761
1993 1,037
1994 1,861
1995 4,130
1996 10,857
1997 30,653

Let's focus on Workmens Compensation again. We can see the full triangle expectation, as well as back fit the chain ladder to get a full expected triangle:

Full triangle

bcl.full_triangle_.loc['wkcomp', 'CumPaidLoss']
12 24 36 48 60 72 84 96 108 120 132 9999
1988 106 287 340 473 544 571 581 581 581 581 581 581
1989 672 1,839 1,547 1,650 1,810 1,838 1,872 1,872 1,897 1,897 1,897 1,897
1990 1,436 2,963 4,184 4,728 5,033 5,134 5,011 5,120 5,172 5,172 5,172 5,172
1991 1,810 3,852 5,056 5,703 6,148 6,313 6,526 6,621 6,689 6,689 6,689 6,689
1992 3,938 9,948 12,783 13,800 14,195 14,509 14,649 14,863 15,015 15,015 15,015 15,015
1993 4,961 11,196 14,062 15,193 15,696 16,055 16,211 16,447 16,615 16,615 16,615 16,615
1994 4,581 11,727 15,090 16,442 17,186 17,579 17,749 18,008 18,192 18,192 18,192 18,192
1995 4,364 11,029 13,526 14,782 15,450 15,804 15,957 16,190 16,355 16,355 16,355 16,355
1996 5,295 12,017 15,143 16,549 17,298 17,694 17,865 18,126 18,311 18,311 18,311 18,311
1997 6,478 15,468 19,492 21,302 22,265 22,775 22,995 23,331 23,569 23,569 23,569 23,569

Backfit

bcl.full_expectation_.loc['wkcomp', 'CumPaidLoss']
12 24 36 48 60 72 84 96 108 120 132 9999
1988 160 381 480 525 549 561 567 575 581 581 581 581
1989 521 1,245 1,569 1,715 1,792 1,833 1,851 1,878 1,897 1,897 1,897 1,897
1990 1,422 3,394 4,277 4,675 4,886 4,998 5,046 5,120 5,172 5,172 5,172 5,172
1991 1,838 4,390 5,532 6,045 6,319 6,463 6,526 6,621 6,689 6,689 6,689 6,689
1992 4,127 9,854 12,417 13,570 14,184 14,509 14,649 14,863 15,015 15,015 15,015 15,015
1993 4,567 10,904 13,741 15,017 15,696 16,055 16,211 16,447 16,615 16,615 16,615 16,615
1994 5,000 11,939 15,045 16,442 17,186 17,579 17,749 18,008 18,192 18,192 18,192 18,192
1995 4,495 10,734 13,526 14,782 15,450 15,804 15,957 16,190 16,355 16,355 16,355 16,355
1996 5,033 12,017 15,143 16,549 17,298 17,694 17,865 18,126 18,311 18,311 18,311 18,311
1997 6,478 15,468 19,492 21,302 22,265 22,775 22,995 23,331 23,569 23,569 23,569 23,569

We can also see development factors:

Cumulative development factors

bcl.cdf_.loc['wkcomp', 'CumPaidLoss']
12-Ult 24-Ult 36-Ult 48-Ult 60-Ult 72-Ult 84-Ult 96-Ult 108-Ult 120-Ult 132-Ult
(All) 3.6383 1.5237 1.2092 1.1064 1.0586 1.0349 1.0249 1.0102 1.0000 1.0000 1.0000

Incremental development factors

bcl.ldf_.loc['wkcomp', 'CumPaidLoss']
12-24 24-36 36-48 48-60 60-72 72-84 84-96 96-108 108-120 120-132 132-144
(All) 2.3877 1.2602 1.0929 1.0452 1.0229 1.0097 1.0146 1.0102 1.0000 1.0000 1.0000

As you can see, chainladder adopts the sklearn API. If you can use sklearn, you can use chainladder.

Creating a simple reserving algorithm

Now, simply fitting a basic chain ladder is not always appropriate. We need to consider the sparsity of the triangle. Where there is too little data to use chain ladder based methods we will use the loss ratio approach. For other cases we will use the Cape Cod approach.

So how do we go about this? Well first, we need an indicator to tell us if the triangle is too sparse to apply chain ladder based methods.

If we convert non zero cells in the triangle to 1, we can compare the sum to the number of elements. If the sum is above a certain threshold we can say that it is too sparse and apply out simple loss ratio based approach.

Indicator for non zero elements

# n_cells is hardcoded for example purposes but should be derived in
# real world workflow
def get_sparsity(triangle, n_cells=55):
    indicator = triangle.unstack() / triangle.unstack()
    density = indicator.sum() / n_cells
    return 1 - density
 
get_sparsity(naic.loc['prodliab', 'CumPaidLoss'])
0.8363636363636364

So what threshold should be chosen? That's quite subjective. So let's just pick 25%, and go with that. Now, if a triangle has more than 25% sparsity, it means that 25% of the cells have no data. In a cumulative triangle, that is a lot (in my subjective opinion).

Now if a triangle is sparse, we want to apply the loss ratio. chainladder does not have a loss ratio method built in (and why would it?). It is not too difficult to create one, but for brevity we will just make a function that returns the IBNR based on a given loss ratio applied for a given number of years.

def loss_ratio_ibnr(triangle, earned_premium, loss_ratio, from_year):
 
    earned_premium = earned_premium.to_frame()
    earned_premium.columns = ['IBNR']
    earned_premium['IBNR'] = loss_ratio * earned_premium
 
    latest_paid = triangle.latest_diagonal.to_frame()
    latest_paid.columns = ['IBNR']
 
    ibnr = earned_premium[['IBNR']] - latest_paid
 
    ibnr.loc[ibnr.index.year < from_year, 'IBNR'] = 0
 
    return ibnr
 
loss_ratio_ibnr(
    naic.loc['prodliab', 'CumPaidLoss'],
    naic.loc['prodliab', 'EarnedPremDIR'].latest_diagonal,
    0.5,
    1995
)
IBNR
1988 0.0
1989 0.0
1990 0.0
1991 0.0
1992 0.0
1993 0.0
1994 0.0
1995 281.0
1996 416.5
1997 770.0

Great, so we have a way to get an IBNR for sparse triangles. Now we can create a simple reserving algorithm that uses loss ratio when the sparsity is above 25%, and Cape Cod otherwise.

def reserving_algorithm(triangle, sparsity_threshold):
 
    lobs = triangle['LOB'].to_numpy()
    ibnr_by_lob = {}
 
    for lob in lobs:
 
        claims = triangle.loc[lob, 'CumPaidLoss']
        premium = triangle.loc[lob, 'EarnedPremDIR'].latest_diagonal
 
        if get_sparsity(claims) >= 0.25:
            ibnr_by_lob[lob] = loss_ratio_ibnr(claims, premium, 1.0, 1995)
            ibnr_by_lob[lob].columns = ['Cape Cod']
        else:
            ibnr_by_lob[lob] = cl.CapeCod(decay=0.75).fit(
                claims,
                sample_weight=premium
            ).ibnr_.to_frame()
            ibnr_by_lob[lob].columns = ['Cape Cod']
 
    ibnr = pd.concat(ibnr_by_lob, axis=1)
    ibnr['Total'] = ibnr.sum(axis=1)
 
    return ibnr.fillna(0)
reserving_algorithm(naic, 0.25)
comauto medmal othliab ppauto prodliab wkcomp Total
Cape Cod Cape Cod Cape Cod Cape Cod Cape Cod Cape Cod
1988 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000
1989 199.700077 0.0 2.433665 0.000000 0.0 0.000000 202.133742
1990 374.176681 0.0 5.381205 0.000000 0.0 113.670274 493.228161
1991 384.622895 0.0 27.550540 0.000000 0.0 404.705647 816.879082
1992 294.215562 0.0 -41.698234 0.189356 0.0 431.483501 684.190186
1993 255.948328 0.0 -154.280465 -0.430910 0.0 537.545822 638.782775
1994 420.565099 0.0 -65.160634 -253.447370 0.0 1280.055827 1382.012922
1995 393.962604 0.0 474.803152 -338.502120 577.0 2242.557573 3349.821209
1996 979.192035 0.0 1570.368156 1029.200060 879.0 5393.652281 9851.412531
1997 2907.991561 0.0 3529.706847 4076.591330 1584.0 11639.363016 23737.652754

And there we have it, an (admittedly oversimplified) automated and repeatable reserving algorithm. (Medmal is zero because the triangle is 100% sparse and there is no earned premium).

I am focusing purely on the chainladder package and its usage, so let's ignore any discussion around whether the reserving approach is correct or not. For the purpose of demonstration it doesn't matter.

Where do we go from here?

In the real world life isn't so simple. What I have shown is only a small part of the reserving process, but I have demonstrated the usefulness of the chainladder package, and how free open-source software can be extremely valuable in a commercial context. Its not hard to imagine how this process can be extended in either direction and completely automate the reserving process of an insurer.

For example, we started this process using clean, triangulated data. But in reality, you get messy, tabular data from a system. So you would need to pull the data from the system, push it through a cleaning process and then a preprocessing process, before it would go into a more robust and complete version of the chainladder reserving process we've built above. But, with the right skills this is achievable. It simply takes investment of person-hours, which will save considerable person-hours in future.

After the chainladder process above, the results can be pushed to dashboarding systems, databases, or even directly into other actuarial models, such as a pricing model or a capital model.

Closing remarks

I hope some will find this post useful and begin experimenting with the chainladder library. Please reach out if you would like to discuss, point out errors or silly mistakes, or if you'd like to share your interesting findings.