Calibration Use Cases

This site is focused on the following calibration scenarios, each of which is discussed in more detail below:

Calibration of probability of default (PD) models (and supporting infrastructure) to improve their accuracy in the assignment of probabilities

Auxiliary calibration modelling in support of (1), covering master rating scale (MRS) analysis and remodelling; credit cycle modelling; and general time series modelling (back-casting, forecasting, et

Estimation of PD term structures to compute expected credit losses (ECL) in the context of impairment allowances

Fine-tuning of existing models following a validation or review, which is closer to parameter estimation

At the present time, our efforts are more focused on the calibration of PD models, which is a large domain in itself. However, we have considerable experience in the other problem types and will be offering services against those requirements very shortly. Our modelling platform – currently in testing – will also cover these other requirements.

PD Model Calibration

Terminology

All banks—whether they are small, challenger banks or large, established multinationals—have a complex credit risk measurement task that requires ongoing attention, especially in changing conditions. “Risk measurement” typically means models, and models means parameters, so a point of clarification may be required here regarding terminology. The terms ‘parameter estimation’ and ‘calibration’ are synonymous in some modelling contexts, but here, on this site, calibration refers specifically to the tuning of credit risk model outputs … not estimation of the core parameters.

In practical terms, this often means introducing a new component (indicated below in purple) that transforms raw model scores such that overall outputs are ‘well-calibrated’. In many cases of course, this component already exists, and the calibration exercise is to re-estimate its parameters (leaving the ‘core model’ as it is). Also, it is not at all unusual for there to be more than one calibration component, with additional components designed to quantify uncertainty (in the form of prediction intervals) and/or control for dynamics in the original input space.

Regardless of the final architecture — and at the risk of over-simplifying a rigorous modelling exercise — the process of calibration has at least two stages: [1] analysis (measuring the degree of potential mis-calibration and performing diagnostics), and [2] correction (meaning calibration or re-calibration of the model’s outputs). In some circles, there is a very strong emphasis on the first part, with whole frameworks dedicated to uncertainty quantification alone. However, with the majority of banking systems relying on accurate point estimates, corrective action is an absolutely essential component. For us, “calibration” always involves both parts.

The Calibration Imperative

For PD models to be used in frontline risk measurement, at modern lending institutions, they must have both discriminatory power and accuracy. Together, these two complementary properties define the essential purpose of any credit risk modelling task at a very high level. “Discriminatory power” is the ability to separate, or classify, good borrowers from bad. “Accuracy”, in this context, means the ability to assign appropriate probabilities to borrowers and is the provenance of calibration.

The calibration of PD models, scorecards, rating systems and rating scales is a central challenge for any institution providing credit. Problems with data are common and can materially constrain a calibration exercise. In fact, data issues are so common that regulators treat some portfolios now as a separate category. Careful augmentation with things such as expert opinion, external benchmarks and auxiliary modelling are often required to offset these issues. Augmentation is discussed elsewhere on this site, and will feature heavily in our technical blog going forward.

Calibration is a complex sub-domain of risk model development, with its own special requirements and considerations. Not only is this category of modelling intrinsically complex, it is frequently underestimated by modelling teams, for several reasons that are discussed elsewhere on this site. As such, the average credit risk modeller may not be as familiar with the core issues and modelling idiosyncrasies as they are, say, with development of a PD model.

Auxiliary Modelling for Calibration

Very rarely is calibration a ‘one-step’, stand-alone exercise. Typically, a number of other calculations and analytical steps are involved, from exploratory data analysis (EDA) to the development of wholly-separate models that compute some intermediate quantity. It’s critically important to factor these other elements in when planning a calibration project, and calibration operations in general.

Modelling requirements vary enormously across calibration exercises and use cases. For some projects, and architectures, modelling the credit cycle is a central component and this brings a lot of complexity with it. For others, it might be more important to focus on portfolio composition and alignment with external benchmarks.

Specific calibrations aside, a number of other analytic activities may be conducted in support of calibration’s broader mission of supporting accurate risk measurement. These include periodic validation; work connected with the master rating scale (such as validation, benchmarking against external ratings, and full re-calibration); and computing ancillary measures (such as cross-bucket roll-rates in the context of provisioning models).

Validation alone is a much-underestimated, mission-critical exercise that contains a great deal of subtlety when done properly. It is probably the most salient type of ‘auxiliary’ modelling. Diagnosing and quantifying the drivers of a mis-calibration is an active area of ongoing research.

Validation alone is a much-underestimated, mission-critical exercise that contains a great deal of subtlety when done properly. It is probably the most salient type of ‘auxiliary’ modelling. Diagnosing and quantifying the drivers of a mis-calibration is an active area of ongoing research.

PD Term Structure Modelling

With the requirement under accounting standard IFRS9 to provide lifetime expected losses (for at least part of the portfolio) comes the need to generate a term structure of default risk.

It would be difficult to exaggerate the impact that these new requirements have had on credit risk modelling operations in banks; IFRS9 was a significant change, placing increasing pressure on rating systems, their associated methodologies, and PD estimation processes in general.

There are many approaches to generating a PD term structure, and each have their own constraints and assumptions. Further, these approaches are diverse, to say the least, ranging from regression-based approaches grounded in survival analysis, to Markov-based recursions. What they all share, however, is the need to generate an initial PD with certain attributes; attributes that differ in fundamental ways from the PD required for capital.

There are many approaches to generating a PD term structure, and each have their own constraints and assumptions. Further, these approaches are diverse, to say the least, ranging from regression-based approaches grounded in survival analysis, to Markov-based recursions. What they all share, however, is the need to generate an initial PD with certain attributes; attributes that differ in fundamental ways from the PD required for capital.

Calibration plays a central role here as the differences between Basel and IFRS9 do not necessarily require a separate PD model for each domain (although some banks elect to do this). The needs of both domains can be met by the same PD model but feeding different calibration engines.

Due to the above-mentioned complexity, it’s fair to say that banks’ methodologies are still evolving in this space. One of the key challenges is that even under fairly standard conditions, these methodologies can produce wildly different results. As such, reconciling these results, and – where appropriate – stabilising and tempering them with expert opinion, is an ongoing task.

$$\lambda \left ( t \right )= \displaystyle \lim_{\Delta t \to 0^{+}}\frac{\mathbb{P}\left [ t < \tau \leq t+\Delta t|\tau > t \right ]}{\Delta t}$$

Regardless of approach, defining and generating PD term structures, under a broad range of conditions, is now an essential capability.

Fine-Tuning Existing Models

Re-estimating Core Models

Models do not always have to be built from scratch. The parameters of an existing model, one that has been built to high standards, can be updated (i.e. re-estimated) on more current data, leaving model’s feature space unchanged. In the context of regulatory modelling, this is sometimes called a ‘recalibration’; other terms include ‘model refinement’ and ‘model updating’.

Choosing model refinement over a full re-build can save a huge amount of time and money. When done well, it can leverage all of the previous work done in phases like exploratory data analysis (EDA), feature engineering, feature selection, and expert elicitation.

It does not always have to be a re-build that triggers this action – it could be a speculative step in the spirit of a what-if analysis, or a counterfactual. Much can be learned by re-estimating a model on an extended data-set.

From a methodological point-of-view, there are several ways of doing this ranging from deep/thorough processes that are virtually complete model re-builds without the feature selection step, to more modest processes. Methodologies exist that are perfectly suited to this special

Reproducibility possibilities also – re-estimating the model on fresh code can be a powerful way to validate the results of the original build.

It does not always have to be a re-build that triggers this action – it could be a speculative step in the spirit of a what-if analysis, or a counterfactual. Much can be learned by re-estimating a model on an extended data-set.

From a methodological point-of-view, there are several ways of doing this ranging from deep/thorough processes that are virtually complete model re-builds without the feature selection step, to more modest processes. Methodologies exist that are perfectly suited to this special

Reproducibility possibilities also – re-estimating the model on fresh code can be a powerful way to validate the results of the original build.

Building and Tuning Overlays

In addition, there are any situations where a model may require a temporary or permanent overlay to modify its outputs before they are consumed by downstream processes. This is also referred to by some regulators as a ‘post model adjustment’.

Various drivers of overlays:

  • IFRS9 stage assessment
  • Portfolio scenario analysis or micro-stress testing
  • Adjusting to regime shifts such as presented by the Pandemic
  • Differential treatment for special categories or segments within a portfolio

Banks have experimented with some quite sophisticated overlays that verge on an auxiliary model. Other overlays are effectively a structured override utility to implement credit policy. In many cases, these overlays are model-like in that they take inputs; process those inputs in some way; produce outputs; and are governed in some sense by parameters and/or logic. As such, they require parameter estimation and tuning. For some overlays, this may mean applying various portfolio-level optimisations and heuristics to achieve a specific goal, or to ensure the attainment of a certain threshold.

Radius has considerable experience in this area, from the conception and design of overlays to their tuning and and subsequent validation.

Banks have experimented with some quite sophisticated overlays that verge on an auxiliary model. Other overlays are effectively a structured override utility to implement credit policy. In most cases, these overlays are model-like in that they take inputs; process those inputs in some way; produce outputs; and are governed in some sense by parameters. As such, they require parameter estimation and tuning. For some overlays, this may mean applying various portfolio-level optimisations and heuristics to achieve a specific goal, or ensure the attainment of a certain threshold.

Radius has considerable experience in this area, from the conception and design of overlays to their tuning and

Current Challenges

The State of Play in 2024

A number of factors make calibration particularly challenging at this point in time:

  • Post-Pandemic default behaviour is difficult to predict
  • High volatility in financial markets
  • Rampant global inflation
  • Appreciation in the US dollar
  • War in Europe
  • Poor Brexit outcomes for the UK
  • Sociological shifts away from conventional salaried work
  • Severe drought in Europe
  • Global challenges to food security

What is almost certainly true is that many portfolios are not behaving in the way they once were. As such, it’s possible that default risk has a measurably different joint distribution in terms of risk drivers and default indicators, which would also then undermine the discriminatory power of models built prior to this period.

Whether this is a structural, persistent shift in the economy, with lasting damage to creditworthiness, is impossible to say at this point, but it’s certainly not a question that can be ignored. If the Pandemic was the only globally-destabilising event, then it might be simpler to analyse. However, there are many negative, mutually-influencing forces in play at present, and the conditional probability of a fast return to normality seems unlikely.

Going out on a limb, we think it’s extremely likely that we have entered a period of time where many portfolios’ baseline level of risk has been shifted upwards. Further, we think that this displacement is more likely to be a semi-permanent regime shift, rather than something transient in nature. This shift has obvious implications for model calibration; blind obeisance to historical averages will not be representative of default risk going forward and modelling practice has to reflect that.

Points of Methodology

On the surface, calibration looks like the simplest of modelling exercises: a univariate model-build, probably regression, on relatively well-behaved data, with a well-defined objective …

… how hard could it be?

Honestly? Very.

$$S\left( P, Q \right)=\textbf{E}_{Q}\left[S\left( P, \omega \right)\right]=\int_{_{\Omega}}S\left( P, \omega \right)dQ\left( \omega \right)^{}$$

Calibration is an area of modelling that is easy to do at a very basic level, but actually very hard to do well; i.e. to a standard befitting a model predicting and pricing risk for a multi-billion dollar portfolio.

It requires a degree of precision that is not expected of the main classification role (performed by the core model), and real-world calibrations often need to meet a number of additional criteria, not just maximise goodness-of-fit. There is a right answer (in the sense that it is possible to implement a close-to-perfect calibration) and hence expectations can run high.

Some calibration set-ups require extensive time series modelling and satellite model builds that provide context for the main model build. Others need elaborate data preparation and benchmarking before modelling can even begin. The most advanced solutions need all of that and a lot more, and are themselves sometimes wrapped in an iterative scheme that demands relatively high levels of operational control.

Lastly, as indicated earlier, there are multiple use cases hidden inside the word “calibration”, and each of these scenarios brings its own complexity.

 

Variation in Model Architecture

There is a surprising variety of model architectures in the rating model space. The core of a model can have:

  • A purely statistical linear structure based on a dot-product;
  • Factors that themselves have a pre-computational structure;
  • A link function that wraps a linear kernel;
  • A tree structure (such as might come with a decision tree or any recursively partitioned model type);
  • A heavily discretised structure based on weights-of-evidence, or equivalent;
  • A hierarchical structure (as exhibited by ensembles, and other stacked, aggregated and/or boosted structures);
  • A neural network structure, of arbitrary complexity
  • A rule-based structure, or one based on logical statements; or of course
  • A very basic subjective process at the heart.

In addition to the above (and sometimes driven by it), the ‘type’ of the output can vary as well. Raw model outputs can be purely numeric, model-specific scores; weights-of-evidence; un-calibrated probabilities; ordinal ratings; and so on

The diversity of layouts, calculation flows and output types makes it exceedingly difficult to specify a blanket approach to calibration. A range of techniques is required

Rating Philosophy

Models have what we call a ‘cycle signature’ that reflects, primarily, the structure of the information flowing through the model. It’s an intrinsic property of the model; the information embedded in its inputs; and the temporal structure of rating operations.

In most models, this signature does not fully capture the credit cycle, nor does it miss the cycle; it is somewhere in between. In other words, a model’s unmodified cycle signature is neither purely ‘point in time’ (PIT) nor ‘through the cycle’ (TTC), but something that is typically called a ‘hybrid’.

In the early development of international capital regulations, this characterisation came to be called the ‘rating philosophy’ (RP). In our opinion, the RP is often not given the attention it completely deserves, nor is it defined precisely enough in methodology and policy documentation. Our technical blog (coming soon) will address these patterns in more depth.

In the early development of international capital regulations, this characterisation came to be called the ‘rating philosophy’ (RP). In our opinion, the RP is often not given the attention it completely deserves, nor is it defined precisely enough in methodology and policy documentation. Our technical blog (coming soon) will address these patterns in more depth.

We call a calibration of the unmodified cycle signature a “Base” calibration and, despite there being some hidden complexity (see below), this is a relatively standard exercise. By contrast, for calibrations to the PIT and TTC ‘extremes’ of the cycle spectrum, specialised solutions need to be built that draw on more advanced data-handling, model architectures, and parameter estimation techniques.

At Radius we use the term ‘calibration archetypes’ for the patterns (1) Base, (2) PIT and (3) TTC, discussed above.

Central Tendency Estimation

The most salient measure of a portfolio’s performance is arguably the ‘central tendency’ (CT) – the average default rate, at portfolio level, under some temporal assumptions. For relatively long periods, covering at least one cycle but preferably more, the CT is an intuitive, summary measure of a portfolio’s default risk.

On the surface, the CT would appear to be based on the most basic of calculations: it is simply an average of historic values. However, for this estimation to be a robust, meaningful analysis of average default risk, several considerations need to be taken in to account:

  • Small absolute number of defaults
  • Intrinsically-low default rate
  • Partial cycle coverage
  • Asymmetric cycle structures
  • Complex trend structures
  • Large-scale macroeconomic events
  • Changes in portfolio composition
  • Historical changes to underwriting policy
  • Complex segment dynamics
  • Alignments to external benchmarks
  • Changes to the default definition

Add these to the more basic fact that a scalar measure of default rate is a profoundly incomplete picture of risk and that there are even nuances around computing the variance, and you can see that the simple average is fundamentally limited as a measure.

One technique for estimating a portfolio’s central tendency is to estimate past default rates as a function of macroeconomic data. This technique, long used by modellers, is colloquially known as ‘back-casting’ and has recently been recognised by at least one regulator. Back-casting is obviously a time series modelling task, of arbitrary complexity, and requires the same care as a full model.

When default data is limited, for whatever reason, and/or a portfolio’s history is short, modellers need to use other, possibly non-quantitative information to estimate the CT. The use of subjective expert opinion, and benchmarks, are two obvious remedies, but the integration of these auxiliary sources requires special care to be defensible as well as meet other requirements (such as the quantification of any margin of conservatism).

Modelling the Cycle

Modelling and forecasting the credit cycle is absolutely pivotal to non-Base calibrations, and this is a relatively complex task, especially for low-default portfolios. We would argue that it is also of immense value to model the evolution of ODRs for Base calibrations.

Calibration architectures that condition on the cycle are non-trivial frameworks that require great care in their specification and development. Time series modelling is obviously a discipline in its own right, comprised as it is of a wide range of parameter estimation techniques, model architectures, statistical tests and data structures.

Calibration architectures that condition on the cycle are non-trivial frameworks that require great care in their specification and development. Time series modelling is obviously a discipline in its own right, comprised as it is of a wide range of parameter estimation techniques, model architectures, statistical tests and data structures.

Rating Scales

The master rating scale (MRS) is a core component of a properly-constructed rating system, and it is a relatively basic structure at first glance. However, despite this apparent simplicity, there is still great scope for making errors in specifying a rating scale, and consequently it is a component that often benefits from refining or even re-building.

Periodic validation of this mapping is an essential exercise, and all the usual challenges that attend the estimation of low default rates can complicate or limit this process. A computationally-robust validation …

Data Augmentation

Where portfolios have a low absolute number of defaults, and/or limited default history in general, it may not be possible to run some statistical processes in an eyes-closed manner. Consideration has to be given to data inadequacies.

In some cases, the data is of such poor quality, and/or quantity, that other methods need to be employed that are outside of conventional, frequentist statistics. Experts and external benchmarks can provide context but are not typically in a form that is easy to use; in fact this information can be a long way from a form that supports parameter estimation. The capture and synthesis of these elements into the modelling flow requires deep domain expertise and knowledge of advanced statistical techniques.

We have considerable experience with formal elicitation of expert opinion and the modelling of these inputs alongside other exogenous sources, and the core data, to produce robust calibration solutions.

We have considerable experience with formal elicitation of expert opinion and the modelling of these inputs alongside other exogenous sources, and the core data, to produce robust calibration solutions.

Self-Assessment

Pause for a moment and ask yourself the following questions:

  • Do you have all calibration requirements covered and/or fully explored?
  • Do you a comprehensive calibration framework with methodologies matched to each special portfolio type and data situation? For example, do you have a methodology for PIT calibration? Can you calibrate your low default portfolios in a robust, defensible manner?
  • Can you describe your institution’s Rating Philosophy (RP) in depth? Can you quantify that description, or do you currently use more qualitative terms? Does this RP suit your total requirement in terms of all the downstream utilisation of PDs?
  • Do you have a clearly-defined architecture for the overall model + calibration array?
  • Am you completely on schedule with all potential calibrations? Or is there some calibration work in the general modelling backlog?
  • How frequently do you calibrate the model set?
  • How thoroughly does your validation framework analyse calibration accuracy?
  • Are you forecasting ODRs with sufficient precision and frequency? More broadly, are you modelling and forecasting the credit cycle?
  • Do you know how to estimate a central tendency (CT) under heavy data-quality constraints?
  • Is your methodology for TTC producing smooth, level PDs? Or something more like a dampened hybrid?
  • What is your strategy for managing exotic recovery patterns (L-shaped) post pandemic?
  • How will you use historical data going forward? Will you Rebase? Make the blind assumption that ?
  • What is your approach to implementing or injecting a MOC? Do you have a specific methodology for that?
  • Approximately how long does it take you to re-calibrate a model? And from the point of validation?