Joint and Survivor Annuity Valuation with a Bivariate Reinforced Urn Process

We introduce a novel way of modeling the dependence of coupled lifetimes, for the pricing of joint and survivor annuities. Using a well-known Canadian data set, our results are analysed and compared with the existing literature, mainly relying on copulas. Based on urn processes and a one-factor construction, the proposed model is able to improve its performances over time, in line with the machine learning paradigm, and it also allows for the use of experts’ judgements, to complement the empirical data.


Introduction
We propose a novel nonparametric approach to the modeling of joint and survivor annuities, particularly useful in the case of right-censored observations. We build on previous results by Bulla et al. (2007); Muliere et al. (2000); Walker and Muliere (1997). The goal is to propose an alternative to copulas and other parametric constructions already used for the modeling of coupled lifetimes in insurance, for example in Frees et al. (1996), Carriere (2000) and Luciano et al. (2008), to cite a few important contributions.
The basic ingredients of our approach are Pólya urns (Mahmoud, 2008), probabilistic objects with the ability of intuitively representing the idea of learning via reinforcement. As observed in Cirillo et al. (2013) and Cheng and Cirillo (2018), urns are able to learn hidden patterns in the data, discovering previously ignored features. The model here presented can actually be seen as a non-conventional machine learning algorithm (Murphy, 2012).
Differently from the great majority of machine learning algorithms, our Bivariate Reinforced Urn Process (B-RUP) belongs to a particular class of models (Muliere et al., 2000;Walker and Muliere, 1997) with the extremely useful ability of combining some a priori knowledge-possibly referring to experts' judgements-with the information coming from actual data, perfectly in line with the Bayesian paradigm. This possibility allows for the incorporation of trends, tail events or other aspects that can be rarely observed in a data set-hence invisible to standard machine learning approaches, yet possible and with dramatic consequences (Taleb, 2007). For instance, if an expert thinks that their data under-represent a given phenomenon, like for example some unusual lifetime combinations in the modeling of survivor annuities, they could solve the problem by eliciting an a priori assigning a higher mass to those combinations, thus obliging the model to always take into account such a possibility, even if rarely (or not) present in the data. In a sense, a clever use of the priors can thus be an elegant way of dealing with sampling and historical bias (Derbyshire, 2017;Shackle, 1955).
A simple and quite popular extension of the single annuitant case we have just described is given by annuities involving more annuitants, most of the times just two. These are often the members of a married couple or of a partnership, but parent-child and relative-relative combinations are also common (Frees et al., 1996;Brown and Poterba, 2000). An example is offered by joint and last-survivor annuities, in which the insurance company pays as long as at least one of the two annuitants is alive. In this kind of contracts, the death of one of the annuitants may have an impact on the stream of payments that the survivor receives. For instance, for p ∈ [0, 1], in a 100p% (e.g. 50%) joint and survivor annuity contract, payments are made in full while both annuitants are alive, but if one of the two dies, only 100p% (e.g. 50%) of the original amount is paid to the survivor (Sheshinski, 2007;Winklevoss, 1977). A popular version is the contingent joint and survivor annuity, which pays the full benefit as long as the first annuitant (plan member) is alive, and then changes the benefit to a portion which is usually one-half or two-thirds.
From an actuarial point of view, to evaluate a joint and survivor annuity, one needs the marginal and joint survival functions of both annuitants. Let X and Y be two random variables representing the lifetimes of two annuitants in a joint and survivor annuity. Assume that today the age (say in years) of the first annuitant is observed to be x, while for the second we have y. The joint probability of survival for another k years, given the current ages, is S (x,y) XY (k, k) := P(X > x + k, Y > y + k|X > x, Y > y), (1) where k = 0, 1, ..., K.
We can then look at the probability that at least one of the two annuitants survives another k years, i.e. P (x,y) where the subscript LS is the acronym of "last survivor", with S (x,y) Y (k) = P(Y > y + k|X > x, Y > y) indicating the marginal survival probability of the second annuitant given the present situation of the couple, and similarly S (x,y) X (k) for the first annuitant. It is important to stress that, in Equation (2), both marginal probabilities are conditioned on the ages of both annuitants. This is because, in the general situation where the lifetimes X and Y share some kind of dependence, we have that S (x,y) X (k) = S (x,0) X (k) (Youn et al., 2002).
If we assume a constant interest rate r, the basic one-unit pricing formula for a 100% joint and last-survivor annuity is given by In the 100p% situation, with 0 < p < 1, this formula is easily adjusted (Winklevoss, 1977).
Equations (2) and (3) imply that, in order to evaluate joint and survivor annuities, one needs to specify the dependence between the lifetimes of the annuitants. If we assume that their lifetimes are independent, then S (x,y) Y (k), and the estimation of the marginal probabilities of survival is all that is needed. Since the estimation of univariate survival functions is a widely studied and well-known topic (e.g. Cox and Oakes (1984)), pricing annuities under independence makes both modelling and computations considerably easier (Winklevoss, 1977), but not necessarily realistic.
That of independence is indeed always a strong assumption. As Frees et al. (1996); Carriere (2000); Luciano et al. (2008); Sanders and Melenberd (2016) and others have shown, the lifetimes of several couples in annuity contracts present a non-negligible positive dependence. Moreover, some studies suggest that annuitants tend to have higher individual survival probabilities with respect to people who do not buy annuities (Mitchell et al., 1999). Such a result opens to the possibility that also the dependence structure among couples that buy annuities, and couples that do not, differs. Although there is no conclusive study about differences in the coupled lifetimes of annuitants versus non-annuitants, Sanders and Melenberd (2016) observe that the magnitude of the positive dependence for a random subset of 50, 000 couples-not necessarily annuitants-sampled from the whole Dutch population is significantly smaller than what Frees et al. (1996) and Luciano et al. (2008) find for about 15000 couples of annuitants in Canada 2 .
Regardless of its magnitude, there are several reasons that explain, at least qualitatively, why the dependence in the lifetimes of couples (especially married couples and partnerships) is positive. Some of them are rather intuitive, like the possibility of both individuals dying at the same time because of an accident or a contagious disease, or the common habits they might share as a consequence of living together, like the food they eat or the environment they live in. Others are more linked to emotional, psychological and medical aspects, like the "broken heart syndrome", a well-known phenomenon in medicine, in which the sudden passing of one spouse greatly increases (at least temporarily) the probability of death of the surviving one because of bereavement and sadness (Jagger and Sutton, 1991).
A question that naturally arises, when we consider the several possibilities for the modelling of joint mortality, is how these affect the annuity price in Equation (3). This question is well-addressed in Frees et al. (1996), where they compute the ratio between the annuity price under positive dependence and the price under independence (we shall call it annuity ration from now on). To model dependence, they make use of copula models with Gompertz and Weibull marginals, and they find out that the annuity ratio varies between approximately 0.95 and 1.05, depending on the initial ages of the annuitants at contract initiation. This means that, if one assumes independence to simplify modelling, they may end up with the underestimation or the overestimation of the annuity price by as much as a 5%. Compatible results are more recently obtained by Sanders and Melenberd (2016), even if the error they find is smaller than 5% on average, in absolute terms, for different types of contracts. Furthermore, the same authors underline that the error can be particularly larger and relevant for specific types of annuities, in which the joint survival distribution plays a major role, like for instance joint annuities.

Right-censoring
In survival analysis, right-censoring is a well-known problem in estimation (Kaplan and Meier, 1958). This usually occurs when individuals drop from the study before the event of interest has happened. If X is our age variable, when right-censoring occurs, the minimum X * = min(X, T X ) between X and a random censoring time T X is observed 3 . Every potentially censored observation is therefore represented by the couple (X * , δ), where the indicator δ tells if a given observation is censored (δ = 0) or not (δ = 1). Left-censoring can also be defined, but it is less relevant here. For more details see Klein and Moeschberger (2003).
While univariate censoring has been extensively studied in the literature, e.g. in Kaplan and Meier (1958);Elandt-Johnson and Johnson (1980) or Cox and Oakes (1984), far less progress has been made in the bivariate case. One of the first works to study this problem is that of Dabrowska (1988), in which the Kaplan-Meier estimator is extended to the two-dimensional framework. However, this estimator is known to produce negative masses when a large amount of censoring is present (Pruitt, 1991). In Wang and Wells (1997) or Gribkova and Lopez (2015), the censoring conditions are relaxed to obtain better estimators. For example, in the context of coupled lifetimes, Gribkova and Lopez (2015) assume that the age difference between individuals is known and related to the censoring variables; in Lin (1993), conversely, the censoring variable is taken to be the same for both lifetimes. Other recent articles on nonparametric estimators under bivariate censoring are those of Lopez (2012) and Shen and Yan (2008). In all these works, censoring is assumed to be independent from the variables of interest, an assumption we will also make in our construction.
In line with the univariate case, in the presence of bivariate censoring one plays with the quartet (X * , δ, Y * , ), where is the censoring variable associated to Y , and Y * = min(Y, T Y ). The variables (X, Y ) are assumed independent from (T X , T Y ).

The Bivariate Reinforced Urn Process
The Bivariate Reinforced Urn Process (B-RUP) is our proposal for the modeling of joint and survivor annuities. Originally introduced by Bulla (2005); Bulla et al. (2007), it is here extended and adapted to the annuity framework.
We will show that this model is not only able to replicate the results one can obtain with other existing methodologies, like copulas, but, at least withing the realm of positive dependence, it also allows for a more flexible modeling, as it improves its performances over time, and it allows for the exploitation of experts' judgements 4 .
To describe the model, we start by defining its main building blocks: the beta-Stacy process of Walker and Muliere (1997), and the Reinforced Urn Process (RUP) of Muliere et al. (2000). Then we show an intuitive representation of RUPs via urns. Finally we discuss how several RUPs can be easily combined into a one-factor construction leading to the B-RUP.

Reinforced Urn Processes and beta-Stacy processes
The Reinforced Urn Process was first described in Walker and Muliere (1997), where the results of Blackwell and MacQueen (1973) and Ferguson (1973) are extended to right-censored data, thus defining the so-called beta-Stacy process, a neutral-to-the-right (Doksum, 1974) generalization of the Dirichlet process (Ferguson, 1973).
The RUP is a combinatorial stochastic process, and it can be seen as a reinforced random walk over a state space of urns. Depending on how its parameters are specified, it can generate a large number of interesting models. Essential references on the topic are Muliere et al. (2000Muliere et al. ( , 2003 and Fortini and Petrone (2012). Examples of applications can be found for example in Cirillo et al. (2013) and Peluso et al. (2015).
In this paper we specify a reinforced urn process able to generate a discrete beta-Stacy process, a particular random distribution over the space of discrete distributions. Definition 4.1 (Walker and Muliere (1997)). A random distribution function F is a discrete beta-Stacy process with jumps at j ∈ N and parameters {β j , ω j ∈ R + , j ∈ N}, if there exist mutually independent random variables {V j } j∈N , each beta distributed with parameters (β j , ω j ), such that the random mass assigned by F to {j}, written F ({j}), is given by V j i<j (1 − V i ).
Following Definition 4.1, we introduce couples {β j , ω j } ∈ R + × R + , with j ∈ N, such that β j , ω j ≥ 0, β j + ω j > 0, and lim n→∞ n j=0 ωj βj +ωj = 0. Then, given a discrete beta-Stacy process F with parameters {β j , ω j ∈ R + , j ∈ N}, and a sample (X * n , δ n ) of exchangeable 5 and potentially censored observations, with X * n = {X * n , n ≥ 1}, the sequence is the number of observations at x ≥ j. In the context of survival analysis, Equation (4) gives the probability of survival of a new individual (X n+1 ), conditioned on the lifetime observations of previous individuals (X * n ). Notice that, by defining β * j = β j + m * j (x n , d n ) and ω * j = ω j + s j (x n ) − m * j (x n , d n ), we can obtain a new beta-Stacy process F * with parameters {β * j , ω * j ∈ R + , j ∈ N}. The beta-Stacy process is thus conjugate and neutral-to-the-right (Doksum, 1974). The importance of conjugacy is to be stressed, as it allows for continuous updating of the process over time, every time new observations become available. In other words, the posterior one can obtain as the result of a first cycle of Bayesian updates can then represent the a priori of a new cycle, without the necessity of restarting everything from scratch (Muliere et al., 2000).

Urn representation
Following Muliere et al. (2000), we can show that Equation (4) can be obtained via a sequence of two-color Pólya urns. This intuitive representation makes the beta-Stacy process easier to grasp. To further simplify the treatment, we momentarily ignore censoring.
Imagine we have M + 1 urns containing balls of two colors (red and green), as in Figure 1a (hence M = 4). Urn 0 only contains green balls. To generate a RUP, we move along the urns according to the following rules: • The process starts from Urn 0.
• For each urn, the probability that a given color is selected simply depends on the urn composition at the time of sampling.
• Every time we pick a ball, we note the color, and we reinforce the urn, that is we put the ball back together with an extra ball of the same color. This increases the probability of picking that color again in the future.
• If the color of the ball is green, we move forward to the next urn, which we sample to continue our walk. Conversely, if the color is red we must go back to Urn 0, and the process starts anew. Every time we restart from Urn 0, we define a new cycle for the RUP.  Figure 1: Representation of the RUP as a series of Pólya urns. After each sampling the urns are updated in a way that reinforces the probability of that sampling, giving the RUP its name.
In Figure 1b we see an example of a possible trajectory (•, •, •) and the resulting urn composition after each sampling. Notice that the probability of observing the same path in the next cycle has increased, thanks to the reinforcement mechanism. Please notice that from Urn 0 we necessarily move to Urn 1, while from the all other urns, we can move forward or jump back to Urn 0.
If we define the age X of death of an individual as the Urn X where we sample a red ball, and every urn corresponds to one year of life, it is easy to see that the survival probability given by the urn construction above and that of Equation (4) coincide, under no censoring. The pairs {β j , ω j } would correspond to the initial numbers of balls of each color for the j-th urn, and the functions s j (·) and m * j (·, ·) to the extra numbers of balls due to sampling. Further, the exchangeability of the observations is implied by the evolution of the Pólya urns (Mahmoud, 2008).
As said, the urn construction just analyzed generates a RUP without right-censoring. The possibility of censored observations can however be easily introduced: in case of a right-censored value at Urn j, we add extra green balls from Urn 0 to Urn j (included) and no red balls. Then we start again from Urn 0.
At every cycle, which for us represents the lifetime of an individual, the RUP learns and keeps memory of what happened before, thanks to the Pólya reinforcement, combining the information from samplings with the initial compositions of the urns (the a priori). In this way what we learn from a group of people can be used to make inference about other people.
In Muliere et al. (2000), the properties of the urn construction, including the conditions for recurrence, are studied in detail.

Defining a prior
One of the characteristics of the beta-Stacy process, inherited from the Dirichlet process, is that its trajectories are centered around a certain probability distribution G(·), that is E[F ({j})] = G({j}), which-in Bayesian terms-plays the role of the prior. As shown in Walker and Muliere (1997), a necessary condition for this is that where G(j) = P G (X ≤ j) is the probability that X is at most j under the centering measure.
The prior distribution G can also be used to define the initial compositions in a RUP. Following Walker and Muliere (1997) and Muliere et al. (2000), it is sufficient to set with c j denoting the so-called strength of belief in the prior knowledge, and G({j}) = P G (X = j). Notice that, given Equation (6), the necessary conditions for the couples {β j , ω j } to properly define a beta-Stacy process are automatically verified, if G(·) is a well-defined probability distribution.
By giving high values to the strength of belief c j , larger amounts of data are required in order for the posterior distribution, obtained via sampling and given in Equation (4), to move away from the a priori. On the contrary, when c j → 0, Equation (4) tends to the Kaplan-Meier (KM) estimator of Kaplan and Meier (1958). As observed in Walker and Muliere (1997), setting c j = c, ∀j, leads to a Dirichlet prior.

Building dependence
We have seen how the RUP model can be used to model lifetimes of single individuals. However, the ultimate goal of this work is the modelling of coupled lifetimes, and for that reason we need to extend the RUP so that it can deal with this type of problems. We can do so by using the one-factor construction introduced in Bulla (2005): the Bivariate Reinforced Urn Process (B-RUP).
Assume we observe n couples of (possibly) censored lifetime data of the form ((X * n , δ n ), (Y * n , n )), where X * n and Y * n are n-dimensional vectors of observations, and where δ n and n are the corresponding vectors of indicators for right-censoring, as defined in Section 3.
Let A n , B n and C n be three independent RUPs, defined exactly as in Section 4.1 and with their respective parameters One can easily create a bivariate model of the form The lifetimes of X i and Y i are thus composed of a common quantity A i and two personal elements, B i and C i . In this way, the dependence between X i and Y i relies entirely on A i , and thus, conditioned on A i , X i and Y i are independent.
A straightforward calculation yields thus implying that the B-RUP construction can only model positive dependence. For the application at hand, that is the joint modelling of lifetimes of coupled annuitants, this is not a problem, since previous studies have shown a strictly positive dependence (Frees et al., 1996). However, it is important to stress that the B-RUP model should not be used when negative dependence is also possible.
In Bulla (2005), the sequence {(X n , Y n ), n ≥ 1} is shown to be exchangeable (given that the RUPS A n , B n and C n are exchangeable by construction), and therefore, by the de Finetti representation theorem (de Finetti, 2017), there exists a joint random distribution function F XY conditionally on which the elements of (X n , Y n ) are independent and identically distributed according to F XY . The properties of F XY have been studied in detail in Bulla et al. (2007).
Let F X and F Y be the marginal distributions of X and Y , respectively. Clearly, we have with × denoting the convolution operation, so that both F X and F Y are convolutions of beta-Stacy processes. Furthermore, if P is the probability function corresponding to F , one has Although the joint posterior distribution of Equation (9) can in principle be computed analytically 6 (Bulla et al., 2007), its complexity grows quickly with the number of observations, making it already unfeasible for a relatively small sample. A convenient way of by-passing the problem is therefore to use a Markov Chain Monte Carlo (MCMC) approach, also known as Gibbs sampler, to obtain the joint probability distribution from the posterior distributions of A n , B n and C n , using the one-factor construction.
The MCMC method consists of the following steps: n , with the superscript referring to the iteration number, at the k-th iteration conditioned on the available data of the previous iteration. That is, sample A n from its conditional distribution, i.e.

P[A [k]
n = a n |A where, using the fact that P and with the conditional distribution of C n defined in an analogous way. Since n is an exchangeable process, Equation (10) also applies for any element of A n by simply changing A using the values of A n has been obtained. Once this is done, n . Note that, defined in this way, right-censoring is only applied to B n and C n , and not to A [k] n . As already underlined in Frees et al. (1996) and Carriere (2000), such an assumption appears acceptable from an empirical point of view.
2. Create a new combination (A k , B k , C k ) from the respective conditional marginal distributions, as per Equation (4).
4. Repeat steps 1-3 until k reaches the maximum number of iterations N .
Notice that the previous algorithm requires, besides the (possibly) censored observations of (X, Y ), an initial sample for A, that is, n . Since A is not observed in practice, a sensible approach is to run the MCMC several times for different samples A

Empirical results
In this section we apply the B-RUP model to calculate the dependence of coupled lifetimes, in order to price joint and survivor annuities.
We first consider an analytical example, where we sample possibly censored observations from a known distribution. Knowing the original distribution will not only allow us to estimate the difference between the original distribution and the posterior computed through MCMC, but also to study how these differences affect the final price of the annuity.
In the second part of the section, the B-RUP is tested on a well-known Canadian data set originally used by Frees et al. (1996), which we will also use as a benchmark.

Analytical example
Assume that X and Y are linked through the one-factor model of Equation (7). Assume that A follows a Poisson distribution with intensity parameter λ equal to 25 (from now on we use the notation A ∼ Poi(25)), and that B ∼ Poi (35) and C ∼ Poi(40). This implies that the marginals of X and Y are Poisson distributions with parameters 60 and 65, respectively; and that Cov(X, Y ) = Var(A) = 25. The correlation between X and Y is therefore approximately 0.4. Moreover, while not necessary true in general, we will make some assumptions about the dependence between the censoring variables, T X and T Y (see Section 3), so that we can generate censored observations and better study the properties of our construction. We start from the naive observation that the two annuitants of a joint and survivor annuity enter the contract at the same time; say that X 0 is the age of the first annuitant at the date of the signature, while Y 0 is that of the second person. If we denote the observation period by ∆, it is clear that T Therefore, to obtain right-censored observations, we have to choose the distributions of T X , ∆ and θ. In what follows we take T X ∼ Poi(50) , ∆ ∼ Poi(2) and θ = θ * − θ 0 , with θ * ∼ Poi (7) and θ 0 = 5. We then sample n = 10 4 couples of observations, of which almost 90% turn out to be at least partially censored (i.e. at least one of the annuitants in the couple is right-censored).
Following the procedure defined in Section 4, the first step to use the B-RUP is to define prior distributions for the target variables according to Equation (6), as well as the strength of belief parameters. The priors we choose are (20) and G C = Poi (20), respectively. For the strengths of belief, c k with k ∈ {A, B, C}, we consider two different scenarios: one where the strengths of belief are very small, so that the posterior distribution is strongly affected by the incoming data, as if we did not trust our a priori; and another scenario where the strengths of belief are big enough to keep memory of the a priori-which we trust-and, in particular, to influence those areas in the data with less observations. As said, the a priori can indeed be used to complement the data with information about rarely observed events and trends. In Table 1 the values of the strengths of belief for both scenarios 7 are provided. In the following we will refer to them as the "low belief" and the "high belief" scenarios, or with the subscripts l and h, when referring to the results obtained with the B-RUP model. For example, when we write B-RUP l , we refer to the B-RUP estimators in the low belief scenario. B-RUP h is defined analogously for the high belief scenario.
In order to initialize the Gibbs sampler, we also need to define an initial sample for A, which is not observable in practice. Therefore, this sample will be "artificial", and we need a reasonable way of generating it, taking into consideration the characteristics of A, like for example its positiveness. Moreover, the resulting posterior distribution will be affected by this artificial sample, and thus in practice it is recommended to compute the posterior for several appropriate initial samples. In our case, we will sample observations from the same distribution that we chose for the prior of A, so that, for this particular example, we generate 10 4 samples from a Poi(20).
Low belief High belief c A , c B , c C 10 −6 10 2 Table 1: Proposed strength of belief scenarios. Notice that, since we are considering a data set of size 10 4 , even under the high belief scenario, the prior will not be able to actually affect those areas where the observations are more concentrated, but it will definitely influence the areas with fewer data points.
In Figure 2 we show the recovered marginal distribution using the B-RUP via MCMC, as well as the KM estimators and the original distribution. Notice how, due to the lack of exact samples on the right tail of the distribution, the KM estimator is not properly defined on the whole support of the original distribution, but only up to the maximum value of the uncensored observations. The B-RUP model, on the contrary, is able to recover the right tail of the distribution, although the fitting is clearly worse in that part of the curve, because of the evident lack of observations. If available, a better a priori could here be used to improve tail fitting.
We also present the joint distribution in Figures 3a and 3b, for the low and high belief scenarios, respectively. Notice how, in Figure 3a, the fitting is worse in the upper-right corner than in the lower-left corner, in accordance with the right-censoring effect. For the high belief scenario, however, the prior distribution dominates in those areas of few observations and we recover a smoother surface, as per Figure 3b.
A more quantitative comparison can be found in Table 2, where we show the means, the variances and the correlation for both B-RUP scenarios, as well as for the KM estimator, the theoretical analytical solution, and the data. In this last  case we provide the sample estimates under two different points of view: 1) correctly using only the truly uncensored observations ("Uncensored"), and 2) taking into consideration the entire data set, ignoring the presence of censoring ("Whole"). While this second approach is not correct (Klein and Moeschberger, 2003), it can be heuristically useful to better understand the impact of right-censoring in the data. For example we can clearly see that not considering censoring can lead to a serious overestimation of the dependence in the data.
Always in Table 2, observe how the B-RUP is able to recover the correlation, under both scenarios 8 , something not possible when using the KM approach. Moreover, observe how the B-RUP, in particular under the low strength of belief, better captures the variability of both X and Y .
In Table 3 we show permutation tests performed on the means and the variances. Under the null hypothesis of no difference between the obtained estimates and the true analytical values of Table 2, the tests never reject the null for the B-RUP estimators under the low strength of belief scenario, at a standard 5% significance level. These permutation tests where performed simulating samples of size 10 3 from both the B-RUP (low and high strength of belief) and the KM marginals. For the mean, the test statistic used is the absolute difference between the means, while for the variance Good's test (Good, 1994) is employed. Looking at Table 3, we can conclude that the B-RUP with low strength of belief even beats the KM, which is not able to correctly recover the variance of Y . If we also take into account that the KM approach cannot estimate the correlation, as already discussed for Table 2, the B-RUP l performance is even more appreciable.
Further, observe that, since the a priori is considerably different from the analytical solution, it is no surprise that, for the high strength of belief case, the B-RUP h performances worsen for the variances of both X and Y . As known, the variance is in fact particularly sensitive to discrepancies in the tails, and we already said that the B-RUP h puts a non-negligible mass on the right tail, which is never really updated by the data. Once again we want to stress that this is actually a sign of the power and flexibility of the B-RUP, in case of a credible a priori (clearly not the case in this very simple example).
Once we have the joint distribution of the pair (X, Y ), we can price annuities using Equation (3). Following Frees et al. (1996), we are interested in the annuity ratio between the price when assuming dependence and the price when assuming independence. For the independent scenario we simply use the marginals obtained in Figure 2, while for the dependent scenario we use the joint distribution of Figure 3a. Moreover, according to Equation (3), once we know the joint distribution, the annuity price depends on three quantities: the ages of the two annuitants at the beginning of the contract, which we call entry/initial ages, and the interest rate. As in Frees et al. (1996), we first show the dependence of the annuity ratio for several initial ages with a fixed interest rate, and then we vary the interest rate and assume that both annuitants have the same age, so that we can still represent the results in a 3D plot.
In Figures 4a and 4b, we show the annuity ratio obtained with the B-RUP construction and the true (analytical) distribution for an interest rate of r = 0.05. Notice how, in this example, when considering dependence, the price can be up to 7% higher than in the independent scenario for some initial ages, especially when the age difference is large.  Table 2: Comparison of the means, the variances and the correlation of X and Y using the uncensored data, the whole data set (wrongly assuming no censoring), the KM estimator, the B-RUP estimator and the original analytical solution, respectively.  Table 3: Permutation test for the means and the variances of X and Y , using samples of 10 3 observations with 10 4 permutations. The sample are taken from the KM estimator and the B-RUP with low and high strength of belief. In bold we underline the situations for which the null hypothesis of no difference between the estimates and the true values (from the analytical model) is rejected, for a 5% significance level.
For smaller age differences, conversely, the annuity ratio takes values below the unit. For low initial ages for both annuitants, the death event is still far away with high probability, and thus the annuity ratio barely moves from the unity in that area.
Looking at Figures 4a and 4b, it is precisely for high annuity ratios that the B-RUP estimators differ more from the analytical ones. This, again, is a consequence of the censored observations, that do not allow for a proper estimation due to the lack of information about the right-tail of the distribution 9 . In Table 4 we present the absolute value differences between the B-RUP and true estimators for several initial ages and a 5% interest rate.  Table 4: Absolute value differences in the annuity ratios obtained with the B-RUP model and the analytical distribution for several initial ages of the annuitants (Fist Annuitant -FA, Second Annuitant -SA) with an interest rate of 0.05. Given the small differences, we can say that the B-RUP is able to recover the "truth" from the data.
Finally, in Figures 5a and 5b, one can find the annuity ratio as a function of the interest rate for the same initial age for both annuitants. In accordance with the previous results, the annuity ratio is above one for large initial ages, and below one for low initial ages. Furthermore, this effect seems to be more pronounced for small interest rates, since the curves are steeper than for high interest rates. This tendency is naturally the same for both the B-RUP estimator and the original distribution.    shows the price ratio using the actual distribution (via the known analytics). Note how the ratio differs more between the two plots for high initial ages, since, because of the censored observations, the B-RUP fitting is worse in that area. The result is nevertheless quite positive. Frees et al. (1996) Let us now consider a real-world application, using a well-known data set, initially presented in Frees et al. (1996), and later also studied in the relevant works of Youn and Shemaykin (1999), Carriere (2000), Youn and Shemaykin (2001) and Luciano et al. (2008), among others.

Canadian data set of
The data set contains 14,497 contracts from a large Canadian insurer, and the period of observation runs from December 29, 1988, until December 31, 1993. To simplify the interpretation and the comparability of our results with the previous works of Frees et al. (1996) and Luciano et al. (2008), we have removed same sex contracts and, for every couple, we have kept only one contract, leaving a total of 11,454 contracts. Further, we have ignored couples were the annuitants age was less than 40 at the end of the observation period, thus excluding a very limited number of parent-child annuities.
All in all we have ended up with a data set of 11,421 male-female couples.
For each couple, the following information is available: dates of birth, dates of death or ages at the end of observation period, and date of contract initiation. Once the observation period ends, we no longer have any information about the individuals, and thus for those who lived longer than December 31, 1993, the lifetimes are right-censored.
Only a small portion of all the lifetimes in our data are actually uncensored, i.e. fully observed. In particular: 10.87% of the male lifetimes, 3.93% of the female lifetimes, and 1.71% of the joint (couple) lifetimes. The last 1.71% corresponds to 195 couples, of which 24 died within a one-day period, 59 within one month, and 92 within one year. Therefore, a considerable portion (47%) of the uncensored couple deaths are either likely due to an accident-which would explain both spouses' dying within a day-or to some other impactful reason like a highly contagious disease, or a strong broken  heart syndrome (Jagger and Sutton, 1991). In these situations, it is expected that the correlation between the annuitants' lifetimes is strongly positive, and it actually is. However, as we will see shortly, taking only into account the uncensored observations highly overestimates said dependence for all the other couples.
To model the data with the B-RUP, we follow the exact procedure as before. We start by identifying the target variables, define their respective priors, and then apply the Gibbs sampler with predefined strengths of belief to merge the a priori knowledge with the data. The targets are, once again, A, B and C, and the priors we chose are G A = Poi (60), (15) and G C = Poi(15). With this choice, the marginals for X and Y are both Poisson distributions with parameter µ = 75, and the correlation is approximately 0.8. This prior setting (from now on, the Poisson scenario) takes into account the sample moments and correlation one can estimate directly from the uncensored observations, and that are shown in the first column of Table 5. For the strengths of belief we will still use the scenarios in Table 1. As before, for initializing the Gibbs sampler we will generate observations from the prior we chose for A.
The decision of choosing distributions whose moments are in line with (a part of) the empirical data is to show a possible way of eliciting the a priori. Clearly, in case of experts' judgements or other sources of knowledge, one can also elicit completely subjective priors. In Appendix A, we show the impact of changing the a priori.
The rest of the section is structured as follows: in Subsection 5.2.1 we show the B-RUP estimation of the joint survival function, and in Subsection 5.2.2 the relative annuity calculations; then in Subsection 5.2.3, we compare our results with the copula approach one can find for example in Frees et al. (1996). Figure 6 shows the B-RUP marginal distributions for both the low and high strengths of belief. We observe that the marginal distribution for the first annuitants (males) is nicely recovered. However, the differences are clearly bigger 10 for the marginal of Y -the second annuitant (females)-since censoring seems to affect more the right-tail.

B-RUP results for the joint survival function
The lack of information about the right-tail of the distribution of female lifetimes is so large that the KM cumulative probability at the age of 97 is around 0.8, while for the age of 98 the cumulative distribution function (cdf) reaches its maximum value of 1. Taken literally, this would mean that there is a 20% probability for females of dying between 97 and 98 years, something we know is not true. This unrealistic behaviour could be due to the fact that, as per Table 5, female annuitants have a slightly longer life expectancy than males, while, from the initial ages at which they enter the contract, we see that on average (Luciano et al., 2008) females are three years younger than males, when the observation period starts. This combination of younger entry age plus higher life expectancy could be the reason why censoring affects more the distribution of females than that of males for this particular data set.
Some additional numbers can give insight about the lack of data in the tails. Only 48 females have indeed survived until at least 90 years. Conversely, the number of males observed in the same range is 77. While this difference may not seem significant, notice that the life expectancy of males, according to the results in Table 5, is 3 years lower than that of women. Therefore one would expect to observe more women at high ages than men, whereas the opposite happens in our data. This could also be explained by the lower entry ages of females with respect to males.  The upper row shows the results for the low belief scenario and the lower row the same for the high belief scenario. All distributions are conditioned on the minimum value of the uncensored observations. For this example those were X min = 51 and Y min = 46. Table 5 presents the means and the variances of X and Y , and their correlation, for the raw uncensored data, the KM estimator (notice that in this case correlation is not obtained because independence is assumed), and the B-RUP. For completeness, we also provide all quantities when taking all observations as uncensored ("Whole"), as we did in the analytical example.
First notice that using only uncensored observations leads to an underestimation of the average lifetimes and to a likely overestimation of dependence. But this is no surprise, as it is known that ignoring censoring is highly misleading in survival analysis (Klein and Moeschberger, 2003 Table 5: Comparison of the means, variances and correlation of X and Y using: the uncensored data, the whole data set (wrongly assuming no censoring), the KM estimator and the B-RUP estimator for both low and high strength of belief, and the copula model, respectively.
In Figures 7a and 7b we show the joint distributions obtained with the B-RUP model for the low and high belief scenarios, respectively, as well as the prior distribution for comparison purposes. Notice how, although their marginals are very similar-see Figure 6-the shape of the contours is considerably different around the areas of lower probability.
Conversely the joints are more compatible around the inner contours, for the bulk of the distribution, sign that the strength of belief is not large enough to affect the areas where data are concentrated. In particular, for high strengths of belief the contours are somewhat smoother, a phenomenon which we already observed in the analytical example. This makes the distribution of Figure 7b easier to interpret. Nevertheless, notice that this smoothing induced by the strength of belief is a secondary role, while the main purpose of a reliable prior is to embed behaviours that are not captured by the data. Moreover, whereas further increasing the strength of belief would surely result in smoother distributions, we could end up giving a larger weight to the prior distribution than originally intended.
A way to improve our previous results, obtaining a more tractable joint distribution, while maintaining the desired strength of belief, is to use a kernel bivariate density estimation. For example, we can employ a bivariate normal kernel, with bandwidth parameters consistent with Sylverman's rule of thumb (Sylverman, 1986). The obtained distributions are presented in Figures 7c and 7d, for low and high strengths of belief, respectively. Note that the contours of the high belief kernel estimate are considerably more symmetric than those of the low belief kernel estimate, in accordance with the results of Figures 7a and 7b. Notice that, by increasing the strength of belief, the areas with fewer observations are dominated by the prior, and this causes a decline in the correlation, which goes from 0.477 to 0.421, consistently with the findings in Table 5.
As observed in Carriere (2000) for this very same data set-and more recently in Sanders and Melenberd (2016) for a data set concerning the whole Dutch population-the positive dependence between the lifetimes of married couples increases with the ages of its members. In other words, people in their forties are expected to present a smaller positive dependence than a couple in their eighties. To check for this phenomenon, as in Sanders and Melenberd (2016) we compute the relative difference between the joint survival function and the product of the marginal survival distributions. The results are in Figures 8a and 8b for the low and high belief scenarios, respectively, and they are clearly in line with an increase of dependence with age. Moreover, from the same figures we can conclude that the relative difference is bigger under the low belief scenario, which would explain the slight increase in the overall correlation (Table 5).

Annuity pricing with the B-RUP
Now that we have estimated the joint probability distribution, we can calculate the annuity ratio. We still assume a fixed interest rate of 0.05 and show the annuity ratio for several initial ages of the annuitants. Next, when we want to compute the annuity ratio as a function of the interest rate, we will hypothesize that annuitants enter the contract at the same age to simplify the exposition, so that we can easily plot the results as a function of the interest rate only. Figure 9 shows the results for the low and high strengths of belief, and the fixed interest rate. In line with expectations, even for the annuity ratio, the high belief scenario gives again a smoother surface.
Notice that, in both cases, the annuity ratio increases with the age difference between the annuitants. When one of the annuitants dies at a very old age, the probability for the surviving spouse to also die at an old age is larger under positive correlation than under independence. This effect seems to be bigger when the female is the surviving annuitant, probably because of the larger life expectancy of females with respect to males (Carriere, 2000).
When both spouses are around the same age, the annuity ratio takes values below one, specially for very high entry ages. If one annuitant dies at a young age, the probability of the remaining annuitant to also die in the near future is higher when positive dependence is assumed, but in the long term the surviving annuitant will count as being independent from the deceased spouse, and the annuity ratio is closer to 1. Conversely, for high entry ages, the impact of the deceased spouse upon the surviving annuitant is more relevant in the near future than in the long term-according to the meaning of "high" entry ages, there is no long term survival to be expected in the first place-and the annuity ratio keeps on decreasing below the unit. All these findings are in line with the results obtained in Frees et al. (1996) for the same data set with copula models.
Also notice the similarity of these surfaces with those of Figure 4: qualitatively the behaviour is mostly the same. This is because in both cases we have positive dependence, and thus all the reasoning developed in this section also applies for the analytical example. Moreover, since the correlation levels are very similar (around 0.5), we obtain annuity ratios of comparable orders.
The results of the annuity ratio as a function of the interest rate can be found in Figure 10. Please notice how the overall shape is in accordance with what we already observed in Figure 5. For small interest rates and entry ages, the annuity ratio is below one, meaning that the annuity price with independent mortality is overestimated. As we increase the entry age, the annuity ratio decreases up to a minimum (0.9527 for and interest rate equal to 0 and an entry age of 71 years), from where it starts to increase again, also reaching values above the unit.
We would like to stress that, while annuity ratios of 1.02 or 0.98 may seem close enough to 1, and this could lead to the conclusion that it is ok to assume independence, the nominal value for an annuity contract is usually considerably large, and thus the final difference in the prices may not be negligible in monetary terms. In Table 6 we show some values of the annuity ratio for particular entry ages.
The results above are in line with what Frees et al. (1996) and Luciano et al. (2008) found on the same data, using different approaches like copulas, suggesting that the B-RUP can be safely used as an alternative model.
In the next Subsection we compare the B-RUP and a copula approach in some more detail.

What about copulas?
We finally compare the performance of the B-RUP with copulas. In particular, we choose the Frank copula model of Frees et al. (1996), which proved to give good results for the annuity problem.
In Frees et al. (1996), Gompertz distributions are used for the individual lifetimes, and the Frank copula to model the dependence. The Gompertz distribution is given by where µ, σ are the location and scale parameter, respectively.
The Frank copula is defined as where u, v are the marginal distributions for the male and female annuitant, respectively, and α is the parameter controlling the dependence. A negative value of α indicates positive dependence, while α = 0 means that we have independence (Nelsen, 2006).
It was already shown in Frees et al. (1996) that this model can be calibrated via maximum likelihood estimation (MLE), and we refer to the original paper for all the details regarding the estimation of the parameters. In Table 7 we present the results obtained using MLE, where (µ X , σ X ) are the Gompertz estimates for the male annuitant, and (µ Y , σ Y ) the estimates for the female annuitant. Since the value of α is highly negative, we expect a strong positive dependence. It can be seen in Table 8 Table 6: Annuity ratio obtained with the B-RUP model with low strength of belief parameters for several initial ages of the annuitants with an interest rate of 0.05. Assuming independence can definitely be a bad choice in pricing. 783 5.927 90.118 5.145 -4.144 Table 7: Calibration of the copula model via MLE. The subscripts X, Y refer to the marginals of the male and female annuitants, respectively. Notice that α is negative and away from 0, so according to Equation (12) there is positive dependence.
In Figures 11a and 11b we present the marginal distributions for this model, with also the KM estimator for comparison. If we compare these plots with those of Figure 6, we see that both the B-RUP and the copula model return very similar curves. We can also compare the moments using Tables 5 and 8. For the male annuitants, both the mean and the variance are remarkably similar, while there are some discrepancies for female annuitants. As discussed earlier, for this particular data set there is barely any information about the right tail of the females' distribution, and so there is no surprise in models yielding slightly different results.  Figure 11: Comparison of the marginals of X and Y using a Poisson prior. In red we show the Kaplan-Meier estimators and the copula distribution in black. All distributions are conditioned on the minimum value of the uncensored observations. For this example those were X min = 51 and Y min = 46.
In Figure 12 we show the joint distribution given by the copula, with the B-RUP results already presented in Figure 7. As expected, the parametric copula produces a less "cloudy" and elliptical distribution, with less mass on "peripheral" couples like (70, 85). The copula also gives a slightly stronger correlation (0.51 versus 0.48), but we can consider this difference as negligible in practice.
Finally, we check if the small discrepancies in the survival functions of the B-RUP and the copula model are also observed once we compute the annuity ratio. The annuity results for the Frank copula are in Figures 13a and 13b. Apart from the areas where both annuitants are considerably old (> 75) at the time of contract initiation, we see that the B-RUP and copula models tend to give very similar results.
All in all the B-RUP and the Frank copula of Frees et al. (1996) show similar performances-but this holds true also for the other models of Frees et al. (1996) Figure 12: Joint lifetime distribution for the copula construction of Frees et al. (1996).
While working nicely on data, copulas do not have the same flexibility of the B-RUP. In particular when a priori knowledge can be used to improve fitting, or to deal with specific characteristics of the data-like a severe lack of observations in the tails-the B-RUP clearly has an edge over the other commonly used approaches. And this without taking into consideration the model risk involved in every parametric choice behind the use of copulas (Hull, 2015;Mikosch, 2006). This is why we believe that the nonparametric approach the B-RUP proposes, which combines a priori knowledge (when meaningful) and the ability of learning from the data, represents a viable and powerful alternative to the existing approaches to the modeling of coupled lifetimes. The fact that one can get the same results, but also extend them with experts' judgements (eliciting a different a priori as in Appendix A) is a plus we deem worth using.

Conclusions
In this paper we have proposed the bivariate reinforced urn process (B-RUP) as a way of modeling dependent mortality, to price joint and survivor annuities.
The main advantages of the B-RUP lie in its intuitive construction, in the possibility of combining experts' judgements with empirical evidence, in the ability of the model to learn and improve its performances over time, like in many machine learning approaches, but without "black boxes" (Knight, 2017), and in the successful treatment of rightcensored observations, very common in annuity modeling.
In the absence of a credible a priori, which would give a strong competitive edge to its use, the B-RUP is nevertheless able to replicate the performances of other commonly used approaches (e.g. copulas as in Frees et al. (1996) or Luciano et al. (2008)), thus showing an interesting flexibility. Differently from other models, however, the B-RUP can be used on a continuous basis, as it automatically updates its parameters whenever new data become available, without the necessity of re-estimating the model entirely. This can be extremely useful in an online learning environment. Finally, being nonparametric, the B-RUP is less subject to model risk than copulas or similar approaches (Hull, 2015), especially if a clever use of priors can also account for data problems.
Using artificial data and a well-known Canadian data set of annuities (Frees et al., 1996), we have discussed the performances of the B-RUP, which appear definitely satisfactory. In analysing those performances, we have also discussed how a sufficient number of data can correct a wrong a priori, but also how strong priors may try to correct for historical bias and deal with tail events.
(a) Fixed interest rate.
(b) Couples with same age. Figure 13: Ratio between the annuity values of the joint and independent survival models using the copula construction.
The left plot (a) shows the annuity ratio for a fixed interest rate r = 0.05. Similar results when the couples have the same age are presented on the right plot (b).
In terms of performances, the model provides results in a very reasonable time, spanning from a few minutes to a couple of hours, depending on the size of the data set and the amount of predictions required. In particular, for the Canadian data set considered and 10 4 iterations of the Gibbs sampler, the algorithm took approximately 1 minute of computing time in a C++ environment and an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz processor.
In the future, it would be extremely useful to go deeper into the study of credible and reliable a prioris for annuity modeling, by gathering opinions and recommendations from experts. From a computational point of view, conversely, it could be meaningful to find ways of introducing parallelization in the simulations of the B-RUP. While not immediately relevant to the application described in this paper, such a possibility could dramatically expand the applicability and the performances of reinforced urn models in insurance and finance (Amerio et al., 2004;Cirillo, 2018, 2019;Peluso et al., 2015).

Acknowledgements
First of all, the authors would like to thank Prof. Edward W. Frees for sharing his Canadian insurance data with us.
The help of two anonymous referees and the editor in improving the first version of the manuscript is also highly appreciated.

A Changing the prior
In this section we include additional results about the impact of different priors on the joint lifetime distribution for the Canadian data set of Section 5.2. These results are meant to be compared with the Poisson prior and the copula model of Section 5.2.
The results obtained using the Gompertz scenario are in Figure 14, and they are very similar to the Poisson case. On the other hand, the results obtained with the Uniform scenario, shown in Figure 15, are significantly different, especially for high strengths of belief-see Figures 15c and 15d-where the B-RUP and KM marginals have very little in common. Trivially, as one grows older the probability of demise increases, and therefore imposing a uniform distribution for the lifetime of an individual is highly unrealistic, so this mismatch was actually to be expected. In this sense, the B-RUP inherits the pros and the cons of the prior setting used, for sure under high strengths of belief. Naturally and conversely, for low strengths of belief, the B-RUP is able to capture both marginals, except for the areas with very few observations, so that the model is able to override a wrong prior given a sufficient amount of observations.  Figure 14: Comparison of the marginals of X and Y using a Gompertz prior. In green we show the prior distribution, the Kaplan-Meier estimators in red and the B-RUP solution in blue. The upper row shows the results for the low belief scenario and the lower row the same for the high belief scenario. All distributions are conditioned on the minimum value of the uncensored observations. For this example, X min = 51 and Y min = 46.
In Table 9 we show the first two moments and the correlation, similar to what we did in Table 5 for the Poisson prior. As before, the subscript l (h) denotes low (high) strength of belief. Notice that, for low strengths of belief, the difference in the B-RUP results are mainly due to the choice of the initial sample for A in the Gibbs sampler, which in our case is generated from its prior distribution. Consistently with what we have already said, the uniform scenario with a high strength of belief generates the most radical results, including a correlation close to the one (0.8) we can estimate from the uncensored observations. All the other B-RUPs tend to suggest a less extreme correlation, in the vicinity of 0.5, in line with Frees et al. (1996) and Luciano et al. (2008). In any case, a correlation of 0.5 should definitely not be ignored by assuming independence, as per the KM approach.  Table 9: Comparison of the means, variances and correlation of X and Y using: the uncensored data, the whole data set (wrongly assuming no censoring), the KM estimator and the B-RUP estimator with Gompertz and Uniform priors, respectively. The subscript l (h) indicates low (high) strength of belief.  Figure 15: Comparison of the marginals of X and Y using an Uniform prior. In green we show the prior distribution, the Kaplan-Meier estimators in red and the B-RUP solution in blue. The upper row shows the results for the low belief scenario and the lower row the same for the high belief scenario. All distributions are conditioned on the minimum value of the uncensored observations. Here X min = 51 and Y min = 46.