Introduction

The current global spread of a novel influenza strain [1] highlights gaps in our understanding of the spatial component of disease transmission at national and regional scales. For example, the early summer 2009 wave in the United States has affected some populations much more so than others (Centers for Disease Control, USA), even populations at similar latitudes. Also, there was substantial transmission in parts of southern England throughout the summer 2009, but very little in most of northern mainland Europe (European Centre for Disease Prevention and Control). This slow progression to national and regional level synchrony is not obviously consistent with previous theoretical studies of the within-country dynamics of pandemic influenza [2][3][4], in which census-reported commuting patterns and airline flight data were used to characterize very rapid spatial spread. Explaining these early patterns of spatial spread for the 2009 pandemic will likely be an active area of epidemiological research in the coming years.

Stochastic spatial transmission models, in which individuals or small communities are represented explicitly in space, are an extension of more traditional approaches and have been a valuable tool in the study of infectious diseases in humans and animals [5]. Traditionally, mathematical models of epidemics often take the form of deterministic differential equations in which the variables represent the expected number of individuals in broad disease classes (e.g. susceptible, infected, or recovered) [6]. Although such models can be extended to model the geographic spread of infectious diseases on patches [7], when it is not clear which spatial scales are most important, it is difficult to use compartmental approaches with confidence.

Here, we describe an algorithmic refinement of a spatial stochastic model of individuals and their communities. This framework was originally designed to investigate community interventions against influenza in a generic sense [8]. It was later extended to examine the optimal response to a bio-terrorist smallpox attack [9] and also to examine the potential for the containment if an influenza pandemic in large well mixed population [10]. A spatial component was added to the model to study the feasibility of containing an emergent influenza pandemic in a rural setting in southeast Asia [11]. In its last major development, the underlying algorithm was parallelized to allow it to run with a population of 300 million and used to predict the likely impact of mitigation measures on an influenza pandemic in the United States [2]. More recently, the same framework has been used to describe the likely fall wave transmission dynamics for H1N1pdm in Los Angeles county [12].

We have implemented a more efficient algorithm for this popular disease transmission model. We demonstrate increased computational efficiency compared with previous implementations and we describe a parameterization scheme for the model using the basic reproductive number, rather than the per contact transmission potential. We illustrate the utility of the refined model with simulation studies of seeding dynamics for a pandemic of influenza in Taiwan.

Methods

The model is, effectively, a highly connected network model. Individuals are fully connected within different settings (home, school, workplace, etc) and we assume that all members of one instance of a setting are connected to all other members. The infection process over this network can be though of in two stages: contact and infection. Each individual is a member of one of five age groups and has several contact groups. The age groups are preschoolers (0 – 4 years), school-age children (5 -18 years), young adults (19 – 29 years), adults (30 – 64 years), and elders (65+ years). The social contact groups include communities, neighborhoods, household clusters, households, work groups, high schools, middle schools, elementary schools, daycare centers, and play groups. Individuals have contacts in their contact groups every day. Each day is divided into a daytime and nighttime period. During nighttime, contacts occur only in communities, neighborhoods, household clusters, and households; whereas in the daytime, contacts occur in all the contact groups. Each contact between an infected and a susceptible individuals results in a potential infection. The probability that an infected individual can transmit the virus to a susceptible individual is called the contact probability and depends on the group in which they have contacts and the ages of the potential infector and infectee. The age of an individual not only affects the probability that an individual is infected, it also determines an individual’s daytime contact groups: preschoolers stay either in daycare centers or in play groups; school-age children stay either in schools or in households as drop-outs; and young adults and adults stay either in work groups or in households if unemployed. Further details of the basic model can be found in the supporting information to Ref [2].

The discrete-time simulation of infection events in individual-based epidemic models can be reduced to the generation of binomial deviates. Within any given model there can be many types of infectious individual and many types of susceptible individual. For example, there can be many age groups and many stages in the natural history of the disease. The set of all possible pairs in which the first element is an infectious individual and the other element is a susceptible individual (an IS pair) defines the set from which infection events can be simulated at any point in time. If many of the pairs have exactly the same probability of generating an infection ( S of exactly same type and I of the exactly the same type) then many infection events can be generated with relatively few binomial deviates. However, if the pairs are largely different, then many binomial deviates need to be drawn to generate a similar number of infections. The introduction of spatial dimensions into individual-based formulations greatly increases the heterogeneity of the model because every small group of individuals with a unique location forms, effectively, their own risk group.

A high-level description of a naive algorithm for the basic model is as follows:

The algorithm we have developed greatly improves the efficiency with which infection events can be generated across large numbers of similar risk pairs. Here, we briefly describe the key features of the algorithm as it relates to the efficient simulation of spatial epidemics. The methods are described in more detail elsewhere [13]. In essence, the approach is to use lazy evaluation for large groups of pairs with similar probabilities of an infection event. For example, one infectious individual a in community A has a certain maximum probability of infecting members of community B , based on the flow of workers between those two communities. The precise probability of infection for each member of community B will depend on their age and other risk variables. However, the maximum probability for any individual in group B, pmax, may be very small if the worker flow between A and B is small. Therefore, using the Sieve algorithm, our first step is to generate a random variable for the provisional number of infection events that occur by assuming that all pairs have the same probability of an infection occurring. This generates too many infections. The second step is to select specific pairs at random and either accept or reject provisional infections using the precise probability of infection between individual a and each individual b (in the provisional set of infections in community B). We define the precise probability to be pb. If we accept each provisional infection event with probability pb/pmax, it is clear that the overall probability of individual b being infected is equal to pb. Therefore, our method reproduces the same stochastic process as if we evaluated each individual pb separately and is not an approximation.

A high-level description of our improved algorithm is as follows:

We are able to prove the statistical behaviors of the Sieve algorithm are the same with the naive algorithm where each candidate is decided one by once sequential. The proof of equivalence is given in [13]. Note that our Sieve algorithm decides a set of candidates in a batch. One of the reasons that our algorithm can run faster is because that in practice pmax is very small. Thus the size of the candidates K selected in the Sieve algorithm is much small than N which is the pool of people to be considered.

By treating the model explicitly as a network, we calculate the average number of secondary cases a priori , rather than using semi-empirical methods to calibrate the model. The basic reproductive number R0 is the expected number of secondary infections generated by a single typically infectious individual in an otherwise susceptible population [14]. R0 is a threshold parameter that determines whether or not an infectious disease will spread through a population. Strictly, for models with multiple types of infectious individuals, R0 should be defined in terms of a next generation matrix and an eigen vector for the exponential phase of growth. The eigen vector is important in that it defines what is typical during the exponential phase. Often, a typical type of infectious individual will be different from a randomly chosen individual. For network models of infectious disease, the formal approach presents some problems because every individual is, essentially, a different type. Therefore, we follow many previous network models and use the average number of secondary cases per randomly chosen individual as R0.

Based on the influenza model and parameters, we compute the probability that infectious individual i infects susceptible individual j, namely wij as follows. First, the infection probability resulting from i and j ‘s contact in group k is defined to be pijk = Ptrans multiplies ck, where Ptrans is the disease-dependent transmission probability and ck is the group-dependent contact probability. Second, Dij is the set of i and j ‘s contact groups in the daytime, and Nij is the set of i and j ‘s contact groups during nighttime. The intersection of Dij and Nij can be either empty or nonempty. Third, when the infectious individuals are incubating or asymptomatic, the infection probability is reduced by a factor, r, where r > 1. For clarity, we define hijk = pijk/ r . In our model, the current setting of r is set to 2 following Ref [2]. Finally, by taking all daytime and nighttime contacts into consideration, the daily infection probability is calculated by

where Hij is the daily infection probability when individual i is incubating or asymptomatic. By adopting the natural history model, the computation of the total expected number of infections can be calculated (Figure 1). Also, we validated the accuracy of the calculation by simulation (Table 1).

Fig. 1: The computation of the probability that individual j is infected by individual i according to the natural history model.

Results

The Sieve algorithm was substantially more efficient when applied to a real-world application than the naive algorithm. We used a population of 23 million people (approximately equal to the size of Taiwan’s population) as our example. We calibrated the strength of transmission to have a cumulative infection attack of 60% (a severe pandemic) and let the duration of infectiousness be, on average, 3 days. Even with a coarse 0.5 day time step, the naive algorithm would still require of order 1015 interactions to be evaluated (if we assume every infectious individual has a non-zero probability of infecting every susceptible individual). By using the re-sampling approach in the Sieve algorithm, the efficiency is dramatically improved with no loss of precision (Figure 2a).

Fig. 2: The precision and efficiency of the Sieve algorithm, as applied to a model of pandemic influenza transmission in Taiwan.

a) Demonstrates the correct implementation of the algorithm in that the attack rate from the Sieve algorithm is the same as the attack rate from the naive algorithm for a range of values of transmissibility.

Fig. 2b) Shows the speedup of the Sieve algorithm for a range of values of transmissibility. Speedup is defined as the ratio of the average computation time for the naive algorithm over that of the Sieve algorithm.

The improved efficiency allows rapid simulation of pandemic influenza outbreaks at the national scale for Taiwan. On a machine with AMD Opteron OSA250, dual-core, 2.4 GHz CPU and 4GB DDR1 memory, our implementation completed one simulation in less than 5 minutes for a wide range of values of transmissibility. Simulation time was short up to a threshold value after which the cumulative number of infections increased substantially. The peak in simulation time does not correspond to the peak in numbers of infections, which is to the far right of the chart. The peak simulation time occurs at a lower cumulative attack rates because there is a trade off between the duration of the epidemic and the total number of infections. Large epidemics are faster. When the epidemic is very quick, there are far fewer time steps in which many infections occur. However, when the epidemic is a little smaller but much slower, the computational burden is at its highest.

Increased efficiency allows us to conduct simulations many times rather than being forced to report runs from a single realization. Therefore, we conducted a number of experiments to assess the degree of variability introduced by the process of generating the mock population and by the simulation algorithm itself.

First, we consider the variation introduced when, given different seeds but the same demographic data, the Initialization module generates many different mock populations. We carried out the following experiment: we randomly picked a mock population and simulated 2,000 baseline realizations constant transmission parameters. For each of the 2,000 realizations, we extracted the average day of 1000th symptomatic case and the average total number of infections. We then treated the statistics from all 2,000 results as if they are the real sample space and assume that the parameters of the real unknown sample space is not too far away from it. Now each production run is merely a sample from the 2000-run sample space. We first observe that the histograms of those important properties are all bell-shaped. We use maximum likelihood heuristic to estimate the most likely normal distribution to match the histogram. In Figure 3, the histogram and the estimated normal distribution are shown on the left. We then compare observed distribution with the theoretical normal distribution in quantile-quantile (q-q) plot. The q-q plot is a graphical technique for determining if two data sets come from populations with a common distribution. It is a plot of the quantiles of the first data set against the quantiles of the second data set. By a quantile, we mean the fraction (or percent) of points below the given value. A 45-degree reference line is also plotted. If the two sets come from a population with the same distribution, the points should fall approximately along this reference line [15]. As we can see, the normality of observed distribution is not only visually correlated on the left but also statistically verifiable on the right.

Fig. 3: The statistical properties of simulations with different mock populations.

(The statistical properties of simulations with different mock populations.)

Second, we study the relationship between the width of the confidence interval and the number of realizations. In Table 2, we summarize the width of confidence interval of several important measurements for sample size equal to 20, 30 and 40. Based on these numbers, it is safe to say that a sensible decision is to repeat each simulation 30 times.

We illustrate the practical use of the model by simulating severe pandemics in Taiwan. Movie 1 demonstrates the comparison of two baseline scenarios. From the movie, a number of general aspects of the spatial epidemiology of infectious disease can be observed. First, the capital city, Taipei, is better connected than the mid-latitude city of Chuanghua. The epidemic progressed more rapidly from Taipei to other areas, resulting in a more synchronized epidemic: incidence is at similar values in quite distant locations during the middle part of the epidemic. In contrast, Chuanghua is less well connected and the epidemic seeded there takes longer to spread to other parts of the population. Hence, incidence in the mid-latitudes close to the seed leads incidence in other areas. This results in a slightly faster and ‘less peaky’ epidemic. These simulation runs illustrate the general principal that when epidemics fail to synchronize spatially, overall incidence is less peaked. However, the result presented here do not describe local incidence of infection, which will be more ‘peaky’.

Taiwan Influenza Epidemic Simulations

Movie 1: Simulation of Epidemic in Taiwan seeded either in Taipei (left) or Chuanghau (right).

Discussion

We have described the application of a general re-sampling algorithm to a widely used spatial model of infectious disease transmission [8]. The resulting epidemic simulation tool achieves substantial speedups compared with our own implementation of a naive algorithm for the same model. Although derived independently, the resulting simulation algorithm is similar to those used to investigate the properties of the re-emergence of smallpox in the UK [16], and the pandemic influenza in Thailand [17], the United Kingdom and the United States [4].

We suggest that further research on the underlying algorithms for the model presented here and other similar models is warranted. For example, there are many ecological questions about the spatial properties of the current H1N1 pandemic — not least the need to explain the high degree of spatio-temporal variability observed at the continental scale. More generally, at any scale, improved computational efficiency of epidemic models, similar to that demonstrated here, will substantially increase their utility as tools for theoretical investigation.

Competing Interests

All the authors declare that no competing interests exist.