6. Applications to Disease Prediction and Spread Prevention

6.1 Summary of SIR model

To begin our brief summary of disease modeling, it is noted from Wikipedia article (XYZ1), that as of April 2020 the basic reproduction number “R0” for SARS-CoV-2 was estimated to be between 1.4 and 3.9. The computation of this value is absolutely essential to any mathematical prediction of the disease spread, as this value basically explains how many people one infected individual will spread the virus to. However, this value is often extremely challenging to obtain in real time as not only is it very difficult to trace either where an individual was infected from and/or to whom they may have passed it to, but this value is also very much affected by the regional aspects. For example, an individual in a highly urbanized area such as New York City will generally interact with hundreds and hundreds of people during their usual everyday lives commuting on subways or walking on crowded sidewalks, hence interacts with a lot more hosts to potentially spread to; while an individual in a rural area will generally interact with only a few people during their everyday lives, hence interacting with only a few potential hosts to spread to.

Now, while there are many advance techniques to resolve such issues in data collection we will not address those here but rather we will focus on outlining the main steps in developing a mathematical model to model a spread in disease. Then in the following two sections a summary of two applications of hands on real world data analysis is illustrated to show how one can, in real time, obtain practical useful information to understand more in real time of how a disease is spreading.

One of the most commonly utilized infectious disease predictions methods is the so called SIR model This model attempts to predict, as time moves forward, the number of Infected and then Recovered individuals from a population of Susceptible individuals. Hence, the SIR model name is referring to the flow $S\rightarrow I\rightarrow R$. To summarize the logic behind this model we first define the following functions

$S\left(t\right)=$ the number of not yet infected individuals susceptible of the disease at time t.
$I\left(t\right)=$ the number of not yet recorded individuals infected with the disease at time t.
$R\left(t\right)=$ the number of previously infected individual, who are now recovered at time t.

The goal of the SIR model is to create a mathematical model between these three functions, commonly starting from some initial data $S\left(0\right)=S_0$ and $I\left(0\right)=I_0$.

Now, while we will not discuss the intricate details, nor how they adjust the modeling, a few main assumptions we note here are that in our simplified modeling we assume that all of the individuals in $S\left(t\right)$ are equally likely to become infected the disease, and all of the individuals in $I\left(t\right)$ are equally likely to spread the disease. In addition, we assume that once an individual is infected, hence moves from $I\left(t\right)$ into $R\left(t\right),$ they can no longer spread nor be re infected by the disease. Furthermore, the function $R\left(t\right)$ is actually a compartment that collects all individuals after they leave $I\left(t\right)$, hence it includes both recovered individuals who survived then gained immunity, but it also includes individuals who died. In addition, we do not attempt to predict any methods to adjust these transitions, such as applying external factor like cure medicines, nor do we attempt to introduce any jump functions, i.e. $S\left(t\right)\rightarrow R\left(t\right)$, that result from applying external factors such as vaccinations. In short, the SIR model we construct here is just a good start for an initial model stage in the research.

The commonly accepted assumption (XYZ2) of the SIR model, is that the rate of change of change with respect to time of $S\left(t\right)$ is proportional to the ratio, from the total population, of the product of current number of susceptible times the infected. Thus, by defining the constant value of N as the sum $S\left(t\right)+I\left(t\right)+R\left(t\right)$, this yields a differential equation for $S\left(t\right)$ as

$\frac{dS}{dt}=-\beta\frac{S\left(t\right)I\left(t\right)}{N}$

where β is the to be determined proportionally constant, and it is worthy to note that this value of beta is also the probability of an individual within $S\left(t\right)$ to become infected with the disease of which is equal likely for all individuals. Also, it should be noted that the negative value is utilized in the differential equation to explain the fact that as time moves forward the size of $S\left(t\right)$ decrease since individuals move out of $S\left(t\right)$ into $I\left(t\right).$ In addition, if we define γ as the to be determined proportionally constant modeling the rate of change from $I\left(t\right)$ into $R\left(t\right)$ this yields a differential equation for $R\left(t\right)$ as

$\frac{dR}{dt}=\gamma\ I\left(t\right)$

It is worthy to note here that γ can also be viewed as the mean recovery/death rate which can be approximated from real data in real time.

If it assumed that this is a closed system then the rate of change of $I\left(t\right)$ can be computed applying simple in minus out logic, hence this yields a differential equation for $I\left(t\right)$ as

$\frac{dI}{dt}=\beta\frac{S\left(t\right)I\left(t\right)}{N}-\gamma\ I\left(t\right).$

Thus, a classic 3×3 system of three differential equations for three unknown functions has been obtained. While this system is a complex non linear differential equation, there are some methods of solution which can yield some extremely useful information.

Prior to outlining these solution methods, it is worthwhile to take a step back and consider what are the most important pieces of information to obtain in real time as a new disease is spreading in real time. Namely the basic production number, $R_0$, is one of the most desirable pieces of information to obtained. If the basic production number is known, the interpretation of its value can be used by doctors along with governmental officials to determine if a disease spread will be a minor event or an epidemic or in the worst cases become a global pandemic. Now, in a most simple cases one can determine the number of infected cases over time, in the early stages of a disease spreading, as simple exponential growth model with a logarithmic growth rate of

$K=\frac{d}{dt}ln\left[I\left(t\right)\right].$

Then, if from data, it is possible to estimate that after time, $T_I,$ an individual infects exactly

$R_0$ new individuals then the value of K can alternately be computed $K=\frac{\ln\left(R_0\right)}{T_I}$

and from this information both the values of initial growth rate along with the basic production number can be approximated. However, it is extremely difficult to actually obtain the information needed in real time and often by the time it is discovered that a disease is actively spreading in the real world the growth has moved much further along in its evolution than being modeled by such a simple model utilized for this rudimentary solution. Thus, a more advanced methodology is commonly required to estimate $R_0$ which is where our full 3×3 system of differential equations can be utilized.

The formal definition the basic production number is

$R_0=\beta\tau$

where beta is as previously defined, but often unknown and very difficult to estimate in real time, and 𝜏 is the mean infectious period which is often able to be estimated in real time by observation of real cases. Hence, if one can create a mathematical model from data in real time and then compare it to a mathematical model from our prior differential equation solution it maybe possible to extract the value of beta, and thus accurately estimate the value of

$R_0$. To begin, we take our system of three differential equations
$\frac{dS}{dt}=-\beta\frac{S\left(t\right)I\left(t\right)}{N},$
$\frac{dI}{dt}=\beta\frac{S\left(t\right)I\left(t\right)}{N}-\gamma\ I\left(t\right),$
$\frac{dR}{dt}=\gamma\ I\left(t\right).$

And, we note that since the sum S+I+R is assumed to be a constant value of the total population, we have the fact of

$\frac{dS}{dt}+\frac{dI}{dt}+\frac{dR}{dt}=0$

And, due to the fact that $\tau=\frac{1}{\gamma}$ we note the basic reproduction number can be rewritten as

$\frac{\beta}{\gamma}=R_0.$

$\frac{dI}{dt}=\left(R_0\frac{S}{N}-1\right)\gamma\ I\left(t\right)$

we can observe that the value within the parentheses tells a lot of practical information. First, it is worthy to note that the value of $\gamma\ I\left(t\right)$ will always be positive in sign. Thus we can conclude that if

$R_0>\frac{N}{S}$ then the disease will spread rapidly, as the sign of $\frac{dI}{dt}$ will be positive, hence increasing. Thus, with initial data, a disease can be defined as one will spread to an epidemic outbreak, or in worst cases a pandemic, if

$R_0>\frac{N}{S(0)}$

while it will not be expected to if

$R_0<\frac{N}{S\left(0\right)}$

This information is one of the most powerful pieces of information that doctors and/or governmental officials can be provided with in real time when making policy decisions for a new disease. The only major issue with this is that by the time enough data has been collected to obtain this critical piece of information is obtained, in real time the disease has often spread so far that there is often not much that can be done to stop the spread of the disease, other than attempts to reduce the function $S\left(t\right)$ such as social distancing.

Now, to actually obtain a model solution for this system of equations, some algebraic manipulation is needed. To begin if the first is divided by the third it is obtained that

$\frac{dS}{dR}=-R_0\frac{S\left(t\right)}{N}$

and by routine variable separation for the first order differential equation this becomes

$\frac{1}{S}dS=\left(\frac{-R_0}{N}\right)dR.$

Then, by conducting a definite integration it is found that

$Ln\left[S\left(t\right)-S\left(0\right)\right]=\frac{-R_0}{N}R\left(t\right)+\frac{R_0}{N}R\left(0\right)$

From which the solution

$S\left(t\right)=S\left(0\right)e^{-\frac{R_0}{N}\left(R\left(t\right)-R\left(0\right)\right)}$

is obtained. While this is not exact solution to our 3×3 systems of equations, as the value of $S\left(t\right)$ obtained depends on the value of $R\left(t\right),$ it is an extremely useful piece of information which can be used in real time. If, in real time, data is collected to measure the value of $R\left(t\right)$ and possibly even $I\left(t\right)$ this solution, which can be rewritten as

$N-I\left(t\right)=R\left(t\right)+S\left(0\right)e^{-\frac{R_0}{N}\left(R\left(t\right)-R\left(0\right)\right)}$

is an extremely powerful solution. From here further methods can be applied to either obtain actual individuals closed from solutions for $\left(t\right)$ and $I\left(t\right),$ or can yield an approximation for the basic reproduction number. Namely, if a nonlinear regression data fit (AKA using the command ~nls in R or python coding) a data fit solution can be created and then by comparing the two one can extract an approximation value for

$R_0$.

While this modeling is an extremely interesting mathematical model we will not continue further development on the topic here, but rather we will quickly look in the next section at one real world data example from New York City of the COVID-19 disease spread in to illustrate how the statistical methods learned can be applied to actually model a regression solution. Then we will end this textbook in the proceeding section to discuss an extremely interesting research question, how to determine what factors drive citizens to actively participate in social distancing measures. This is a very important topic to study as within the scientific community it accepted that once a disease is activity spreading in an epidemic, or even worse a pandemic, spread then the most effective way to stop the disease is social distancing. This is due to the fact that while these mathematical solutions are beautiful to study, the downfall is that in real life once a disease starts spreading from person to person there is absolutely nothing that can be done to stop it. The only tools we have available to us such as humans in such a battle against a virus are common sense preventative measures to spread the disease in daily life ( e.g. wearing filtration masks and gloves or other personal protective equipment ), or taking societal measures such as social distancing. While this is accepted by most in the community, we can now validate it mathematical as from our solution of the second equation we noted the disease will not become a pandemic if

$R_0<\frac{N}{S\left(0\right)}$

While we do not have any control over R naught, and the value of N is fixed, we can greatly reduce the value of $S\left(0\right)$ by implementing social distancing measures; in fact as that value approaches zero the right hand side of this bound will become infinite which ensures society will win the battle against the virus, or here mathematically a battle against the direction of an inequality symbol!

6.2 Data illustrations of seeking the exponential inflection point

 3/1/2020 0 1 3/3/2020 0 3 3/4/2020 0 8 3/5/2020 0 11 3/6/2020 0 18 3/7/2020 0 25 3/8/2020 0 46 3/9/2020 0 103 3/10/2020 0 173 3/11/2020 1 326 3/12/2020 2 681 3/13/2020 2 1299 3/14/2020 4 1941 3/15/2020 10 2969 3/16/2020 19 5085 3/17/2020 26 7532 3/18/2020 47 10481 3/19/2020 71 14159 3/20/2020 116 18144 3/21/2020 157 20744 3/22/2020 205 23288 3/23/2020 288 26790 3/24/2020 382 31180 3/25/2020 503(+31.7%) 35914 3/26/2020 688(+36.8%) 40840 3/27/2020 897(+30.4%) 45829 3/28/2020 1,162(+29.5%) 49214 3/29/2020 1,444(+24.3%) 52651 3/30/2020 1,757(+21.7%) 58666 3/31/2020 2,126(+21%) 63834 4/1/2020 2,545 68859 4/2/2020 3,001 74504 4/3/2020 3,465 80020 4/4/2020 3,942 83772 4/5/2020 4,475 87386 4/6/2020 5,024 93592 4/7/2020 5,599 99511 4/8/2020 6,118 104915 4/9/2020 6,638 109781 4/10/2020 7,137 114022 4/11/2020 7,645 117588 4/12/2020 8,172 120304 4/13/2020 8,699 123514 4/14/2020 9,181 127569 4/15/2020 9,606 131366 4/16/2020 9,987 134819 4/17/2020 10,334 138318 4/18/2020 10,682 140397 4/19/2020 11,031 142679 4/20/2020 11,355 146393 4/21/2020 11,640 149395 4/22/2020 11,924 152809 4/23/2020 12,208 155596 4/24/2020 12,480 157994 4/25/2020 12,691 159508 4/26/2020 12,899 160498 4/27/2020 13,114 162728 4/28/2020 13,292 165369 4/29/2020 13,449 167633 4/30/2020 13,590 169555

Now, prior to conducting the data analysis for this data set it is useful to make a note about how, in real time, it is preferred to look at a very simple measure which is the rate of change of cases not the actual increase in raw numbers. For example, the day over day change from March 24th to March 25th was 31.7% which is easily computed as number of cases reported on March 25th – cumulative number of cases up to March 24th, then to make this a percentage ratio the result is divided by the cumulative number.

$\%=\frac{\#new}{cumulative}=\frac{total\ cumulative}{cumulative}=\frac{688-503}{503}$

This very simple computation is one of the most important data values to watch in real time, as while the number of new cases may still be a large value it is the percentage change that really tells the story, really tells when the spread is slowing. Obviously, in the early days of the disease spread that value will be rapidly changing due to small numbers, but once the progression of the disease continues it is noted that this number becomes more stable. Namely, in the New York City data this percentage change was steadily growing up to March 26th , when it reached its maximum value of just under 37%. Then, the value steadily declined over the next few days, 30.4% on March 27th and 29.5% on March 28th . The value continued a steady decline reaching 19.7% on April 1st and then falling to 9.3% on April 8th staying below 20% for the reminder of the month. Thus, this numerical value is a measure of when the spread of the disease starts to slow, and as one can see within this timeline the peak corresponds to the time shortly after the strict social distancing measures were put in for place for NYC residence. Furthermore, it is debatable as to what measure is the most accurate to use, the number of infected or the number of deaths, and while each is now without some doubt it is generally accepted that the number of deaths is more practical in real time as in order to actually know the total number of infected persons each member of the population would be needed to be tested and it is not practical to do so in real time. However, to study the models in the prior section, related to data analysis methods, a data fit for $I\left(t\right)$ is desired, thus we will conduct one now but not address the actual validity of the data obtained nor any corrections to the data values due to advanced sampling methods that could be applied.

If a simple regression model was run on the natural log of the data of the number of cases during the month of March, the following result is obtained.

$\ln{\left(\hat{y}\right)}=0.37x+1.62$

Thus, one can back solve this to see the approximate exponential growth model for the number of infected cases as

$I\left(t\right)\approx{I_0e}^{0.37x}$

And, from this one can compare to the either one of two things: the formal solution to the methods learning in the prior section, hence matching parameters, or a training/testing data set from current time data. Either way, the model should be effective at making short term predictions on the future spread of the data. It would be at interesting study to look at this same exercise, but at different times in the future; hence, one could conduct a post hoc type statistical analysis to look if measure taken by local authorities had any effect of the spread?