Survival Analysis, Simulation and Sample Size

The nice thing about working as a consultant ist he variety of different problems one is confronted with. Just as I was preparing the next blog-post about the possibilities of simulation in statistics I received a request to help design an experiment to determine the life-time of a chemical product. Life-time analysis – better known as survival analysis- is a large and interesting part of statistics. We, in the “normal” six sigma world do not see much of this, but in DFSS projects is one of the more important elements.

Today, Minitab provides some help in a relatively new sub-menu and there are specialized programs such as Weibull++ which help specialists. Of course R has a wealth of libraries to use as well. On the other hand, there is nothing better then a simulation to build the intuition around a problem, so I decided to go that way, using R for the ease of writing scripts. A problem, in these cases is also, that calculating sample sizes is not easy and is not really well documented in Minitab for example.

So, I wanted to simulate some life data analysis and especially look at the power of the analysis as a function of external stress – something even Weibull++ will not provide as the company sells a special program to plan HAST tests. (Just for those less versed in DFSS – HAST is the abbreviation of Highly Accelerated Stress Test. The idea is that one can define a stress factor (like weight for mechanical assemblies or temperature for electronic equipment) and assume that there is a relationship between the distribution of survival times and the level of the stress factor – like the higher the stress the faster a breakdown occurs. Once this is known, from theory or possibly from  earlier experiments, one can run a test for a much shorter duration with an elevated stress level, and work out the survival distribution for the normal stress levels from the data. )

In our situation we had basically no knowledge of the relationship between the stress factor and the life-time, so what we needed to do was to run life-time tests at different levels of the stress factor and work out the relationship from the results, THEN use these results to extrapolate the life-time distribution at normal levels. Of course, the easiest would be to simply run the life-time tests at the normal levels, we just do not have the time and the resources to wait for month and month until the product dies.

However, if we want to see the effect of higher stress levels on the product we absolutely need to know the power of the statistical method used to test the data. This will determine the sample size we need at each setting, and as usual, we avoid both the risk of not seeing anything in a stupidly underpowered study or of wasting a lot of product and effort if we use a higher then necessary sample size.

Estimating the power is simple in principle. We define an effect strength, we simulate the data with the defined influence of the stress factor  and run the analysis. We repeat this step 2000 (or more) times and count how many times we have a p-value that is less then 0.05, divide it by the total number of experiments and there we are, we have a power estimate. Of course, the details are a bit more cumbersome.

 

First, the R code to do this is here:

# Sample size determination for accelerated life testing

# Assume the survival time distribution is exponential, that is

# no manufacturing errors and no aging – constant hazard rate

# Assume that the hazard rate is dependent on a stress factor F

 

# To define a realistic parametrization we use historical knowledege

# we know that at F=20 the MTTF=3*30=90 days

# lambda = 1/MTBF =1/90

# the effect size is given in reduction of the MTTF – so an increase in the Forcing

# factor from 20 to 40 reduces the MTTF by DM days

 

require(survival)

F=c(20, 40, 60, 80)

# DM is the effect size in MTBF days – in this case a change of 20 grades

# reduces the MTBF by 10 days – a weak effect

 

#DM=10

# Percentage censored values – as per now the censoring will be randomly decided

#CensPer=0.0

 

lambda0=0.011 #=!/MTBF=1/90 days

#Function to calculate the power at a setup consisting of a given sample size, effect size and percent censured

SingleSetupRun=function(Nrun=100, DM=10, Ns=10, CensPer=0){

ps=numeric(Nrun)

scales=numeric(Nrun)

# FOR Nrun times DO

for(i in 1:Nrun){

Fs=numeric(0)

Ts=numeric(0)

#   FOR each value of F DO

for(n in 0:3){

#     Calculate the lambda

lambda = lambda0/(1-lambda0*n*DM)

#     Add the F values to the vector

Fs=c(Fs, rep(F[n+1], Ns))

#     Generate Ns samples from the exponential described by lambda

Ts=c(Ts, rexp(Ns, lambda))

 

#   ENDFOR

}

#   Build a censoring vector

cs=rep(1,length(Fs))

#   A percentage equal to CensPer is censored

cptr=sample(1:length(Fs),as.integer(CensPer*length(Fs)),replace=FALSE)

 

cs[cptr]=0

 

#   Build the data frame

sdata=data.frame(Fact=Fs, Stime=Ts, Cens=cs)

S=Surv(sdata$Stime, sdata$Cens)

 

#   Perform the regression

mod=survreg(S~sdata$Fact, dist=“weibull“)

 

#   Get the p-value got from survival:::print.summary.survreg

ps[i]=summary(mod)$table[11]

scales[i]=summary(mod)$scale

 

# ENDFOR Nrun

}

print(sprintf(„Power is: %.2f, for effect %d and sample size %d“,sum(ps<0.05)/length(ps), DM, Ns))

sign.ind=which(ps<0.05)

print(sprintf(„Percent scale>1 %.2f“, sum(scales[sign.ind]>1)/length(sign.ind)))

}

 

Now, let us look at some details.

Setting the parameter of the exponential distribution

 

In life data analysis probably the most important decision is to determine the distribution of the survival times. The simples model is the one where the survival times are described by the exponential distribution. This is equivalent to the assumption that there are no memory effects present – that is , the remaining time until a failure is independent of the age of the product. A different way to describe this is by using the concept of the hazard. The hazard at time t  is defined as the probability that the product fails exactly at time t after haing survived until this moment. The memoryless property (which is equivalent to the exponential survival times distribution) simply says that the hazard is constant in time. To make things even simpler, the hazard will be the only parameter of the exponential distribution – defined as “lambda” in the code above. Generating random values of a given exponential distribution is simply done by calling the R function  rexp(size, lambda).

 

So, if we want to simulate different effect sizes we just need to change the lambda parameter accordingly. There is the crux, however – what do we mean by accordingly? We (at least I) are not trained to think in terms of hazards – so  how can we decide whether changing the lambda parameter by 0.1 is a strong or a weak effect? I honestly do not know. So, I had to find a way, to make effect size more easy to estimate, and I came up with the idea of using the Mean Time to Failure or MTTF. This is simply the expected value of the exponential distribution – and despite its name it is the time at which not 50% but roughly 37% of the tested items still survive. It can be proven that lambda = 1/MTTF which is handy because most manufacturers provide the MTTF number. But, given, that this is a time value we can more easily relate to it then to the concept of the hazard so, will be easy to determine a weak, medium or strong effect. We know, that our MTTF under normal, unstressed, conditions should be around 90 days. So, an effect that shifts the MTTF by around 10%  can be considered weak . Knowing this we can work out backwards the change in the lambda that will reduce the MTTF by 10 days (10%-ish). The details are in the comments of the code.

Censoring

The other special feature of the life data analysis is that we have two types of life data. There is the “real” type, where the value is simply the time of failure, but we also have “censored” data where we did not see any failure but we have to interrupt the observation. E.g. we planned the experiment for 8 weeks and at this time some items just kept on living. This is valuable information and it would be a folly to discard this data. But we can not enter a value for the failure time either – it could be the next second or the next month, who knows? In life-data analysis we mark these points as “censored” and we can still use them in the anlysis – but obviously this analysis has to be different type from a normal least squares regression. For the power analysis, we also have the intuition that the amount of censoring will reduce the power – a censored data point simply does not have all the information a “normal” point has.   In our R function CensPer is the variable giving  how much percent of the available data is censored. I just attached the censored label randomly to data points – a better way would be to assume that the  censored values are mostly those with a high life-time.

Scale  and power

Before looking at the results we still need to discuss another detail. When analyzing life data, as mentioned before, we can decide which distribution we assume for the distribution of our data – and the simples choice is the exponential,l with a constant hazard rate. However, in real life we can not be absolutely sure that this assumption holds – so, in a real life situation we should not pick the exponential but something that allows a little more flexibility. Fortunately, there is a such a choice, the Weibull distribution, which is flexible enough and contains the exponential distribution as a special case. So, in the analysis our best option is to pick the Weibull distribution, so, we should pick this option also when looking at the power in simulations. If you check the code, the line

mod=survreg(S~sdata$Fact, dist=“weibull“)

is running the regression, with the Weibull distribution as our choice. This has an additional advantage for the power analysis. If we look at the results of the regression, we shall see a calculated value called “scale”. This is giving us the direction of the effect that was detected with the regression – if the scale is greater then 1  the hazard will increase with increasing stress factor, if it is smaller then one then the hazard will decrease with the stress. By looking at the distribution of the scale values we can judge how well a regression with the given sample and effect size can determine the direction of the effect. In principle, it is not enough to have a  p value less then 0.05 (statistically significant effect) we would also need a scale that is greater then 1 (as we know that we set up the simulation in a way that the hazard increases with the strength of the factor.

Results

Well, the first and most important result is that the code above does give a pretty good idea of how the power will look like. Feel free to paste the code into R Studio and experiment with it.

Now some results: this will not come as a surprise but small effect sizes and small sample sizes will result in very low power – it is almost hopeless to expect to see a significant effect. Moreover the direction of the effect will be variable and misleading . for an effect of 10 days of shift in the MTTF and a sample of 5 per factor setting I got only 14% significant tests (power of 14%) and of those only 16% had a scale >1 – basically useless for practical purposes.

Increasing the sample size will strongly increase the power, but will only have a mediocre effect on identifying the right direction of the effect. At 10 samples per factor value the power increased to 18% and of those 27% identified the right direction. At a sample size of 20 the power was 29% and scale>1 was at 32%. The numbers at the completely unrealistic sample size of 100 are : power at 84% but percent scale > 1 still at 46% only.

So, we can hope to identify the existence of an effect with this method, but the direction will be iffy even at very high sample sizes. However, if we have a stronger effect, both numbers improve drastically:

At an effect of 30 days on the MTTF, even at a sample size of 5 (!) we would have a power of 99% with scale>1 at 89%. This effect grows in a very non-linear way – at an effect of 20 days we would still only have a power of 45% with the scale correctly identified in only 26% of the cases. So, definitely, the method becomes more reliable at stronger effects.

How does censoring affect the power? As expected, all things being the same a hgher percentage of censored values will reduce the power. For example, if we stay with previous case but we introduce a censoring of 50% of our generated data then the power decreases to 76% from 99% and the percent scale>1 to 75% from 89%.

The effect is not very strong though, 50% censored does seem to be extreme, so,  there is hope that we can use this method, provided we manage to identify a suitable effective stress factor.

There is a lot more to find out with this simulation, and I encourage everyone to investigate this further on their own.

 

And, as always, any feedback is warmly welcome.

Solving a Probability Problem Through Simulation

I am hoping to extend the blog-post about the end of statistics (https://rforsixsigma.wordpress.com/2017/05/03/the-end-of-statistics-as-we-know-it/)  to several more entries by discussing examples and methods that illustrate the theoretical conclusions of the first entry.  Be warned that these entries might contain R code, which is not for the faint of heart (kidding).

The coupon collectors problem

First, let us return to this problem, mentioned in my last post. It is coming from a scientifically minded grandmother,  who wants to figure out how many roller-coaster rides are needed to enable the grandson to ride at least once in every carriage of a roller-coaster. There are 32 carriages in total and the customers are randomly assigned to a carriage before the ride. (This was an actual problem posed at the www.talkstats.com website).

As somebody pointed out at the website, this is a known mathematical problem, called (surprisingly) the coupon collectors problem and formulated as the question of how many coupons do we need to buy if we want to collect one of each type (anyone with small children will be familiar with this problem,  from collecting all NBA player cards to all  dinosaurs). Of course there many, more practical, formulations of this problem but for the example’s sake we can stick to the roller-coaster rides.

In theory

The advantage of this being a known problem, is that we already have a full theoretical solution, easily available on wiki (https://en.wikipedia.org/wiki/Coupon_collector%27s_problem). Reading this entry we find that the expected value of the number of rides is E(N)=N*H(N) where H(N) is the Nth  harmonic number, and the variance of the mean is less then (N*Pi)^2/6. The harmonic number is calculated as H(n)´=1+1/2+1/3+…+1/n.

Using Excel I get H(32)=4.05 so the expected number of rides will be 32*4.05 = 130  with a standard deviation that is somewhat less then 41 (again, using the formula from the wiki site). So assuming, that the expected number of rides is normally distributed (which is a pretty wild assumption and I am by no means sure that this is warranted) the number of necessary rides will be in 95% of the cases between 130-2*41 and 130+2*41 that is , in the interval (48, 212)  and that is about all we can say .

In practice

The first, obvious, problem with the theoretical solution is that one needs to be aware of the existence of this alternative formulation. Remember, our problem was to find the number of rides and had no relationship whatsoever to the coupon collectors problem. It takes deep knowledge of the probability theory to be able  to recognize that the alternative formulation is known and is a solved problem. (Or, possibly, access to a good statistics help website and a large portion of luck). Most of us, practitioners, will not have the exposure to probability theory and luck is, well, not reliable.

The second problem is, that even knowing this alternative formulation, will not necessarily help to answer the practical questions raised in real life. Even the most loving grandmother will have to plan for a budget, so, simply knowing the 95% confidence interval of the number of rides is not enough. How can she get a feel of what is to be expected? (As a side note, if the number of rides were really normally distributed, then the mean and the variance would be enough to fully describe the situation, but we do not know whether this is true or not . The central limit theorem might be an indication, but again, how close are we to normality, in this special case of 32 carriages ?)

So, chances are that we would not even know of the existence of the theoretical solution, and even if we did, we would just want to know more. In both cases, the answer is to simulate the problem.

Simulation 

My immediate reaction to situations where the math is not available (to me, at least) or the results are not really of practical value, is to try to simulate the problem and just see what happens. An immediate benefit is, always, that we can develop an intuition of what is going on, something that is very valuable for practitioners and is totally missing from the theoretical solutions. So, what to do in this particular case?

Obviously, we only need to simulate the draws of the carriage numbers. We need to generate sequences of random numbers from 1 to 32 and count how many numbers we need to draw until we have each number from 1 to 32 in the sequence. Thinking about this a bit more, there is a slight problem. Theoretically, we might end up with very long sequences (potentially infinitely long ones) as we can have sequences that never contain some of the numbers. Given, that we would want to have results in a reasonable amount of time, the naïve approach might not work.

The alternative is to generate sequences of fixed length and to count how many of those contain the full row of numbers from 1 to 32.  Theoretically, this is giving us the cumulative distribution function (https://en.wikipedia.org/wiki/Cumulative_distribution_function) . The values of this function give us the probability that the number of rides will be less then a given number – so F(50) is the probability of needing less then 50 rides, F(300) the probability of needing less then 300 rides and so on. Calculating this function has the double advantage of having a fixed run-time (no problems with infinite length sequences) and it is also the information we need to budget the rides.

Code

This is the way I coded the simulation. This is definitely NOT an optimized version, but for my purpose it runs fast enough even at 5000 trial runs per experimental setting , on my laptop. By fast enough I mean I barely have time to make a fresh cup of coffee before it is done.  If you are not into R, just reading the comment lines (starting with #), they should give you an idea of what the intention of the code is.

# Define the steps

n=seq(from=32, to=532, by=4)

#Define the vector holding the results

Prob=numeric(length(n))

#Define the number of samples for each step

RunLength=5000

#Index of the save vector

Ind=1

#FOR each step DO

for(NTrial in n){

# Set the counter to 0

  Count=0

# FOR RunLenth times DO

  for(i in 1:RunLength){

#   Generate a random sample of “size” step of the numbers from 1:32

#   with replacement

    s=sample(1:32, NTrial, replace=TRUE)

#   IF the sample has all the numbers THEN

    if(length(unique(s))==32){

#     Increase the counter

      Count=Count+1

#   ENDIF

    }

# ENDFOR

}

# Save probability of finishing in less then “steps” runs

  Prob[Ind]=Count/RunLength

 

# Move the index one position forward

   Ind=Ind+1

 

#ENDFOR

}

The result is a list of probabilities for each number of trials, giving the chance of finishing in less trials as that number – for instance the probability of getting into each carriage with less then 100 rides is 23%.

To have a 50% probability of finishing one needs to go for 124 rides and so on. The cumulative distribution function will look like this:

 

Cumulative plot

What did we gain, what did we lose?

In comparison to the theoretical solution, we definitely gained clarity, in my opinion. With the distribution function in a table we can answer any practical question in an easily understandable way. I mean, questions like : “what if we had 16 instead of 32 carriages” or “what can we expect for a given budget” and such. They will either be directly answered with this code or with a simple tweak in it, like replacing 32 with 16 in the first line. Also, I may be mistaken in this, but I have a strong feeling that in any company we will have more people who are able to explain a few lines of R code then who are able to work with the harmonic numbers.

But, as everywhere, we lose something as well. There is a special kind of beauty in seeing mathematical results like the ones for the coupon collectors problem. The formula has a nice link to the old Zeno paradox, about why Achilles will never be able to outrun a turtle. Seeing such a pattern turn up in a seemingly everyday problem of a grandmother planning the rides of her grandson is really satisfying and we might lose such insights in the future. Also, if someone interested in the long-run average behavior of the system (where the number of carriages tends to infinity) will be better served by the mathematical theory.

However, I strongly doubt, that the grandmother in question, or a laboratory manager trying to solve a practical problem  would consider these insights as a great loss when weighted against the possibility of getting relevant answers quickly and in a way everybody can understand.

 

The End of Statistics as We Know It

Admittedly this is a provocative title but it is line with developments I see in the field, so bear with me. To come to the point quickly, I believe, that we are about to see a fiundamental change in the way statistics will be used in the future – a move away from the mathematical methods and towards more computationally intensive and pragmatic methodologies. This would mean that we shall worry much (much) less about the normal distribution of our data, and probably a lot more about the correctness of our libraries and scripts. This might also change the way statistics will be trained and taught – I would expect a more rigorous separation between practical and theoretical statistics, where naturally, theoretical statistics would remain the heavily mathematical field it is today, but practical statistics would move more towards engineering and computer science, with a heavy emphasis on computing and programming.

History

This change is made possible by the development of computers and sensors. Historically, statistical methods were developed , mostly in the 30-ties, in a situation where data was difficult to come by in quantities and computations were done by hand. It is quite natural that the methods that were developed relied heavily on model assumptions, and formulae as well as the necessity to draw conclusions based on relatively small number of datapoints. Truth be told, Fisher also developed exact, numerical methods that are very well suited to numerical analysis, but these were sen as wildly impractical at the time. So, in this traditional world of statistics one could save a lot of effort at data collection and computations, by utilizing mathematical models. Most prominently, the assumption of an approximatively normal distribution was a key for many of these methods.

This also ensured, that statistics was (is) seen as a mostly mathematical discipline – given some initial assuptions, one develops a mathematical procedure to decide a practically relevant question, and the math ensures that as long as the assumptions hold, the answer will be right.  The crux oft he matter is that the type of questions we can ask is somewhat limited by the mathematical scaffolding and more crucially the checking of the assumptions is just as important as the answer to our question.  This can easily be seen in any statistics forum even today – a majority of the questions asked is some variation on  the topic – „is this data normally distributed?“ – often framed as „can I use ANOVA(t-test, regression…etc) on this data?“

The real question though is a different one, namely: „is there an indication in this data, that there is an effect present, as different from simple random noise“? and if yes, how strong is this effect? Of course the mathematically involved traditional methods also answered this question (though in a surprisingly roundabout way), but the answer was more or less along the lines – „if your data is conformant tot he assumptions oft he theory then…“. Hence the need to always check the assumptions first.

Today

In many of the practical problems we see today, even the question of the existence of an effect is less important then a different question : „can I predict the future values, based on present observations“?  Of course, if we can prove the existence of an effect with our present data, then this, by definition, will help us to predict future values.  If, after running a DoE we can select a few factors that have an effect, we can build a model, and use it to predict the future performance of the system. The traditional limitation is, that we can only build a model if we have, or somehow turn, our data into the type that conforms with modelling assumptions (is normally distributed, mostly).

Already in the thirties, statisticians knew, that the assumptions about the data distribution is not necessary to answer the question about an effect or predictive power. For example, Fisher proposed a so called „exact test“ for binomial data, which did not rely on the normality assumption, it just calculated, in theory, all possible permutations of the data, and counted how many of those permutations realized an outcome that was similar to the actually measured data. This percentage gave a reliable p-value, without any approximations or assumptions- the problem was, no one would be able to  calculate all the permutations for any reasonably sized data.  Even today, Minitab for example, offers a small hidden check-box to calculate Fishers exact test – but the default ist he approximative p-value based on the assumption of normality.

New ways

What changed since the thirties is of course the availability of computing power unimaginable to the pioneers of statistics. Today, rationally seen, there is no need for most of the assumptions, as long as we have a few miliseconds to wait for the results of a simulation. (Well, maybe a bit more, for large data-sets).  This is a lucky situation, also because we see the development of many practical statistical methods, where the mathematical foundations are much less well understood as ANOVA, for instance -or even if they are understood, the math behind them is too complex for practical use. Even if we look at something apparently as simple as multiple regression , the problem of selecting the right model (by picking the correct variables, adding interactions and nonlinear terms) is by far not solved in a satisfactory manner.  The modern answer, in most of these cases is surprisingly simple from theoretical point of view – pick a random percentage of the data to build the model and test the predictions oft he model on the part that was not picked. Measure the prediction error then repeat with a different data set.  Select the model that performs best in terms of prediction error. The method, called cross-validation, can be applied to any model, of any mathematical complexity. It will not deliver a theoretical value like the p-value, but will with a high probability deliver a modelt hat will actually be better at predictions.

This is not to say, that the new methods do not require expertise – but they require an expertise of a different kind. Intuition of how random data behaves will be more necessary then ever. My personal experience is, that having studued and worked with statistics for many years, I was sorely lacking this intuition, until I started to write simulations. Working with some small dta-set in the idealized world of mathematics is simply not enough to build intuitzion. Even today, when I at least know, that my intuition is weak, I am surprized by how often my predictions are wrong – but today, knowing this, I just fore up my calculator and run a few simulations .

To close, I would like to address a small challenge to anyone willing to take it: in my preferred statistics forum someone asked this eminently practical problem. She took her grandson to a roller-coaster ride, where the each customer was assigned randomly to a carriage. Her grandson was so enthusiastic about the ride, that he wanted to repeat the rides until he rode at least once  in each of the carriages.  The roller-coaster had 32 carriages in total. How many rides will be needed ? What wozld be your guess? Less then 100 , less then 1000? Can you simulate the problem?

For those with a mathematical inclination: this is a known as   the coupon collectors problem. The mathematical theory of it is well developed and relatively easy to find on the net.   See if you can work out an estimate of the most probable number of rides, the expected number and maybe the variance. Then try to simulate the problem and compare the effort 😊

 

Logistic regression vs. Chi-Squared test

A few years ago I developed a simulation to teach the application of statistical methods to real – life problems. I had to figure out a situation where the students had to apply statistics to continuous and to discrete variables. For the discrete case I used the number of defects caused by materials coming from 7 different sources – actually it was a sci-fi simulation where they needed to decide  which planet to mine for fuel given that the fuel had impurities that caused malfunctions in the rocket engine. To enforce the use of discrete data I designed the story in a way that much of their measurement equipment was defective and all they could do was count the occurrence of a malfunction (a yes-no type of measurement).

This was easily accepted by the trainees, and references to Halo and Battlestar Galactica also helped. Now, my problem was to analyse this data. What I needed was something similar to ANOVA – a test that takes the discrete data for each category and in one operation gives me a ranking of which sources cause significantly more malfunctions than the others – with the emphasis on the word „significantly“. There are two naive solution to this , both of which are fast, simple and wrong – either we just compare the numbers as measured or we run as many two sample P-tests as we have pairs.

Simply comparing the numbers is obviously wrong – we completely miss the point of using statistics and of course we can not differentiate between random flukes and systematic effects. Running pairwise tests is more subtly wrong. Each test has a probability of a false alarm of 5%. So, if I run 10 comparisosns, say, then the probability of having at least one false alarm is already by roughly 50% – definitely not what we want. In the continuous case ANOVA handles this problem, because it can run one test and get the answer fo all comparisons, with a probability of false alarms still at 5%. A so called post-hoc analysis then will deliver all the answers we need to questions about how levels of the factor are grouped and where do we have extreme values.

So, the simple question was: what is the analogue of ANOVA for the discrete case?  The surprsing and somewhat unpleasant answer was that except for the case where my factor has two levels (the so called 2×2 table) there is no really satisfactory method that would do all that ANOVA does for the continous case. The closest we get is the chi-squared test but it is not really convincing even though that would also be the textbook solution for this problem.

For a long time we used the simulation with the recommendation to use the chi-squared test – and glossed over eventual objections by sheepishly remarking that we just do not have a better idea – until last week when I happened over a discussion in my favourite statistics forum, discussing the same point.  I realized then, that there is a nice and easy way to get the informations ANOVA would provide in the continuous case: just run a logistic regression with the discrete factor only.

The received wisdom is that binary logistic regression is only useful if our Y is discrete (binary) and our X is continuous – and this is indeed tha standard application. In the meantime the methods and the statistical SW we use evolved to the level where we can use discrete factors as well as continous ones in a logisic regression – indeed Minitab raises no qualms if we only feed it with a single discrete factor.

This has huge advantages compared to a chi-squared test. We actually build a statistical model so we can estimate the importance of each level of the factor and we can make predictions based on this model – we basically get the same level of knowledge as with ANOVA. The only caveat is that people might not be comfortable with using logistic regressions and even less wih using one with discrete factors only. But this is a question of education and opportunity – even the relatively unsophisticated Minitab offers this possibility, so it is time we move away from the age of hand calculations (the overwhelming reason for preferring a chi-squared test is that it is simple enough to be performed by hand) and to use the computer for what it was meant to do: perfom complex calculations. Then we can do what we are meant to do – interpret the output and provide relevant answers to our client’s real lfe questions.

A Snark in the Cuboard – Autocorrelated Data

There are problems that simply refuse to go away and Six Sigma as a theory has probably a larger than normal share of them – like the (in)famous 1.5 shift or the endless and useless discussion about what is common cause variation. There is a such a problem that has spooked me for quite a long time, though it is not so well known as the ones mentioned above. It does have links to these immortal ones though.

In every statistical test we use in a normal six sigma project (t-test, ANOVA etc.) the assumption is that our data points are independent of each other – or more precisely the error terms once we separated the effect are independent of each other. This means, the system under study has no memory – the deviation from the norm seen today will give us no information about the size of the deviation tomorrow.

It is possible that somewhere out there we have processes that behave like this – but the real life is different. Take the game of darts for example. When I play I have a quite regular pattern – in one game I hit the mark and in the next one I will very probably fail. If an observer knew my performance in one game she could predict my performance with much better than a random chance.  The same is valid for the weather. Where I live, temperatures over a year are in the range between -15 and +40 but no one in his right mind would assume that on any given day -15 is just as likely as +40 for the next days temperature, though this is precisely what we should assume if the temperatures were independent.  But, of course, they are not – the system has a pretty long term memory with some occasional shocks – we generally expect that the next days temperature will be about the same as todays.

Most industrial processes have some memory in the same way. If we are measuring the size of sweets on a production line we do not expect to see one that weights 2 grams to sit in the same row with one that weights 7 grams – we will see very similar pieces with a slow variation of the weight over a longer period of time. The statistical term for this memory effect is that the measurements are autocorrelated.

Strongly autocorrelated series behave in a special way – they will generally wander around taking long flights above or below the mean and then suffer a shock and move in the other direction. If we are unlucky enough to catch a such a phase in our data collection we might get a wildly wrong estimate of the mean. The variance is probably even worse – as the values tend to be close to each other we will definitely underestimate the variance if do not measure long enough.

Now imagine that we use the measurement to calculate a process capability – overestimating the mean and underestimating the standard deviation means, that we will have a completely wrong picture of our process. So, it is evident that we need to make sure that we measure long enough so that we capture the whole picture. But how long is long enough? Does this sound like the familiar question – how long do I have to measure to be sure I capture long term variation? What is certain is that there will be no 1.5 magical number in the answer. What we would need to do (actually always IMHO) is to test our process data for autocorrelation first and then decide how to measure. Minitab (and of course R as well) offer a number of tools that would do this for us – especially the autocorrelation and the partial autocorrelation functions.

Of these two the partial autocorrelation is probably the better, because it corrects for the effect of a correlation “contaminating” the next correlation. Imagine there is a strong link between the value measured today and the value measured yesterday. If we naively computed the correlations we would necessarily observe a correlation between our value today and the value from the day before yesterday – not because there is a link but because the value from two days ago influences the value from one day ago which in turn influences the value from today. So the autocorrelation function would show a number of decreasing correlations over days, even though I fact we only ave an effect for one day. This problem can be detected and corrected for mathematically – and tis is what the partial autocorrelation function does for us.

So, what should we do, in real life? I would suggest that for any measurement in time order (process data) we should definitely test for autocorrelation BEFORE we start drawing conclusions or applying statistical tests. If we have a significant autocorrelation, we need to apply the necessary corrections which can be as simple as applying a  correction factor to the measured standard deviation. If we have enough data we could also simply apply a sampling method where we always skip enough data points to make sure the next data point is not correlated to the previous one any more.

We also need to school our intuition – autocorrelated data tends to look like something that definitely has a pattern (aka special cause variation). I attached the picture of such a data set, with the outputs of the acf and pacf functions. I bet, anyone who would see this runchart, would immediately start to look for periodic patterns´and special causes in the process – and this wold be like  the Hunting of the Snark (an Agony in 8 Fits) definitively not what we want in our projects https://en.wikipedia.org/wiki/The_Hunting_of_the_Snark

The portrait and signature of a Snark below:

Logistic regression analysis of life expectancy

The next assignement is to use logistic regression to analyse our data. Life expectancy is a continuous variable, so normally one would not use logistic regression for this analysis. For the purposes of this course I will transform my continuous variable to a discrete one with two levels,  LOW and not LOW (aka HIGH) with the R code:

LE$LOW=as.factor(LE$LE<65)

so that life expectancy values below 65 years count as LOW . Where did the number 65 come from? Take a look at this graph which shows the distribution of life expectancies across all countries – there seems to be a split somewhere close to the value 65.

LE Histogram

Logistic regression in R is mostly done using the function glm with the family parameter set to „binomial“. Running this gives the following result:

> model=glm(LE$LOW~., family="binomial", data=scaledIV)
Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 

 

There was no model built because the algorithm did not converge and the reason for it is what is generally called the complete or quasi complete separation – we have some variables that comletely split the data set to LOW and not LOW (for details see http://www.ats.ucla.edu/stat/mult_pkg/faq/general/complete_separation_logit_models.htm)

Given our foreknowledge from the multiple logistic model the first candidate for causing the quasi-complete separation would be the hivrate. There are several ways to handle complete separation depending on what we want to achieve with the model. In this case we want to see which factors play a role in deciding whether a country ends up in the LOW or HIGH group, so, eliminating the hivrate as an explanatory variable is acceptable – this would definitely not be an option if we tried to build a predictive model, for example.

Running the logistic regression without hivrate does converge and gives the following:

Call:
glm(formula = LE$LOW ~ ., family = "binomial", data = scaledIV[, 
    -6])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.34513  -0.15802  -0.00322   0.24734   2.59204  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -5.881013   1.539680  -3.820 0.000134 ***
alcconsumption        0.290602   0.138858   2.093 0.036367 *  
armedforcesrate      -0.272118   0.236565  -1.150 0.250025    
breastcancerper100th  0.004764   0.048912   0.097 0.922416    
co2emissions         -0.160791   0.257335  -0.625 0.532083    
femaleemployrate      0.047841   0.055835   0.857 0.391544    
internetuserate      -0.226375   0.068223  -3.318 0.000906 ***
polityscore          -0.034174   0.074296  -0.460 0.645542    
suicideper100th      -0.054913   0.080000  -0.686 0.492455    
employrate           -0.062924   0.076186  -0.826 0.408844    
urbanrate             0.010546   0.031410   0.336 0.737053    
logincome            -0.738198   0.521130  -1.417 0.156619    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 153.935  on 127  degrees of freedom
Residual deviance:  52.109  on 116  degrees of freedom
AIC: 76.109

 

Now we can proceed to eliminate insignificant variables in exactly the same way as in the case of the multipke regression . Note, that we have no R-sqaured value to judge the quality of our fit, but we can use the Akakike Information Criterion ( represented as AIC in the output). The lower the value the better.

After elimination we get the following model:

Call:
glm(formula = LE$LOW ~ ., family = "binomial", data = scaledIV[, 
    -c(2, 3, 4, 5, 6, 8, 9, 10, 11)])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.12439  -0.19087  -0.00293   0.31316   2.55414  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -6.00511    1.37909  -4.354 1.33e-05 ***
alcconsumption   0.31902    0.10877   2.933  0.00336 ** 
internetuserate -0.23452    0.05636  -4.161 3.17e-05 ***
logincome       -0.77489    0.36492  -2.123  0.03372 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 153.935  on 127  degrees of freedom
Residual deviance:  56.296  on 124  degrees of freedom
AIC: 64.296

 

Not only do we have a smaller model, our AIC value improved as well from 76 to 64.

So, if we completely disregard the hivrate (which seems not to be strongly correlated to the other variables, as seen in the „normal “ multiple regression VIF factors) alcohol consumption increases the odds of a country being classified as LOW life expectancy whereas internet user rate  and income  decrease this chance.

Could we conclude that the internet is actuallly saving lifes? There are many potential confounding factors, that were not considered in this data set, that could explain this effect -level of education, quality of the infrastructure and so on, so a direct cause effect relationshio can not be concluded. The same argument applies to alcohol consumption – all we know is that countries with high alcohol consumption tend to be classified more frequently as LOW life expectancy.

How strong are these effects? First we calculate the 95% confidence intervals for the efects using the confint function:

> confint(model)
Waiting for profiling to be done...
                     2.5 %      97.5 %
(Intercept)     -9.2316766 -3.72705436
alcconsumption   0.1215276  0.55785085
internetuserate -0.3634238 -0.13867174
logincome       -1.5336562 -0.07326821

 

Then the odds ratios, by exponentiating the coefficients:

> exp(confint(model))
Waiting for profiling to be done...
                       2.5 %     97.5 %
(Intercept)     9.788898e-05 0.02406361
alcconsumption  1.129221e+00 1.74691408
internetuserate 6.952917e-01 0.87051373
logincome       2.157454e-01 0.92935154

 

So, increasing the alcohol consumption by one unit (refer to the gapminder code-book ) increases the odds by 12 to 75 % to get classified as a LOW life expectancy country (odds ratio between 1,12 and 1,75 )

Increasing the internet user rate by one unit decreases the chances of being classified as LOW by 13 to 30%. Increasing the log of the per capita income decreases the odds by a 79 to 8 %. The ranges of the odds ratios are huge (with the exception of internet user rate) , so all we can really say is that the effects are positive or negative, the magnitude of the effects is difficult to judge.

 

 

Multiple Regression for Life Expectancy in the Gapminder Dataset

The next assignement in the training is to extend the model with new variables. Although the single variable logarithm of the per capita income explained above 60% of the variability in life expectancy it is worthwhile trying to add more variables to improve the model. It is generally just as instructive to see what does not get into the model as to see what makes it into the final selection.

As I planned to look at most variables I scaled them all first using the handy R function „apply“ as below:

scaledIV=data.frame(apply(gap[,-c(1,10)], 2, scale, scale=FALSE)) where IV stands for Independent Variable aka X-es. I took out of the scaled dataframe the country names and of course the life expectancy  I also considered taking out the variables that are directly related to life expectancy like the breast cancer or suicide rates but then decided against. It could be interesting to see whether they have a statistically visible impact.

As the multiple regression models omit data points that have missing values, I also reduced the data set to the complete lines with the function complete.cases. Now all was set to create a multiple regression model. I used the backward elimination method and chose to do the steps manually instead of using the „step“ function. the final model is:

> summary(mod1)

Call:
lm(formula = LE ~ hivrate + polityscore + employrate + logincome, 
    data = scaledIV)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.680  -1.715   0.242   2.087  12.028 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 69.62538    0.39228 177.487   <2e-16 ***
hivrate     -1.14597    0.10717 -10.693   <2e-16 ***
polityscore  0.16255    0.07168   2.268   0.0251 *  
employrate  -0.09297    0.04130  -2.251   0.0261 *  
logincome    3.90576    0.28135  13.882   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.284 on 123 degrees of freedom
Multiple R-squared:  0.8217,	Adjusted R-squared:  0.8159 
F-statistic: 141.7 on 4 and 123 DF,  p-value: < 2.2e-16

R-squared improved to 81% . Income and surprisingly hiv rate were the most significant factors. Just as surprisingly employment rate seems to have a negative effect on the life expectancy and  the polity score seems to improve life expectancy.

Checking the model quality – colinearity is of no concern, the Variance Inflation Factors are all safely below 5.

> vif(mod1)
    hivrate polityscore  employrate   logincome 
   1.063580    1.190520    1.159547    1.363082

 

This means that there is no confounding between the variables in the final model. However I suspect that other factors that were not included in the data set might have a confounding effect e.g. the dominat religion in a country.

The standardized residuals are between +/- 2 with a few exceptions:

MR_Residuals

but this shows that model does not fit very well for some countries – something we already saw in the simple regression model.

The normality of the residuals is not very satisfactory also hinting at a unsatisfactory fit:

QQPlot

Checking the leverage I see no worrying points:

Leverage

In summary the income per person kept its most important position as a predictor of life expectancy but the hiv rate is also a globally important factor whereas breast cancer and suicide rates are not visible as global predictors of life expectancy. Some other factors that one would expect, did not make it into the final model e.g. alcohol consumption or co2 emissions (as a proxy for air pollution in general). The model could still be improved e.g. by  including non-linear terms or new variables to better explain the life expectancy for low income countries where we have the worst deviation from normality for the standardized residuals.