Learning from a Line simulation

In this installment I would like to describe a first set of small experiments we can make using our simulation program, and show what we can learn out of it. This is by no means an exhaustive list – it serves to encourage you to try it out for yourselves.

First a quick description of the program: in order to make the effects more visible we made some changes to the original dice game by making the difference between the high and low variation stronger. Now, in the simulation, each workstation will process up to 50 pieces per round, the actual processed number being uniformly distributed between 1 and 50 for the high variance case and between 24 and 27 in the low variance case.

The average number of processed parts and the variance is, using the formulae for the uniform distribution,   mean=(a+b)/2 and Variance=(b-a)^2/12 where a and b are the endpoints of the uniform distribution.  This will give us for the high variance case a mean of (1+50)/2 = 25.5 and for the low variance case a mean of (24+27)/2=25.5 . The variances however will differ – in the first case we will have V=(50-1)^2/12  = 200 for the high and (27-24)^2/12=0.7  for the low variance case.

The simulation has a graphical UI that is shown in the  graph. We can select the number of the workstations and the number of simulation runs in increments of 500. We can also chose the line feed strategy – that is the way the new orders come into the system. Random Feed means that they come in the same way, uniformly distributed, with high variance between 1 and 50 while Regular Mean Feed means that we feed the line with the average number of orders the line can handle 25.5 per run , with a low variance.

There is also a button called re-run which will just create a new run with the same settings as the previous run. This is for our first experiments the most important button because by re-running the same setup we can really see the influence of randomness.

We have 3 output graphs :

  1. The number of WIP for each run of the simulation. This is the number of work items that are somewhere inside the production line after the end of a run – the number of semi-finished goods.
  2. A graph describing the throughput of the line. This is a bit more tricky – in order to clearly see the differences we do not display the number of finished pieces per run but the deviation from the expected number of finished goods per run. So a value of 0 would roughly mean 25 finished pieces, a value of 20 means 25+20=45 finished pieces while a value of -20 means 5 finished pieces. While this takes a bit of getting used to, this graph shows the deviations from the expectation much more clearly then the raw numbers.
  3. The number of WIP at each workstation at the end of the simulation. This can give us an idea where we have an accumulation of semi-finished goods and this is often proposed as a way of finding the bottleneck on a line. We will soon know better.

So, let us run a first simulation! I set the number of workstations at 5 , for 1000 runs. Now, I would be interested in the effects of randomness, so I pick the worst scenario – highly variable demand and higly variable machines. The output is on the graph.

Remember, from traditional point of view this is a nicely balanced line. Still, after 1000 runs, we have almost 2000 pieces scattered around production (graph 1) and our throughput was more or less unpredictable between 20% (5 pieces) of the expected volume and 180%  (45 pieces) .Obviously we had some good and some bad days without any chance of predicting what the next day will bring. By looking at the last graph we can readily identify the bottleneck –  workstation 2 with 3 trailing quite close behind it. So, to improve the line performance we need to focus on these two, right?

Before investing resources on improvement work, press the Re-run button. You will obviously have different results, but very likely something similar will happen to what I see now – the bottleneck at 2 disappeared and this workstation now has the best performance, while workstation 1 accumulated the most WIP. Obviously the threat of starting an improvement project at their workstation worked wonders – however at workstation 1 people became complacent and need more supervision . Also, the WIP in general has evolved in a much better way, we had a long stretch of runs where the WIP actually decreased and we are now at about 1500 pieces instead of 2000 !

So, let us press Re-run again: the WIP trend is a complete mess now, steadily increasing  to about 2500 at the end of the runs and the workstation 2 started to accumulate more WIP , however the workstation 1 is still the black sheep.  Press again : workstation 1 has improved , however 2 became the bottleneck again, while our focus on the overall WIP trend payed off and we are again down at about 1500 pieces.

The one thing that did not change across all these simulations was the daily performance – it stayed just as variable in all 3 re-runs.

Of course this is all a joke – there is no way the simulation would react to the threat of an improvement project. What we see is the effect of random variations in the line – and the most visible is the random movement of the “bottleneck”.  This is the core problem of the Theory of Constraints methodology by the way – we hear very often that the a line does not have a single bottleneck, but “moving bottlenecks”. This is the effect of random variations in the line.

The second visible effect is the accumulation of WIP. We would naively expect no WIP or very little, as the line is balanced on the average. In reality the random variations cause ever larger queues at random locations in our balanced line and the prediction of the factory physics formulae is that the WIP will continuously increase as a result of these random variations.  Looking at the graphs, it does seem that the WIP is kind of levelling off and stabilizing at around 2000 pieces but this is an illusion. If you change the number of runs to 2000 and higher  you will see that the increase is continuous.

The third, even more worrying effect is the high variability of the throughput. This line, though balanced, can give no guarantee of the number of pieces produced per day (run), despite the best efforts of the production people. So, what will happen in real life? Important orders will be moved forward (expedited) while less important ones will stay in the WIP – all this will create even more randomness and increase the negative effects we already see. I think this is not unlike many situations we see in production lines today.

As we are not trained to discern random effects we will necessarily come up with explanations for purely random phenomena – like my example of the threat of running an improvement project having positive effects or people becoming complacent.  This is also something we see very often in real life.

So, how would you investigate our line further? Please feel free to try.

Advertisements

A Simulation in Lean Says More Than 1000 Words

 

In general, we are sadly missing an intuitive understanding of random processes. Several authors wrote about this blind spots in the last years – Taleb in Fooled by Randomness

 

(https://www.amazon.com/Fooled-Randomness-Hidden-Markets-Incerto/dp/0812975219/ref=sr_1_1?ie=UTF8&qid=1503991947&sr=8-1&keywords=fooled+by+randomness)

or Sam L. Savage in the Flaw of Averages – just to name a few.

(https://www.amazon.com/Flaw-Averages-Underestimate-Risk-Uncertainty/dp/1118073754/ref=sr_1_1?s=books&ie=UTF8&qid=1503992030&sr=1-1&keywords=flaw+of+averages)

To my mind, the basic problem is that we rarely have the opportunity to experience long stretches of random phenomena. Of course we all saw dice being thrown and the stock market fluctuate – but this is not enough to develop an intuition about random events. To do this, we would need to observe and record long stretches of the phenomenon – and I would have yet  to see the person, who spent hours and hours throwing a balanced die and recording the results – not to say, analyzing the them as well . (Well, there was a truly renaissance personality called Cardano, who did this in the XVIth century, and wrote a book about it, but this was more then unusual).

 

Studying and improving processes is a field where we really miss an intuitive understanding of how random variations influence the behavior of the system. In principle Six Sigma acknowledged the role of random variations but, sadly, not when looking at process mappings. Lean followed in the same steps, simply formulating the request to reduce the variation as far as possible in order to achieve a uniform flow. So, in the end, some practical questions are left unanswered in both methodologies, like: How much variation is acceptable? What are precisely the negative effects of randomness in our concrete process? How could we make a cost-benefit analysis of the variation reduction?

To answer these questions, we would need to understand the effects of randomness on a process. To my knowledge there is one discipline that really attempted to do this: Factory Physics  (https://en.wikipedia.org/wiki/Factory_Physics)  . In this theory we apply queuing theory to the manufacturing processes and the result is that we can come up with a formula that will precisely describe the effects of random variations on a manufacturing operation., the (in)famous Pollaczek-Khinchine equation eventually combined with Little’s Law.

Now, if you are the type of person, who was trained in reading and interpreting equations (like physicists are, no surprise here) then all you would need is to look at this equation to understand the effects of variation on a value stream. Unfortunately, very few people can do that, so, for the rest of us the equations provide a tantalizing hint of what is going on, but fail to deliver the kind of intuitive understanding we would find useful. To achieve that we would need to be able to observe and track processes with random variations over a long period of time to see what is going on.

This is generally not feasible in real life, due to constraints of costs and time. The processes are rarely constant over a long enough period, so that we could observe their behavior at leisure. Changing some parameters, like the average processing times, or the standard deviation of the processing time at a given step, just to be able to observe the effects of a such a change, is of course impossible. So, we need a random system that is cheap and fast to change and  also easy to observe.

This is why in the 80-ties Goldratt invented the Dice Game – an abstract model of a processing line, where the only influencing factor was the randomness of the processing times. The game is played as follows: 5-8 people sit at a table, each with a six sided die. There are matchsticks to the right and to the left of each person, one person’s right matchstick heap being the left heap of the next person. There is also a “feeder” – someone who is providing new matchsticks at the input of the process. At the sign of the moderator, each person throws the die and moves as many matchsticks from their left to the right as the count of points on their die. If they do not have as many sticks to their left, (say they have 3 sticks and the die says 5) they move as many sticks as they can.  The feeder will input regularly 3 then 4 then 3 again to the process at the beginning of each cycle.

 

The operation is then repeated, for as many times as the moderator sees fit – generally not more than 20 cycles to avoid boring people to death.  Then the exercise will be repeated, but instead of a die a coin is tossed. The rule is that heads mean moving 3 sticks, tails mean moving 4.

 

Now, consider the following: in both cases the average performance of each workstation will be 3.5 sticks per cycle, that is, the line is perfectly balanced in both cases. Before reading on, think about the following questions:

  1. Would you expect a difference in these two cases (die vs. coin).
  2. Do you expect a difference in the throughput (finished products per cycle) between the two cases?
  3. Where do you expect to see the bottleneck?
  4. If each step can still move 3.5 pieces on the average, how would you improve the process?

 

As we are missing the intuitive understanding of random processes, it is not easy to answer these questions. Getting enough people together to play the dice game is not as easy either, especially if we want to see the longer term effects – no one will waste time to play this for several hundreds of cycles, right? Still, understanding the behaviot´r of the Dice Game is an important step towards understanding random variation in a value stream.

 

The solution is to write a computer simulation with a comfy user interface and let people simulate the system. In this case the individual steps are fairly simple so writing this simulation is not a complicated job, adding a user interface is also pretty much a standard using R and the Shiny interface. This way we can play several thousands of rounds, without getting bored, we can measure the results, change the parameters or simply repeat the runs several times, to see how the system behaves. There is no better way to develop our intuition, and in my experience, there wasn’t a singleperson who was not surprised by some aspects of what they saw.

 

As is happens, we, at ifss,  developed a such a simulation and it can be accessed by anyone via the Internet. It is not up all the time, but if anybody would like to figure out the answers to the above questions, we would be happy to start the simulation. Just send a message, some time in advance, and we shall find an opportunity to run activate the simulation,

So, feel free to explore and to start developing your intuition about random processes. We would be happy to hear from you, and if you try the simulated VSM Game, sharing what you experienced would be even nicer.

The Joy of Principal Component Analysis

Making sense of  a large pile of data, as in several hundred measurements of 5 to 10 different characteristics of a product, is a dream of a data analyst come true. In manufacturing environments, especially in the chemical industry, it often happens that we measure a wealth of data on each finished product batch . Some of these measurements are important as they might directly relate to customer requirements, some are “informative” only and the goal of collecting all the data is to have an idea of what is happening in the production process. This is the theory, at least. Of course, life is always a bit more complicated, and many operations will simply be overwhelmed by the amount of data without having the adequate tools to make sense of it. So, the data will be faithfully collected, after all it is in the SOP, but it will not be used to draw conclusions from it.

What I would like to show, is a method, to actually use this wealth of information. The method is by no means new but it definitely failed to find its way into the manufacturing environment. I still recall the first reaction of a colleague when I showed it to him: “oh, this is for PhD work” he said – meaning more or less “this is way too complex and practically useless”.

The method is called principal component analysis. The idea is to generate artificial variables out of the set of real measured ones in a way that the new variables capture the most of the variation present in the data set. To be more precise, we generate linear combinations of the original variables in a way that the first linear combination captures the most of the total variation, the second linear combination captures the most of the residual variation and so on.    If there is some systematic trend in the data, the hope is that the first few artificial variables (called principal components) will capture this systematic pattern and the latecomers will only catch random variation in the data. Then, we can restrict our analysis to the first few principal components and have cleansed our data of most of the random variation.

To make this more clear imagine a room in which we have temperature sensors in each of the 8 corners. We will have 8 sets of measurements that are somewhat correlated but also vary independently of each other. Running a PCA (principal component analysis) will yield a first principal component that will be the average of the 8 measurements – as obviously the average temperature in the room captures most of the variation of the room temperature (just think winter vs. summer, night vs, day). The second principal component will capture the difference between the averages of the top 4 sensors and the averages of the bottom 4 sensors (it is generally warmer on top, right?) , the third component the difference between the side where we have the windows and the side that has no windows….and the rest probably only meaningless measurement variation and possibly some complex but small differences due to the patterns of air currents in the room.

What did we gain? Instead of the original 8 values we only need to analyze 3 – and we will gain a much better insight into what is going on. We did not reduce the number of measurements though -we still need all the sensors to construct our principal components.

What did we lose ? As my colleague put it – this is PhD work , so we lose a bit of transparency. It is relatively easy to explain a trend in a sensor on the top of the room , but a trend in the first principal component sounds like academic gibberish. So, we will have learn to explain the results and to relate them to real life experiences. This is not as difficult as it sounds, but definitely needs to be carefully prepared and the message tailored separately in each practical case.

Now to close with a practical example :  I just had the good luck to run into a problem at a customer I work for. They measured 5 components of a product and had the respective data from 2014 onward. All data points represented product batches that were accepted by the QA and delivered to the customers, as they were all in spec. The problem was that the customers still complained of highly variable quality and especially of variations in quality over time.

Looking at five control charts, all consisting of quite skewed non-normal data, basically just increased the confusion. There were regularly points that were out of control, but did we have more of them in 2015 then in 2014? And if yes, was the difference significant? No one could tell.

So, I proposed to analyze the data with PCA.

In R one easy way to do this is by using the prcomp function in the following way:

Results=prcomp(InputDataFrame, scale=TRUE)

Here obviously the InputDataFrame contains the input data – that is, a table where each column contains the values of one specific measurement of the product (like viscosity or pH or conductivity etc.) and each line is one set of measurements for one batch of the product. The scale = TRUE options asks the function to standardize the variables before the analysis.  This makes sense in general but especially when the different measurements are on different scales (e.g. the conductivity could be a number between 500 and 1000 whereas the pH is less then 10). If they are not scaled, then the variables with greater values will dominate the analysis,  something we want to avoid.

The Results object will contain all the outcomes of the PCA analysis. The first component sdev is a list of the standard deviations captured by each principal component. The first value is the greatest and the rest will follow in decreasing order. From these values we can generate the so called “scree plot” which shows us how much of the total variance is captured by the first, second etc. principal component. The name comes from the image of a steep mountainside filled with rubble at the bottom – scree. The steeper the descent, the more variation is captured by the first few principal components – that is, the clearer the discrimination between systematic trends and useless noise.

A quick way to generate the scree plot out of the results of prcomp would be as follows : plot(Results$sdev^2/sum(Results$sdev^2), type = “l”)  .

Simply calling plot on Results will give us a barchart of the variances captured by each principal component, which has comparable information, though it is not exactly the same.

The next element of the prcomp return structure is called “rotation” and it tells us how the principal components are constructed out of the original values. Best is to see this in a practical example:

PC1                                PC2

Turbidity                        0.45146656           -0.59885316

Moisture                         0.56232155           -0.36368385

X25..Sol.Viscosity.         0.06056854             0.04734582

Conductivity                  -0.44352071           -0.59106943

pH..1..                               0.52876578            0.39686805

In this case we have 5 principal components but I am only showing the first two as an example. The first component is built as 0.45 times the standardized value for Turbidity plus 0.56 times Moisture plus 0.06 times the viscosity minus 0.44 times the conductivity and 0.52 times the pH.   For each measured data can generate a new data point using these values by multiplying the standardized original measurement with these weights (called loadings in PCA language).

What good would that bring us? The benefit is that after looking at the scree plot we can see that most of the variation in our data is captured by the first two (maybe three) principal components. This means that we can generate new artificial values that have only two components (PC1 and PC2 instead of the original 5, Turbidity, Moisture etc…)  and still describe the systematic trends of our data. This will make doing graphical analysis much easier – we can restrict ourselves to  calculating the new artificial values for PC1 and PC2 and to look at a two-dimensional graph to spot patterns. It also means that we eliminated a lot of random variation from our data. They are captured by the remaining components PC3—PC5 and we conveniently leave them out of the analysis. So, not only that we get a graph we can actually understand (I find my way in two dimensions much faster then in 5 ) but we also managed to clean up the picture by eliminating a lot of randomness.

As it is, we don’t even need to do the calculation ourselves – the Results structure contains the recalculated values for our original data in a structure called x – so x$x[,1] contains the transformed values for the first principal component and x$x[,2] for the second principal component. Plotting one against the other can give us a nice picture of the systems evolution.

Here is the resulting plot of PC2 versus PC1  for the example I was working on.

pca_example

The colors show different years and we can see that between 2015 and 2016 there was a very visible shift towards increasing PC1 values. Now looking at the loadings we see that the PC1 values increase if Turbity, Moisture and /or pH increase and  (or) conductivity decreases , Remember, these are all changes that are well within specs and also within the control limits, so they were not discernible with simple control charts. Still, once we knew what we were looking for trends in the increase of moisture and decrease in conductivity were easily identifiable.

Now, we can identify the process changes that led to these trends and strengthen the process control. PCA can be used to plot the measured values of future batches to see that we have no creeping of the measured values in the future. Not bad for an academic and useless method, right?

Survival Analysis, Simulation and Sample Size

The nice thing about working as a consultant ist he variety of different problems one is confronted with. Just as I was preparing the next blog-post about the possibilities of simulation in statistics I received a request to help design an experiment to determine the life-time of a chemical product. Life-time analysis – better known as survival analysis- is a large and interesting part of statistics. We, in the “normal” six sigma world do not see much of this, but in DFSS projects is one of the more important elements.

Today, Minitab provides some help in a relatively new sub-menu and there are specialized programs such as Weibull++ which help specialists. Of course R has a wealth of libraries to use as well. On the other hand, there is nothing better then a simulation to build the intuition around a problem, so I decided to go that way, using R for the ease of writing scripts. A problem, in these cases is also, that calculating sample sizes is not easy and is not really well documented in Minitab for example.

So, I wanted to simulate some life data analysis and especially look at the power of the analysis as a function of external stress – something even Weibull++ will not provide as the company sells a special program to plan HAST tests. (Just for those less versed in DFSS – HAST is the abbreviation of Highly Accelerated Stress Test. The idea is that one can define a stress factor (like weight for mechanical assemblies or temperature for electronic equipment) and assume that there is a relationship between the distribution of survival times and the level of the stress factor – like the higher the stress the faster a breakdown occurs. Once this is known, from theory or possibly from  earlier experiments, one can run a test for a much shorter duration with an elevated stress level, and work out the survival distribution for the normal stress levels from the data. )

In our situation we had basically no knowledge of the relationship between the stress factor and the life-time, so what we needed to do was to run life-time tests at different levels of the stress factor and work out the relationship from the results, THEN use these results to extrapolate the life-time distribution at normal levels. Of course, the easiest would be to simply run the life-time tests at the normal levels, we just do not have the time and the resources to wait for month and month until the product dies.

However, if we want to see the effect of higher stress levels on the product we absolutely need to know the power of the statistical method used to test the data. This will determine the sample size we need at each setting, and as usual, we avoid both the risk of not seeing anything in a stupidly underpowered study or of wasting a lot of product and effort if we use a higher then necessary sample size.

Estimating the power is simple in principle. We define an effect strength, we simulate the data with the defined influence of the stress factor  and run the analysis. We repeat this step 2000 (or more) times and count how many times we have a p-value that is less then 0.05, divide it by the total number of experiments and there we are, we have a power estimate. Of course, the details are a bit more cumbersome.

 

First, the R code to do this is here:

# Sample size determination for accelerated life testing

# Assume the survival time distribution is exponential, that is

# no manufacturing errors and no aging – constant hazard rate

# Assume that the hazard rate is dependent on a stress factor F

 

# To define a realistic parametrization we use historical knowledege

# we know that at F=20 the MTTF=3*30=90 days

# lambda = 1/MTBF =1/90

# the effect size is given in reduction of the MTTF – so an increase in the Forcing

# factor from 20 to 40 reduces the MTTF by DM days

 

require(survival)

F=c(20, 40, 60, 80)

# DM is the effect size in MTBF days – in this case a change of 20 grades

# reduces the MTBF by 10 days – a weak effect

 

#DM=10

# Percentage censored values – as per now the censoring will be randomly decided

#CensPer=0.0

 

lambda0=0.011 #=!/MTBF=1/90 days

#Function to calculate the power at a setup consisting of a given sample size, effect size and percent censured

SingleSetupRun=function(Nrun=100, DM=10, Ns=10, CensPer=0){

ps=numeric(Nrun)

scales=numeric(Nrun)

# FOR Nrun times DO

for(i in 1:Nrun){

Fs=numeric(0)

Ts=numeric(0)

#   FOR each value of F DO

for(n in 0:3){

#     Calculate the lambda

lambda = lambda0/(1-lambda0*n*DM)

#     Add the F values to the vector

Fs=c(Fs, rep(F[n+1], Ns))

#     Generate Ns samples from the exponential described by lambda

Ts=c(Ts, rexp(Ns, lambda))

 

#   ENDFOR

}

#   Build a censoring vector

cs=rep(1,length(Fs))

#   A percentage equal to CensPer is censored

cptr=sample(1:length(Fs),as.integer(CensPer*length(Fs)),replace=FALSE)

 

cs[cptr]=0

 

#   Build the data frame

sdata=data.frame(Fact=Fs, Stime=Ts, Cens=cs)

S=Surv(sdata$Stime, sdata$Cens)

 

#   Perform the regression

mod=survreg(S~sdata$Fact, dist=“weibull“)

 

#   Get the p-value got from survival:::print.summary.survreg

ps[i]=summary(mod)$table[11]

scales[i]=summary(mod)$scale

 

# ENDFOR Nrun

}

print(sprintf(„Power is: %.2f, for effect %d and sample size %d“,sum(ps<0.05)/length(ps), DM, Ns))

sign.ind=which(ps<0.05)

print(sprintf(„Percent scale>1 %.2f“, sum(scales[sign.ind]>1)/length(sign.ind)))

}

 

Now, let us look at some details.

Setting the parameter of the exponential distribution

 

In life data analysis probably the most important decision is to determine the distribution of the survival times. The simples model is the one where the survival times are described by the exponential distribution. This is equivalent to the assumption that there are no memory effects present – that is , the remaining time until a failure is independent of the age of the product. A different way to describe this is by using the concept of the hazard. The hazard at time t  is defined as the probability that the product fails exactly at time t after haing survived until this moment. The memoryless property (which is equivalent to the exponential survival times distribution) simply says that the hazard is constant in time. To make things even simpler, the hazard will be the only parameter of the exponential distribution – defined as “lambda” in the code above. Generating random values of a given exponential distribution is simply done by calling the R function  rexp(size, lambda).

 

So, if we want to simulate different effect sizes we just need to change the lambda parameter accordingly. There is the crux, however – what do we mean by accordingly? We (at least I) are not trained to think in terms of hazards – so  how can we decide whether changing the lambda parameter by 0.1 is a strong or a weak effect? I honestly do not know. So, I had to find a way, to make effect size more easy to estimate, and I came up with the idea of using the Mean Time to Failure or MTTF. This is simply the expected value of the exponential distribution – and despite its name it is the time at which not 50% but roughly 37% of the tested items still survive. It can be proven that lambda = 1/MTTF which is handy because most manufacturers provide the MTTF number. But, given, that this is a time value we can more easily relate to it then to the concept of the hazard so, will be easy to determine a weak, medium or strong effect. We know, that our MTTF under normal, unstressed, conditions should be around 90 days. So, an effect that shifts the MTTF by around 10%  can be considered weak . Knowing this we can work out backwards the change in the lambda that will reduce the MTTF by 10 days (10%-ish). The details are in the comments of the code.

Censoring

The other special feature of the life data analysis is that we have two types of life data. There is the “real” type, where the value is simply the time of failure, but we also have “censored” data where we did not see any failure but we have to interrupt the observation. E.g. we planned the experiment for 8 weeks and at this time some items just kept on living. This is valuable information and it would be a folly to discard this data. But we can not enter a value for the failure time either – it could be the next second or the next month, who knows? In life-data analysis we mark these points as “censored” and we can still use them in the anlysis – but obviously this analysis has to be different type from a normal least squares regression. For the power analysis, we also have the intuition that the amount of censoring will reduce the power – a censored data point simply does not have all the information a “normal” point has.   In our R function CensPer is the variable giving  how much percent of the available data is censored. I just attached the censored label randomly to data points – a better way would be to assume that the  censored values are mostly those with a high life-time.

Scale  and power

Before looking at the results we still need to discuss another detail. When analyzing life data, as mentioned before, we can decide which distribution we assume for the distribution of our data – and the simples choice is the exponential,l with a constant hazard rate. However, in real life we can not be absolutely sure that this assumption holds – so, in a real life situation we should not pick the exponential but something that allows a little more flexibility. Fortunately, there is a such a choice, the Weibull distribution, which is flexible enough and contains the exponential distribution as a special case. So, in the analysis our best option is to pick the Weibull distribution, so, we should pick this option also when looking at the power in simulations. If you check the code, the line

mod=survreg(S~sdata$Fact, dist=“weibull“)

is running the regression, with the Weibull distribution as our choice. This has an additional advantage for the power analysis. If we look at the results of the regression, we shall see a calculated value called “scale”. This is giving us the direction of the effect that was detected with the regression – if the scale is greater then 1  the hazard will increase with increasing stress factor, if it is smaller then one then the hazard will decrease with the stress. By looking at the distribution of the scale values we can judge how well a regression with the given sample and effect size can determine the direction of the effect. In principle, it is not enough to have a  p value less then 0.05 (statistically significant effect) we would also need a scale that is greater then 1 (as we know that we set up the simulation in a way that the hazard increases with the strength of the factor.

Results

Well, the first and most important result is that the code above does give a pretty good idea of how the power will look like. Feel free to paste the code into R Studio and experiment with it.

Now some results: this will not come as a surprise but small effect sizes and small sample sizes will result in very low power – it is almost hopeless to expect to see a significant effect. Moreover the direction of the effect will be variable and misleading . for an effect of 10 days of shift in the MTTF and a sample of 5 per factor setting I got only 14% significant tests (power of 14%) and of those only 16% had a scale >1 – basically useless for practical purposes.

Increasing the sample size will strongly increase the power, but will only have a mediocre effect on identifying the right direction of the effect. At 10 samples per factor value the power increased to 18% and of those 27% identified the right direction. At a sample size of 20 the power was 29% and scale>1 was at 32%. The numbers at the completely unrealistic sample size of 100 are : power at 84% but percent scale > 1 still at 46% only.

So, we can hope to identify the existence of an effect with this method, but the direction will be iffy even at very high sample sizes. However, if we have a stronger effect, both numbers improve drastically:

At an effect of 30 days on the MTTF, even at a sample size of 5 (!) we would have a power of 99% with scale>1 at 89%. This effect grows in a very non-linear way – at an effect of 20 days we would still only have a power of 45% with the scale correctly identified in only 26% of the cases. So, definitely, the method becomes more reliable at stronger effects.

How does censoring affect the power? As expected, all things being the same a hgher percentage of censored values will reduce the power. For example, if we stay with previous case but we introduce a censoring of 50% of our generated data then the power decreases to 76% from 99% and the percent scale>1 to 75% from 89%.

The effect is not very strong though, 50% censored does seem to be extreme, so,  there is hope that we can use this method, provided we manage to identify a suitable effective stress factor.

There is a lot more to find out with this simulation, and I encourage everyone to investigate this further on their own.

 

And, as always, any feedback is warmly welcome.

Solving a Probability Problem Through Simulation

I am hoping to extend the blog-post about the end of statistics (https://rforsixsigma.wordpress.com/2017/05/03/the-end-of-statistics-as-we-know-it/)  to several more entries by discussing examples and methods that illustrate the theoretical conclusions of the first entry.  Be warned that these entries might contain R code, which is not for the faint of heart (kidding).

The coupon collectors problem

First, let us return to this problem, mentioned in my last post. It is coming from a scientifically minded grandmother,  who wants to figure out how many roller-coaster rides are needed to enable the grandson to ride at least once in every carriage of a roller-coaster. There are 32 carriages in total and the customers are randomly assigned to a carriage before the ride. (This was an actual problem posed at the www.talkstats.com website).

As somebody pointed out at the website, this is a known mathematical problem, called (surprisingly) the coupon collectors problem and formulated as the question of how many coupons do we need to buy if we want to collect one of each type (anyone with small children will be familiar with this problem,  from collecting all NBA player cards to all  dinosaurs). Of course there many, more practical, formulations of this problem but for the example’s sake we can stick to the roller-coaster rides.

In theory

The advantage of this being a known problem, is that we already have a full theoretical solution, easily available on wiki (https://en.wikipedia.org/wiki/Coupon_collector%27s_problem). Reading this entry we find that the expected value of the number of rides is E(N)=N*H(N) where H(N) is the Nth  harmonic number, and the variance of the mean is less then (N*Pi)^2/6. The harmonic number is calculated as H(n)´=1+1/2+1/3+…+1/n.

Using Excel I get H(32)=4.05 so the expected number of rides will be 32*4.05 = 130  with a standard deviation that is somewhat less then 41 (again, using the formula from the wiki site). So assuming, that the expected number of rides is normally distributed (which is a pretty wild assumption and I am by no means sure that this is warranted) the number of necessary rides will be in 95% of the cases between 130-2*41 and 130+2*41 that is , in the interval (48, 212)  and that is about all we can say .

In practice

The first, obvious, problem with the theoretical solution is that one needs to be aware of the existence of this alternative formulation. Remember, our problem was to find the number of rides and had no relationship whatsoever to the coupon collectors problem. It takes deep knowledge of the probability theory to be able  to recognize that the alternative formulation is known and is a solved problem. (Or, possibly, access to a good statistics help website and a large portion of luck). Most of us, practitioners, will not have the exposure to probability theory and luck is, well, not reliable.

The second problem is, that even knowing this alternative formulation, will not necessarily help to answer the practical questions raised in real life. Even the most loving grandmother will have to plan for a budget, so, simply knowing the 95% confidence interval of the number of rides is not enough. How can she get a feel of what is to be expected? (As a side note, if the number of rides were really normally distributed, then the mean and the variance would be enough to fully describe the situation, but we do not know whether this is true or not . The central limit theorem might be an indication, but again, how close are we to normality, in this special case of 32 carriages ?)

So, chances are that we would not even know of the existence of the theoretical solution, and even if we did, we would just want to know more. In both cases, the answer is to simulate the problem.

Simulation 

My immediate reaction to situations where the math is not available (to me, at least) or the results are not really of practical value, is to try to simulate the problem and just see what happens. An immediate benefit is, always, that we can develop an intuition of what is going on, something that is very valuable for practitioners and is totally missing from the theoretical solutions. So, what to do in this particular case?

Obviously, we only need to simulate the draws of the carriage numbers. We need to generate sequences of random numbers from 1 to 32 and count how many numbers we need to draw until we have each number from 1 to 32 in the sequence. Thinking about this a bit more, there is a slight problem. Theoretically, we might end up with very long sequences (potentially infinitely long ones) as we can have sequences that never contain some of the numbers. Given, that we would want to have results in a reasonable amount of time, the naïve approach might not work.

The alternative is to generate sequences of fixed length and to count how many of those contain the full row of numbers from 1 to 32.  Theoretically, this is giving us the cumulative distribution function (https://en.wikipedia.org/wiki/Cumulative_distribution_function) . The values of this function give us the probability that the number of rides will be less then a given number – so F(50) is the probability of needing less then 50 rides, F(300) the probability of needing less then 300 rides and so on. Calculating this function has the double advantage of having a fixed run-time (no problems with infinite length sequences) and it is also the information we need to budget the rides.

Code

This is the way I coded the simulation. This is definitely NOT an optimized version, but for my purpose it runs fast enough even at 5000 trial runs per experimental setting , on my laptop. By fast enough I mean I barely have time to make a fresh cup of coffee before it is done.  If you are not into R, just reading the comment lines (starting with #), they should give you an idea of what the intention of the code is.

# Define the steps

n=seq(from=32, to=532, by=4)

#Define the vector holding the results

Prob=numeric(length(n))

#Define the number of samples for each step

RunLength=5000

#Index of the save vector

Ind=1

#FOR each step DO

for(NTrial in n){

# Set the counter to 0

  Count=0

# FOR RunLenth times DO

  for(i in 1:RunLength){

#   Generate a random sample of “size” step of the numbers from 1:32

#   with replacement

    s=sample(1:32, NTrial, replace=TRUE)

#   IF the sample has all the numbers THEN

    if(length(unique(s))==32){

#     Increase the counter

      Count=Count+1

#   ENDIF

    }

# ENDFOR

}

# Save probability of finishing in less then “steps” runs

  Prob[Ind]=Count/RunLength

 

# Move the index one position forward

   Ind=Ind+

#ENDFOR

}

The result is a list of probabilities for each number of trials, giving the chance of finishing in less trials as that number – for instance the probability of getting into each carriage with less then 100 rides is 23%.

To have a 50% probability of finishing one needs to go for 124 rides and so on. The cumulative distribution function will look like this:

 

Cumulative plot

What did we gain, what did we lose?

In comparison to the theoretical solution, we definitely gained clarity, in my opinion. With the distribution function in a table we can answer any practical question in an easily understandable way. I mean, questions like : “what if we had 16 instead of 32 carriages” or “what can we expect for a given budget” and such. They will either be directly answered with this code or with a simple tweak in it, like replacing 32 with 16 in the first line. Also, I may be mistaken in this, but I have a strong feeling that in any company we will have more people who are able to explain a few lines of R code then who are able to work with the harmonic numbers.

But, as everywhere, we lose something as well. There is a special kind of beauty in seeing mathematical results like the ones for the coupon collectors problem. The formula has a nice link to the old Zeno paradox, about why Achilles will never be able to outrun a turtle. Seeing such a pattern turn up in a seemingly everyday problem of a grandmother planning the rides of her grandson is really satisfying and we might lose such insights in the future. Also, if someone interested in the long-run average behavior of the system (where the number of carriages tends to infinity) will be better served by the mathematical theory.

However, I strongly doubt, that the grandmother in question, or a laboratory manager trying to solve a practical problem  would consider these insights as a great loss when weighted against the possibility of getting relevant answers quickly and in a way everybody can understand.

 

The End of Statistics as We Know It

Admittedly this is a provocative title but it is line with developments I see in the field, so bear with me. To come to the point quickly, I believe, that we are about to see a fiundamental change in the way statistics will be used in the future – a move away from the mathematical methods and towards more computationally intensive and pragmatic methodologies. This would mean that we shall worry much (much) less about the normal distribution of our data, and probably a lot more about the correctness of our libraries and scripts. This might also change the way statistics will be trained and taught – I would expect a more rigorous separation between practical and theoretical statistics, where naturally, theoretical statistics would remain the heavily mathematical field it is today, but practical statistics would move more towards engineering and computer science, with a heavy emphasis on computing and programming.

History

This change is made possible by the development of computers and sensors. Historically, statistical methods were developed , mostly in the 30-ties, in a situation where data was difficult to come by in quantities and computations were done by hand. It is quite natural that the methods that were developed relied heavily on model assumptions, and formulae as well as the necessity to draw conclusions based on relatively small number of datapoints. Truth be told, Fisher also developed exact, numerical methods that are very well suited to numerical analysis, but these were sen as wildly impractical at the time. So, in this traditional world of statistics one could save a lot of effort at data collection and computations, by utilizing mathematical models. Most prominently, the assumption of an approximatively normal distribution was a key for many of these methods.

This also ensured, that statistics was (is) seen as a mostly mathematical discipline – given some initial assuptions, one develops a mathematical procedure to decide a practically relevant question, and the math ensures that as long as the assumptions hold, the answer will be right.  The crux oft he matter is that the type of questions we can ask is somewhat limited by the mathematical scaffolding and more crucially the checking of the assumptions is just as important as the answer to our question.  This can easily be seen in any statistics forum even today – a majority of the questions asked is some variation on  the topic – „is this data normally distributed?“ – often framed as „can I use ANOVA(t-test, regression…etc) on this data?“

The real question though is a different one, namely: „is there an indication in this data, that there is an effect present, as different from simple random noise“? and if yes, how strong is this effect? Of course the mathematically involved traditional methods also answered this question (though in a surprisingly roundabout way), but the answer was more or less along the lines – „if your data is conformant tot he assumptions oft he theory then…“. Hence the need to always check the assumptions first.

Today

In many of the practical problems we see today, even the question of the existence of an effect is less important then a different question : „can I predict the future values, based on present observations“?  Of course, if we can prove the existence of an effect with our present data, then this, by definition, will help us to predict future values.  If, after running a DoE we can select a few factors that have an effect, we can build a model, and use it to predict the future performance of the system. The traditional limitation is, that we can only build a model if we have, or somehow turn, our data into the type that conforms with modelling assumptions (is normally distributed, mostly).

Already in the thirties, statisticians knew, that the assumptions about the data distribution is not necessary to answer the question about an effect or predictive power. For example, Fisher proposed a so called „exact test“ for binomial data, which did not rely on the normality assumption, it just calculated, in theory, all possible permutations of the data, and counted how many of those permutations realized an outcome that was similar to the actually measured data. This percentage gave a reliable p-value, without any approximations or assumptions- the problem was, no one would be able to  calculate all the permutations for any reasonably sized data.  Even today, Minitab for example, offers a small hidden check-box to calculate Fishers exact test – but the default ist he approximative p-value based on the assumption of normality.

New ways

What changed since the thirties is of course the availability of computing power unimaginable to the pioneers of statistics. Today, rationally seen, there is no need for most of the assumptions, as long as we have a few miliseconds to wait for the results of a simulation. (Well, maybe a bit more, for large data-sets).  This is a lucky situation, also because we see the development of many practical statistical methods, where the mathematical foundations are much less well understood as ANOVA, for instance -or even if they are understood, the math behind them is too complex for practical use. Even if we look at something apparently as simple as multiple regression , the problem of selecting the right model (by picking the correct variables, adding interactions and nonlinear terms) is by far not solved in a satisfactory manner.  The modern answer, in most of these cases is surprisingly simple from theoretical point of view – pick a random percentage of the data to build the model and test the predictions oft he model on the part that was not picked. Measure the prediction error then repeat with a different data set.  Select the model that performs best in terms of prediction error. The method, called cross-validation, can be applied to any model, of any mathematical complexity. It will not deliver a theoretical value like the p-value, but will with a high probability deliver a modelt hat will actually be better at predictions.

This is not to say, that the new methods do not require expertise – but they require an expertise of a different kind. Intuition of how random data behaves will be more necessary then ever. My personal experience is, that having studued and worked with statistics for many years, I was sorely lacking this intuition, until I started to write simulations. Working with some small dta-set in the idealized world of mathematics is simply not enough to build intuitzion. Even today, when I at least know, that my intuition is weak, I am surprized by how often my predictions are wrong – but today, knowing this, I just fore up my calculator and run a few simulations .

To close, I would like to address a small challenge to anyone willing to take it: in my preferred statistics forum someone asked this eminently practical problem. She took her grandson to a roller-coaster ride, where the each customer was assigned randomly to a carriage. Her grandson was so enthusiastic about the ride, that he wanted to repeat the rides until he rode at least once  in each of the carriages.  The roller-coaster had 32 carriages in total. How many rides will be needed ? What wozld be your guess? Less then 100 , less then 1000? Can you simulate the problem?

For those with a mathematical inclination: this is a known as   the coupon collectors problem. The mathematical theory of it is well developed and relatively easy to find on the net.   See if you can work out an estimate of the most probable number of rides, the expected number and maybe the variance. Then try to simulate the problem and compare the effort 😊

 

Logistic regression vs. Chi-Squared test

A few years ago I developed a simulation to teach the application of statistical methods to real – life problems. I had to figure out a situation where the students had to apply statistics to continuous and to discrete variables. For the discrete case I used the number of defects caused by materials coming from 7 different sources – actually it was a sci-fi simulation where they needed to decide  which planet to mine for fuel given that the fuel had impurities that caused malfunctions in the rocket engine. To enforce the use of discrete data I designed the story in a way that much of their measurement equipment was defective and all they could do was count the occurrence of a malfunction (a yes-no type of measurement).

This was easily accepted by the trainees, and references to Halo and Battlestar Galactica also helped. Now, my problem was to analyse this data. What I needed was something similar to ANOVA – a test that takes the discrete data for each category and in one operation gives me a ranking of which sources cause significantly more malfunctions than the others – with the emphasis on the word „significantly“. There are two naive solution to this , both of which are fast, simple and wrong – either we just compare the numbers as measured or we run as many two sample P-tests as we have pairs.

Simply comparing the numbers is obviously wrong – we completely miss the point of using statistics and of course we can not differentiate between random flukes and systematic effects. Running pairwise tests is more subtly wrong. Each test has a probability of a false alarm of 5%. So, if I run 10 comparisosns, say, then the probability of having at least one false alarm is already by roughly 50% – definitely not what we want. In the continuous case ANOVA handles this problem, because it can run one test and get the answer fo all comparisons, with a probability of false alarms still at 5%. A so called post-hoc analysis then will deliver all the answers we need to questions about how levels of the factor are grouped and where do we have extreme values.

So, the simple question was: what is the analogue of ANOVA for the discrete case?  The surprsing and somewhat unpleasant answer was that except for the case where my factor has two levels (the so called 2×2 table) there is no really satisfactory method that would do all that ANOVA does for the continous case. The closest we get is the chi-squared test but it is not really convincing even though that would also be the textbook solution for this problem.

For a long time we used the simulation with the recommendation to use the chi-squared test – and glossed over eventual objections by sheepishly remarking that we just do not have a better idea – until last week when I happened over a discussion in my favourite statistics forum, discussing the same point.  I realized then, that there is a nice and easy way to get the informations ANOVA would provide in the continuous case: just run a logistic regression with the discrete factor only.

The received wisdom is that binary logistic regression is only useful if our Y is discrete (binary) and our X is continuous – and this is indeed tha standard application. In the meantime the methods and the statistical SW we use evolved to the level where we can use discrete factors as well as continous ones in a logisic regression – indeed Minitab raises no qualms if we only feed it with a single discrete factor.

This has huge advantages compared to a chi-squared test. We actually build a statistical model so we can estimate the importance of each level of the factor and we can make predictions based on this model – we basically get the same level of knowledge as with ANOVA. The only caveat is that people might not be comfortable with using logistic regressions and even less wih using one with discrete factors only. But this is a question of education and opportunity – even the relatively unsophisticated Minitab offers this possibility, so it is time we move away from the age of hand calculations (the overwhelming reason for preferring a chi-squared test is that it is simple enough to be performed by hand) and to use the computer for what it was meant to do: perfom complex calculations. Then we can do what we are meant to do – interpret the output and provide relevant answers to our client’s real lfe questions.