Theory of Constraints meets Big Data part 2

I would like to continue the story of the hunt for the constraint using a lot of historical data and the invaluable expertise of the local team.  There is a lot of hype around big data and data being the new oil – and there is also a lot of truth in this. However, I find that ultimately the success of a data mining operation will depend on the intimate process knowledge of the team . The local team will generally not have the expertise of mining the data using the appropriate tools, which is absolutely ok, given that data mining is not their daily job.  On the other hand a data specialist will be absolutely blind to the fine points of the operation of the process – so cooperation is an absolute must to achieve results  The story of our hunt for the constraint illustrates this point nicely in my opinion.

After having found proof that we have a bottleneck in the process our task was to find it or at least gain as much knowledge about the nature of the bottleneck as possible. This might seem to be an easy task for hardcore ToC practitioners in manufacturing, where the constraint is generally a process step or even a physical entity, such as a machine. In our process of 4 different regions, about 100 engineers per regions, intricate long and short term planning and erratic customer behaviour, little of the known methods to find the bottleneck seemed to be relevant.  For starters, there was no shop-floor we could have visited and no WIP laying around giving us clues about the location of the bottleneck. The behaviour of all regions seemed to be quite similar which pointed us in the direction of a systematic or policy constraint . I have read much about those, but a procedure how to identify one was sorely missing from my reading list.

So, we went back to our standard behaviour in process improvements : “when you do not know what to do learn more about the process”.  A hard-core lean practitioner would have instructed us to go Gemba, which, I have no doubt, would have provided us with adequate knowledge in time. But we did not have enough time, so our idea was to learn more about the process by building a model of it. This is nicely in line with the CRISP-DM methodology and it was also our only possibility given the short time period we had to complete the job.

The idea (or maybe I should call it a bet) was to build a well-behaved statistical model of the installation process and then check the residuals. If we have a constraint, we shall either be able to identify it with the model or (even better) we shall observe that the actual numbers are always below the model predictions and thus we can pinpoint where and how the bottleneck manifests itself.

Using the tidyverse ( packages from R  it was easy to summarize the daily data to weekly averages. Then, taking the simplest approach, we built a linear regression model. After some tweeking and adjusting we came up with a model that had an amazing 96.5% R-squared adjusted value, with 4 variables. Such high R-squared values are in fact more of a bad news in themselves – they are an almost certain sign of overfitting, that is, that our model is tracking the data  too faithfully, incorporating even random fluctuations into the model. To test that we used the model to predict the number of successful installs of Q1 2018. If we overfitted the 2017 data then the 2018  predictions should be off the mark – god knows, there was enough random fluctuation in 2017 to lead the model astray.

But we were lucky – our predictions fit the new data to within +/- 5% . This meant, that the fundamental process did not change between 2017 and 2018 and also that our model was good enough to be investigated for the bottleneck.    Looking at the variables we used we saw that we had two that had a large impact and were process related  –  the average number of jobs an operator will be given per week and the percentage of cases where an operator was given access to the meter by the customer . The first was a thinly disguised measure of the utilisation of our capacity and the other a measure of the quality of our “raw material” – the customers. Looking at this with a process eye, we found a less then earth-shaking conclusion – for a high success rate we need a high utilisation and high quality raw materials.

Looking at the model in more detail we found another consequence – there were many different combinations of these two parameters that led to the same number of successes:  low utilisation combined with high quality was just as successful as high utilization combined with much lower quality. If we plotted the contour lines of equal number of successes then we got, unsurprisingly, a number of parallel straight lines moving from the lower left corner to the upper right corner of the graph.  This delivered the message, again, not an earth-shaking discovery, that in order to increase the number of successes we need to increase the utilisation AND the quality in the same time.

To me, the surprise came when we plotted the weekly data from 2017 over this graph of parallel lines, and this was really a jaw-dropping surprise. All weekly performance data for the whole of 2017 (and 2018) were moving parallel to one of the constant success lines. This meant that all the different improvements and ideas that were tried during the whole year were either improving the utilization but in parallel reducing the quality or improving the quality but reducing the utilization – sliding up and down along a line of a constant number of success (see attached graph).

This is a clear case of a policy constraint – there is no physical law forcing the process to move along that single line (well, two lines actually) but there is something that forces the company to stay there. As long as the policies keep the operation on this one (two) lines, this will look exactly the same as a physical constraint.

This is about the most we can achieve with data anylysis. The job is not yet done – the most important step is now for the local team to identify the policy constraint and to move the company towards changing the mode they operate from sliding in parallel to the constant line  to a mode where they move perpendicular to the lines. We can provide the data, the models and the graphs but now we need passion, convincing power and commitment –  and this the way data mining can actually deliver on the hype. In the end it is about people able and willing to change the way a company operates and about the company  empowering them to investigate, draw conclusions and implement the right changes.  so, business as usual in the process improvement world.Historical 2017 with weeks



Theory of Constraints meets Big Data

The theory of constraints is the oldest and probably the simplest (and most logical) of the great process optimization methodologies. One must also add that it is probably the most difficult to sell nowadays as everybody already heard about it and is also convinced that for their particular operation it is not applicable. Most often we hear the remark, “we have dynamic constraints”, meaning that the constraint is randomly moving from one place in the process to the other . Given that the ToC postulates one fixed constraint in any process clearly the method is not applicable to such complex operations.  This is an easily refutable argument though it undoubtedly points to a missing link in the original theory : if there is too much random variation in the process steps, this variation will generate fake bottlenecks in the process, such that they seem to move unpredictably from one part of the process to the other. Obviously, we need a more standardized process with less variation in the steps to even recognize, where the true bottleneck is, and this leads us directly to Lean with its emphasis on Mura reduction (no typo, Mura is the excessive variation in the process, that is recognized just as bad as it’s better known counterpart Muda). This probably eliminates or at least reduces the need to directly apply the theory of constraints as a first step.

There are other situations as well. Recently I was working for a large utilities company in a project where they need to gain access to their customer’s homes to   execute an upgrade in a meter, which is a legal obligation of the company prescribed by law. So, the process starts with convincing customers to grant access to their site and actually be present during the upgrade, allocate the job to an operator with sufficient technical knowledge to execute the upgrade, get the operator to the site on time and to execute the necessary work. There is a lot of locality and time based variation in this process  – different regions have different demographics that react differently to the request for access and also people tend to be more willing to grant access to the operator outside the working hours, but not too late in the day and so on.


On the other hand this process looks like a textbook example of the Theory of Constraints : we have a clear goal defined by the law, to upgrade X amount of meters in two y

ears. Given a clear goal, the next question will be, what is keeping us from reaching this goal? Whatever we identify here, will be our bottleneck and once the bottleneck is identified we can apply the famous 5 improvement steps of the ToC,

1. Identify the constraint

2. Exploit the constraint

3. Subordinate all processes to the constraint

4. Elevate the constraint

5. Go back to step 1

In a traditional, very much silo-based, organization steps 1-3 would already be very valuable. By observing the processes in their actual state we already saw, that each silo was working hard on improving their part of the process. We literally had tens of uncoordinated improvement initiatives per silo, all trying their best to move closer to the goal. The problem with this understandable approach  is nicely summarized in the ToC principle: any improvement at a non-constraint is nothing but an illusion.  As long as we do not know where the bottleneck is, running around starting improvement projects will be a satisfying but vain activity. It is clearly a difficult message to send concerned managers, that their efforts are mostly generating illusions, but I believe this is a necessary first step in getting to a culture of process (as opposed to silo) management.

The obvious first requirement, then, is to find the bottleneck. In a production environment we would most probably start with a standardization initiative to eliminate the Mura, to clear the smoke-screen that does not allow us to see. But what can we do in a geographically, organizationally diverse, huge organization? In this case our lucky break was that the organization already collected huge amounts of data – and this is where my second theme “big data” comes in.  One of the advantages of having a lot of data points – several hundreds per region per month – is that smaller individual random variations will be evened out and even in the presence of Mura we might be able to see the most important patterns.

In this case the first basic question was: “do we have a bottleneck”? this might seem funny to someone steeped in ToC but in practice, people need positive proof that a bottleneck exists in their process – or, to put it differently, that the ToC concepts are applicable. Having a large and varied dataset we could start with several steps of exploratory data analysis to find the signature of the bottleneck. Exploratory data analysis means that we run through many cycles of looking at the process in detail, set up a hypothesis, try to find proof of the hypothesis and repeat the cycle. The proof is at the beginning mostly of graphical nature – in short, we try to find a representation that tells the story in an easy to interpret way, without worrying too much about statistical significance.

In order to run these cycles there are a few pre-requisites in terms of people and tools. We need some team members who know the processes deeply and are not caught in the traditional silo-thinking. They should also be open and able to interpret and translate the graphs for the benefit of others. We also need at least one team member who can handle the data analysis part – has a good knowledge of the different graphical possibilities and has experience with telling a story through data. And finally we need the right tools to do the work.

In terms of tools I have found that Excel is singularly ill-suited to this task – it really handles several hundred thousands of lines badly (loading, saving, searching all take ages) and the graphical capabilities are poor and difficult to do. In working on a task like this I will use R with the “tidyverse” library and of course the ggplot2 graphical library. This is a very handy and fast environment – using pipes with a few well chosen filtering and processing functions and directing the data output directly to the ggplot graphics system allows the generation of hypothesis and publication quality graphs on the fly during a discussion with the process experts. It does have its charm to have the process expert announce a hypothesis and to have a high quality graph to show the hypothesis within one two minutes of the announcement. It is also the only practical way to proceed in such a case.

Most of the hypothesis and graphs end on the dung-heap of history, but some will not. They will become the proofs that we do have a bottleneck, and bring us closer to identifying it. Once we are close enough we can take the second step in the exploratory data analysis and complete a first CRISP-DM cycle ( by building a statistical model and generate predictions. If we are lucky, our predictions will overestimate our performance in terms of the goal – thus pointing towards a limiting factor (aka bottleneck) because we achieve LESS than what would be expected based on the model. Once here, we try some new, more concrete hypothesis, generate new graphs and models and see how close we get to the bottleneck.

So, where are we in real life today? In this concrete example we are through the first cycle and our latest model, though overoptimistic, will predict the performance towards the goal up to -10%. We are at the second iteration now, trying to find the last element of the puzzle to give us the full picture – and of course we already have a few hypothesis.

In conclusion – I think that the oldest and most venerable process optimization methodology might get a new infusion of life by adapting the most modern and up-to-date one. This is a development to watch out for and I will definitely keep my fingers crossed.


Rapid Action Workouts – Lean for NGOs

There are situations where we would like to improve a process but we do not have the luxury of working inside a full-fledged lean initiative. This means, most of all, that we can not build on previous trainings, lean awareness and a changing culture that the teams know about. Also, in these cases, the expectation is to achieve rapid successes , as the effort of getting the teams together can not be justified by long-term positive evolution. In short, the activity has to pay for itself.


In my experience, these situations can arise in two ways – either there is a simple need to improve one process in the organization, and they have not (yet) had thoughts about a long-term improvement initiative or the exercise is meant by a promoter of a lean initiative as an appetizer to convince the organization to start a deeper and more serious lean initiative. Either way, it is important to be successful in the allocated short time.

To respond to this need we at ifss  ( developed a rapid process improvement methodology. The methodology is addressing several of the constraints we see for this scenario:

  1. The teams are not trained, indeed, not even aware of the lean way of thinking and solving problems
  2. The costs of the workshop need to be minimal, so the action needs to be fast

Our idea is to select the minimal effective subset of the lean toolset. Each day starts with a short training (short meaning a maximum of 1 hour) only focusing on the lean tools that will be needed for the day. The rest of the day will be spent on applying the tools the team learned on that day, to the problem that needs to be solved. The whole day the team has access to the coach, but they will have to apply the tools themselves. At the end of the day results will be summarized and the roadmap for the next day will be discussed.

Of course, for this to work, problem selection and expectation management are key. As such, the coach has to work with the organization to understand the problem before the RAW and to help the organization select an appropriate problem.  It would be totally disrespectful to assume that we, as lean coaches, can solve any organizational problem within a workshop of 4 days, but in most cases we can suggest improvements, achieve team buy-in and design a roadmap to be followed. Thus, we must work with the organization to define a problem, where this improvement justifies the effort of organizing the workshop. In optimal cases we do have the required tools to help them with, Intermediate Objectives Maps or Priorization Matrices, to just name a few. Nevertheless, the ultimate decision, and most important one at that, is the responsability of the target organization in the end.

The second step the coach needs to do is to select the right tools for the RAW workshop, This can be, in theory, different for each client and problem. In practice we have a set of tools that can be utilized well in many different situations – SIPOC, Process Mapping, Root Cause Analysis , Future State Definition, Risk Analysis and Improvement Plan will (in this order) generally work. I put the methods in uppercase, much like chapter titles, because we have a fair number of different methods for each “chapter” and the coach will have to pick the one that is best suited to the problem and team.

E.g. for Root Cause Analysis the coach might pick an Ishikawa diagram if she judges the causes to be simple (and uncontroversial) or dig deep with an Apollo Chart if the contrary. Of course the training for the day the team starts to apply the tool will have to be adapted, based on the choice the coach made.

Because we generally do not get to finish all the actions and we definitely aim for a sustained improvement effort  I will always discuss PDCAs as well – and make sure that the teams defines a rhythm in which the PDCA cycles will be performed and presented to the local management.

This is all nice in theory, but does it really work? I had the privilege to work for several years with two NGOs in improving processes in Africa and recently in the Middle East. The constraints I mentioned above apply very strongly for them and I found that this approach of combining minimal training with process improvement work met with enthusiastic support and was successful. So, hopefully we will be able to work and refine this approach further in the future .

Learning from a Line simulation

In this installment I would like to describe a first set of small experiments we can make using our simulation program, and show what we can learn out of it. This is by no means an exhaustive list – it serves to encourage you to try it out for yourselves.

First a quick description of the program: in order to make the effects more visible we made some changes to the original dice game by making the difference between the high and low variation stronger. Now, in the simulation, each workstation will process up to 50 pieces per round, the actual processed number being uniformly distributed between 1 and 50 for the high variance case and between 24 and 27 in the low variance case.

The average number of processed parts and the variance is, using the formulae for the uniform distribution,   mean=(a+b)/2 and Variance=(b-a)^2/12 where a and b are the endpoints of the uniform distribution.  This will give us for the high variance case a mean of (1+50)/2 = 25.5 and for the low variance case a mean of (24+27)/2=25.5 . The variances however will differ – in the first case we will have V=(50-1)^2/12  = 200 for the high and (27-24)^2/12=0.7  for the low variance case.

The simulation has a graphical UI that is shown in the  graph. We can select the number of the workstations and the number of simulation runs in increments of 500. We can also chose the line feed strategy – that is the way the new orders come into the system. Random Feed means that they come in the same way, uniformly distributed, with high variance between 1 and 50 while Regular Mean Feed means that we feed the line with the average number of orders the line can handle 25.5 per run , with a low variance.

There is also a button called re-run which will just create a new run with the same settings as the previous run. This is for our first experiments the most important button because by re-running the same setup we can really see the influence of randomness.

We have 3 output graphs :

  1. The number of WIP for each run of the simulation. This is the number of work items that are somewhere inside the production line after the end of a run – the number of semi-finished goods.
  2. A graph describing the throughput of the line. This is a bit more tricky – in order to clearly see the differences we do not display the number of finished pieces per run but the deviation from the expected number of finished goods per run. So a value of 0 would roughly mean 25 finished pieces, a value of 20 means 25+20=45 finished pieces while a value of -20 means 5 finished pieces. While this takes a bit of getting used to, this graph shows the deviations from the expectation much more clearly then the raw numbers.
  3. The number of WIP at each workstation at the end of the simulation. This can give us an idea where we have an accumulation of semi-finished goods and this is often proposed as a way of finding the bottleneck on a line. We will soon know better.

So, let us run a first simulation! I set the number of workstations at 5 , for 1000 runs. Now, I would be interested in the effects of randomness, so I pick the worst scenario – highly variable demand and higly variable machines. The output is on the graph.

Remember, from traditional point of view this is a nicely balanced line. Still, after 1000 runs, we have almost 2000 pieces scattered around production (graph 1) and our throughput was more or less unpredictable between 20% (5 pieces) of the expected volume and 180%  (45 pieces) .Obviously we had some good and some bad days without any chance of predicting what the next day will bring. By looking at the last graph we can readily identify the bottleneck –  workstation 2 with 3 trailing quite close behind it. So, to improve the line performance we need to focus on these two, right?

Before investing resources on improvement work, press the Re-run button. You will obviously have different results, but very likely something similar will happen to what I see now – the bottleneck at 2 disappeared and this workstation now has the best performance, while workstation 1 accumulated the most WIP. Obviously the threat of starting an improvement project at their workstation worked wonders – however at workstation 1 people became complacent and need more supervision . Also, the WIP in general has evolved in a much better way, we had a long stretch of runs where the WIP actually decreased and we are now at about 1500 pieces instead of 2000 !

So, let us press Re-run again: the WIP trend is a complete mess now, steadily increasing  to about 2500 at the end of the runs and the workstation 2 started to accumulate more WIP , however the workstation 1 is still the black sheep.  Press again : workstation 1 has improved , however 2 became the bottleneck again, while our focus on the overall WIP trend payed off and we are again down at about 1500 pieces.

The one thing that did not change across all these simulations was the daily performance – it stayed just as variable in all 3 re-runs.

Of course this is all a joke – there is no way the simulation would react to the threat of an improvement project. What we see is the effect of random variations in the line – and the most visible is the random movement of the “bottleneck”.  This is the core problem of the Theory of Constraints methodology by the way – we hear very often that the a line does not have a single bottleneck, but “moving bottlenecks”. This is the effect of random variations in the line.

The second visible effect is the accumulation of WIP. We would naively expect no WIP or very little, as the line is balanced on the average. In reality the random variations cause ever larger queues at random locations in our balanced line and the prediction of the factory physics formulae is that the WIP will continuously increase as a result of these random variations.  Looking at the graphs, it does seem that the WIP is kind of levelling off and stabilizing at around 2000 pieces but this is an illusion. If you change the number of runs to 2000 and higher  you will see that the increase is continuous.

The third, even more worrying effect is the high variability of the throughput. This line, though balanced, can give no guarantee of the number of pieces produced per day (run), despite the best efforts of the production people. So, what will happen in real life? Important orders will be moved forward (expedited) while less important ones will stay in the WIP – all this will create even more randomness and increase the negative effects we already see. I think this is not unlike many situations we see in production lines today.

As we are not trained to discern random effects we will necessarily come up with explanations for purely random phenomena – like my example of the threat of running an improvement project having positive effects or people becoming complacent.  This is also something we see very often in real life.

So, how would you investigate our line further? Please feel free to try.

A Simulation in Lean Says More Than 1000 Words


In general, we are sadly missing an intuitive understanding of random processes. Several authors wrote about this blind spots in the last years – Taleb in Fooled by Randomness



or Sam L. Savage in the Flaw of Averages – just to name a few.


To my mind, the basic problem is that we rarely have the opportunity to experience long stretches of random phenomena. Of course we all saw dice being thrown and the stock market fluctuate – but this is not enough to develop an intuition about random events. To do this, we would need to observe and record long stretches of the phenomenon – and I would have yet  to see the person, who spent hours and hours throwing a balanced die and recording the results – not to say, analyzing the them as well . (Well, there was a truly renaissance personality called Cardano, who did this in the XVIth century, and wrote a book about it, but this was more then unusual).


Studying and improving processes is a field where we really miss an intuitive understanding of how random variations influence the behavior of the system. In principle Six Sigma acknowledged the role of random variations but, sadly, not when looking at process mappings. Lean followed in the same steps, simply formulating the request to reduce the variation as far as possible in order to achieve a uniform flow. So, in the end, some practical questions are left unanswered in both methodologies, like: How much variation is acceptable? What are precisely the negative effects of randomness in our concrete process? How could we make a cost-benefit analysis of the variation reduction?

To answer these questions, we would need to understand the effects of randomness on a process. To my knowledge there is one discipline that really attempted to do this: Factory Physics  (  . In this theory we apply queuing theory to the manufacturing processes and the result is that we can come up with a formula that will precisely describe the effects of random variations on a manufacturing operation., the (in)famous Pollaczek-Khinchine equation eventually combined with Little’s Law.

Now, if you are the type of person, who was trained in reading and interpreting equations (like physicists are, no surprise here) then all you would need is to look at this equation to understand the effects of variation on a value stream. Unfortunately, very few people can do that, so, for the rest of us the equations provide a tantalizing hint of what is going on, but fail to deliver the kind of intuitive understanding we would find useful. To achieve that we would need to be able to observe and track processes with random variations over a long period of time to see what is going on.

This is generally not feasible in real life, due to constraints of costs and time. The processes are rarely constant over a long enough period, so that we could observe their behavior at leisure. Changing some parameters, like the average processing times, or the standard deviation of the processing time at a given step, just to be able to observe the effects of a such a change, is of course impossible. So, we need a random system that is cheap and fast to change and  also easy to observe.

This is why in the 80-ties Goldratt invented the Dice Game – an abstract model of a processing line, where the only influencing factor was the randomness of the processing times. The game is played as follows: 5-8 people sit at a table, each with a six sided die. There are matchsticks to the right and to the left of each person, one person’s right matchstick heap being the left heap of the next person. There is also a “feeder” – someone who is providing new matchsticks at the input of the process. At the sign of the moderator, each person throws the die and moves as many matchsticks from their left to the right as the count of points on their die. If they do not have as many sticks to their left, (say they have 3 sticks and the die says 5) they move as many sticks as they can.  The feeder will input regularly 3 then 4 then 3 again to the process at the beginning of each cycle.


The operation is then repeated, for as many times as the moderator sees fit – generally not more than 20 cycles to avoid boring people to death.  Then the exercise will be repeated, but instead of a die a coin is tossed. The rule is that heads mean moving 3 sticks, tails mean moving 4.


Now, consider the following: in both cases the average performance of each workstation will be 3.5 sticks per cycle, that is, the line is perfectly balanced in both cases. Before reading on, think about the following questions:

  1. Would you expect a difference in these two cases (die vs. coin).
  2. Do you expect a difference in the throughput (finished products per cycle) between the two cases?
  3. Where do you expect to see the bottleneck?
  4. If each step can still move 3.5 pieces on the average, how would you improve the process?


As we are missing the intuitive understanding of random processes, it is not easy to answer these questions. Getting enough people together to play the dice game is not as easy either, especially if we want to see the longer term effects – no one will waste time to play this for several hundreds of cycles, right? Still, understanding the behaviot´r of the Dice Game is an important step towards understanding random variation in a value stream.


The solution is to write a computer simulation with a comfy user interface and let people simulate the system. In this case the individual steps are fairly simple so writing this simulation is not a complicated job, adding a user interface is also pretty much a standard using R and the Shiny interface. This way we can play several thousands of rounds, without getting bored, we can measure the results, change the parameters or simply repeat the runs several times, to see how the system behaves. There is no better way to develop our intuition, and in my experience, there wasn’t a singleperson who was not surprised by some aspects of what they saw.


As is happens, we, at ifss,  developed a such a simulation and it can be accessed by anyone via the Internet. It is not up all the time, but if anybody would like to figure out the answers to the above questions, we would be happy to start the simulation. Just send a message, some time in advance, and we shall find an opportunity to run activate the simulation,

So, feel free to explore and to start developing your intuition about random processes. We would be happy to hear from you, and if you try the simulated VSM Game, sharing what you experienced would be even nicer.

The Joy of Principal Component Analysis

Making sense of  a large pile of data, as in several hundred measurements of 5 to 10 different characteristics of a product, is a dream of a data analyst come true. In manufacturing environments, especially in the chemical industry, it often happens that we measure a wealth of data on each finished product batch . Some of these measurements are important as they might directly relate to customer requirements, some are “informative” only and the goal of collecting all the data is to have an idea of what is happening in the production process. This is the theory, at least. Of course, life is always a bit more complicated, and many operations will simply be overwhelmed by the amount of data without having the adequate tools to make sense of it. So, the data will be faithfully collected, after all it is in the SOP, but it will not be used to draw conclusions from it.

What I would like to show, is a method, to actually use this wealth of information. The method is by no means new but it definitely failed to find its way into the manufacturing environment. I still recall the first reaction of a colleague when I showed it to him: “oh, this is for PhD work” he said – meaning more or less “this is way too complex and practically useless”.

The method is called principal component analysis. The idea is to generate artificial variables out of the set of real measured ones in a way that the new variables capture the most of the variation present in the data set. To be more precise, we generate linear combinations of the original variables in a way that the first linear combination captures the most of the total variation, the second linear combination captures the most of the residual variation and so on.    If there is some systematic trend in the data, the hope is that the first few artificial variables (called principal components) will capture this systematic pattern and the latecomers will only catch random variation in the data. Then, we can restrict our analysis to the first few principal components and have cleansed our data of most of the random variation.

To make this more clear imagine a room in which we have temperature sensors in each of the 8 corners. We will have 8 sets of measurements that are somewhat correlated but also vary independently of each other. Running a PCA (principal component analysis) will yield a first principal component that will be the average of the 8 measurements – as obviously the average temperature in the room captures most of the variation of the room temperature (just think winter vs. summer, night vs, day). The second principal component will capture the difference between the averages of the top 4 sensors and the averages of the bottom 4 sensors (it is generally warmer on top, right?) , the third component the difference between the side where we have the windows and the side that has no windows….and the rest probably only meaningless measurement variation and possibly some complex but small differences due to the patterns of air currents in the room.

What did we gain? Instead of the original 8 values we only need to analyze 3 – and we will gain a much better insight into what is going on. We did not reduce the number of measurements though -we still need all the sensors to construct our principal components.

What did we lose ? As my colleague put it – this is PhD work , so we lose a bit of transparency. It is relatively easy to explain a trend in a sensor on the top of the room , but a trend in the first principal component sounds like academic gibberish. So, we will have learn to explain the results and to relate them to real life experiences. This is not as difficult as it sounds, but definitely needs to be carefully prepared and the message tailored separately in each practical case.

Now to close with a practical example :  I just had the good luck to run into a problem at a customer I work for. They measured 5 components of a product and had the respective data from 2014 onward. All data points represented product batches that were accepted by the QA and delivered to the customers, as they were all in spec. The problem was that the customers still complained of highly variable quality and especially of variations in quality over time.

Looking at five control charts, all consisting of quite skewed non-normal data, basically just increased the confusion. There were regularly points that were out of control, but did we have more of them in 2015 then in 2014? And if yes, was the difference significant? No one could tell.

So, I proposed to analyze the data with PCA.

In R one easy way to do this is by using the prcomp function in the following way:

Results=prcomp(InputDataFrame, scale=TRUE)

Here obviously the InputDataFrame contains the input data – that is, a table where each column contains the values of one specific measurement of the product (like viscosity or pH or conductivity etc.) and each line is one set of measurements for one batch of the product. The scale = TRUE options asks the function to standardize the variables before the analysis.  This makes sense in general but especially when the different measurements are on different scales (e.g. the conductivity could be a number between 500 and 1000 whereas the pH is less then 10). If they are not scaled, then the variables with greater values will dominate the analysis,  something we want to avoid.

The Results object will contain all the outcomes of the PCA analysis. The first component sdev is a list of the standard deviations captured by each principal component. The first value is the greatest and the rest will follow in decreasing order. From these values we can generate the so called “scree plot” which shows us how much of the total variance is captured by the first, second etc. principal component. The name comes from the image of a steep mountainside filled with rubble at the bottom – scree. The steeper the descent, the more variation is captured by the first few principal components – that is, the clearer the discrimination between systematic trends and useless noise.

A quick way to generate the scree plot out of the results of prcomp would be as follows : plot(Results$sdev^2/sum(Results$sdev^2), type = “l”)  .

Simply calling plot on Results will give us a barchart of the variances captured by each principal component, which has comparable information, though it is not exactly the same.

The next element of the prcomp return structure is called “rotation” and it tells us how the principal components are constructed out of the original values. Best is to see this in a practical example:

PC1                                PC2

Turbidity                        0.45146656           -0.59885316

Moisture                         0.56232155           -0.36368385

X25..Sol.Viscosity.         0.06056854             0.04734582

Conductivity                  -0.44352071           -0.59106943

pH..1..                               0.52876578            0.39686805

In this case we have 5 principal components but I am only showing the first two as an example. The first component is built as 0.45 times the standardized value for Turbidity plus 0.56 times Moisture plus 0.06 times the viscosity minus 0.44 times the conductivity and 0.52 times the pH.   For each measured data can generate a new data point using these values by multiplying the standardized original measurement with these weights (called loadings in PCA language).

What good would that bring us? The benefit is that after looking at the scree plot we can see that most of the variation in our data is captured by the first two (maybe three) principal components. This means that we can generate new artificial values that have only two components (PC1 and PC2 instead of the original 5, Turbidity, Moisture etc…)  and still describe the systematic trends of our data. This will make doing graphical analysis much easier – we can restrict ourselves to  calculating the new artificial values for PC1 and PC2 and to look at a two-dimensional graph to spot patterns. It also means that we eliminated a lot of random variation from our data. They are captured by the remaining components PC3—PC5 and we conveniently leave them out of the analysis. So, not only that we get a graph we can actually understand (I find my way in two dimensions much faster then in 5 ) but we also managed to clean up the picture by eliminating a lot of randomness.

As it is, we don’t even need to do the calculation ourselves – the Results structure contains the recalculated values for our original data in a structure called x – so x$x[,1] contains the transformed values for the first principal component and x$x[,2] for the second principal component. Plotting one against the other can give us a nice picture of the systems evolution.

Here is the resulting plot of PC2 versus PC1  for the example I was working on.


The colors show different years and we can see that between 2015 and 2016 there was a very visible shift towards increasing PC1 values. Now looking at the loadings we see that the PC1 values increase if Turbity, Moisture and /or pH increase and  (or) conductivity decreases , Remember, these are all changes that are well within specs and also within the control limits, so they were not discernible with simple control charts. Still, once we knew what we were looking for trends in the increase of moisture and decrease in conductivity were easily identifiable.

Now, we can identify the process changes that led to these trends and strengthen the process control. PCA can be used to plot the measured values of future batches to see that we have no creeping of the measured values in the future. Not bad for an academic and useless method, right?

Survival Analysis, Simulation and Sample Size

The nice thing about working as a consultant ist he variety of different problems one is confronted with. Just as I was preparing the next blog-post about the possibilities of simulation in statistics I received a request to help design an experiment to determine the life-time of a chemical product. Life-time analysis – better known as survival analysis- is a large and interesting part of statistics. We, in the “normal” six sigma world do not see much of this, but in DFSS projects is one of the more important elements.

Today, Minitab provides some help in a relatively new sub-menu and there are specialized programs such as Weibull++ which help specialists. Of course R has a wealth of libraries to use as well. On the other hand, there is nothing better then a simulation to build the intuition around a problem, so I decided to go that way, using R for the ease of writing scripts. A problem, in these cases is also, that calculating sample sizes is not easy and is not really well documented in Minitab for example.

So, I wanted to simulate some life data analysis and especially look at the power of the analysis as a function of external stress – something even Weibull++ will not provide as the company sells a special program to plan HAST tests. (Just for those less versed in DFSS – HAST is the abbreviation of Highly Accelerated Stress Test. The idea is that one can define a stress factor (like weight for mechanical assemblies or temperature for electronic equipment) and assume that there is a relationship between the distribution of survival times and the level of the stress factor – like the higher the stress the faster a breakdown occurs. Once this is known, from theory or possibly from  earlier experiments, one can run a test for a much shorter duration with an elevated stress level, and work out the survival distribution for the normal stress levels from the data. )

In our situation we had basically no knowledge of the relationship between the stress factor and the life-time, so what we needed to do was to run life-time tests at different levels of the stress factor and work out the relationship from the results, THEN use these results to extrapolate the life-time distribution at normal levels. Of course, the easiest would be to simply run the life-time tests at the normal levels, we just do not have the time and the resources to wait for month and month until the product dies.

However, if we want to see the effect of higher stress levels on the product we absolutely need to know the power of the statistical method used to test the data. This will determine the sample size we need at each setting, and as usual, we avoid both the risk of not seeing anything in a stupidly underpowered study or of wasting a lot of product and effort if we use a higher then necessary sample size.

Estimating the power is simple in principle. We define an effect strength, we simulate the data with the defined influence of the stress factor  and run the analysis. We repeat this step 2000 (or more) times and count how many times we have a p-value that is less then 0.05, divide it by the total number of experiments and there we are, we have a power estimate. Of course, the details are a bit more cumbersome.


First, the R code to do this is here:

# Sample size determination for accelerated life testing

# Assume the survival time distribution is exponential, that is

# no manufacturing errors and no aging – constant hazard rate

# Assume that the hazard rate is dependent on a stress factor F


# To define a realistic parametrization we use historical knowledege

# we know that at F=20 the MTTF=3*30=90 days

# lambda = 1/MTBF =1/90

# the effect size is given in reduction of the MTTF – so an increase in the Forcing

# factor from 20 to 40 reduces the MTTF by DM days



F=c(20, 40, 60, 80)

# DM is the effect size in MTBF days – in this case a change of 20 grades

# reduces the MTBF by 10 days – a weak effect



# Percentage censored values – as per now the censoring will be randomly decided



lambda0=0.011 #=!/MTBF=1/90 days

#Function to calculate the power at a setup consisting of a given sample size, effect size and percent censured

SingleSetupRun=function(Nrun=100, DM=10, Ns=10, CensPer=0){



# FOR Nrun times DO

for(i in 1:Nrun){



#   FOR each value of F DO

for(n in 0:3){

#     Calculate the lambda

lambda = lambda0/(1-lambda0*n*DM)

#     Add the F values to the vector

Fs=c(Fs, rep(F[n+1], Ns))

#     Generate Ns samples from the exponential described by lambda

Ts=c(Ts, rexp(Ns, lambda))




#   Build a censoring vector


#   A percentage equal to CensPer is censored





#   Build the data frame

sdata=data.frame(Fact=Fs, Stime=Ts, Cens=cs)

S=Surv(sdata$Stime, sdata$Cens)


#   Perform the regression

mod=survreg(S~sdata$Fact, dist=“weibull“)


#   Get the p-value got from survival:::print.summary.survreg






print(sprintf(„Power is: %.2f, for effect %d and sample size %d“,sum(ps<0.05)/length(ps), DM, Ns))


print(sprintf(„Percent scale>1 %.2f“, sum(scales[sign.ind]>1)/length(sign.ind)))



Now, let us look at some details.

Setting the parameter of the exponential distribution


In life data analysis probably the most important decision is to determine the distribution of the survival times. The simples model is the one where the survival times are described by the exponential distribution. This is equivalent to the assumption that there are no memory effects present – that is , the remaining time until a failure is independent of the age of the product. A different way to describe this is by using the concept of the hazard. The hazard at time t  is defined as the probability that the product fails exactly at time t after haing survived until this moment. The memoryless property (which is equivalent to the exponential survival times distribution) simply says that the hazard is constant in time. To make things even simpler, the hazard will be the only parameter of the exponential distribution – defined as “lambda” in the code above. Generating random values of a given exponential distribution is simply done by calling the R function  rexp(size, lambda).


So, if we want to simulate different effect sizes we just need to change the lambda parameter accordingly. There is the crux, however – what do we mean by accordingly? We (at least I) are not trained to think in terms of hazards – so  how can we decide whether changing the lambda parameter by 0.1 is a strong or a weak effect? I honestly do not know. So, I had to find a way, to make effect size more easy to estimate, and I came up with the idea of using the Mean Time to Failure or MTTF. This is simply the expected value of the exponential distribution – and despite its name it is the time at which not 50% but roughly 37% of the tested items still survive. It can be proven that lambda = 1/MTTF which is handy because most manufacturers provide the MTTF number. But, given, that this is a time value we can more easily relate to it then to the concept of the hazard so, will be easy to determine a weak, medium or strong effect. We know, that our MTTF under normal, unstressed, conditions should be around 90 days. So, an effect that shifts the MTTF by around 10%  can be considered weak . Knowing this we can work out backwards the change in the lambda that will reduce the MTTF by 10 days (10%-ish). The details are in the comments of the code.


The other special feature of the life data analysis is that we have two types of life data. There is the “real” type, where the value is simply the time of failure, but we also have “censored” data where we did not see any failure but we have to interrupt the observation. E.g. we planned the experiment for 8 weeks and at this time some items just kept on living. This is valuable information and it would be a folly to discard this data. But we can not enter a value for the failure time either – it could be the next second or the next month, who knows? In life-data analysis we mark these points as “censored” and we can still use them in the anlysis – but obviously this analysis has to be different type from a normal least squares regression. For the power analysis, we also have the intuition that the amount of censoring will reduce the power – a censored data point simply does not have all the information a “normal” point has.   In our R function CensPer is the variable giving  how much percent of the available data is censored. I just attached the censored label randomly to data points – a better way would be to assume that the  censored values are mostly those with a high life-time.

Scale  and power

Before looking at the results we still need to discuss another detail. When analyzing life data, as mentioned before, we can decide which distribution we assume for the distribution of our data – and the simples choice is the exponential,l with a constant hazard rate. However, in real life we can not be absolutely sure that this assumption holds – so, in a real life situation we should not pick the exponential but something that allows a little more flexibility. Fortunately, there is a such a choice, the Weibull distribution, which is flexible enough and contains the exponential distribution as a special case. So, in the analysis our best option is to pick the Weibull distribution, so, we should pick this option also when looking at the power in simulations. If you check the code, the line

mod=survreg(S~sdata$Fact, dist=“weibull“)

is running the regression, with the Weibull distribution as our choice. This has an additional advantage for the power analysis. If we look at the results of the regression, we shall see a calculated value called “scale”. This is giving us the direction of the effect that was detected with the regression – if the scale is greater then 1  the hazard will increase with increasing stress factor, if it is smaller then one then the hazard will decrease with the stress. By looking at the distribution of the scale values we can judge how well a regression with the given sample and effect size can determine the direction of the effect. In principle, it is not enough to have a  p value less then 0.05 (statistically significant effect) we would also need a scale that is greater then 1 (as we know that we set up the simulation in a way that the hazard increases with the strength of the factor.


Well, the first and most important result is that the code above does give a pretty good idea of how the power will look like. Feel free to paste the code into R Studio and experiment with it.

Now some results: this will not come as a surprise but small effect sizes and small sample sizes will result in very low power – it is almost hopeless to expect to see a significant effect. Moreover the direction of the effect will be variable and misleading . for an effect of 10 days of shift in the MTTF and a sample of 5 per factor setting I got only 14% significant tests (power of 14%) and of those only 16% had a scale >1 – basically useless for practical purposes.

Increasing the sample size will strongly increase the power, but will only have a mediocre effect on identifying the right direction of the effect. At 10 samples per factor value the power increased to 18% and of those 27% identified the right direction. At a sample size of 20 the power was 29% and scale>1 was at 32%. The numbers at the completely unrealistic sample size of 100 are : power at 84% but percent scale > 1 still at 46% only.

So, we can hope to identify the existence of an effect with this method, but the direction will be iffy even at very high sample sizes. However, if we have a stronger effect, both numbers improve drastically:

At an effect of 30 days on the MTTF, even at a sample size of 5 (!) we would have a power of 99% with scale>1 at 89%. This effect grows in a very non-linear way – at an effect of 20 days we would still only have a power of 45% with the scale correctly identified in only 26% of the cases. So, definitely, the method becomes more reliable at stronger effects.

How does censoring affect the power? As expected, all things being the same a hgher percentage of censored values will reduce the power. For example, if we stay with previous case but we introduce a censoring of 50% of our generated data then the power decreases to 76% from 99% and the percent scale>1 to 75% from 89%.

The effect is not very strong though, 50% censored does seem to be extreme, so,  there is hope that we can use this method, provided we manage to identify a suitable effective stress factor.

There is a lot more to find out with this simulation, and I encourage everyone to investigate this further on their own.


And, as always, any feedback is warmly welcome.