Newbie Joined: Jul 2010 Posts: 5 Thanks: 0  How to calculate simple linear model for noisy data?
I have data about the number of a website's hits per day. This data is very noisy. I've seen someone calculate a stepwise linear model in order to simplify the graph and highlight significant changes  like this: . What do you think is the right approach / method / tool to generate the linear model (red line)? PS: I posted in this forum because I assume statistics is involved in the process. 
I think a better method would be to plot a bestfit line through the data. Excel can do that, as can (for example) R.

The stepwise approach can be accommodated, but you need to make some assumptions to get it done. One set of assumptions: The steps occur at set breaks in time. With this assumption you only need to do a constant regression on each of the subsets of data (which is just using the mean to estimate the data on each of the subintervals. However, this is not satisfying. Since your data is a time series, there is a model which is I believe more appropriate for piecewise constant regression. Use an average of the last N data points. This will smooth the data. Jumps in the average will become more apparent. Use these jumps to determine where the breaks between piecewise approximations should occur, and then use the average on each interval to perform the regression there. Note this second formulation is NOT a linear regression as the regression does not change linearly in the data (due to movements in the jump points). 
Thank you for your great post. I will definitely do some reading on piecewise constant regression. I agree with you that in order to use the very simplified linear model it must be assumed that "the mean of the data basically stays the same and only changes from time to time". The core problem here of course is to find where these breaks happen (it will later be used in 'news'like fashion, notifying the user about these breaks). I guess I could use threshold and "jumps in the average will become more apparent"  or is there a better way? By the way, on the slide where I saw this the author also wrote down the confidence level for each break. How might that fit into the model? 
Do you expect any seasonality in your data....for example is Saturdays volume likely to be greater than Monday each week. If so it is wrong to attempt to put a trendline line through your raw data  instead you should attempt to seasonally adjust the data first. A simple way of doing that is create a 7day rolling average (7day if you think your data is weekly seasonal) then try to fit that to a linear model. Why do you expect the underlying trend in your data to be linear by the way  maybe the data reflects the amount of promotion you are getting  if that promotion is done in bursts then I might expect a sawtooth shape to your underlying data. 

