Автор работы: Пользователь скрыл имя, 23 Апреля 2013 в 12:35, задача
Simple regression: (x1, Y1), (x1, Y2), … , (xn, Yn) Multiple regression: ( (x1)1, (x2)1, (x3)1, … (xK)1, Y1),
( (x1)2, (x2)2, (x3)2, … (xK)2, Y2),
( (x1)3, (x2)3, (x3)3, … (xK)3, Y3),
INPUT TO A REGRESSION PROBLEM
Simple regression: (x1, Y1), (x1, Y2), … , (xn, Yn) Multiple regression: ( (x1)1, (x2)1, (x3)1, … (xK)1, Y1),
( (x1)2, (x2)2, (x3)2, … (xK)2, Y2),
( (x1)3, (x2)3, (x3)3, … (xK)3, Y3),
… ,
( (x1)n, (x2)n, (x3)n, … (xK)n, Yn),
The variable Y is designated as the “dependent variable.” The only distinction between the two situations above is whether there is just one x predictor or many. The predictors are called “independent variables.”
There is a certain awkwardness about giving generic names for the independent variables in the multiple regression case. In this notation, x1 is the name of the first independent variable, and its values are (x1)1, (x1)2, (x1)3, … , (x1)n . In any application, this awkwardness disappears, as the independent variables will have application-based names such as SALES, STAFF, RESERVE, BACKLOG, and so on. Then SALES would be the first independent variable, and its values would be SALES1, SALES2, SALES3, … , SALESn .
The listing for the multiple regression case suggests that the data are found in a spreadsheet. In application programs like Minitab, the variables can appear in any of the spreadsheet columns. The dependent variable and the independent
variables may appear in any columns in any order. Microsoft’s EXCEL requires that you identify the independent variables by blocking off a section of the spreadsheet; this means that the independent variables must appear in consecutive columns.
MINDLESS COMPUTATIONAL POINT OF VIEW
The output from a regression exercise is a “fitted regression model.”
Simple regression:
Multiple regression:
Y = b0 + b1 x
Yˆ = b + b ( x1) + b ( x2) + b ( x3) + ... + b ( xK )
Many statistical summaries are also produced. These are R2, standard error of estimate, t statistics for the b’s, an F statistic for the whole regression, leverage values, path coefficients, and on and on and on and ...... This work is generally done by a computer program, and we’ll give a separate document listing and explaining the output.
3
WHY DO PEOPLE DO REGRESSIONS?
A cheap answer is that they want to explore the relationships among the variables.
A slightly better answer is that we would like to use the framework of the methodology to get a yes-or-no answer to this question: Is there a significant relationship between
variable Y and one or more of the predictors? Be aware that the word significant has a very special jargon meaning.
An simple but honest answer pleads curiousity.
The most valuable (and correct) use of regression is in making predictions; see the next point. Only a small minority of regression exercises end up by making a prediction, however.
HOW DO WE USE REGRESSIONS TO MAKE PREDICTIONS?
The prediction situation is one in which we have new predictor variables but do not yet have the corresponding Y.
Simple regression: We have a new x value, call it xnew , and the predicted (or fitted) value for the corresponding Y value is
ˆ
Ynew
= b0 + b1 xnew .
Multiple regression: We have new predictors, call them (x1)new, (x2)new, (x3)new,
…, (xK)new . The predicted (or fitted) value for the
corresponding Y value is
Yˆ =
b
+
b
(
x1)
+
b
(
x2)
+
b
(
x3)
+
...
+
b
( xK )
new 0 1
new 2
new 3
new K new
CAN I PERFORM REGRESSIONS WITHOUT ANY UNDERSTANDING OF THE UNDERLYING MODEL AND WHAT THE OUTPUT MEANS?
Yes, many people do. In fact, we’ll be able to come up with rote directions that will work in the great majority of cases. Of course, these rote directions will sometimes mislead you. And wisdom still works better than ignorance.
4
WHAT’S THE REGRESSION MODEL?
The model says that Y is a linear function of the predictors, plus statistical noise.
Simple regression: Yi = â0 + â1 xi + åi
Multiple regression: Yi = â0 + â1 (x1)i + â2 (x2)i + â3 (x3)i + … + âK (xK)i + åi
The coefficients (the â’s) are nonrandom but unknown quantities. The noise terms å1, å2, å3, …, ån are random and unobserved. Moreover, we assume that these å’s are statistically independent, each with mean 0 and (unknown) standard deviation ó.
The model is simple, except for the details about the å’s. We’re just saying that each data point is obscured by noise of unknown magnitude. We assume that the noise terms are not out to deceive us by lining up in perverse ways, and this is accomplished by making the noise terms independent.
Sometimes we also assume that the noise terms are taken from normal populations, but this assumption is rarely crucial.
WHO GIVES ANYONE THE RIGHT TO MAKE A REGRESSION MODEL? DOES THIS MEAN THAT WE CAN JUST SAY SOMETHING AND IT AUTOMATICALLY IS CONSIDERED AS TRUE?
Good questions. Merely claiming that a model is correct does not make it correct. A model is a mathematical abstraction of reality. Models are selected on the basis of simplicity and credibility. The regression model used here has proved very effective. A careful user of regression will make a number of checks to determine if the regression model is believable. If the model is not believable, remedial action must be taken.
HOW CAN WE TELL IF A REGRESSION MODEL IS BELIEVABLE? AND WHAT’S THIS REMEDIAL ACTION STUFF?
Patience, please. It helps to examine some successful regression exercises before moving on to these questions.
5
THERE SEEMS TO BE SOME PARALLEL STRUCTURE INVOLVING THE MODEL AND THE FITTED MODEL.
It helps to see these things side-by-side.
Simple regression:
The model is Yi = â0 + â1 xi + åi
The fitted model is
Y = b0 + b1 x
Multiple regression:
The model is Yi = â0 + â1 (x1)i + â2 (x2)i + â3 (x3)i + …
+ âK (xK)i + åi
The fitted model is
Yˆ = b + b ( x1) + b ( x2) + b ( x3) + ... + b ( xK )
The Roman letters (the b’s) are estimates of the corresponding Greek letters (the â’s).
6
WHAT ARE THE FITTED VALUES?
In any regression, we can “predict” or retro-fit the Y values that we’ve already observed, in the spirit of the PREDICTIONS section above.
Simple regression:
The model is Yi = á + â xi + åi
The
fitted model is
Y = a + bx
The fitted value for point i is
Y = a + bx
i i
Multiple regression:
The model is Yi = â0 + â1 (x1)i + â2 (x2)i + â3 (x3)i + …
+ âK (xK)i + åi
The fitted model is
Yˆ = b + b ( x1) + b ( x2) + b ( x3) + ... + b ( xK )
The fitted value for point i is
Yˆ
=
b
+
b
(
x1)
+
b
(
x2)
+
b
(
x3)
+
...
+
b
( xK )
i 0 1 i 2
i 3 i K i
Indeed, one way to assess the success of the regression is the closeness of these fitted Y
values, namely
Y
,
Y
,
Y
,
...,
Y
to the actual observed Y values Y1, Y2, Y3, …, Yn.
1 2 3 n
THIS IS LOOKING COMPUTATIONALLY HOPELESS.
Indeed it is. These calculations should only be done by computer. Even a careful, well- intentioned person is going to make arithmetic errors if attempting this by a non-
computer method. You should also be aware that computer programs seem to compete in using the latest innovations. Many of these innovations are passing fads, so don’t feel too bad about not being up-to-the-minute on the latest changes.
7
The notation used here in the models is not universal. Here are some other possibilities.
Notation here |
Other notation |
Yi |
yi |
xi |
Xi |
â0+â1xi |
á+â xi |
åi |
ei or ri |
(x1)i, (x2)i, (x3)i, …, (xK)i |
xi1, xi2, xi3, …, xiK |
bj |
∠j |
8