predict.glm -> which class does it predict?

Jul 10, 2009; 10:46pm

predict.glm -> which class does it predict?

2 posts

Hi,

I have a question about logistic regression in R.

Suppose I have a small list of proteins P1, P2, P3 that predict a
two-class target T, say cancer/noncancer. Lets further say I know that I
can build a simple logistic regression model in R

model <- glm(T ~ ., data=d.f(Y), family=binomial) (Y is the dataset of
the Proteins).

This works fine. T is a factored vector with levels cancer, noncancer.
Proteins are numeric.

Now, I want to use predict.glm to predict a new data.

predict(model, newdata=testsamples, type="response") (testsamples is
a small set of new samples).

The result is a vector of the probabilites for each sample in
testsamples. But probabilty WHAT for? To belong to the first level in T?
To belong to second level in T?

Is this fallowing expression
factor(predict(model, newdata=testsamples, type="response") >= 0.5)
TRUE, when the new sample is classified to Cancer or when it's
classified to Noncancer? And why not the other way around?

Thank you,

Peter

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Marc Schwartz-3

Jul 10, 2009; 11:37pm

Re: predict.glm -> which class does it predict?

1330 posts

On Jul 10, 2009, at 9:46 AM, Peter Schüffler wrote:

> Hi,
>
> I have a question about logistic regression in R.
>
> Suppose I have a small list of proteins P1, P2, P3 that predict a
> two-class target T, say cancer/noncancer. Lets further say I know
> that I can build a simple logistic regression model in R
>
> model <- glm(T ~ ., data=d.f(Y), family=binomial) (Y is the
> dataset of the Proteins).
>
> This works fine. T is a factored vector with levels cancer,
> noncancer. Proteins are numeric.
>
> Now, I want to use predict.glm to predict a new data.
>
> predict(model, newdata=testsamples, type="response") (testsamples
> is a small set of new samples).
>
> The result is a vector of the probabilites for each sample in
> testsamples. But probabilty WHAT for? To belong to the first level
> in T? To belong to second level in T?
>
> Is this fallowing expression
> factor(predict(model, newdata=testsamples, type="response") >= 0.5)
> TRUE, when the new sample is classified to Cancer or when it's
> classified to Noncancer? And why not the other way around?
>
> Thank you,
>
> Peter

... [show rest of quote]

As per the Details section of ?glm:

A typical predictor has the form response ~ terms where response is
the (numeric) response vector and terms is a series of terms which
specifies a linear predictor forresponse. ***For binomial and
quasibinomial families the response can also be specified as a factor
(when the first level denotes failure and all others success)*** or as
a two-column matrix with the columns giving the numbers of successes
and failures. A terms specification of the form first + second
indicates all the terms in first together with all the terms in second
with any duplicates removed.

So, given your description above, you are predicting
"noncancer"...that is, you are predicting the probability of the
second level of the factor ("success"), given the covariates.

If you want to predict "cancer", alter the factor levels thusly:

T <- factor(T, levels = c("noncancer", "cancer"))

By default, R will alpha sort the factor levels, so "cancer" would be
first.

Think of it in terms of using a 0,1 integer code for absence,presence,
where you are predicting the probability of a '1', or the presence of
the event or characteristic of interest.

BTW, using 'T' as the name of the response vector is not a good habit:

> T
[1] TRUE

'T' is shorthand for the built in R constant TRUE. R is generally
smart enough to know the difference, but it is better to avoid getting
into trouble by not using it.

HTH,

Marc Schwartz

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Peter Dalgaard

Jul 10, 2009; 11:48pm

Re: predict.glm -> which class does it predict?

2360 posts

In reply to this post by Peter Schüffler-2

Peter Schüffler wrote:

> Hi,
>
> I have a question about logistic regression in R.
>
> Suppose I have a small list of proteins P1, P2, P3 that predict a
> two-class target T, say cancer/noncancer. Lets further say I know that I
> can build a simple logistic regression model in R
>
> model <- glm(T ~ ., data=d.f(Y), family=binomial) (Y is the dataset of
> the Proteins).
>
> This works fine. T is a factored vector with levels cancer, noncancer.
> Proteins are numeric.
>
> Now, I want to use predict.glm to predict a new data.
>
> predict(model, newdata=testsamples, type="response") (testsamples is
> a small set of new samples).
>
> The result is a vector of the probabilites for each sample in
> testsamples. But probabilty WHAT for? To belong to the first level in T?
> To belong to second level in T?
>
> Is this fallowing expression
> factor(predict(model, newdata=testsamples, type="response") >= 0.5)
> TRUE, when the new sample is classified to Cancer or when it's
> classified to Noncancer? And why not the other way around?

... [show rest of quote]

It's the probability of the 2nd level of a factor response (termed
"success" in the documentation, even when your modeling the probability
of disease or death...), just like when interpreting the logistic
regression itself.

I find it easiest to sort ut this kind of issue by experimentation in
simplified situations. E.g.

> x <- sample(c("A","B"),10,replace=TRUE)
> x
[1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A"
> table(x)
x
A B
4 6

(notice that the relative frequency of B is 0.6)

> glm(x~1,binomial)
Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
In addition: Warning message:
In model.matrix.default(mt, mf, contrasts) :
variable 'x' converted to a factor

(OK, so it won't go without conversion to factor. This is a good thing.)

> glm(factor(x)~1,binomial)

Call: glm(formula = factor(x) ~ 1, family = binomial)

Coefficients:
(Intercept)
0.4055

Degrees of Freedom: 9 Total (i.e. Null); 9 Residual
Null Deviance: 13.46
Residual Deviance: 13.46 AIC: 15.46

(The intercept is positive, corresponding to log odds for a probability
> 0.5 ; i.e., must be that "B": 0.4055==log(6/4))

> predict(glm(factor(x)~1,binomial))
1 2 3 4 5 6 7
8
0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651
0.4054651
9 10
0.4054651 0.4054651
> predict(glm(factor(x)~1,binomial),type="response")
1 2 3 4 5 6 7 8 9 10
0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6

As for why it's not the other way around, well, if it had been, then you
could have asked the same question....

--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) (*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - ([hidden email]) FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Gabor Grothendieck

Jul 11, 2009; 1:27am

Re: predict.glm -> which class does it predict?

7686 posts

2009/7/10 Peter Dalgaard <[hidden email]>:

> Peter Schüffler wrote:
>>
>> Hi,
>>
>> I have a question about logistic regression in R.
>>
>> Suppose I have a small list of proteins P1, P2, P3 that predict a
>> two-class target T, say cancer/noncancer. Lets further say I know that I can
>> build a simple logistic regression model in R
>>
>> model <- glm(T ~ ., data=d.f(Y), family=binomial) (Y is the dataset of
>> the Proteins).
>>
>> This works fine. T is a factored vector with levels cancer, noncancer.
>> Proteins are numeric.
>>
>> Now, I want to use predict.glm to predict a new data.
>>
>> predict(model, newdata=testsamples, type="response") (testsamples is a
>> small set of new samples).
>>
>> The result is a vector of the probabilites for each sample in testsamples.
>> But probabilty WHAT for? To belong to the first level in T? To belong to
>> second level in T?
>>
>> Is this fallowing expression
>> factor(predict(model, newdata=testsamples, type="response") >= 0.5)
>> TRUE, when the new sample is classified to Cancer or when it's classified
>> to Noncancer? And why not the other way around?
>
> It's the probability of the 2nd level of a factor response (termed "success"
> in the documentation, even when your modeling the probability of disease or
> death...), just like when interpreting the logistic regression itself.
>
> I find it easiest to sort ut this kind of issue by experimentation in
> simplified situations. E.g.
>
>> x <- sample(c("A","B"),10,replace=TRUE)
>> x
> [1] "B" "A" "B" "B" "A" "B" "B" "A" "B" "A"
>> table(x)
> x
> A B
> 4 6
>
> (notice that the relative frequency of B is 0.6)
>
>> glm(x~1,binomial)
> Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1
> In addition: Warning message:
> In model.matrix.default(mt, mf, contrasts) :
> variable 'x' converted to a factor
>
> (OK, so it won't go without conversion to factor. This is a good thing.)
>
>> glm(factor(x)~1,binomial)
>
> Call: glm(formula = factor(x) ~ 1, family = binomial)
>
> Coefficients:
> (Intercept)
> 0.4055
>
> Degrees of Freedom: 9 Total (i.e. Null); 9 Residual
> Null Deviance: 13.46
> Residual Deviance: 13.46 AIC: 15.46
>
> (The intercept is positive, corresponding to log odds for a probability >
> 0.5 ; i.e., must be that "B": 0.4055==log(6/4))
>
>> predict(glm(factor(x)~1,binomial))
> 1 2 3 4 5 6 7 8
> 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651 0.4054651
> 0.4054651
> 9 10
> 0.4054651 0.4054651
>> predict(glm(factor(x)~1,binomial),type="response")
> 1 2 3 4 5 6 7 8 9 10
> 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6
>
> As for why it's not the other way around, well, if it had been, then you
> could have asked the same question....
>

... [show rest of quote]

Or more specifically:

> resp <- factor(c("cancer", "noncancer", "noncancer", "noncancer"))
> mod <- glm(resp ~ 1, family = binomial)
> predict(mod, type = "response")
1 2 3 4
0.75 0.75 0.75 0.75

and since noncancer occurs 75% of the time in the sample clearly
its predicting the probability of noncancer.

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Peter Dalgaard

Jul 11, 2009; 2:10am

Re: predict.glm -> which class does it predict?

2360 posts

In reply to this post by Peter Dalgaard

> As for why it's not the other way around, well, if it had been, then you
> could have asked the same question....

...and come to think about it, it is rather convenient that it meshes
with the default ordering of levels in factor(x) is x is 0/1 or FALSE/TRUE.

--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) (*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - ([hidden email]) FAX: (+45) 35327907

______________________________________________
[hidden email] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.