# Preparing for Conducting a Logistic Regression

## Contents

In a meeting with SS, he explained to me that in order to conduct a logistic regression a few preparatory steps are needed. After talking to him, I also started reeding up on logistic regression in the book SPSS Survival Manual (7th ed.) by Julie Pallant. To summarize, the following steps needs to be taken before you can conduct a logistic regression: 1) Motivate the predictors (independant variables) chosen; 2) Make sure assumptions are met; 3) Prepare the data.

### 1. Motivate the predictors chosen

Firstly, one needs to decide on which predictors to include and motivate why they should be included. This should be motivated theoretically based on what is known through earlier research in the field.

#### 1.1. Predictors chosen for the logistic regression

As in my case, after discussing with SS I decided to add ‘Earlier experience of being actively involved in research’ as an additional predictor to the list of already decided-upon predictors. Thus, the list of predictors (independant variables) for willingness to be actively involved in research decided-upon that I would like to include in the model are (with the variable names in SPSS in parenthesis): ##

- Age (AGE)
- Sex (GENDER)
- Level of education (Q19_recode)
- Economic situation (Q21)
- Self-rated health (Q27)
- Birthplace (“Är du född i ett annat land än Sverige) (Q22)
- Kommunklassificering (MunicipalityClassification)
- Earlier experience of being actively involved in research (ActivelyInvolved)

For the lastly listed independant variable, I realised that I needed to create a new variable which goes something like this:

IF Q3._1_recode V Q3._2_recode V Q3._3_recode V Q3._4_recode V Q3._5_recode V Q3._6_recode V Q3._7_recode = 1 –> Have been actively involved in research = 1.

Thus, for this created a new variable called *ActivelyInvolved* (through Transform -> Compute variable as described here as ‘Method 2’). And it worked out just fine!

I also entered manually a ‘99’ for all the cases for which Q3._1_recode AND Q3._2_recode AND Q3._3_recode AND Q3._4_recode AND Q3._5_recode AND Q3._6_recode AND Q3._7_recode = 99.

Unfortunately, I found no way to in an automatic way enter a ‘2’ for the rest of the answrs so I had to put it in manually.

### 2. Make sure assumptions are met

Secondly, the assumptions must be met in order for the results to be valid. Assumptions here relate to three aspects: a) Sample size; b) Multicollinearity; c) Outliers.

#### Sample size

Here a potential issue concerns the number of cases in the sample and the number of predictors (independant variables) one wishes to include in the logistic regression model.

In the book it is recommended to run **Descriptive Statistics** on each of the predictors, and to consider collapsing or deleting categories if they have limited numbers.

My question here is: How do you know if the numbers are limited, and what is considered limited. Is there a form of measure for this?

Update: Aftor doing a bit of reading, I found the following formula on page 155 in Pallants book:

Tabachnick and Fidell (2013, p. 123) give a formula for calculating sample size requirements, taking into account the number of independent variables that you wish to use: N > 50 + 8

m(wherem= number of independent variables).

Thus, for me that number would be: 50 + (8 x 9) = 122.

I decided then to run Descriptive Statistics in SPSS on each of the chosen variables and the result is presented here:

##### Running Descripive Statistics in SPSS

- Overview with focus on frequency

- Age (AGE)
- Scale

- Sex (GENDER)
- 1 = Man
- 2 = Kvinna

- Level of education (Q19_recode)
- 1 = Grundskola
- 2 = Gymnasium
- 3 = Eftergymnasial utbildning, mindre än tre år
- 4 = Eftergymnasial utbildning, tre år eller mer
- 5 = Forskarutbildning
- 6 = Annat/vill ej uppge

- Economic situation (Q21)
- 1 = Dålig
- 2 = Någorlunda
- 3 = God
- 4 = Mycket god
- 5 = Utmärkt

- Self-rated health (Q27)
- 1 = Dålig
- 2 = Någorlunda
- 3 = God
- 4 = Mycket god
- 5 = Utmärkt

- Birthplace (“Är du född i ett annat land än Sverige”) (Q22)
- 1 = Ja
- 2 = Nej

- Kommunklassificering (MunicipalityClassification)
- A1 = Large cities
- A2 = Commuting municipalities near large cities
- B3 = Medium-sized towns
- B4 = Commuting municipalities near medium-sized towns
- B5 = Commuting municipalities with a low commuting rate near medium-sized towns
- C6 = Small towns
- C7 = Commuting municipalities near small towns
- C8 = Rural municipalities
- C9 = Rural municipalities with a visitor industry

- Have been actively involved in research? (ActivelyInvolved)
- 1 = Yes
- 2 = No

##### Conclusions drawn from running Descripive Statistics in SPSS

**Age (AGE)**

Since AGE is a scale, the outcome of Descriptic Statistics doesn’t really matter and we do not have to do any edits due to the sample size for this variable.

**Sex (GENDER)**

Since the variable GENDER is only has two categories and the division is about ^{50}⁄_{50}, there is in this case to worry about a too small sample size for this variable. Så kategorierna blir här:

- Man (n=412)
- Kvinna (n=462)

**Level of education (Q19_recode)**

Here it becomes a bit trickier. To begin with there are six individuals (0,7 %) who filled in *‘Annat, nämligen’*. First and foremost, since they are so few they cannot constitute a standalone category apropriate for the logistic regression. Moreover, the fact that one cannot simply say that the kind of education that these individuals have can be graded along the same higher-lower level of education scale as the others, they need to be excluded. What code should they be given? I think ideally they should be assigned a special code but in order not to complicate things, I will give them a code ‘99’ (= Missing), since I will anyway just use valid percent for the logistic regression.

Next, only 18 individuals in the sample has*‘Forskarutbildning*, which sounds too little to constitute an independant category in this case. Thus, I will incorporate this category into the existing category ‘Eftergymnasial utbildning, tre år eller mer’.

Rather than keeping the categories as they are, due to the characteristics and sample size and how it is divided into categories, I have decided that for the logistic regression, re-code the variable ‘Level of education’ and lower the number of categories from six to four, namely the following:

- 1 = Grundskola (n=210)
- 2 = Gymnasium (n=155)
- 3 = Eftergymnasial utbildning mindre än tre år (n=201)
- 4 = Eftergymnasial utbildning mer än tre år (n=284 + 18 = 302)

Thus, the recoding I did was the following and I named the new variable *Q19_Recode_for_LR*

- 1 -> 1 = Grundskola
- 2 -> 2 = Gymnasium
- 3 -> 3 = Eftergymnasial utbildning mindre än tre år
- 4 -> 4 = Eftergymnasial utbildning mer än tre år
- 5 -> 4 = Eftergymnasial utbildning mer än tre år
- 6 -> 99 = Missing
- 7 -> 99 = Missing

**Economic situation (Q21)**

The category ‘Dålig’ has 46 responses and ‘Utmärkt’ has 53. I consider both these categories too small to bear the logistic regression. Thus, I will recode the variable to contain the following categories:

- Dålig - Någorlunda (n = 46 + 226 = 272)
- God (n = 390)
- Mycket god - Utmärkt (n = 154 + 53 = 207)

Thus, the recoding I did was the following and I named the new variable *Q21_Recode_for LR*

- 1 -> 1 = Dålig - Någorlunda
- 2 -> 1 = Dålig - Någorlunda
- 3 -> 2 = God
- 4 -> 3 = Mycket god - Utmärkt
- 5 -> 3 = Mycket god - Utmärkt

**Self-rated health (Q27)**

The category ‘Dålig’ has 22 responses, and ‘Utmärkt’ has 91. I think at least the former category is too small. Thus, I will recode the variable to contain the following categories:

- Dålig - Någorlunda (n=22 + 216 = 238)
- God (n = 338)
- Mycket god - Utmärkt (n = 210 + 91 = 301)

Thus, the recoding I did was the following and I named the new variable *Q27_Recode_for LR*

- 1 -> 1 = Dålig - Någorlunda
- 2 -> 1 = Dålig - Någorlunda
- 3 -> 2 = God
- 4 -> 3 = Mycket god - Utmärkt
- 5 -> 3 = Mycket god - Utmärkt

**Birthplace (“Är du född i ett annat land än Sverige”) (Q22)**

Here I think it is motivated to rephrase the question, since a vast majority is born in Sweden and since this could be argued constitute the “normal” case. In other words, I will create a new variable called BornInSweden where 1 = Yes, and 2 = No. Thus the categories for this variable will be

- Yes = 799
- No = 78

Thus, the recoding I did was the following and I named the new variable *Q22_Recode_for LR* [Born in Sweden]

- 1 -> 2 = “No”
- 2 -> 1 = “Yes”

**Kommunklassificering (MunicipalityClassification)**

The current classificaion categories are. However, some of these catagories does not contain such a large amount of respinses. It is hard to know what should be considered to be sufficient in this context, But I drew the line at 123, and thus decided to recode the variable so that it contains the following six categories instead of the prevailing nine:

- A1 = Large cities (n=134)
- A2 = Commuting municipalities near large cities (n=134)
- B3 = Medium-sized towns (n=195)
- B4 + B5 = Commuting municipalities near medium-sized towns (n=95 + 58 = 153)
- C6 = Small towns (n=123)

- C7 + C8 + C9 = Commuting municipalities near small towns OR Rural municipalities (n=70 + 65 = 135)

Here, I also realised that it is preferrable to have a data set that has simple numbers, rather than codes, so I first create a new variable called *MunicipalityClassification_numbers* in which I deleted the initial letter in the code, so that A1 -> 1; A2 -> 2; B3 -> 3, etc.

Thus, the recoding I did was the following and I named the new variable *MunicipalityClassification_Recode_for LR*

- 1 -> 1 = A1: Large cities
- 2 -> 2 = A2: Commuting municipalities near large cities
- 3 -> 3 = B3: Medium-sized towns
- 4 -> 4 = B4 + B5: Commuting municipalities near medium-sized towns
- 5 -> 4 = B4 + B5: Commuting municipalities near medium-sized towns
- 6 -> 5 = C6: Small towns
- 7 -> 6 = C7 + C8 + C9: Commuting municipalities near small towns OR Rural municipalities
- 8 -> 6 = C7 + C8 + C9: Commuting municipalities near small towns OR Rural municipalities
- 9 -> 6 = C7 + C8 + C9: Commuting municipalities near small towns OR Rural municipalities

**Have been actively involved in research? (ActivelyInvolved)**

Only two categories and 119 responses (13.5%) in the smallest one, so not so much to speak about. However, I decided to assign a ‘0’ to the non-active involvment and a ‘1’ to the active involvement. Thus, the recoding I did was the following and I named the new variable *ActivelyInvolved_recode_for LR*:

- 1 -> 1 = Yes
- 2 -> 0 = No

**In summary**

Birthplace has only 78 responses in the ‘No’ category, which is judging by the formula for multiple regression too small a number. The question is if this requirement need to be fulfilled also when running logistic regression. Is it better then to create two different models then rather than one big? In order for an N = 78 to be acceptable in a model it can at most contain *m* independant variables, where:

78 < OR = 50 + 9*m*

Meaning, it could at most contain three independent variables.

#### Multicollinearity

On page 136, Pallant writes:

Before performing a correlation analysis, it is a good idea to generate a scatterplot. This enables you to check for outliers and for violation of the assumptions of linearity. […] Inspection of the scaterplots also gives you a better idea of the nature of the relationship between your variables.

The question is, which variables should I check for correlation between? Can I do it for all eight independent variables as once? As I read further in the book, this is possible following the method described on page 143. However, intertwining too many variables like that is not without cautions, and it is good to have a strategy or plan for what one is doing and why. I also found a good article on Medium on the topic. However, following the instructions given in the chapter in Pallant’s book devoted specifically to logistic regression, she writes (on page 176):

…you should always check for high intercorrelation among your predictor (independent) variables. Ideally, your predictor variables will be strongly related to your dependent variable but not strongly related to each other. Unfortunately, there is no formal way in the logistic regression procedure of IBM SPSS Statistics to test for multicollinearity, but you can use the procedure described in Chapter 13 to request

collinearity diagnosticsunder theStatisticsbutton. Ignore the rest of the output, but focus on theCoefficients tableand the columns labelledCollinearity Statistics. Toleranc values that are very low (less than .1) indicate that the variable has high correlations with other variables in the model. You may need to reconsider the set of variables that you wish to include in the model and remove one of the highly intercorrelating variables.

I followed the instructions on page 169 and this was the **Coefficients table** I got:

#### Outliers

Pallant writes on page 176:

It is important to check for presence of outliers or cases that are not well explained by your model. In logistic regression terms, a case may be strongly predicted by your model to be one category but in reality be classfied in the other category. These outlying cases can be identified by inspecting the residuals, a particularly important step if you have problems with the goodness of fit of your model.

Here is the **Residuals table** I got when I follwed to instructions on page 169:

### 3. Prepare the data

Here, I need to make the dependent variable dichotomous, and not only that, I should also according to Pallant (p. 177)…

…code the responses as O and 1 […] The value of 0 should be assigned to whichever response indicate a lack or absence of the characteristic of interest.

The dependent variable is here *Q7 Skulle du kunna tänka dig att medverka aktivt i forskning om åldrande och hälsa?*, which currently is coded in the following way:

- 1 = Ja
- 2 = Nej
- 3 = Kanske

In order to code it properly for the logistic regression I recoded it in the following way and I named the new variable *Q7_recode_for LR*:

- 1 -> 1
- 2 -> 0
- 3 -> 1