Handling Categorical Data

The Logistics Regression data analysis tool cannot analyze non-numeric data, but alphanumeric data can be converted to numeric form using the Real Statistics Extract Columns from a Data Range data analysis tool.

Example 1: Convert the data in range A3:D19 on the left side of Figure 1 to numeric form.

Categorical coding dialog box

Figure 1 – Categorical coding of alphanumeric data

Press Ctrl-m and choose the Extract Columns from a Data Range option. When the dialog box shown on the right side of Figure 1 appears, insert range A3:D19 into the Input Range field (or highlight the range A3:A19 B3 and then press the Fill button) and press the OK button. The dialog box is now changed as shown in Figure 2.

First we simply copy column A into the output as shown in Figure 2. This is done by clicking on the Age item in the list of column headings shown in the dialog box and clicking on the Add Column button.

Extract Columns dialog box

Figure 2 – Extract Columns dialog box

Next, select the Party item in the dialog box and click on the Add Code button. This will use ordinary tag coding (aka dummy coding) to code the Party values. Since there are three values for Party, 3 – 1 = 2 dummy variables will be used.

Next, select both the Gender and Vote items in the dialog box (i.e. click on Gender and then while holding down the Shift key click on Vote). Now change the Code type to Categorical coding and click on the Add Code button. Finally, click on the Done button to close the Extract Columns from a Data Range dialog box. The result is shown in Figure 3.

Categorical coding output

Figure 3 – Categorical coding output

Note that we can also use the Categorical coding option even when the categorical variable contains more than two outcomes. E.g. if we had done this with the Party variable, then the output would consist of one column (instead of two columns) containing 0’s, 1’s and 2’s. The only problem with this approach for Logistic regression is that an implied order would be assumed, namely  Rep < Dem < Ind, since Rep would be coded as 0, Dem as 1 and Ind as 2 (i.e. the order in which they first appear in column B).

Leave a Reply

Your email address will not be published. Required fields are marked *