Coding of Categorical Variables

As described elsewhere in this website, especially regarding regression (see ANOVA using Regression), it is common to create dummy (or tag) coding for categorical variables. We have employed both the usual coding (using 1 and 0) as well as the alternative coding (using 1, 0, -1).

On this webpage we show how to create the dummy coding using supplemental functions found in the Real Statistics Resource Pack. We also demonstrate an even more straightforward coding, called the simple coding, where categories such as Democrat, Republican, Independent are coded as 0, 1, 2.

Real Statistics Functions: The following array functions are provided by the Real Statistics Resource Pack.

CATCODE(R1): If R1 is an m × n array then this function returns an m × n array with the simple coding of the corresponding cell in R1, i.e.  a number from 0 to k–1 where k = the number of unique values in R1 (the numbering is in the order of the first appearance of that unique value in R1).

TAGCODE (R1, b): if R1 is an m × 1 column range with k unique values, then this function returns an m × (k–1) range with the dummy coding of the values in R1 using the 1/0 coding when b = TRUE (the default) and the 1/0/-1 coding when b = FALSE.

Note that empty cells are not coded (alternatively you can think of them as being coded as empty).

Example 1: Find the three different types of coding for the Season variable based on the data in range B4:B13 of Figure 1

Dummy coding functions

Figure 1 – Categorical Coding

The simple coding of the Season column is shown in the range F4:F13, which is calculated using the array formula =CATCODE(B4:B13). The E and G columns are copied from columns A and C since we are not coding them.

The usual dummy coding is calculated using the array formula =TAGCODE(B4:B13). Since there are 4 values for the Season variable, there are 4–1 = 3 dummy variables with values shown in range I4:K13.

The alternative dummy coding is shown in range M4:O13 using the array formula =TAGCODE(B4:B13,FALSE).

Real Statistics Data Analysis Tool: The Real Statistics Extract Columns data analysis tool also provides a way to perform any one of the codings described above. For more information about this, see Categorical Coding for Regression.

12 thoughts on “Coding of Categorical Variables”

  1. Hi,

    I am trying to use the CATCODE Function to create categorical variable values, but every time I attempt it with my data I receive a single entry of zero.

    How should I be inputting the values for this function? What am I doing wrong?

    Thank you in advance for your help!

    Reply
  2. Hey Charles,
    Real statistics is a great portal for learning analytics.
    I had a doubt can we use simple coding of categorical variables to convert it into numeric data and apply multiple regression on it.

    Reply
    • You can use simple coding provided there is order to the categories (in which case the data is not categorical). If there is no order then you need to use dummy coding.
      Charles

      Reply
  3. The example of how to use tagcode in the last two paragraphs doesn’t look right to me. Shouldn’t the formulas be =TAGCODE(B4:B13) and =TAGCODE(B4:B13,false)?

    Reply
    • Alan,
      You are correct. I have changed the webpage to correct this error. Thanks very much for catching this error.
      Charles

      Reply
  4. This is not working. Could you explain more of what is being called and returned in the tagcode function? Am I to enter that funtion in one cell and it populates to the whole range?

    Reply
    • James,

      Suppose you have the following data in range A1:A6

      A
      2
      A
      3
      A
      2

      Highlight the range C1:D6 (i.e. any 6 x 2 range — 6 because the original data has 6 rows and 2 (= 3 – 1) because the original data has 3 distinct values (A, 2, 3). If you now enter the formula =TAGCODE(A1:A6) in the highlighted range and press Ctrl-Shft-Enter the output is as follows:

      1 0
      0 1
      1 0
      0 0
      1 0
      0 1

      Charles

      Reply
  5. Wonderful site and wonderful work! Statistics for the masses . . .

    Small correction re: categorical variables:

    “The E and F columns are ” should read “The E and G columns are “.

    Reply

Leave a Comment