빈도 분석 (Frequencies and Crosstabs)

Sunday. October 22, 2017

preface 이번 포스트에서는 categorical variables 분석을 위하여 frequency table 과 contingency table 을 만드는 방법을 알아봅니다. 또한 이들 간의 독립성을 테스트하고, 연관성 척도를 계산하고, 결과를 그래프를 통해 나타내는 방법을 알아봅니다.

다음 자료를 참고하였습니다:

https://www.statmethods.net/stats/frequencies.html

Frequency Table 만들기 (Generating Frequency Tables)

R 에서는 frequency table 과 contingency table 을 만드는 다양한 방법을 제공합니다. 아래에 3가지 예시가 준비되어 있습니다. 예제에서 A, B, C 는 categorical variable 을 의미합니다.

Frequency Table: table( )

table( ) 함수를 이용하여 frequency table 을 만들 수 있습니다. 빈도수 대신 비율 테이블을 만들려면 prop.table( ) 함수를 사용합니다. marginal frequencies 는 margin.table( ) 함수를 사용합니다.

# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table

margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)

prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages

table ( ) 함수를 이용하면 3개 이상의 categorical variable 에 대한 테이블도 만들 수 있습니다. 이 경우 ftable( ) 함수를 이용하면 결과를 보다 깔끔하게 출력할 수 있습니다.

# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)

table ( ) 함수는 결측치를 무시합니다. 결측치(NA)도 포함하려면, 해당 변수가 vector 인 경우 옵션에 exclude=NULL를 추가합니다. 해당 변수가 factor 인 경우 newfactor <- factor(oldfactor, exclude=NULL) 로 새로운 factor 를 만들어줍니다.

다음 코드를 이용하면 간단한 frequency table 을 dataframe 에서 볼 수 있습니다.

tmp <- as.data.frame(table(mytable$myvar))

xtabs( )

xtabs( ) 함수는 formula 형태의 입력을 지원합니다. 왼쪽에 위치한 변수는 vector of frequencies 로 간주됩니다.

# 3-Way Frequency Table
mytable <- xtabs(~A+B+c, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence

Crosstable

gmodels 패키지의 CrossTable( ) 함수를 사용하면 SAS 의 PROC FREQ 나 SPSS 의 CROSSTABS 와 같은 crosstable 을 만들 수 있습니다. 해당 함수는 다양한 옵션을 가지고 있습니다. (There are options to report percentages (row, column, cell), specify decimal places, produce Chi-square, Fisher, and McNemar tests of independence, report expected and residual values (pearson, standardized, adjusted standardized), include missing values as valid, annotate with row and column titles, and format as SAS or SPSS style output.)

# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)

독립성 테스트 (Tests of Independence)

Chi-Square Test

2차원 테이블에 대해서 chisq.test(mytable) 함수를 사용하면 row and column variable 간의 독립성을 테스트할 수 있습니다. p-value 는 asymptotic chi-squared distribution 으로 계산하는 것이 디폴트입니다. 옵션 설정을 통해 p-value 를 Monte Carlo simultation 으로 계산할 수 있습니다.

Null Hypothesis: 변수 간의 관련성이 없다

Fisher Exact Test

fisher.test(x) 는 표본 크기가 적을 때 사용할 수 있는 Fisher Exact Test 를 계산합니다. x 는 matrix 형태의 2차원 contingency table 형태여야 합니다.

Mantel-Haenszel test

Use the mantelhaen.test(x) function to perform a Cochran-Mantel-Haenszel chi-squared test of the null hypothesis that two nominal variables are conditionally independent in each stratum, assuming that there is no three-way interaction. x is a 3 dimensional contingency table, where the last dimension refers to the strata.

Loglinear Models

MASS 패키지의 loglm( ) 함수를 사용해서 log-linear model 을 만들 수 있습니다. 다음은 변수 A, B, C 에 대한 3차원 contingency table 을 만드는 방법을 소개합니다.

library(MASS)
mytable <- xtabs(~A+B+C, data=mydata)

다음 테스트를 실시할 수 있습니다:

Mutual Independence: A, B, and C are pairwise independent.

loglm(~A+B+C, mytable)

Partial Independence: A is partially independent of B and C (i.e., A is independent of the composite variable BC).

loglin(~A+B+C+B*C, mytable)

Conditional Independence: A is independent of B, given C.

loglm(~A+B+C+A*C+B*C, mytable)

No Three-Way Interaction

loglm(~A+B+C+A*B+A*C+B*C, mytable)

연관성 척도 (Measures of Association)

vcd 패키지의 assocstats(mytable) 함수를 사용하면 phi coefficient, contingency coefficient, and Cramer’s V for an rxc table 을 계산할 수 있습니다.

vcd 패키지의 kappa(mytable) 함수를 사용하면 Cohen’s kappa weighted kappa for a confusion matrix 를 계산할 수 있습니다.

Richard Darlington 이 쓴 Measures of Association in Crosstab Tables 에 보다 자세히 나와있습니다.

결과 시각화 (Visualizing results)

1차원 빈도를 나타낼 때에는 바차트와 파이차트를 사용합니다.

categorical data 간의 관계를 나타낼 때에는 vcd 패키지의 mosaic plot 과 association plot 을 사용합니다.

ca 패키지를 사용하면 correspondence analysis 를 실시할 수 있습니다. (contingency tables 의 행과 열의 관계를 시각화 할 수 있습니다.)

Converting Frequency Tables to an “Original” Flat file

frequency table 로부터 원자료를 복구해야할 필요가 있을지도 모릅니다. Marc Schwartz가 그 방법에 대해서 설명합니다.

Hyeongjun Kim

Financial Economist, Data Scientist, and Hearthstone Player