IMPACT OF SAMPLE SIZE ON MULTICOLLINEARITY WITH HIGH DIMENSIONAL DATA IN LOGISTIC REGRESSION ANALYSIS
Gerald Ohene Agyekum
Kumasi Technical University
Samuel Akwasi Adarkwa
Kumasi Technical University
Richard Yaw Kusi
Kumasi Technical University
Keywords: logistic regression, multicollinearity, high dimensional data, principal component analysis
Abstract
In fields like epidemiology and biological sciences, logistic regression is essential for predicting or categorizing binary outcomes. However, multicollinearity, where predictor variables are highly correlated, can impact the model and lead to erroneous conclusions about each predictor’s effect. While maximum likelihood estimation is commonly used to obtain model parameters, it can be problematic with small sample sizes. There is littleresearch on how sample size affects multicollinearity in high-dimensional binary logistic regression. To address this, researchers often suggest using methods like variable droppingor principal component analysis. This study aimed to evaluate the feasibility of using PCA to manage multicollinearity in logistic regression with large column vectors and examine how sample size affects multicollinearity with samples of size 100, 200, 500, 1000, 1500 and 2000. Results indicate that standard errors (SEs) and Variance Inflation Factors (VIFs) decrease with larger sample sizes and increase as sample sizes decrease, even with no correlation between predictors. Suggesting that sample size plays a crucial role in multicollinearity. The study recommends a sample size of at least 500 to avoid issues with multicollinearity in logistic regression. If obtaining a sample of this size is not possible, using Principal Component Analysis (PCA) is a useful alternative.
Author Biographies
Gerald Ohene Agyekum, Kumasi Technical University
Department of Statistical Sciences
Samuel Akwasi Adarkwa, Kumasi Technical University
Department of Statistical Sciences
Richard Yaw Kusi, Kumasi Technical University
Department of Statistical Sciences