IMPACT OF SAMPLE SIZE ON MULTICOLLINEARITY WITH HIGH DIMENSIONAL DATA IN LOGISTIC REGRESSION ANALYSIS

Gerald Ohene Agyekum

Kumasi Technical University

Samuel Akwasi Adarkwa

Kumasi Technical University

Richard Yaw Kusi

Kumasi Technical University

Keywords: logistic regression, multicollinearity, high dimensional data, principal component analysis


Abstract

In fields like epidemiology and biological sciences, logistic regression is essential for predicting or categorizing binary outcomes. However, multicollinearity, where predictor variables are highly correlated, can impact the model and lead to erroneous conclusions about each predictor’s effect. While maximum likelihood estimation is commonly used to obtain model parameters, it can be problematic with small sample sizes. There is littleresearch on how sample size affects multicollinearity in high-dimensional binary logistic regression. To address this, researchers often suggest using methods like variable droppingor principal component analysis. This study aimed to evaluate the feasibility of using PCA to manage multicollinearity in logistic regression with large column vectors and examine how sample size affects multicollinearity with samples of size 100, 200, 500, 1000, 1500 and 2000. Results indicate that standard errors (SEs) and Variance Inflation Factors (VIFs) decrease with larger sample sizes and increase as sample sizes decrease, even with no correlation between predictors. Suggesting that sample size plays a crucial role in multicollinearity. The study recommends a sample size of at least 500 to avoid issues with multicollinearity in logistic regression. If obtaining a sample of this size is not possible, using Principal Component Analysis (PCA) is a useful alternative.


Author Biographies

Gerald Ohene Agyekum, Kumasi Technical University

Department of Statistical Sciences

Samuel Akwasi Adarkwa, Kumasi Technical University

Department of Statistical Sciences

Richard Yaw Kusi, Kumasi Technical University

Department of Statistical Sciences