Abstract
Social media has become a popular resource of health data analysis. Mathematics and computation techniques are challenging to public health practitioners when using the massive data from social media. Besides, it is difficult to interpret results from traditional machine learning techniques. This study proposes a simple new solution by regressing the primary outcome of interest (e.g., number of retweets of a tweet or whether a tweet contains certain keywords) on the frequency of common terms appeared in the tweet. This method reduces the term matrix based on the fitted regression scores, such as relative risk or odds ratio. It also solves the data sparsity issue and transforms text data into continuous summary scores. It would be easier to perform data analysis on social media data and interpret the results using the proposed scores. We used a twitter data of Autism Spectrum Disorder (ASD) and applied regression models for analysis, including poisson model, hurdle model and logistic model with model selection based on the Youden index. We found that the terms with significant results are generally present the key factors associated with ASD in the existing literature.
Original language | American English |
---|---|
State | Published - Mar 26 2018 |
Event | Eastern North American Region International Biometric Society (ENAR) - Duration: Mar 25 2018 → … |
Conference
Conference | Eastern North American Region International Biometric Society (ENAR) |
---|---|
Period | 03/25/18 → … |
Keywords
- Austism Spectrum Disorder
- Text-Mining Data
DC Disciplines
- Biostatistics
- Public Health