Predictive Data Modeling: How it Works


IMG_7174A Q&A with Jin Yu, Data Scientist and UCLA Graduate Student

What is the goal of UCIPT’s Twitter sentiment model?

The goal of the Twitter sentiment model is to predict the sentiment (positive, neutral, or negative) of a given tweet accurately. However, it is only a part of the big picture that our group has in mind. By developing a model that can predict sentiments of tweets accurately, we intend to automate the entire process of labeling an incoming stream of tweets. Through such automation, we hope to predict meaningful things such as students’ stress level and GPA by analyzing a vast amount of data available on Twitter.

How do you use a neural network to improve the accuracy of your predictions?

We have tried several machine learning models. We first started by trying basic and popular ones such as logistic regression, random forest, and support vector machine, all of which are based on a bag-of-words approach. However, we quickly recognized the limitations of the bag-of-words approach. Such models lack the ability to capture semantic or syntactic information of words and are insensitive to the ordering of words. This inspired us to implement a deep convolutional neural net, which is sensitive to the order of words. Moreover, each word is mapped to a continuous, high-dimensional vector, which allows neural net models to capture word semantics and syntactics to some extent. The results are promising, as the convolutional neural net model has outperformed all other models we tried by concrete ~5% in classification accuracy.

How many layers are there in your model? What are they and how do they work?

Our model has five layers: word embedding, convolutional, pooling, fully-connected, and soft-max. The word-embedding layer maps words to high-dimensional vectors. Those word vectors are learned through the examples we feed into the model. The convolutional layer maintains a filter, which contains a number of trainable weights, and applies the filter to the sentence matrix in a sliding window fashion. (Note that neural nets convert each word into a vector, meaning each sentence is mapped to a matrix.) The convolutional layer allows the model to be sensitive to word order and capture local features of input sentences. The pooling layer works to summarize the features learned through the convolutional layer. The fully-connected layer, which is used in conventional neural networks, helps to map learned features to the final classification results. In our case, we have three classes: positive, neutral, and negative. Finally, the soft-max layer computes the final probability of the input sentence being classified as one of the sentiments.

What other Twitter sentiment models exist? How do UCIPT’s models compare to theirs? Why should people have confidence in them?

Our model is based on a paper by Yoon Kim at New York University. The model was also used in a Twitter sentiment analysis task by Severyn and Moschitti at the 2015 International Workshop on Semantic Evaluation. There are other approaches that use standard machine learning algorithms with handcrafted features. However, those models require laborious effort to come up with features and extract those features. In contrast, our model extracts features automatically and shows better performance in general.

What are the greatest challenges that you face as a data analyst?

The greatest challenge to me as a data analyst is how to distinguish between which tasks are achievable and which are not given the limited resources I have access to. Of course, coming up with the right algorithm and model and refining the model to perform better is always a challenging task, but setting up the right direction at the beginning of the process is more difficult and seems to have a bigger impact on the outcome of the project at the end. I believe such ability comes with great experience and intuition, and it is definitely one of the things that makes a good data analyst.

Leave a comment

Your email address will not be published. Required fields are marked *