The Opportunities and Challenges of Working with Big Data


A Q&A with Wenchao Yu, Research Assistant and PhD Candidate

What is your process of creating software to analyze tweets?

The typical tweet analysis process includes four steps: data collection, data modeling, algorithm design, and performance evaluation. The very first step is to collect the data from Twitter. We use the free REST APIs provided by Twitter (click here). However, this approach has a rate limitation per user or per application. Thus, if you need to pull a large amount of tweets, say millions of tweets, Topsy APIs is a better choice. Then the tweet data needs to be modeled according to the task (e.g., if one wants to detect the events using Twitter data). One possible approach is to model the data as a graph, of which the nodes are the words/phrases extracted from tweets, and the edge weights between nodes are quantified by similarity measure. The following step after data modeling is to design and analyze the algorithms. In the event detection example, we may simply deploy community detection algorithms to detect the word/phrase clusters, which are considered a summary of events. Or we can choose to propose new community detection algorithm by design new metric that used to measure the closeness of nodes. The last step is the performance evaluation. We can choose difference evaluation metrics such as root-mean-square error (RMSE), area under the curve (AUC), Pearson product-moment correlation, and F1 score. In this example, we may have the ground truth labels (events) for the tweets in a specific time frame. Thus, we can just simply use the F1 score, which considers both precision and recall.

What is your process in formulating/validating a hypothesis related to social media data?

I’ll take crime rate prediction using tweets as an example. First we develop a research question: for example, can we use the tweets to predict crime rates at the state/county level? Then we collect the relevant datasets: crime data and tweets. At the same time, the question needs to be progressively refined (i.e., what kind of tweets can be used in the prediction?). One study showed that “low economic status, ethnic heterogeneity, residential mobility, and family disruption lead to community social disorganization, which, in turn, increases crime delinquency rates.” Thus a possible solution is to create a tweet filter that selects the tweets related to low economic status, ethnic heterogeneity, and so on. A simple way to perform crime rate prediction is to study the correlation between crime incidents with Twitter posts. The strong association between tweets and incident crimes indicate the potential predictive power of social media data.

How can you adapt your analysis tools to understand slang?

Detecting slang words and understanding them is very hard due to the lack of labeled tweets. Also, Twitter users can create slang words in several ways, which makes the detection even harder due to the new slang words. One possible approach to detect slang words is to start with a slang word database (e.g., the drug slang dictionary). Then one can deploy the frequent pattern mining method to find “patterns” (words) in tweets that co-occur with the seed slang words from the database. Here each tweet is viewed as a transaction, and each word as an item. Finally, the detected slang word candidates are verified by the help of a domain expert.

When you produce software with your publications, you make the software open source. What are the benefits of this approach?

The open source concept has various benefits that give people other than the author the right to read, learn, or improve a project. Usually formal open source software will have a license associated that helps manage contributions and version releases. By putting research work published under open source, for example, other researchers interested in the topic will help validate the research conclusions, contribute to the work given proper reference, and deepen the influence of the research work. I’ve been a beneficiary of this approach. For example, I can use the codes published by other authors as a baseline for my own research papers. This saves time and can ensure fairness of the comparison.

What is the greatest challenge you have ever faced as a computer scientist?

I would say the greatest challenge is to deal with “big data.” The data sets are rapidly growing, and show no signs of slowing down. Therefore all the proposed models have to be scalable, which means the models need to perform analyses and make decisions quickly given the large scale of data. The scalable capacity of each algorithm can be evaluated by time complexity. In addition, it’s very common to add a computational analysis section for a proposed algorithm in a research paper. Another choice is to implement algorithms with cluster computing systems such as Spark and Hadoop, which can handle large volumes of structured and unstructured data more efficiently on commodity PCs.

Leave a comment

Your email address will not be published. Required fields are marked *