recvProjDetailInfo({"projects":[{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cl300","m4uni":"","pid":"201412-89","m2uni":"","timestring":"Mon Dec 8 14:55:01 2014","m4fname":"","language":"Assembly","m3lname":"","dataset":"Wonderful Dataset","m1lname":"Lin","industry":"","analytics":"Human Brain","m2fname":"","description":"Great project","m1fname":"Ching-Yung","projectname":"Final Project Upload Test","m3fname":""},{"m2lname":"Ellis","m4lname":"","m3uni":"","m1uni":"bwj2105","m4uni":"","pid":"201412-7","m2uni":"jge2105","timestring":"Mon Dec 8 16:08:42 2014","m4fname":"","language":"Python, C++, MATLAB, Shell","m3lname":"","dataset":"We collected our own dataset mined from IMDB. Our software can support any dataset consisting of visual media with concept/keyword annotations along with emotion/affective semantic labels.","m1lname":"Jou","industry":"","analytics":"Web crawling, audio feature extraction, SIFT, BoW, VLAD, Fisher encoding, dense trajectories, color histograms, LBP, face detection, face expression recognition, saliency, object detection, deep learning, Sentibank, SVM, multiple instance learning, logistic regression","m2fname":"Joseph","description":"We seek to predict concepts in movie trailers to understand how affect, or emotion, is elicited in cinematic audiovisual media. We collect our own data from IMDB along with keywords and annotations. We apply a number of content-based audiovisual features along with state-of-the-art machine learning to detect these concepts toward affect prediction.","m1fname":"Brendan","projectname":"Affective Computational Cinematography","m3fname":""},{"m2lname":"Oh","m4lname":"","m3uni":"gg2529","m1uni":"skk2142","m4uni":"","pid":"201412-5","m2uni":"mo2499","timestring":"Mon Dec 8 18:33:02 2014","m4fname":"","language":"Java, MacOS","m3lname":"Gupta","dataset":"We use artificial data with noise added to it. Our project can support any image data in csv format. ","m1lname":"Keshri","industry":"","analytics":"Matrix multipllication, matrix inversion, sampling random variable algorithms from Mahout and Spark. ","m2fname":"Min-hwan","description":"Our objective is to reconstruct denoised image of a neuron. We use dictionary learning and sparse regression related ideas to solve the denoising problem. The toolkits involve dictionary learning and ADMM to solve the reconstruction problems. In general, these toolkits can be used for binary dictionary learning for images. Also, the research is effort towards solving the brainbow problem. The data is very big and there is a need to have packages which can handle big data. ","m1fname":"Suraj","projectname":"Brainbow: Reconstruction of Neurons","m3fname":"Gaurav"},{"m2lname":"Zhu","m4lname":"","m3uni":"","m1uni":"to2232","m4uni":"","pid":"201412-4","m2uni":"kz2232","timestring":"Mon Dec 8 19:44:02 2014","m4fname":"","language":"Java, PHP, R, Pig, Javascript","m3lname":"","dataset":"We used the genomic and the clinical profiles from the 3,316 patients in TCGA data set. Since we transformed the data set into the common CSV format, our programs can take any RNAseq/microarrary-based genomic data if needed.","m1lname":"Ou Yang","industry":"","analytics":"1. Map-Reduce Concordance Index Algorithm
We created the map-reduce concordance index function by mapping all possible combination of the predictions into pairs and check if they are valid pairs and concordant with the response. Then in the reducing part we collect the valid and concordant pairs.

We reduced 66% input size by hashing the status of the patient and the survival time into one floating number, then we sort the hashed response by the order of prediction. Therefore we only transmit an array of floating numbers instead of a matrix.

2. Patient-Based Treatment Recommendation Engine
We created our own treatment recommendation engine by: First, compute the similarity of the patients in the data set with regard to the profile uploaded by user. Second, sort most similar profiles by the expected survival time. And finally return the treatment plan received by the patient with the longest expected survival.

The algorithm is validated using LOOCV over the 3,316 cancer patients, which yielded 87.24% true positive rate.

3. D3.js-based visualization for the Bayesian networks
We created the Bayesian network of the 30 genes using the bnlearn package in R. Then we visualize it using D3.js on our website. The subnetworks were validated using the publicly available gene ontology analysis tool.

4. PHP-Based Interactive Visualization
We created an interactive facility using PHP and R. Because we created the REST API for our concordance index function and the recommendation engine, we demonstrate them as a web service. ","m2fname":"Kaiyi","description":"The objective of our project is to analyze the massive cancer genome profiles to create the model for diagnosis and treatment suggestion. Because of the feature number and the size of the samples, the analysis could only be done with the Big Data Analytics tools.
In our project, we implemented the similarity functions used in biomedical research. Our innovation are:

1. We created the map-reduce concordance index function and we successfully reduced the input size by more than 50%.
2. We identified 30 genes that are related to patients outcome using the toolkit and we visualized the network of them. We also validated they represent the cancer hallmarks.
3. We created a patient-based treatment recommendation engine using R, which returns the analysis and suggestion in one second with the plot.
4. We created the web API and the facility of the recommendation engine and the concordance index function.

This project is important because there was no diagnosis and treatment planning tools were created on the Pan-Cancer basis. And the tools we provided were not implemented on Hadoop platforms before. The project may not only shed a light on the molecular pathologies of the complicated malignancies and the potential therapeutic regimens but provide a tool for medical professionals to save lives.
","m1fname":"Tai-Hsien","projectname":"Network Analysis on the Big Cancer Genome Data","m3fname":""},{"m2lname":"Hsu","m4lname":"","m3uni":"","m1uni":"hy2368","m4uni":"","pid":"201412-21","m2uni":"mh3346","timestring":"Mon Dec 8 20:17:07 2014","m4fname":"","language":"Python, MongoDB, Heroku","m3lname":"","dataset":"Using our own web crawling program to fetch information and articles from major news and magazine websites","m1lname":"Yin","industry":"","analytics":"Apache Pig: for single word count
Apache Mahout: for word clustering
Python NLTK: for collocation analysis and pre-processing","m2fname":"Meng-Yi","description":"Lack of interesting ideas? This is the app for all you writers to discover new topics to write about that interests everybody. From sports to movies, from politics to travel destination and from high fashion to street styles, you will find something that you want to open your laptop to write about right away.
TrendyWriter finds the hit list of topics your want to write about everyday in one place. Whether you are professional writer or student or just want to take a break from your everyday life by writing interesting topics you will always discover something new in TrendyWriter.
Who should use TrendyWriter?
- everyone! You can start with a topic you like and go creative form there.
Why TrendyWriter?
- Hundreds of topics to choose from everyday
- Easily find topics you want by tap and swipe
- Personalize, add topics you like and save for later
- Get topics from music, movie, travel destination, food, fashion, politics, science, sports and so much more!","m1fname":"Hang","projectname":"TrendyWriter","m3fname":""},{"m2lname":"Enkebol","m4lname":"","m3uni":"","m1uni":"dj2374","m4uni":"","pid":"201412-32","m2uni":"ale2124","timestring":"Mon Dec 8 20:30:11 2014","m4fname":"","language":"Heroku, Flask(Python), PostgreSQL, HTML/Javascript, AWS s3, Git. Neo4j is on the roadmap","m3lname":"","dataset":"PlayPalate uses the following data sources:

Facebook API: user music listening history. This data is rendered by connected streaming music apps such as Spotify, Rdio, iHeartRadio, Pandora, and Shazam.

Rovi API: artist biographies. I have create a program that updates an s3 bucket with artist biographies if I don't have the text file downloaded form this API. I have about 2500 biographies which amounts to 20MB of data so far. ","m1lname":"Jones","industry":"","analytics":"Python NLTK & sklearn for TFIDF & Cosine similarity
","m2fname":"Andy","description":"The objective of PlayPalate is to generate personalized spotify playlists for an end user based on their music listening history. Recommendations are based on similarities in artist biographies computed via NLP feature extraction.

The product and toolkit is important to users who want to find new music and listen to personalized playlists. ","m1fname":"Devin","projectname":"PlayPalate","m3fname":""},{"m2lname":"Gangopadhyay","m4lname":"Maharishi","m3uni":"akn2114","m1uni":"anm2147","m4uni":"em2852","pid":"201412-8","m2uni":"ag3202","timestring":"Mon Dec 8 21:18:24 2014","m4fname":"Esha","language":"MongoDB, Mongo MapReduce, Node.js, Amazon EC2, AWS Elastic Beanstalk, Objective-C","m3lname":"Naganath","dataset":"Our app uses crowdsourced data to generate the best path from one starting location to and end location. We generated simulated data based on pre-populated beginning and end coordinates for our paths, which we stored in the database ","m1lname":"Mishra","industry":"","analytics":"
Comparing two points
• Compute the Euclidean distance between the two points
• Use an epsilon radius to determine if the two points are essentially the same
• Latitudes and longitudes in a local area differ by ~10^-4. We thus set our epsilon cutoff to be an order higher i.e. 10^-3
Comparing two paths
• Paths represented as “bag of points”, where each point is a latitude-longitude coordinate • Use the Jacard index (Tanimoto similarity metric) to determine similarity between paths
• Paths should be considered the same if they have many shared points.
• We thus use a Tau cutoff of 0.98 for the determination of similarity of two paths
Path clustering algorithm
• Algorithm loosely derived from k-means clustering
• First, bucket paths based on the Tanimoto similarity metric • Then, denote largest bucket as highest-voted path
E6893 Big Data Analytics – Final Project Presentation
© 2014 CY Lin, Columbia University
Big Data Algorithms
MapReduce algorithm
• Documents are stored in JSON in MongoDB in the form {
start : [ latitude, longitude ],
end : [ latitude, longitude ],
points : [ [lat,long], [lat,long] ... ]
}
• The map() function emits a concatenation of truncated start and end latitudes and longitudes as the key, and points array as the value
• The reduce() function computes clusters on the sets of points and outputs a random set (all are essentially equal) from the largest cluster
• MapReduce is useful to parallelize our computation of the buckets and implement scalability within our system
Determining a path from CoreLocation data
• Goal was to transform a continuous sequence of points into discrete paths
• Implemented a buffering mechanism based on whether the user was static or moving to
delimit the start and end of a given path

For the system modules we used a Node.js server and MongoDB to store the paths. AWS Beanstalk/EC2 is used for cloud deployment and load balancing. For the visualization we used iOS8 SDK. ","m2fname":"Anirban","description":"GoogleMaps, Waze, and other path-recommendation services use a limited and pre-defined set of attributes to determine the worth of a path
Wish to utilize the very rich knowledge of individuals in their known environments to recommend paths
Not measure any attributes directly, but rather assume that a user taking a path is “voting” for that path as holistically better than any other
By dynamically updating our “knowledge” of an environment through end-user’s choices, we can provide a nuanced, insider recommendation of the current optimal path between two locales

What did we do?
• Replace traditional centrally-orchestrated route suggestion methods (GoogleMaps, Waze) with crowdsourced route suggestions
• Our app considers a person taking a particular route as a “vote” for that route, and then suggests to users the route with the most “votes”
Rationale
• Locals in an area 1) constitute the majority of routes taken 2) know the area best
• Therefore, the highest number of votes should go to the route voted best by locals
What makes it unique?
• Crowdsourced knowledge of environments provides detailed and nuanced information • Human senses and perception far more holistic than any specific metrics
• Crowdsourced data provides real-time updates on environment status
• construction on road, fallen tree, snow/ice not yet been shoveled
","m1fname":"Abhinav","projectname":"People Maps","m3fname":"Aditya"},{"m2lname":"CHENG","m4lname":"LI","m3uni":"zz2283","m1uni":"cz2321","m4uni":"ll2871","pid":"201412-27","m2uni":"jc4215","timestring":"Mon Dec 8 22:06:03 2014","m4fname":"LINGXUE","language":"Languages: Java, R Platforms: Amazon Elastic MapReduce, S3, Hadoop, Rstudio ","m3lname":"ZHU","dataset":"Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone.

The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files.","m1lname":"ZHAO","industry":"","analytics":"Map-Reduce
Hadoop
Clustering text alogrithms

","m2fname":"JIAHUI","description":"In this project, we want to analyze web crawl data to make comparison of different telecoms operators, clustering their features, and recommend users different carriers regarding their service, data package,internet speed, etc.
We'd like to build two different web application, textual interface and graphical interface.

Our project will benefit different kinds of people: e.g. Users who wants to know information about different carriers, Journalists who wants to get the latest news of different carriers, Carriers who wants to make a progress on their services...

","m1fname":"CHENYUN","projectname":"Comparison Analysis of Different Telecoms Operators","m3fname":"ZHENYING"},{"m2lname":"Singh","m4lname":"Madala","m3uni":"mm3557","m1uni":"ss4609","m4uni":"rbm2150","pid":"201412-6","m2uni":"ms4826","timestring":"Mon Dec 8 22:09:11 2014","m4fname":"Rajesh","language":"Python, AWS, MySql, MongoDB, NLP","m3lname":"Misra","dataset":"Twitter Data
Through Twitter account. Got the data through Tweepy stream api.","m1lname":"Shrivastava","industry":"","analytics":"Google Corpus, Tweepy, NLP, PyPlot, Stream, Json, Auth","m2fname":"Mandeep","description":"Objective:

To generate stock signals from real time news analysis. We are using the news posted on twitter so that there is minimal lag and have developed a scoring algorithm that summarizes the relevant news on an hourly basis and generated a real time signal which could be used to make trading decisions.

Innovation - Bespoke Scoring Algorithm:

1. Creating training data for Scoring
a. Construct training data using a set of filtered tweets
b. Manually assign ratings from -2 to +2
i. “Apple preparing for another potential blockbuster debt sale” (Score +2)
ii. “Apple stock takes a big hit, dragging U.S. markets with it” (Score -2)
iii. “iPhone 6: T-Mobile Brings Exciting Cyber Monday Deals” (Score 0)
2. Developing, Testing & Refining Scoring Model
a. Word corpus -word scores, pair of words, placement of words
b. Source -Relevance, Geography, Number of followers, Number of retweets
3. Scoring output
a. Each tweet assigned a score +2, +1, 0, -1, -2 based on range of score
b. Generate score every hour (normalized based on no. of tweets in that hour)
c. Plot hourly summarized scores

Importance:

Apart from creating a methodology we have also proposed an enterprise framework which could be leveraged to generate signals from real time news for any purpose and not just a particular stock and at any scale (from small volume of data to very large volumes).","m1fname":"Shreyas","projectname":"Stock signal generation using real time news analysis","m3fname":"Mayank"},{"m2lname":"Wang ","m4lname":"Yi","m3uni":"rr2950","m1uni":"lj2351","m4uni":"yj2306","pid":"201412-10","m2uni":"yw2586","timestring":"Mon Dec 8 22:31:29 2014","m4fname":"Jiang","language":"Python,R,Mahout on Hadoop","m3lname":"Ran","dataset":"Five relational datasets about project information: donations, outcomes, resources, essays, and projects.
","m1lname":"Jin","industry":"","analytics":"Text Clustering: Kmeans, Canopy;
Classification: Naïve Bayes, SGD.
","m2fname":"Yuezhi","description":"The goal of this project is to help DonorsChoose.org identify projects that are exceptionally exciting to the business, at the time of posting. While all projects on the fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, better the user experience, and help more students receive the materials they need to learn.
","m1fname":"Lina","projectname":"Predicting Excitement at donorschoose.org","m3fname":"Ran"},{"m2lname":"Wang","m4lname":"Wang","m3uni":"zg2203","m1uni":"sz2476","m4uni":"xw2344","pid":"201412-23","m2uni":"ww2373","timestring":"Mon Dec 8 22:31:56 2014","m4fname":"Xuebo ","language":"Java, Maven, Eclipse","m3lname":"Gong","dataset":"Consumer Expenditure Survey, 2011: Interview Survey and Detailed Expenditure Files (ICPSR 34441)
From the website http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34441?q=34441&searchSource=revise","m1lname":"Zhang","industry":"","analytics":"K-means clustering, map reduce","m2fname":"Wenxin","description":"The project we are trying to do is mainly for B2C websites. By taking the raw data of consumer expenditure, try to analyze the data and find the change and trend of purchase of customers. Together with the basic information of the market, try to show the most likely future growth of markets and give detail suggestions. Moreover, if the market is good enough for a start-up retail industry, we also like to give our own suggestions.","m1fname":"Siyuan","projectname":" Market strategy suggestions for B2C websites","m3fname":"Zixuan"},{"m2lname":"Zhang","m4lname":"","m3uni":"","m1uni":"rw2526","m4uni":"","pid":"201412-26","m2uni":"hz2308","timestring":"Mon Dec 8 22:32:57 2014","m4fname":"","language":"Python/Java/Matlab","m3lname":"","dataset":"ICML 2004 Physiological Dataset.
The dataset is obtained from the data access in Purdue University.
","m1lname":"Wang","industry":"","analytics":"We are trying to combine Linear Chain Conditional Random Fields with EM steps (L-CRF-EM Algorithm) and Ultra-Fast Forest Tree (UFFT Algorithm) together, via which could efficiently predict human activity with high accuracy.","m2fname":"Hongzhuo","description":"Objective: Human behavior recognition with physiological data from sensors

Innovation: Implemented streaming data updates and modified existing CRF algorithm to give user-specified performance and reliability.

Capabilities: With higher accuracy than most algorithms with the same size of dataset; can speed-up training phase by modifying the size of data maintained; take concept drift and time-variant nature of human physiological indicators into consideration.

Importance: Instead of regarding the problem merely as a multi-dimension machine learning problem, we took a more practical approach, and treated our implementation as a potential commercial application. With our modifications, the algorithm can be applied to various users in real time, with fast performance, user-specified setup and self-updating features. ","m1fname":"Ruoyu","projectname":"Reversal Prediction from Physiology Data","m3fname":""},{"m2lname":"Arkilic","m4lname":"Liu","m3uni":"tz2237","m1uni":"ah3209","m4uni":"yl3199","pid":"201412-14","m2uni":"aa3438","timestring":"Mon Dec 8 22:38:19 2014","m4fname":"Yuheng","language":"Hadoop, Mahout, Python(Pydoop, Scikit, etc)","m3lname":"Zhang","dataset":"Stock price data from:
1. Yahoo Finance
2. Google Finance
3. S&P 500 free data","m1lname":"Hong","industry":"","analytics":"1. SGD classification and prediction algorithm in Mahout
2. Mapreduce with Pydoop
3. Scikit-learn in Python
4. Mapreduce performance evaluating tools","m2fname":"Arman","description":"Objective:
1. Predicting stock price movements are essential for portfolio risk management and in the core of any trading model
2. Stock tick data is available to public with various time range and frequency
3. Output is straightforward and easy to evaluate
4. Combination of Big Data Analytics and stock market is urgent and crucial

Reasons to use Pydoop + Scikit:
1. Hadoop can provide Python API but doesn’t support C/C++ wrapped Python libraries
2. Pydoop tackles this by wrapping Hadoop C++ pipes(Boost.Python) and libhdfs
3. Pydoop provides both HDFS access and MapReduce tasks with pure Python code (no Jython)
4. Better than using stdin/stdout utilities within Hadoop that is common to all languages given one might want to explore the data in hdfs and/or submit large chunks as a part of the yarn task
5. Scikit-learn provides simple tools for data mining and analysis
6. Provides Stochastic Gradient Descent approach to fit linear models
similar to Mahout

Reasons to use mahout SGD prediction:
SGD is an efficient classification method in Mahout, we use our own code to use SGD in stock price movement prediction.","m1fname":"Ao","projectname":"Stock price Movement Prediction with Hadoop+Mahout & Pydoop+Scikit","m3fname":"Tian"},{"m2lname":"Huang","m4lname":"Xu","m3uni":"fy2188","m1uni":"wc2467","m4uni":"mx2151","pid":"201412-18","m2uni":"lh2647","timestring":"Mon Dec 8 22:41:48 2014","m4fname":"Mingrui","language":"HTML, Javascript, PHP, Python,Java","m3lname":"Ye","dataset":"We use a relatively big dataset that contains more than 240,000 rows of job information. Each row contains the job's id, title, description, location, normalized location, jobType, time, company, category, salary, salaryNorm and sourceName.
And we use a mysql database on Amazon AWS RDS to save the dataset. Also we can put new postings into the database to make the update if we get more data. The original data is from Kaggle.com. ","m1lname":"Cao","industry":"","analytics":"Item-based and User-based Recommendation.
Random Forest;
Support Vector Machine;
Linear Regression;
KNN;","m2fname":"Lin","description":"Our project is a very helpful web application and tool for both jobseekers and employees. We want to help employers figure out the market worth of different positions by building a prediction engine for the salary of a potential job position.
For jobseekers, they can use the recommendation system of our website to search for positions that match their backgrounds and career expectations. And all the job data can be accessed by using the search engine of application. As graduating college students, we deeply feel that the job information is very important. So we provide a good platform for people we want help from the job data analysis.","m1fname":"Wei","projectname":"Salary Engine","m3fname":"Fan"},{"m2lname":"Huang","m4lname":"Xu","m3uni":"fy2188","m1uni":"wc2467","m4uni":"mx2151","pid":"201412-18","m2uni":"lh2647","timestring":"Mon Dec 8 22:47:52 2014","m4fname":"Mingrui","language":"HTML, Javascript, PHP, Python, Java ","m3lname":"Ye","dataset":"We use a relatively big dataset that contains more than 240,000 rows of job information. Each row contains the job's id, title, description, location, normalized location, jobType, time, company, category, salary, salaryNorm and sourceName.
And we use a mysql database on Amazon AWS RDS to save the dataset. Also we can put new postings into the database to make the update if we get more data. The original data is from Kaggle.com. ","m1lname":"Cao","industry":"","analytics":"Item-based and User-based Recommendation.
Random Forest;
Support Vector Machine;
Linear Regression;
KNN.","m2fname":"Lin","description":"Our project is a very helpful web application and tool for both jobseekers and employees. We want to help employers figure out the market worth of different positions by building a prediction engine for the salary of a potential job position.
For jobseekers, they can use the recommendation system of our website to search for positions that match their backgrounds and career expectations. And all the job data can be accessed by using the search engine of application. As graduating college students, we deeply feel that the job information is very important. So we provide a good platform for people we want help from the job data analysis.","m1fname":"Wei","projectname":"Salary Engine","m3fname":"Fan"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"zh2210","m4uni":"","pid":"201412-13","m2uni":"","timestring":"Mon Dec 8 22:49:28 2014","m4fname":"","language":"Python Matlab R","m3lname":"","dataset":"Experimental result data from CMU.
Using for studying brain activity.","m1lname":"Hu","industry":"","analytics":"Gaussian Bayes
Bayes Network
SVM","m2fname":"","description":"Want to use machine learning methods to classify the cognitive state of a human subject based on fMRI data over a single time interval.

","m1fname":"Zhao","projectname":"Learning Brain Activity From fMRI Images","m3fname":""},{"m2lname":"Paine","m4lname":"","m3uni":"","m1uni":"ma2799","m4uni":"","pid":"201412-19","m2uni":"tkp2108","timestring":"Mon Dec 8 22:52:51 2014","m4fname":"","language":"Java, R, Python, Mahout","m3lname":"","dataset":"This tool uses Forex data dumps that simply list conversion pairs. It can also work with data streams that transmit the same information.","m1lname":"Aligbe","industry":"","analytics":"The visualizations are currently limited to line and bar graphs, but trend similarity visualization (by way of showing which conversions are currently displaying similar trends) is planned.

The analytics suite consists of simple statistics like averages and variances, and more complex techniques such as classification and clustering of conversion rates over various date ranges (trends). Additionally, we provide recommendations for quantity and duration of investments, as well as which markets to invest in.","m2fname":"Timothy","description":"The Forex Trend Analyzer seeks to make a convenient library for Forex market analysis. By providing a simple and extensible interface, our tool makes it easy to integrate Forex market analysis into an desired application. Other tools focus on providing an entire trading application, but fail at providing a full suite of analytics.","m1fname":"Mark","projectname":"Forex Trend Analyzer","m3fname":""},{"m2lname":"Liu","m4lname":"Tan","m3uni":"lq2145","m1uni":"jl4174","m4uni":"rt2521","pid":"201412-25","m2uni":"jl4143","timestring":"Mon Dec 8 22:55:50 2014","m4fname":"Ruixin","language":"Java, Python","m3lname":"Qiu","dataset":"Numerical Data: Stock daily prices, Nasdaq index, S&P 500 index, Treasure Yield 5 Years from Yahoo Fiance
Text: Financial news data from Business Insights: Essentials.","m1lname":"Li","industry":"","analytics":"Nature Language Processing, Lasso Regression, Time Series, Map Reduce","m2fname":"Jie","description":"Traditional predictors used in predicting stock prices are always numerical. It is well know that news items have significant impact on stock indices and prices. To make a better prediction, we combine quantitative methods with NLP feature analysis in our model to predict stock daily prices.

","m1fname":"Jingnan","projectname":"Stock Price Prediction Based on News","m3fname":"Lu"},{"m2lname":"Lin","m4lname":"Xu","m3uni":"yl3284","m1uni":"wz2270","m4uni":"zx2152","pid":"201412-16","m2uni":"sl3833","timestring":"Mon Dec 8 22:55:54 2014","m4fname":"Zhefeng","language":"Java, mahout, Matlab, Tableau, Pig-Latin ,Eclipse,R Platforms: Mac OSX","m3lname":"Lin","dataset":"citi bike trip histories in system data website: http://www.citibikenyc.com/system-data","m1lname":"Zhang","industry":"","analytics":"Algorithms: K-means Clustering, Recommendation, Time Series Analysis

Visualisation: Tableau and R","m2fname":"Sun-Yi ","description":"
Citi Bike is an innovated bike-sharing system which is set up in recent years, and it provides a simple, convenient and eco-friendly way for New Yorkers and visitors to travel around the city.

Now being in touch with this outstanding system, we may face some following questions:

Where do Citi Bikers ride?
When do they ride?
How far do they go?
Which stations are most popular?
What days of the week are most rides taken on?

Thus, we want to have a total evaluation of such system.
","m1fname":"Wenxuan","projectname":"Citi Bike System Data Analysis","m3fname":"Yen-Hsi "},{"m2lname":"Grossmann","m4lname":"Liu","m3uni":"jx2238","m1uni":"yz2575","m4uni":"bl2547","pid":"201412-12","m2uni":"jg3538","timestring":"Mon Dec 8 22:59:08 2014","m4fname":"Boren","language":"Java, Hadoop, Mahout, ","m3lname":"Xu","dataset":"Million Musical Tweet Dataset","m1lname":"Zou","industry":"","analytics":"mahout recommendation algorithms (user-
based, item-based with different similarity measurement),
geographic averaging

K-means Clustering for geographic information","m2fname":"John","description":"The rise of portable mp3 players and downloaded music has resulted in music recommendation becoming a larger aprt of major e-commerce and massively used applications(iTunes,
Amazon). With and widespread use of social media sites, it is possible to efficiently mine user contextual data along with music preference of a vast and diverse population of people.
This new data renders old music recommendation algorithms based solely on music content and preference, obsolete.

1. We want to make our toolkit be easily used by other developer in any kind of data processing application (sequential or map-reduce).

2. A lot of large data sets contain just boolean data instead of a rating field (like tweets data, Amazon purchase history, etc. We want to make our application perfect for that kind of datasets.
","m1fname":"Yihan","projectname":"Music-Links","m3fname":"Jiaying"},{"m2lname":"Rajan","m4lname":"","m3uni":"","m1uni":"efj2106","m4uni":"","pid":"201412-3","m2uni":"asr2171","timestring":"Mon Dec 8 23:01:04 2014","m4fname":"","language":"R, Java, Hadoop, Mahout, Caffe","m3lname":"","dataset":"Please see presentation deck.","m1lname":"Johnson","industry":"","analytics":"Please see presentation deck.","m2fname":"Anand","description":"Leveraging the Yahoo! Labs Flickr dataset we plan to test and develop upon feature extraction methods utilizing a parallelized computing system to efficiently extract image characteristics.
Using these image characteristics we will train and test the image classification of these images and evaluate them based on precision. Going beyond this step we also plan to experiment with a GPU powered processing system to evaluate added benefits and performance benchmarks that might be had during the image analysis stage over a standard distributed system.
","m1fname":"ERic","projectname":" Image Classification in the Cloud and GPU (H-Classification & G-Classification)","m3fname":""},{"m2lname":"Liu","m4lname":"Tan","m3uni":"lq2145","m1uni":"jl4174","m4uni":"rt2521","pid":"201412-25","m2uni":"jl4143","timestring":"Mon Dec 8 23:01:10 2014","m4fname":"Ruixin","language":"Java, Python","m3lname":"Qiu","dataset":"Numerical Data: Stock daily prices, Nasdaq index, S&P 500 index, Treasure Yield 5 Years from Yahoo Fiance
Text: Financial news data from Business Insights: Essentials.","m1lname":"Li","industry":"","analytics":"Nature Language Processing, Lasso Regression, Time Series, Map Reduce","m2fname":"Jie","description":"Traditional predictors used in predicting stock prices are always numerical. It is well know that news items have significant impact on stock indices and prices. To make a better prediction, we combine quantitative methods with NLP feature analysis in our model to predict stock daily prices.

","m1fname":"Jingnan","projectname":"Stock Price Prediction Based on News","m3fname":"Lu"},{"m2lname":"Ma","m4lname":"Chen","m3uni":"yl3249","m1uni":"zg2201","m4uni":"xc2291","pid":"201412-29","m2uni":"ym2491","timestring":"Mon Dec 8 23:03:48 2014","m4fname":"Xi","language":"Java, Pig, Mahout","m3lname":"Liu","dataset":" We get our datasets from Yahoo Labs:http://webscope.sandbox.yahoo.com/catalog.php. This dataset contains totally 7,462 users, 11,915 movies and 211,231 ratings. We will try to find more related datasets during our project.","m1lname":"Guo","industry":"","analytics":"We will use some clustering algorithms like KMeans to analyze movies relationships. Recommendation algorithms like User-based and Item-based would be used to recommend movies for other users.","m2fname":"Yunge","description":"Many patients especially those elders don't have enough medical knowledge and meeting a doctor always requires reservation and most patients have to wait a long time to see a doctor from public hospital.
With our recommendation system, patients can get every information online and it's both convenient and efficient to do so.","m1fname":"Zhiyuan","projectname":"Oscar Award Analysis based on Big Data","m3fname":"Yuxuan"},{"m2lname":"Grossmann","m4lname":"Liu","m3uni":"jx2238","m1uni":"yz2575","m4uni":"bl2547","pid":"201412-12","m2uni":"jg3538","timestring":"Mon Dec 8 23:06:40 2014","m4fname":"Boren","language":"Java, Hadoop, Mahout, ","m3lname":"Xu","dataset":"Million Musical Tweet Dataset (http://www.cp.jku.at/datasets/MMTD/)
Our software can support any data that is stored in the following format:
UserID, ItemID, Preference/Rating, Longitude, Latitude","m1lname":"Zou","industry":"","analytics":"mahout recommendation algorithms (user-
based, item-based with different similarity measurement),
geographic averaging

K-means Clustering for geographic information","m2fname":"John","description":"The rise of portable mp3 players and downloaded music has resulted in music recommendation becoming a larger aprt of major e-commerce and massively used applications(iTunes,
Amazon). With and widespread use of social media sites, it is possible to efficiently mine user contextual data along with music preference of a vast and diverse population of people.
This new data renders old music recommendation algorithms based solely on music content and preference, obsolete.

1. We want to make our toolkit be easily used by other developer in any kind of data processing application (sequential or map-reduce).

2. A lot of large data sets contain just boolean data instead of a rating field (like tweets data, Amazon purchase history, etc. We want to make our application perfect for that kind of datasets.

3. A lot of new application software are being run on small scale hardware/systems (iphone/ipads). We wanted to be able to utilize the variety and accuracy of large data sets, as well as cater to the performance requirements of all kinds of computational applications. By utilizing our geospatial clustering preprocess, we not only drastically cut down the computational time for creating recommendations, but we also add an extra degree of similarity in the similarity measurement making the result more accurate.","m1fname":"Yihan","projectname":"Music-Links","m3fname":"Jiaying"},{"m2lname":"Zhang","m4lname":"Li","m3uni":"rg2930","m1uni":"rh2648","m4uni":"ml3695","pid":"201412-15","m2uni":"yz2698","timestring":"Mon Dec 8 23:14:09 2014","m4fname":"Mengge","language":"R, Java, Python","m3lname":"Gaur","dataset":"We use a Meetup.com web scrawling data set that has:
# Users: 4,448,454 # Groups: 42,052
# Events: 1,595,833 # Tags: 77,810
# User-Group Pairs: 8,863,235 # User-Event Pairs: 13,553,134
# User-Tag Pairs: 15,057,535 # Group-Tag Pairs: 144,793
# Users with Locations: 3,741,699 # Events with Locations: 983,333
","m1lname":"Huang","industry":"","analytics":"Social Network Analysis, D3
Clustering Algorithms as described in this paper -
Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000","m2fname":"Yiwei","description":"The motivations are this project are as follows:
1. Unique value of event-based social network (EBSN)
Both online offline social interactions
Analyze & compare properties and dynamics of the 2 networks
Commercial value: industrial trends, recommendation of services/ products based on user preference

2. Big Fan of Meetup.com
Popularity across academia, industry and recreation
Excellent API: user, group, event, tags – location & time

3. Great opportunity to apply big data techniques and tools
Graph database: Neo4j with Cypher
Clustering, Recommendation
Large scale social network analysis

","m1fname":"Rongyao","projectname":"Exploring the Online and Offline Social World","m3fname":"Rahul"},{"m2lname":"Jiang","m4lname":"","m3uni":"","m1uni":"cgs2135","m4uni":"","pid":"201412-17","m2uni":"yj2338","timestring":"Mon Dec 8 23:15:54 2014","m4fname":"","language":"Python, C++, JavaScript, Hadoop, Hive, OpenCV","m3lname":"","dataset":"Extracted imagery from Google StreetView (JavaScript used to call Google's API will be provided as a deliverable for the project.) This project can support any imagery but is designed for pictures of locations with metadata tags.","m1lname":"Stathis","industry":"","analytics":"- OpenCV was leveraged to use their implementation of ORB (Oriented FAST, Rotated BRIEF) image feature detecting algorithm, and for visualization
- Locality Sensitive Hashing was implemented in Python using various metrics, for efficiently storing and matching image features
- Hadoop + Hive is used for storing a database of images and their features in a scalable way that is suitable for parallel processing","m2fname":"Yongchen","description":"The goal of this project is to build a scalable image matching platform, especially for the case where a database of imagery tagged with GPS data is queried with an image that is not, and a suggestion of where that picture was taken is derived by matching against the database images.

Image matching is a hot subject of research and is being improved on every year. The most commonly used image matching algorithms today are quite accurate but are not fast. A naive pair-wise approach to matching imagery in a database becomes infeasible very quickly.

This project applies recent work in approximate nearest-neighbor algorithms and in binary image feature extraction, together with Hadoop's distributed data store capabilities, to provide a fast method for image matching against a database that can scale up to millions of images and beyond. ","m1fname":"Christopher","projectname":"Image Similarity and Matching for Localization","m3fname":""},{"m2lname":"Wang","m4lname":"Li","m3uni":"xz2350","m1uni":"jw3127","m4uni":"wl2501","pid":"201412-22","m2uni":"mw2969","timestring":"Mon Dec 8 23:24:13 2014","m4fname":"Wanding","language":"Mahout, Hadoop, Java, Java Script","m3lname":"Zhang","dataset":"--Forex
--WTI, Brend Oil Price index
--GDP","m1lname":"Wang","industry":"","analytics":"Regression models","m2fname":"Mengnan","description":"Our project is initially designed to provide users with the following contents.

Exchange rates for all world currencies
Forward exchange rates
Cross-exchange rates
Daily exchage rates back to the 1920s
Basic analysis of exchange rates
Latest news about exchange rates

The expected outcome of this project is to set up a website with the mentioned functions and information with searching access.

With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market.","m1fname":"Jianze","projectname":"Exchange Rates Inquiry and Analysis","m3fname":"Xiaomeng"},{"m2lname":"Malhotra","m4lname":"","m3uni":"osk2106","m1uni":"dn2367","m4uni":"","pid":"201412-24","m2uni":"mm2625","timestring":"Mon Dec 8 23:44:34 2014","m4fname":"","language":"Languages: Python, Javascript","m3lname":"Kiyani","dataset":"We tested on the MTA's B63 dataset. It is a sample pull from 2011. In the future we will use our own dataset that we record from polling the MTA API.

The dataset can be found here: http://bustime.mta.info/wiki/Developers/ArchiveData","m1lname":"Nair","industry":"","analytics":"Pandas and Python were used to parse and analyze the bus records. They were visualized using matplotlib.

","m2fname":"Manav","description":"The MTA provides access to real time bus location information. This allows us to solve do two things:

Provide consumers with an interface that makes accessing this information natural. This is based on the notion that bus use in NYC is mostly for short distances, and there are many parallel routes

Understand the optimal bus spacing and timing based on demand
","m1fname":"Dhruv","projectname":"Project Transeo: Making public buses more efficient and accessible","m3fname":"Omar"},{"m2lname":"Jagannathan","m4lname":"","m3uni":"","m1uni":"sy2511","m4uni":"","pid":"201412-31","m2uni":"vj2192","timestring":"Mon Dec 8 23:53:02 2014","m4fname":"","language":"Python, Java Script\u000b, Flask (Website creation – Uses MVC framework) Scrapy (for scrubbing the data) AWS (Hosting the website, offline processing tasks)","m3lname":"","dataset":"Scraping Source – We selected iMDB website and tv.com as source.

Scraping what - Find the list of upcoming episodes using tv.com. Use iMDB to obtain past relevant information about a particular tv show, if they are not already present in the database. Also, a user can also specify it’s list of tv shows. The data of those shows will also be obtained, if not already present.

Scraping how - Use html parser to find tags that have the various parameters we need.

Save all the information in a json file.
","m1lname":"Yadav","industry":"","analytics":"D3 visualization
Hadoop
Mahout (for machine learning part)
HBase ","m2fname":"Vaibhav","description":"-- It is always a challenge to keep track of which TV Shows are going to air new episodes.
-- User may not able to follow all the series they are interested in.
-- Even if user manages to keep track of all his favourite shows, if many of them are released around the same time, he/she may not have the luxury to watch all of them.
-- If the user is presented with a list of upcoming episodes, sorted by the release date and a probabilistic rating, then the choice is clear.
","m1fname":"Shubhanshu","projectname":"TV Analytics","m3fname":""},{"m2lname":"Arnoux","m4lname":"","m3uni":"tja2117","m1uni":"npb2124","m4uni":"","pid":"201412-9","m2uni":"pa2398","timestring":"Tue Dec 9 00:05:18 2014","m4fname":"","language":"Java, Scala, Spark","m3lname":"Ansari","dataset":"20 Newsgroup","m1lname":"Bobra","industry":"","analytics":"LDA","m2fname":"Pierre","description":"Develop an open source LDA algorithm for Spark. Enable developers to use in-memory clustering algorithm over big data.","m1fname":"Neraj","projectname":"Spark NLP","m3fname":"Talha"},{"m2lname":"Wang","m4lname":"Li","m3uni":"xz2350","m1uni":"jw3127","m4uni":"wl2501","pid":"201412-22","m2uni":"mw2969","timestring":"Tue Dec 9 00:32:52 2014","m4fname":"Wanding","language":"Mahout, Hadoop, Java, Java Script","m3lname":"Zhang","dataset":"--Forex
--WTI, Brend Oil Price index
--GDP","m1lname":"Wang","industry":"","analytics":"Regression models","m2fname":"Mengnan","description":"Our project is initially designed to provide users with the following contents.

Exchange rates for all world currencies
Forward exchange rates
Cross-exchange rates
Daily exchage rates back to the 1920s
Basic analysis of exchange rates
Latest news about exchange rates

The expected outcome of this project is to set up a website with the mentioned functions and information with searching access.

With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market.","m1fname":"Jianze","projectname":"Exchange Rates Inquiry and Analysis","m3fname":"Xiaomeng"},{"m2lname":"Lee","m4lname":"","m3uni":"","m1uni":"wh2318","m4uni":"","pid":"201412-3","m2uni":"wl2468","timestring":"Tue Dec 9 00:34:02 2014","m4fname":"","language":"Python, Twitter API, R, Bash, Linux Command line.","m3lname":"","dataset":"Twitter data, Twitter API
Yahoo Finance","m1lname":"Ho","industry":"","analytics":"Twitter API with Python, Data Manipulation and Visualization with R, Mahout Naive Bayes Classification","m2fname":"William","description":"Automatic tools for useful features and leading indicators
Analysis directly lead to solution and strategy
General purpose prediction

","m1fname":"Wei-Chieh","projectname":"Correlating Price / Volume of Low Volume Stocks with Social Media","m3fname":""},{"m2lname":"Lin","m4lname":"","m3uni":"tw2470","m1uni":"ss4716","m4uni":"","pid":"201412-28","m2uni":"ml3662","timestring":"Tue Dec 9 00:56:41 2014","m4fname":"","language":"Python, Java, Pig","m3lname":"Wang","dataset":"We use the datasets from basketball-references.com. It contains tons of data including each player's comprehensive stats for each game as well as the team stats. ","m1lname":"Shen","industry":"","analytics":"Implement Map Reduce From Scratch!","m2fname":"Miao","description":"It provide inside advice for NBA team managers to pick the most matched players.","m1fname":"Su","projectname":"Hunting for NBA Players","m3fname":"Tianji"},{"m2lname":"Lin","m4lname":"","m3uni":"tw2470","m1uni":"ss4716","m4uni":"","pid":"201412-28","m2uni":"ml3662","timestring":"Tue Dec 9 00:57:55 2014","m4fname":"","language":"Python, Java, Pig","m3lname":"Wang","dataset":"We use the datasets from basketball-references.com. It contains tons of data including each player's comprehensive stats for each game as well as the team stats. ","m1lname":"Shen","industry":"","analytics":"Implement Map Reduce From Scratch!","m2fname":"Miao","description":"Provide inside advice for NBA team managers to pick the most matched players.","m1fname":"Su","projectname":"Hunting for NBA Players","m3fname":"Tianji"},{"m2lname":"Lin","m4lname":"","m3uni":"tw2470","m1uni":"ss4716","m4uni":"","pid":"201412-28","m2uni":"ml3662","timestring":"Tue Dec 9 01:04:42 2014","m4fname":"","language":"Python, Java, Pig","m3lname":"Wang","dataset":"We use the datasets from basketball-references.com. It contains tons of data including each player's comprehensive stats for each game as well as the team stats.","m1lname":"Shen","industry":"","analytics":"We implement Map Reduce from scratch. Also we might implement some visualization to make our program more user-friendly.","m2fname":"Miao","description":"Provide insight advices for NBA team managers to pick and trade in the most matched players for their team.","m1fname":"Su","projectname":"Hunting for NBA Players","m3fname":"Tianji"},{"m2lname":"Feng","m4lname":"Xu","m3uni":"wj2227","m1uni":"kz2229","m4uni":"hx2168","pid":"201412-30","m2uni":"kf2435","timestring":"Tue Dec 9 01:31:15 2014","m4fname":"Hongliang","language":"JAVA, Hadoop, Mahout","m3lname":"Jiang","dataset":"Lynxsy is a company that matches job seekers with start-up companies. We partner with lynxsy, work on the ~3000 resumes they provided to develop this classification tool.
","m1lname":"Zhang","industry":"","analytics":"PDF extraction
Text Parsing
Pre-Filtering
Classification with mahout on hadoop","m2fname":"Kaicheng","description":"Candidate/Employer matching helps people find jobs and employers find the right candidates.
It is critical to match the right profiles to the relevant positions. But relying on HR to personally review thousands of resumes is both inefficient and unreliable.
We are developing a tool that stream line and automate the resume classification. To
Analyze, filter, classify and evaluate resumes automatically and efficiently. ","m1fname":"Kaiwei","projectname":"Résumé Category Classification","m3fname":"Wentao"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rjb2150","m4uni":"","pid":"201412-20","m2uni":"","timestring":"Tue Dec 9 01:50:19 2014","m4fname":"","language":"Java (personally, I developed on OS X)","m3lname":"","dataset":"The dataset used is a corpus of chess games in the public domain, compiled by Norman Pollock, available here: http://hoflink.com/~npollock/chess.html

dataset id: gm2006.pgn
number of games: 74,726
number of players: 1,227
minimum player Elo rating: 2475
years included: 2006 - 2014
gameplay restrictions: no blitz or correspondence games","m1lname":"Barker","industry":"","analytics":"primarily Mahout, with various other support libraries (Guava, etc.)","m2fname":"","description":"The goal of this project is to be able to answer to following question:
• given a game state, who will win?

Specifically, this project focuses on games that are:
• two-team
• each team can be represented as a multiset of predefined members

Many games fall into this category:
• chess (each team is composed of N pawns, M rooks, etc.)
• deck-building games (each team is composed of predefined cards)
• MMORPGs (each team is composed of various “classes”)
","m1fname":"Raymond","projectname":"Game Outcome Analysis","m3fname":""},{"m2lname":"Zhang","m4lname":"Li","m3uni":"rg2930","m1uni":"rh2648","m4uni":"ml3695","pid":"201412-15","m2uni":"yz2698","timestring":"Tue Dec 9 09:39:35 2014","m4fname":"Mengge","language":"Python, Java, Neo4j, R","m3lname":"Gaur","dataset":"Self-collected Web Scrapping Meetup Data.","m1lname":"Huang","industry":"","analytics":"- Social Network Analytics
- Fiedler Methods for clustering social network
- Gephi & d3 for visualization","m2fname":"Yiwei","description":"This project is motivated by:

- Unique value of event-based social network (EBSN)
Both online offline social interactions
Commercial value: industrial trends, recommendation of services/ products based on user preference

- Big Fan of Meetup.com
Popularity across academia, industry and recreation
Excellent API: user, group, event, tags – location & time

- Great opportunity to apply big data techniques
Graph database: Neo4j with Cypher
Clustering/ Community Detection
Large scale social network analysis
","m1fname":"Rongyao","projectname":"Exploring the Meetup.com Social World: Large Scale Event-Based Social Network Analysis","m3fname":"Rahul"},{"m2lname":"Zhang","m4lname":"Li","m3uni":"rg2930","m1uni":"rh2648","m4uni":"ml3695","pid":"201412-15","m2uni":"yz2698","timestring":"Tue Dec 9 10:28:07 2014","m4fname":"Mengge","language":"R, Java, Python, Cypher","m3lname":"Gaur","dataset":"Self-collected web scrapping data of meetup.com.","m1lname":"Huang","industry":"","analytics":"Social Network Analytics
Fiedler Method for Social Network Clustering/Community Detection
Gephi and D3 for Network Visualization

","m2fname":"Yiwei","description":"This project is motivated by:
- Unique value of event-based social network (EBSN)
Both online offline social interactions
Commercial value: industrial trends, recommendation of services/ products based on user preference

- Big Fan of Meetup.com
Popularity across academia, industry and recreation
Excellent API: user, group, event, tags – location & time

- Great opportunity to apply big data techniques
Graph database: Neo4j with Cypher
Clustering/ Community Detection
Large scale social network analysis
","m1fname":"Rongyao","projectname":"Exploring the Meetup.com Social World: Large Scale Event-Based Social Network Analysis","m3fname":"Rahul"},{"m2lname":"Zhong","m4lname":"Liu","m3uni":"cl3295","m1uni":"dl2856","m4uni":"sl3763","pid":"201412-11","m2uni":"qz2198","timestring":"Tue Dec 9 10:40:11 2014","m4fname":"Sung-Yen","language":"Java, Python, Hadoop (Mahout), AngularJS, SQL","m3lname":"Lin","dataset":"1. Tweets collected via Twitter API
2. Web data: Amazon product information, provided by J.McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013
","m1lname":"Liu","industry":"","analytics":"1. Fetch tweets from the followers Using Twitter API

Detect desire by NLP and get the product name

Get the product department by classification (TFIDF,Naïve Bayes, MapReduce)

Query product with department information on Amazon

2. Read the tweet that users respond to our recommendation on twitter

Implement sentiment algorithm to convert the tweet to rating (MapReduce)

Store the rating into DB, for further user-based recommendation in our own website app.
","m2fname":"Qianyi","description":"1. Discover potential purchase demand in social network (Twitter)
2. Provide a reciprocal platform to benefit retailers and customers
","m1fname":"Dongxue","projectname":"Twitter-Based Product / Sales Events Recommender","m3fname":"Chia-Jung"},{"m2lname":"Zhong","m4lname":"Liu","m3uni":"cl3295","m1uni":"dl2856","m4uni":"sl3763","pid":"201412-11","m2uni":"qz2198","timestring":"Tue Dec 9 11:24:04 2014","m4fname":"Sung-Yen","language":"Java, Python, Hadoop (Mahout), AngularJS, SQL","m3lname":"Lin","dataset":"1. Tweets collected via Twitter API
2. Web data: Amazon product information, provided by J.McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013
","m1lname":"Liu","industry":"","analytics":"1. Fetch tweets from the followers Using Twitter API

Detect desire by NLP and get the product name

Get the product department by classification (TFIDF,Naïve Bayes, MapReduce)

Query product with department information on Amazon

2. Read the tweet that users respond to our recommendation on twitter

Implement sentiment algorithm to convert the tweet to rating (MapReduce)

Store the rating into DB, for further user-based recommendation in our own website app.
","m2fname":"Qianyi","description":"1. Discover potential purchase demand in social network (Twitter)
2. Provide a reciprocal platform to benefit retailers and customers
","m1fname":"Dongxue","projectname":"Twitter-Based Product / Sales Events Recommender","m3fname":"Chia-Jung"},{"m2lname":"Chen","m4lname":"Liu","m3uni":"ym2491","m1uni":"zg2201","m4uni":"yl3249","pid":"201412-29","m2uni":"xc2291","timestring":"Tue Dec 9 12:18:48 2014","m4fname":"Yuxuan","language":"java, Pig, Mahout, Shell script","m3lname":"Ma","dataset":"We get our datasets from Yahoo Labs:http://webscope.sandbox.yahoo.com/catalog.php. This dataset contains totally 7,462 users, 11,915 movies and 211,231 ratings. We will try to find more related datasets during our project. ","m1lname":"Guo","industry":"","analytics":"We will use some clustering algorithms like KMeans to analyze movies relationships. Recommendation algorithms like User-based and Item-based would be used to recommend movies for other users.","m2fname":"Xi","description":"Analyse the relationship between the ratings of movies from Yahoo movies dataset and the nomination of famous movie awards like Osacar and Golden Globe award, which to some degree, could reflect the objectivity of the awards awarded by committees. In addition, we will study the taste of raters of different age and their relation with the nomination of movie awards. ","m1fname":"Zhiyuan","projectname":"Oscar Award Analysis based on Big Data","m3fname":"Yunge"},{"m2lname":"Zhong","m4lname":"Liu","m3uni":"cl3295","m1uni":"dl2856","m4uni":"sl3763","pid":"201412-11","m2uni":"qz2198","timestring":"Tue Dec 9 12:33:24 2014","m4fname":"Sung-Yen","language":"Java, Python, Hadoop (Mahout), AngularJS, SQL","m3lname":"Lin","dataset":"1. Tweets collected via Twitter API
2. Web data: Amazon product information, provided by J.McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013
","m1lname":"Liu","industry":"","analytics":"1. Fetch tweets from the followers Using Twitter API

Detect desire by NLP and get the product name

Get the product department by classification (TFIDF,Naïve Bayes, MapReduce)

Query product with department information on Amazon

2. Read the tweet that users respond to our recommendation on twitter

Implement sentiment algorithm to convert the tweet to rating (MapReduce)

Store the rating into DB, for further user-based recommendation in our own website app.
","m2fname":"Qianyi","description":"1. Discover potential purchase demand in social network (Twitter)
2. Provide a reciprocal platform to benefit retailers and customers
","m1fname":"Dongxue","projectname":"Twitter-Based Product / Sales Events Recommender","m3fname":"Chia-Jung"},{"m2lname":"Wang","m4lname":"Li","m3uni":"xz2350","m1uni":"jw3127","m4uni":"wl2501","pid":"201412-22","m2uni":"mw2969","timestring":"Tue Dec 9 12:37:45 2014","m4fname":"Wanding","language":"Mahout, Hadoop, Java, Java Script, Matlab","m3lname":"Zhang","dataset":"--Forex
--WTI, Brend Oil Price index
--GDP ","m1lname":"Wang","industry":"","analytics":"Regression models","m2fname":"Mengnan","description":"Our project is initially designed to provide users with the following contents.

Exchange rates for all world currencies
Forward exchange rates
Cross-exchange rates
Daily exchage rates back to the 1920s
Basic analysis of exchange rates
Latest news about exchange rates

The expected outcome of this project is to set up a website with the mentioned functions and information with searching access.

With the internatioanl trade and commerce being of more importance, it's necessary not only for a specific group of industries but also for all of us to gain daily access to the updated currency rates. With more convenience in getting the exchange rate information, which hopefully will be provided by our project, people are equipped with more related knowledge to make better decisions regarding to the currency market.","m1fname":"Jianze","projectname":"Exchange Rates Inquiry and Analysis","m3fname":"Xiaomeng"},{"m2lname":"Shen","m4lname":"","m3uni":"ml3662","m1uni":"tw2470","m4uni":"","pid":"201412-28","m2uni":"ss4716","timestring":"Tue Dec 9 12:48:04 2014","m4fname":"","language":"Python, Java, Pig","m3lname":"Lin","dataset":"We use the datasets from basketball-references.com. It contains tons of data including each player's comprehensive stats for each game as well as the team stats. ","m1lname":"Wang","industry":"","analytics":"We implement Map Reduce from scratch. Also we might implement some visualization to make our program more user-friendly.","m2fname":"Su","description":"Provide insight advices for NBA team managers to pick and trade in the most matched players for their team.","m1fname":"Tianji","projectname":"Hunting for NBA Players","m3fname":"Miao"},{"m2lname":"Shen","m4lname":"","m3uni":"ml3662","m1uni":"tw2470","m4uni":"","pid":"201412-28","m2uni":"ss4716","timestring":"Tue Dec 9 13:40:01 2014","m4fname":"","language":"Python, Java, Pig","m3lname":"Lin","dataset":"We use the datasets from basketball-references.com. It contains tons of data including each player's comprehensive stats for each game as well as the team stats. ","m1lname":"Wang","industry":"","analytics":"We implement Map Reduce from scratch. Also we might implement some visualization to make our program more user-friendly.","m2fname":"Su","description":"Provide insight advices for NBA team managers to pick and trade in the most matched players for their team.","m1fname":"Tianji","projectname":"Hunting for NBA Players","m3fname":"Miao"},{"m2lname":"Liu","m4lname":"Ma","m3uni":"xc2291","m1uni":"zg2201","m4uni":"ym2491","pid":"201412-29","m2uni":"yl3249","timestring":"Tue Dec 9 16:35:42 2014","m4fname":"Yunge","language":"java, Pig, Mahout, Shell script","m3lname":"Chen","dataset":"We get our datasets from Yahoo Labs:http://webscope.sandbox.yahoo.com/catalog.php. This dataset contains totally 7,462 users, 11,915 movies and 211,231 ratings. We will try to find more related datasets during our project.","m1lname":"Guo","industry":"","analytics":"We will use some clustering algorithms like KMeans to analyze movies relationships. Recommendation algorithms like User-based and Item-based would be used to recommend movies for other users.","m2fname":"Yuxuan ","description":"Analyse the relationship between the ratings of movies from Yahoo movies dataset and the nomination of famous movie awards like Osacar and Golden Globe award, which to some degree, could reflect the objectivity of the awards awarded by committees. In addition, we will study the taste of raters of different age and their relation with the nomination of movie awards. ","m1fname":"Zhiyuan ","projectname":"Oscar Award Analysis based on Big Data","m3fname":"Xi"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tma2131","m4uni":"","pid":"201412-67","m2uni":"","timestring":"Tue Dec 9 23:57:12 2014","m4fname":"","language":"Languages: C/C++, Java, Ruby, bash scripting, PHP, HTML, Javascript; Platforms: Hadoop map/reduce, Hadoop streaming, Opencv, ffmpeg, Matlab.","m3lname":"","dataset":"Project dataset came from the Open Connectome Project. I focused on the Kasthuri11 dataset. ","m1lname":"Adams","industry":"","analytics":"Hadoop streaming HDF5 download service. HTML5, JS, PHP web service. Hadoop streaming ffmpeg conversion service (HDF5 --> Video). Wrapped opencv computer vision algorithms for Hadoop streaming. 3D gradient feature extraction in Java map/reduce. ","m2fname":"","description":"The objective of this project is to develop a framework for automatic reconstruction of connectomes based on labelled nodes (i.e. functional neurons) and pathways (i.e. synapses). A prototype will be developed with capabilities in 5 of the following 6 areas:
1. On-demand Web Extraction Service : allows user to carve out HDF5 imagery from data hosted by the Open Connectome Project.
2. Distributed Download Service : automatically launches distributed download and HDFS indexing of data from OCP using hadoop streaming.
3. Image Handling Service : transcodes data for follow-on processing (i.e. feature extraction).
4. Feature Extraction Service : extracts 2D and 3D features for matching.
5. Feature Matching Service : returns scores based on feature similarity and network consistency.
6. Graph Reconstruction : build representative network from user-provided annotations.

The main innovation would be to facilitate application of new advanced analytics on high-res electron microscopy images. This might include probabilistic graphical models, belief propagation, non-parametric statistics/dynamics or deep learning (in the future).

Capability is a prototype framework that can incorporate new analytics running on large-scale image slices. Also, at least one algorithm will be implemented in 5 of the 6 areas listed above.

This research is critical for understanding both brain structure and brain function. ","m1fname":"Terrence","projectname":"Brain Edge Detection ","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tma2131","m4uni":"","pid":"201412-67","m2uni":"","timestring":"Wed Dec 10 00:17:27 2014","m4fname":"","language":"Languages: C/C++, Java, Ruby, bash scripting, PHP, HTML, Javascript; Platforms: Hadoop map/reduce, Hadoop streaming, Opencv, ffmpeg, Matlab.","m3lname":"","dataset":"Project dataset came from the Open Connectome Project. I focused on the Kasthuri11 dataset. ","m1lname":"Adams","industry":"","analytics":"Hadoop streaming HDF5 download service. HTML5, JS, PHP web service. Hadoop streaming ffmpeg conversion service (HDF5 --> Video). Wrapped opencv computer vision algorithms for Hadoop streaming. 3D gradient feature extraction in Java map/reduce. ","m2fname":"","description":"The objective of this project is to develop a framework for automatic reconstruction of connectomes based on labelled nodes (i.e. functional neurons) and pathways (i.e. synapses). A prototype will be developed with capabilities in 5 of the following 6 areas:
1. On-demand Web Extraction Service : allows user to carve out HDF5 imagery from data hosted by the Open Connectome Project.
2. Distributed Download Service : automatically launches distributed download and HDFS indexing of data from OCP using hadoop streaming.
3. Image Handling Service : transcodes data for follow-on processing (i.e. feature extraction).
4. Feature Extraction Service : extracts 2D and 3D features for matching.
5. Feature Matching Service : returns scores based on feature similarity and network consistency.
6. Graph Reconstruction : build representative network from user-provided annotations.

The main innovation would be to facilitate application of new advanced analytics on high-res electron microscopy images. This might include probabilistic graphical models, belief propagation, non-parametric statistics/dynamics or deep learning (in the future).

Capability is a prototype framework that can incorporate new analytics running on massive high-res image slices. Also, at least one algorithm will be implemented in 5 of the 6 areas listed above.

This research is critical for understanding both brain structure and brain function. ","m1fname":"Terrence","projectname":"Brain Edge Detection ","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cl300","m4uni":"","pid":"201412-89","m2uni":"","timestring":"Wed Dec 10 09:48:39 2014","m4fname":"","language":"Test","m3lname":"","dataset":"Test","m1lname":"Lin","industry":"","analytics":"Test","m2fname":"","description":"Test","m1fname":"CY","projectname":"Test","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tma2131","m4uni":"","pid":"201412-67","m2uni":"","timestring":"Wed Dec 10 13:11:33 2014","m4fname":"","language":"Languages: C/C++, Java, Ruby, bash scripting, PHP, HTML, Javascript; Platforms: Hadoop map/reduce, Hadoop streaming, Opencv, ffmpeg, Matlab.","m3lname":"","dataset":"Project dataset came from the Open Connectome Project. I focused on the Kasthuri11 dataset. ","m1lname":"Adams","industry":"","analytics":"Hadoop streaming HDF5 download service. HTML5, JS, PHP web service. Hadoop streaming ffmpeg conversion service (HDF5 --> Video). Wrapped opencv computer vision algorithms for Hadoop streaming. 3D gradient feature extraction in Java map/reduce. Stood up and administered 5-node cluster (Cloud Ganita) for the purpose of this course. Did networking using cat5e and commodity routers. Implemented website http://cloud.ganita.org for the course and final project. ","m2fname":"","description":"The objective of this project is to develop a framework for automatic reconstruction of connectomes based on labelled nodes (i.e. functional neurons) and pathways (i.e. synapses). A prototype will be developed with capabilities in 5 of the following 6 areas:
1. On-demand Web Extraction Service : allows user to carve out HDF5 imagery from data hosted by the Open Connectome Project.
2. Distributed Download Service : automatically launches distributed download and HDFS indexing of data from OCP using hadoop streaming.
3. Image Handling Service : transcodes data for follow-on processing (i.e. feature extraction).
4. Feature Extraction Service : extracts 2D and 3D features for matching.
5. Feature Matching Service : returns scores based on feature similarity and network consistency.
6. Graph Reconstruction : build representative network from user-provided annotations.

The main innovation would be to facilitate application of new advanced analytics on high-res electron microscopy images. This might include probabilistic graphical models, belief propagation, non-parametric statistics/dynamics or deep learning (in the future).

Capability is a prototype framework that can incorporate new analytics running on massive high-res image slices. Also, at least one algorithm will be implemented in 5 of the 6 areas listed above.

This research is critical for understanding both brain structure and brain function.","m1fname":"Terrence","projectname":"Brain Edge Detection","m3fname":""},{"m2lname":"Zhu","m4lname":"","m3uni":"","m1uni":"to2232","m4uni":"","pid":"201412-4","m2uni":"kz2232","timestring":"Wed Dec 10 14:05:13 2014","m4fname":"","language":"Java, PHP, R, Pig, and Javascript ","m3lname":"","dataset":"We used the genomic and the clinical profiles from the 3,316 patients in TCGA data set. Since we transformed the data set into the common CSV format, our programs can take any RNAseq/microarrary-based genomic data if needed. ","m1lname":"Ou Yang","industry":"","analytics":"We accomplished the following deliverables:

1. Map-Reduce Concordance Index Algorithm
We created the map-reduce concordance index function by mapping all possible combination of the predictions into pairs and check if they are valid pairs and concordant with the response. Then in the reducing part we collect the valid and concordant pairs.

We reduced 66% input size by hashing the status of the patient and the survival time into one floating number, then we sort the hashed response by the order of prediction. Therefore we only transmit an array of floating numbers instead of a matrix.

2. Patient-Based Treatment Recommendation Engine
We created our own treatment recommendation engine by: First, compute the similarity of the patients in the data set with regard to the profile uploaded by user. Second, sort most similar profiles by the expected survival time. And finally return the treatment plan received by the patient with the longest expected survival.

The algorithm is validated using LOOCV over the 3,316 cancer patients, which yielded 87.24% true positive rate.

3. D3.js-based visualization for the Bayesian networks
We created the Bayesian network of the 30 genes using the bnlearn package in R. Then we visualize it using D3.js on our website. The subnetworks were validated using the publicly available gene ontology analysis tool.

4. PHP-Based Interactive Visualization
We created an interactive facility using PHP and R. Because we created the REST API for our concordance index function and the recommendation engine, we demonstrate them as a web service.","m2fname":"Kaiyi","description":"The objective of our project is to analyze the massive cancer genome profiles to create the model for diagnosis and treatment suggestion. Because of the feature number and the size of the samples, the analysis could only be done with the Big Data Analytics tools.
In our project, we implemented the similarity functions used in biomedical research. Our innovation are:

1. We created the map-reduce concordance index function and we successfully reduced the input size by more than 50%.
2. We identified 30 genes that are related to patients outcome using the toolkit and we visualized the network of them. We also validated they represent the cancer hallmarks.
3. We created a patient-based treatment recommendation engine using R, which returns the analysis and suggestion in one second with the plot.
4. We created the web API and the facility of the recommendation engine and the concordance index function.

This project is important because there was no diagnosis and treatment planning tools were created on the Pan-Cancer basis. And the tools we provided were not implemented on Hadoop platforms before. The project may not only shed a light on the molecular pathologies of the complicated malignancies and the potential therapeutic regimens but provide a tool for medical professionals to save lives.
","m1fname":"Tai-Hsien","projectname":"Network Analysis on the Big Cancer Genome Data","m3fname":""},{"m2lname":"Pan","m4lname":"Machado","m3uni":"adc2171","m1uni":"xs2229","m4uni":"jkm2155","pid":"201412-50","m2uni":"zp2162","timestring":"Wed Dec 10 15:30:48 2014","m4fname":"Joseph","language":"JAVA, Hadoop, Mahout, Google Maps API, PHP, Javascript","m3lname":"Cunha","dataset":"Dataset:
MTA’s subway dataset of New York City during November 9, 2014 to November 15, 2014. This dataset is provided by MTA and is not public.
Our software can also support MTA's bus dataset.","m1lname":"Shang","industry":"","analytics":"Algorithms:
1. We use a map-reduce algorithm by Hadoop to pre-processing the dataset.
2. We create station-based prediction depending on the user-based recommendation by Mahout. We take different stations, subway lines and actual delay as input to make prediction to get estimated delay.
visualization:
We create a GUI for users to input the departure position and destination and show the optimized result as well.","m2fname":"Zhao","description":"Objectives:
We try to calculate the accurate time spent between departure position and destination using subway in New York City. We use the dataset provided by MTA. Due to the size of dataset, the analysis can only be done with big data methods.

Innovations:
1. We extract the data from MTA dataset and pre-processing by Hadoop.
2. We make prediction of delay by user-based recommendation using Mahout.
3. We implement Google Maps API to find the path spent least time.
4. We combine the estimated delay and the subway time together to calculate total spent time between departure position and destination.
5. We create GUI to show the optimized result.

This project is important because that although there are plenty of transportation API in the market trying to help users find the best route among different Transportation services, most of them don't take delay into consideration that exist all the time. So the estimated time is not accurate in some degree. We try to fix this by cooperating with MTA.
","m1fname":"Xia","projectname":"Best Transportation choice","m3fname":"Andre"},{"m2lname":"Nong","m4lname":"Wang","m3uni":"yc2998","m1uni":"zm2221","m4uni":"yw2625","pid":"201412-47","m2uni":"sn2603","timestring":"Wed Dec 10 16:56:43 2014","m4fname":"Yizhe","language":"Java Eclipse","m3lname":"Chen","dataset":"Our dataset is provided by Duke University. But the original data has been provided by a telecom company.
This is a Modeling Tournament dataset","m1lname":"Miao","industry":"","analytics":"K-Means Clustering","m2fname":"Shibiao","description":"Telecom companies are trying to make more profits, so they need to understand their customers better.

Customers’ behaviors can reflect the customers’ consumption patterns, we assume that customers of the telecom companies can be divided into different groups based on certain characteristics. By analyzing other characteristics across these groups, we can discover the inflections among customers’ behavior and characteristics.
","m1fname":"Zhilei","projectname":"Analysis of telecom service in cellular network","m3fname":"Yaqi"},{"m2lname":"Liu","m4lname":"","m3uni":"ht2358","m1uni":"yl3055","m4uni":"","pid":"201412-44","m2uni":"dl2870","timestring":"Wed Dec 10 17:42:19 2014","m4fname":"","language":"Java 7, JavaScript, Mahout","m3lname":"Tong","dataset":"Movie reviews with labeled scores from rottentomatoes.com, collected by Bo Pang, available at http://www.cs.cornell.edu/People/pabo/movie-review-data/

Reviews are raw text with score ranging from 0 to 3, and there are totally 4500 reviews. We shuffled the data and used 80% of them as training set and 20% as testing set.
","m1lname":"Liu","industry":"","analytics":"Naive Bayes, Logistic Regression, Demo Website","m2fname":"Di","description":"We love movies, and movie reviews help us to find a good movie.

In this project we will use large-scale machine learning and natural language process technologies to experiment on a sentiment analysis task on movie reviews.

Rather than simply decide whether a review is “thumbs up” or “thumbs down”, we want to label a review on a scale of five values: one to four “stars”.
","m1fname":"Yunao","projectname":"Sentiment Analysis on Movie Reviews","m3fname":"Hao"},{"m2lname":"Wang","m4lname":"Yan","m3uni":"wz2311","m1uni":"ddc2122","m4uni":"jy2677","pid":"201412-34","m2uni":"hw2465","timestring":"Wed Dec 10 18:52:10 2014","m4fname":"Jiayi","language":"Java, Perl, Neo4J","m3lname":"Zhang","dataset":"Dataset was from the CAIDA (Center for Applied Internet Data Analysis) and was based on monitor information which was gathered by their tools against autonomous systems (networks) behind routers. The dataset, preparsed, contained monitor information as well as all paths observed by each monitor. The software requires the data to be in a to, from, pathlength format in order to generate the node locations and directional paths.","m1lname":"Cadigan","industry":"","analytics":"Primairly the java toolkit that is available for Neo4J. Project was written in eclipse and contains a java based gui. Visualization of the graph as well as additional recommendation tools are still under development.","m2fname":"Hongjie","description":"Design and implement a toolkit which could take in internet base network and routing information and map out AS to AS network paths. This toolkit would take in all AS information as well as the path lengths between each node. In developing the tools, we would learn more about the importance of network based routing algorithms.","m1fname":"David","projectname":"Network Congestion / Path Analysis","m3fname":"Wei"},{"m2lname":"Wen","m4lname":"Chen","m3uni":"dk2814","m1uni":"mz2417","m4uni":"dc3026","pid":"201412-46","m2uni":"cw2758","timestring":"Wed Dec 10 19:02:41 2014","m4fname":"Duo","language":"Python, Java","m3lname":"Kuchhal","dataset":"Yelp Challenge Dataset - http://www.yelp.com/dataset_challenge ","m1lname":"Zhou","industry":"","analytics":"We proposed a 3-part technical approach utilizing both rating similarity and review similarity.","m2fname":"Chen","description":"Identify group fake reviews among Yelp restaurant reviews.
Opinion spamming is detrimental to Yelp's credibility, creates unfair competitions among businesses and provides deceptive information for customers.","m1fname":"Mo","projectname":"Yelp Fake Review Detection","m3fname":"Dhruv"},{"m2lname":"Yan","m4lname":"","m3uni":"ql2163","m1uni":"cc3736","m4uni":"","pid":"201412-35","m2uni":"jy2654","timestring":"Wed Dec 10 19:56:39 2014","m4fname":"","language":"Hadoop,Mahout,Pig,Java,Matlab, Python","m3lname":"Li","dataset":"Accelerometer(3-axial linear acceleration)
Gyroscope (3-axial angular velocity)
sampling rate: 50Hz
Each person performed six activities
walking, walking_upstairs, walking_downstairs, sitting, standing, lying
Dataset
http://www.smartlab.ws
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones ","m1lname":"Chen","industry":"","analytics":"1. Signal analysis detected from the accelerometers and gyroscope and convert it into dataset.

2. Try different classification methods on the useful data extracted from raw signals and compare the accuracy of the classifiers.

3. Create a predictive model which could indicate the human activity state.

4. Apply our model on a different dataset with elderly voluntaries with ages between 60-70 years to test its effects.

Algorithm:
Data Preprocessing: FFT
Classification: Support Vector Machines; Naive Bayes networks; K-Nearest Neighborhoods ","m2fname":"Junkai","description":"Human activity monitoring and prediction system
health care, near-emergency early warning, fitness monitoring and assisted living
Sensor data from the practical, small and unobtrusive platform -- smartphone ","m1fname":"Chao","projectname":"Human Activity Monitoring and Prediction","m3fname":"Qi"},{"m2lname":"Wang","m4lname":"","m3uni":"","m1uni":"yc2911","m4uni":"","pid":"201412-37","m2uni":"jw3153","timestring":"Wed Dec 10 20:28:34 2014","m4fname":"","language":"Python, R, Mahout","m3lname":"","dataset":"Yelp 2014 Challenge Dataset

The Challenge Dataset includes data from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh:

42,153 businesses
320,002 business attributes
31,617 check-in sets
252,898 users
955,999 edge social graph
403,210 tips
1,125,458 reviews","m1lname":"Cheng","industry":"","analytics":"Algorithms: Topic Modeling and Classification

● Latent Dirichlet allocation (LDA)
To get the subtopics for our classification tasks, we will first use LDA. It is a widely used topic model for generating topics from documents. The results of this model will be a given number of topics composed of a set of terms (topic words).

● Naïve Bayes
Naïve Bayes classifier is selected in this study, because it is a traditional probabilistic methods for text categorization. The model applies Bays’ theory and assumes strong independence between variables.
Logistics Model
The logistics classifier is a very popular probabilistic model used for predicting categorical dependent variable. It models the probability that the response belongs to a particular category.

● Support Vector Machines (SVM)
SVM is almost the most widely used supervised learning algorithm for classification. It is very different from logistic model in that non-probabilistic binary linear classifier is used. Moreover, non-linear classification can also be performed with SVM using a specific kernel.

● Others (mentioned in final report)
","m2fname":"Jingchi ","description":"Description of Tasks
● Topic Modeling. To categorize restaurant reviews into a number of subtopics. Ideally, we
expect they represents different aspects, like food, service, environment and .etc.
● Prediction (Non-text and textual classification). After getting the subtopics, we will predict
useful reviews by fitting various models to either numerical or textual data.

Expected Contributions
● Providing a more functional and targeting review service to Yelp users,which can assure them of a valuable reviews based on user’s interested topics.
","m1fname":"Yu Hua","projectname":"Predicting Useful Restaurant Reviews by Subtopics using Yelp Data","m3fname":""},{"m2lname":"El-Refaey","m4lname":"Zong","m3uni":"vjr21","m1uni":"jx2211","m4uni":"sz2477","pid":"201412-33","m2uni":"mae2150","timestring":"Wed Dec 10 20:33:24 2014","m4fname":"Shi","language":" Apache Mahout, OpenCV, Java, MashApe CamFind API, HBase, Hive, D3 and JavaScript, Matlab","m3lname":"Rubino","dataset":"Interest levels data set provided by Neuromatters.
The dataset is about 32 individuals viewing 64 videos. Video length varies from 15 seconds to one minute.","m1lname":"Xie","industry":"","analytics":"MashApe CamFind API with Java. Classification algorithms, Contextual Analysis and Collaborative Filtering (CF) to build the recommendation engine. Visualizations from D3 toolkit for implementing a visualization to represent the output.","m2fname":"Mohamed","description":"Objectives:We will implement an interest-based Video Recommendation engine based on “interest levels”.

Innovations:EEG-based “interest level” could, in theory, give insight into video preferences that can not be easily expressed in words. We could utilize the EEG-based dataset to make recommendation to users.

Capabilities:Our recommendation engine will provide users a generic EEG-based recommendation that doesn't require users' interaction.(i.e. no explicit rating).

Importance: Obviously, in computer vision, people are better recognized without more interactions with computer. For example, people don't have to fill out rating forms to get recommendations about what they like.","m1fname":"Jingwen","projectname":"EEGoVid: An EEG-Based Interest Level Video Recommendation Engine","m3fname":"Vincent "},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cc3613","m4uni":"","pid":"201412-41","m2uni":"","timestring":"Wed Dec 10 20:54:12 2014","m4fname":"","language":"Java, Hadoop, Mahout/ Mac","m3lname":"","dataset":"the dataset is acquired from databasebasketball.com. The site admin only allows me to use dataset up to 2009 season since up-to-date data costs 650 dollars.

If this algorithm works correctly, the up-to-date data should be able to generate trustworthy results as well.","m1lname":"Chao","industry":"","analytics":"k-mean clustering is implemented to cluster players using their stats.","m2fname":"","description":"Fantasy Basketball has been really popular in the US. Many websites even let people bet on this. A toolkit that helps Fantasy owners to win the game will generate a lot of interests. The goal is to use existing data to come out with a good algorithm to help owners to win in different situations.","m1fname":"Chia Kang","projectname":"Fantasy Basketball Prediction Using previous Seasons' Data","m3fname":""},{"m2lname":"Pan","m4lname":"Machado","m3uni":"adc2171","m1uni":"xs2229","m4uni":"jkm2155","pid":"201412-50","m2uni":"zp2162","timestring":"Wed Dec 10 21:00:13 2014","m4fname":"Joseph","language":"JAVA, Pig, Hadoop, Mahout, Google Maps API, PHP, Javascript","m3lname":"Cunha","dataset":"MTA’s subway dataset of New York City during November 9, 2014 to November 15, 2014. This dataset is provided by MTA and is not public.
Our software can also support MTA's bus dataset.","m1lname":"Shang","industry":"","analytics":"Algorithms:
1. We use a map-reduce algorithm by Hadoop and bash to pre-processing the dataset.
2. We use Pig to remove useless data and calculate the average delay.
3. We create station-based prediction depending on the user-based recommendation by Mahout. We take different stations, subway lines and actual delay as input to make prediction to get estimated delay.

Visualization:
We create a GUI for users to input the departure position and destination and show the optimized result as well.","m2fname":"Zhao","description":"Objectives:
We try to calculate the accurate time spent between departure position and destination using subway in New York City. We use the dataset provided by MTA. Due to the size of dataset, the analysis can only be done with big data methods.

Innovations:
1. We extract the data from MTA dataset and pre-processing by Hadoop.
2. We make prediction of delay by user-based recommendation using Mahout.
3. We implement Google Maps API to find the path spent least time.
4. We combine the estimated delay and the subway time together to calculate total spent time between departure position and destination.
5. We create GUI to show the optimized result.

This project is important because that although there are plenty of transportation API in the market trying to help users find the best route among different Transportation services, most of them don't take delay into consideration that exist all the time. So the estimated time is not accurate in some degree. We try to fix this by cooperating with MTA. ","m1fname":"Xia","projectname":"Best Transportation Choice","m3fname":"Andre"},{"m2lname":"Pan","m4lname":"Machado","m3uni":"adc2171","m1uni":"xs2229","m4uni":"jkm2155","pid":"201412-50","m2uni":"zp2162","timestring":"Wed Dec 10 21:02:02 2014","m4fname":"Joseph","language":"JAVA, Pig, Hadoop, Mahout, Google Maps API, PHP, Javascript","m3lname":"Cunha","dataset":"MTA’s subway dataset of New York City during November 9, 2014 to November 15, 2014. This dataset is provided by MTA and is not public.
Our software can also support MTA's bus dataset.","m1lname":"Shang","industry":"","analytics":"Algorithms:
1. We use a map-reduce algorithm by Hadoop and bash to pre-processing the dataset.
2. We use Pig to remove useless data and calculate the average delay.
3. We create station-based prediction depending on the user-based recommendation by Mahout. We take different stations, subway lines and actual delay as input to make prediction to get estimated delay.

Visualization:
We create a GUI for users to input the departure position and destination and show the optimized result as well.","m2fname":"Zhao","description":"Objectives:
We try to calculate the accurate time spent between departure position and destination using subway in New York City. We use the dataset provided by MTA. Due to the size of dataset, the analysis can only be done with big data methods.

Innovations:
1. We extract the data from MTA dataset and pre-processing by Hadoop.
2. We make prediction of delay by user-based recommendation using Mahout.
3. We implement Google Maps API to find the path spent least time.
4. We combine the estimated delay and the subway time together to calculate total spent time between departure position and destination.
5. We create GUI to show the optimized result.

This project is important because that although there are plenty of transportation API in the market trying to help users find the best route among different Transportation services, most of them don't take delay into consideration that exist all the time. So the estimated time is not accurate in some degree. We try to fix this by cooperating with MTA. ","m1fname":"Xia","projectname":"Best Transportation Choice","m3fname":"Andre"},{"m2lname":"Pan","m4lname":"Machado","m3uni":"adc2171","m1uni":"xs2229","m4uni":"jkm2155","pid":"201412-50","m2uni":"zp2162","timestring":"Wed Dec 10 21:03:04 2014","m4fname":"Joseph","language":"JAVA, Pig, Hadoop, Mahout, Google Maps API, PHP, Javascript","m3lname":"Cunha","dataset":"MTA’s subway dataset of New York City during November 9, 2014 to November 15, 2014. This dataset is provided by MTA and is not public.
Our software can also support MTA's bus dataset.","m1lname":"Shang","industry":"","analytics":"Algorithms:
1. We use a map-reduce algorithm by Hadoop and bash to pre-processing the dataset.
2. We use Pig to remove useless data and calculate the average delay.
3. We create station-based prediction depending on the user-based recommendation by Mahout. We take different stations, subway lines and actual delay as input to make prediction to get estimated delay.

Visualization:
We create a GUI for users to input the departure position and destination and show the optimized result as well.","m2fname":"Zhao","description":"Objectives:
We try to calculate the accurate time spent between departure position and destination using subway in New York City. We use the dataset provided by MTA. Due to the size of dataset, the analysis can only be done with big data methods.

Innovations:
1. We extract the data from MTA dataset and pre-processing by Hadoop.
2. We make prediction of delay by user-based recommendation using Mahout.
3. We implement Google Maps API to find the path spent least time.
4. We combine the estimated delay and the subway time together to calculate total spent time between departure position and destination.
5. We create GUI to show the optimized result.

This project is important because that although there are plenty of transportation API in the market trying to help users find the best route among different Transportation services, most of them don't take delay into consideration that exist all the time. So the estimated time is not accurate in some degree. We try to fix this by cooperating with MTA. ","m1fname":"Xia","projectname":"Best Transportation Choice","m3fname":"Andre"},{"m2lname":"Chuang\u0000","m4lname":"Gu \u0000","m3uni":"ls3201","m1uni":"kt2536","m4uni":"yg2384","pid":"201412-36","m2uni":"hc2751","timestring":"Wed Dec 10 21:22:27 2014","m4fname":"Yujia\u0000","language":"Matlab, Hadoop, Mahout","m3lname":"Su\u0000","dataset":"We use the dataset from http://www.basketball-reference.com
Using NBA Player data in 2009-2010
Each player has different kind of characteristics of performance including field goal made, field goal attempt, three points made, free throw made, free throw attempt, rebound, assists, steal, block, points\u0000","m1lname":"Tsai","industry":"","analytics":"algorithms:K-means Clustering, Recommendation, Linear regression, PLA algorithm","m2fname":"Hao-Hsiang\u0000","description":"Sports play a big role among people in USA. As a result, people spend more time on watching NBA, MLB, NFL...etc. for their leisure time. Within the last five years, ESPN and Yahoo release an online sports game of all the professional sports league for people to actually have a chance to form a simulated team online and compete with other gamers. The ranking of the online teams is based on the real world behavior of the players they pick for their teams. Therefore, drafting becomes a key factor for the victory of the game. We wish to provide a winning strategy for drafting. Moreover, we hope to provide a practical dynamic strategy for real NBA team to choose players. \u0000","m1fname":"Kun-Yen","projectname":"ESPN Fantasy Basketball – Winning Strategy ! \u0000","m3fname":"Lin\u0000"},{"m2lname":"Rubino","m4lname":"Zong","m3uni":"jx2211","m1uni":"mae2150","m4uni":"sz2477","pid":"201412-33","m2uni":"vjr21","timestring":"Wed Dec 10 21:39:29 2014","m4fname":"Shi","language":"Apache Mahout, OpenCV, Java, MashApe CamFind API, HBase, Hive, D3 and JavaScript, Matlab ","m3lname":"Xie","dataset":"Interest levels data set provided by Neuromatters.
The dataset is about 32 individuals viewing 64 videos. Video length varies from 15 seconds to one minute. ","m1lname":"El-Refaey","industry":"","analytics":"MashApe CamFind API with Java. Classification algorithms, Contextual Analysis and Collaborative Filtering (CF) to build the recommendation engine. Visualizations from D3 toolkit for implementing a visualization to represent the output. ","m2fname":"Vincent","description":"Objectives:We will implement an interest-based Video Recommendation engine based on “interest levels”.

Innovations:EEG-based “interest level” could, in theory, give insight into video preferences that can not be easily expressed in words. We could utilize the EEG-based dataset to make recommendation to users.

Capabilities:Our recommendation engine will provide users a generic EEG-based recommendation that doesn't require users' interaction.(i.e. no explicit rating).

Importance: Obviously, in computer vision, people are better recognized without more interactions with computer. For example, people don't have to fill out rating forms to get recommendations about what they like.","m1fname":"Mohamed","projectname":"EEGoVid: An EEG-Based Interest Level Video Recommendation Engine","m3fname":"Jingwen"},{"m2lname":"Xiangliang","m4lname":"Yumeng","m3uni":"Xia","m1uni":"Yu","m4uni":"Xu","pid":"201412-61","m2uni":"Yang","timestring":"","m4fname":"yx2251","language":"1. Historical NAVs for All ProShares ETFs.
We get it from Internet.
","m3lname":"Yu","dataset":"The objective is to forecast the price of the stocks according to the large history dataset.

The innovations include use neuron networking algorithm to calculate the potential price and tend more accurately.

This research has its merit in that it provides some rational analytics about the stock prices, thus can give valuable support to decision makers. ","m1lname":"Yi","industry":"Back-propagation neural network \u000b \u000b","analytics":"Languages: JAVA/JSP/ platforms: Eclipse/Hadoop","m2fname":"yy2496","description":"Stock Forecasting With BP Neuron Networking Algorithm and Hadoop Map-Reduce","m1fname":"Wed Dec 10 21:51:21 2014","projectname":"yx2239","m3fname":"xy2220"},{"m2lname":"Xiangliang","m4lname":"Yumeng","m3uni":"Xia","m1uni":"Yu","m4uni":"Xu","pid":"201412-61","m2uni":"Yang","timestring":"","m4fname":"yx2251","language":"Dataset is \"Historical NAVs for All ProShares ETFs\", We get it from Internet. ","m3lname":"Yu","dataset":"The objective is to forecast the price of the stocks according to the large history dataset. The innovations include use neuron networking algorithm to calculate the potential price and tend more accurately. This research has its merit in that it provides some rational analytics about the stock prices, thus can give valuable support to decision makers. ","m1lname":"Yi","industry":"Back-propagation neural network algorithm","analytics":" Languages: JAVA/JSP/ platforms: Eclipse/Hadoop","m2fname":"yy2496","description":"Stock Forecasting Using Hadoop Map-Reduce","m1fname":"Wed Dec 10 22:02:01 2014","projectname":"yx2239","m3fname":"xy2220"},{"m2lname":"Xu","m4lname":"Yang","m3uni":"yx2251","m1uni":"yy2496","m4uni":"xy2220","pid":"201412-61","m2uni":"yx2239","timestring":"Wed Dec 10 22:06:22 2014","m4fname":"Xiangliang","language":"Languages: JAVA/JSP platforms: Eclipse/Hadoop","m3lname":"Xia","dataset":"Dataset is \"Historical NAVs for All ProShares ETFs\", We get it from Internet.","m1lname":"Yu","industry":"","analytics":"Back-propagation neural networking algorithm","m2fname":"Yumeng","description":"The objective is to forecast the price of the stocks according to the large history dataset. The innovations include use neuron networking algorithm to calculate the potential price and tend more accurately. This research has its merit in that it provides some rational analytics about the stock prices, thus can give valuable support to decision makers.
","m1fname":"Yi","projectname":"Stock Forecasting Using Hadoop Map-Reduce","m3fname":"Yu"},{"m2lname":"Parikh","m4lname":"","m3uni":"ps2791","m1uni":"ssb2171","m4uni":"","pid":"201412-45","m2uni":"yp2348","timestring":"Wed Dec 10 22:11:53 2014","m4fname":"","language":"Java, Python, JavaScript, Hive, AWS - S3, RDS, EBS, EMR","m3lname":"Sinha","dataset":"The dataset is taken from the Yelp Dataset Challenge. The link to the dataset is http://www.yelp.com/dataset_challenge. It has over 40,000 businesses, 250,000 users and 1.12 million reviews spanning multiple cities.

The software can support data that has 2 set of components, i.e users and businesses and a rating between those 2 components. Other examples are movie reviews, product reviews.","m1lname":"Boobna","industry":"","analytics":"We implemented a user based collaborative filtering algorithm using a weighted bipartite graph. We used a recommendation power similarity score instead of other similarity measures like cosine similarity.

Using 95-5 cross-validation ratio, we analyzed the recommendations of our algorithm. Since the algorithm provides rating predictions, we used the root mean square method to calculate the error.

We ran hive queries on Amazon Elastic MapReduce to get these recommendation power and rating predictions.

We built a REST api on Elastic BeanStalk that gets these recommendations from Amazon S3.

For visualization, we built a web page that gets the recommended data using the REST api. These businesses are plotted on the map for better visualization.","m2fname":"Yash","description":"The data provided by Yelp currently can be sometimes overwhelming for the users to make a choice since they have a myriad of choices even for a specific set of businesses.

We have built a recommendation system which predicts what ratings the user would give to a particular business.

We have considered various factors like common businesses rated by the users, total ratings for a business, average user rating, etc. to give our prediction.

These predictions are arranged from highest to lowest to give recommendations to the user.","m1fname":"Siddharth","projectname":"Yelp Recommendation Analysis","m3fname":"Prateek"},{"m2lname":"Wu","m4lname":"","m3uni":"","m1uni":"mc3894","m4uni":"","pid":"201412-65","m2uni":"yw2560","timestring":"Wed Dec 10 22:15:00 2014","m4fname":"","language":"Java, Eclipse,Hadoop","m3lname":"","dataset":"The dataset we choose for back testing is from stock future index of Shanghai Future Exchange ","m1lname":"Chen","industry":"","analytics":"The algorithm is given as combination of moving average convergence divergence(MCAD) and Relative Strength Index(RSI)
MCAD: long average period and the short average period
RSI: the upper threshold, the lower threshold and the calculation period

The project will divided into two part:
First one is MapReduce platform----Hadoop
Second one is the trading strategies to test the data on this platform
","m2fname":"Yifan","description":"With the spread of internet and the improvement of internet speed and storage capacity, data has been flooding everywhere, and here comes the need of a tool to deal with the scale and complexity of the data. Hadoop is a rapidly growing tool for parallel computing and dealing with raw data, and it uses the idea of MapReduce programming to realize the parallel computing.
For this project, we have achieved by using MapReduce program for optimizing the parameters of a trading strategy.
Algorithm trading is make use of computer programs for entering trading orders, in which computer algorithms decide on every aspects of the order, such as the timing, price, and quantity of the order.
Before an algorithm is put into practical use, it is necessary to back test the algorithm using enough historical price data to validate and optimize the algorithm in terms of profitability, stability, etc. ","m1fname":"Meibin","projectname":"Parameters Optimization in Algorithm Trading using Hadoop MapReduce","m3fname":""},{"m2lname":"Gupta","m4lname":"Saingayapally","m3uni":"nnr2107","m1uni":"as4626","m4uni":"ss4728","pid":"201412-49","m2uni":"ag3468","timestring":"Wed Dec 10 22:20:39 2014","m4fname":"Sankalp","language":"Java and Javascript","m3lname":"Rau","dataset":"The data set is raw, and needs to be cleaned and made into csv format. We have a software that does this too. Once this is done, we do analysis on data.","m1lname":"Sridhar","industry":"","analytics":"1) Naive Baised Classification
2) Maching Learning Libraries used (Mahout)
3) Data set sorting through Hive

Repository Info: https://github.com/ss91/big_data_analytics/","m2fname":"Abhaar","description":"Objectives:

1) Used machine learning libraries such as Apache Mahout to perform classifications on raw data sets for banks and states to ensure they have better understanding of customer sentiments

2) Performed data analysis on the data sets using Hive, to give a detailed overview of the banks’ performance from a customer sentiment perspective.

3) Developed our own novel metric system that assigns priorities to customers’ complaints. This helps banks prioritize customers’ problems on specific constraints such as response time, etc.

Importance:

One of the biggest challenges for banks is minimizing customer attrition rate which is directly dependent on customer satisfaction. Customers are inclined to choose the banks who can be trusted for their services. \"Customer is King\"

","m1fname":"Avinash","projectname":"Customer Complaint Analyses","m3fname":"Nachiket"},{"m2lname":"Zhang ","m4lname":"Fang","m3uni":"bd2384","m1uni":"yq2158","m4uni":"zf2150","pid":"201412-60","m2uni":"gz2202","timestring":"Wed Dec 10 22:24:48 2014","m4fname":"Zheng ","language":"Language: HTML, CSS, Javascript, Angular, Bootstrap, Java, PHP Platforms: Eclipse, MySQL, PHP server","m3lname":"Dang","dataset":"We used the Yahoo finance dataset for historical values of NASDAQ stock exchange data and use yahoo API to get the real-time values","m1lname":"Qin","industry":"","analytics":"We encoded the stock time series, found episodes and further implemented the Jaro Winkler Algorithm","m2fname":"Guangyang","description":"Our project is aimed at designing a system that is able to recommend stocks based on customized choices. The recommendation items cover all the NASDAQ stocks. The innovation of our system lies on the point that we smartly switch the problem of stock trend prediction to the problem of finding out the stock which is most similar to a certain chosen stock in terms of a customized criterion. Besides, we are able to give the investor an idea about how the markets are moving and also some unforeseen insights if we get event similarity between stocks. ","m1fname":"Yuechen ","projectname":"Stock Recommendation System","m3fname":"Bowen"},{"m2lname":"Huang","m4lname":"Li","m3uni":"yz2698","m1uni":"rg2930","m4uni":"ml3695","pid":"201412-15","m2uni":"rh2648","timestring":"Wed Dec 10 22:25:24 2014","m4fname":"Mengge","language":"R, Java, Python, Cypher ","m3lname":"Zhang","dataset":"Self-collected web scrapping data of meetup.com. ","m1lname":"Gaur","industry":"","analytics":"Social Network Analytics
Fiedler Method for Social Network Clustering/Community Detection
Gephi and D3 for Network Visualization","m2fname":"Rongyao","description":"This project is motivated by:
- Unique value of event-based social network (EBSN)
Both online offline social interactions
Commercial value: industrial trends, recommendation of services/ products based on user preference

- Big Fan of Meetup.com
Popularity across academia, industry and recreation
Excellent API: user, group, event, tags – location & time

- Great opportunity to apply big data techniques
Graph database: Neo4j with Cypher
Clustering/ Community Detection
Large scale social network analysis ","m1fname":"Rahul","projectname":"Exploring the Online and Offline Social World","m3fname":"Yiwei"},{"m2lname":"Srinivasan","m4lname":"Ji","m3uni":"mg3419","m1uni":"ak3674","m4uni":"yj2345","pid":"201412-40","m2uni":"cs3315","timestring":"Wed Dec 10 22:30:31 2014","m4fname":"Yuzhong","language":"Python, Beautiful Soup, Apache Mahout, JAVA, MySQL , Amazon Web Server (RDS), node js with express framework(backend), Bootstrap and Jade ","m3lname":"Guan","dataset":"Our dataset will be news articles from BBC news, New York Times, Yahoo News etc. Currently our dataset is small, we are getting news articles from BBC news, and New York Times RSS feed. For each article we have {title,description,article,link}.","m1lname":"Khan ","industry":"","analytics":"KMeans, Canopy Clustering
","m2fname":"Chithra ","description":"Our project aims to create a web based application for people to get to know what is going on around them in an easy and efficient way. By clustering news articles we save are users from the hassle of visiting multiple websites for news and give them a one stop solution. This allows the user to get quick access to all the related news articles of news he or she may find interesting.
Dataset: Our dataset will be news articles from BBC news, New York Times, Yahoo News etc. Currently our dataset is small, we are getting news articles from BBC news, and New York Times RSS feed. For each article we have {title,description,article,link}.
","m1fname":"Abdus Samad","projectname":"Nova: News Articles Clustering System","m3fname":"Mingyun "},{"m2lname":"Guo","m4lname":"","m3uni":"mw2917","m1uni":"zw2291","m4uni":"","pid":"201412-57","m2uni":"jg3527","timestring":"Wed Dec 10 22:40:16 2014","m4fname":"","language":"python, java, eclipse","m3lname":"Wang","dataset":"The dataset is we capture from the Douban website, contains the movie, user, and point of the movie","m1lname":"Wang","industry":"","analytics":"item-based recommendation","m2fname":"Jing","description":"With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch.

Our program can help to recommend the movie to the user.","m1fname":"Zihao","projectname":"Movie Recommendation and analysis of this application","m3fname":"Mingyuan"},{"m2lname":"Guan","m4lname":"Wu","m3uni":"yt2443","m1uni":"el2756","m4uni":"mw2987","pid":"201412-51","m2uni":"yg2392","timestring":"Wed Dec 10 22:40:43 2014","m4fname":"Mengting","language":"Java, Mahout, Hadoop","m3lname":"Tan","dataset":"Yelp Dataset Challenge, including users, businesses, and reviews","m1lname":"Liao","industry":"","analytics":"Sentence segmentation, tokenization, lemmatization, part-of-speech tagging, naive Bayesian classification, latent Dirichlet allocation
","m2fname":"Yuqing","description":"Consumer-Oriented:
Recommend businesses to consumers by their raw text
Business-Oriented:
Classify raw review text to two classes: good reviews and bad reviews
Give commercial advice by keywords in good and bad reviews
","m1fname":"Enrui","projectname":"Yelp Review Analysis and Recommendation","m3fname":"Ying"},{"m2lname":"Rubino","m4lname":"Zong","m3uni":"jx2211","m1uni":"mae2150","m4uni":"sz2477","pid":"201412-33","m2uni":"vjr21","timestring":"Wed Dec 10 22:48:26 2014","m4fname":"Shi","language":"Apache Mahout, OpenCV, Java, MashApe CamFind API, HBase, Hive, D3 and JavaScript, Matlab","m3lname":"Xie","dataset":"Interest levels data set provided by Neuromatters. The dataset is about 32 individuals viewing 64 videos. Video length varies from 15 seconds to one minute.

We also use ImageNet and WordNet to have a try on feature extraction part.","m1lname":"El-Refaey","industry":"","analytics":"MashApe CamFind API with Java. Classification algorithms, Contextual Analysis and Collaborative Filtering (CF) to build the recommendation engine. Visualizations from D3 toolkit for implementing a visualization to represent the output.","m2fname":"Vincent","description":"Objectives: We will implement an interest-based Video Recommendation engine based on “interest levels”.

Innovations: EEG-based “interest level” could, in theory, give insight into video preferences that can not be easily expressed in words. We could utilize the EEG-based dataset to make recommendation to users.

Capabilities: Our recommendation engine will provide users a generic EEG-based recommendation that doesn't require users' interaction (i.e. no explicit rating).

Importance: In all, the main contribution of our system is that it seems that no one has combined brain activity with a recommendation engine. Based on EEG patterns, users can receive recommendations for movies they will truly feel are similar, even without having to click or tap anything.","m1fname":"Mohamed","projectname":"EEGoVid: An EEG-Based Interest Level Video Recommendation Engine","m3fname":"Jingwen"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"","m4uni":"","pid":"201412-48","m2uni":"","timestring":"Wed Dec 10 22:52:42 2014","m4fname":"","language":"Java, Javascript","m3lname":"","dataset":"Dataset provided by University of California, Irvine, Center for Machine Learning and Intelligent Systems. It is public but requires citation if used for research.

https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption","m1lname":"","industry":"","analytics":"Regression, JFreeChart, Tableau Software for Visualisation","m2fname":"","description":" Write a Time Series Analysis Yarn Application to run on Hadoop Clusters
- Use Regression to forecast monthly energy consumption in a typical French household
- Application should accept time range into the future to make forecasts
- The output is dumped to file and available for visualization using a tool like Tableau or jfreechart","m1fname":"","projectname":"Minimizing Risk for Energy Arbitrage","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"eb2871","m4uni":"","pid":"201412-68","m2uni":"","timestring":"Wed Dec 10 22:53:02 2014","m4fname":"","language":"Language: Java, Platform: Mac OS X, Linux.","m3lname":"","dataset":"I used Pascal VOC 2009 dataset and Object Detection dataset from Microsoft: http://research.microsoft.com/en-us/downloads/b94de342-60dc-45d0-830b-9f6eff91b301/

Also, I used some of my personal photos. ","m1lname":"Barsoum","industry":"","analytics":"Here what is implemented so far:

Algorithms:
------------
Gaussian: blur a set of images based on Gaussian, Gaussian parameters are set at the command line.
Median: blur a set of images based on median filter, median parameters are set at the command line.
Thumbnail: create a set of thumbnails from a set of images, the thumbnail size is set at the command line.
FaceDetector: Detect all faces for each image in the sequence file and highlight them in the result.
Color2Gray: convert a set of colored image into a gray images.

Analytics:
----------
ImageSearch: content based image search.
ImageSearchTotal: content based image search using total order partition.
FaceStat: Provide a summary about how many images have 0, 1, 2, 3 or more faces.

Visualization:
--------------
I wrote a number of tools that dump the result in a format that it is easy to visualize.

","m2fname":"","description":"With the huge amount of images captured every day and uploaded online, the need for a scalable Vision and Image Processing platform that can process, analyze and make sense of such data is becoming more important than ever.

Apache Hadoop was design to tackle similar problems for text processing, by splitting and distributing the workload on a number clusters built on the top of commodity hardware's, each node work on a small subset of the data. One of the benefit of Hadoop is its fault tolerance against unreliable hardware.
Although, Hadoop provides a great platform for processing a huge amount of data, it was not designed for processing image data.

Hadoop Vision (HVision) is an open source platform on the top of Hadoop MapReduce, with the goal of providing an easy way to use scalable Computer Vision (CV) algorithms and Image Processing functionalities for developers and researchers. In other word, it is OpenCV on Hadoop.
","m1fname":"Emad","projectname":"HVision","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"mw2917","m1uni":"jg3527","m4uni":"","pid":"201412-57","m2uni":"zw2291","timestring":"Wed Dec 10 22:53:35 2014","m4fname":"","language":"Python, Java, Eclipse ","m3lname":"Wang","dataset":"We captured the dataset from the Douban website, contains the movie, user, and point of the movie whichi is evaluated by the user.","m1lname":"Jing","industry":"","analytics":"Item-based recommendation","m2fname":"Zihao","description":"With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch.

Our program can help to recommend the movie to the user.","m1fname":"Guo","projectname":"Movie Recommendation and analysis of this algorithm","m3fname":"Mingyuan"},{"m2lname":"Singhal","m4lname":"","m3uni":"ncb2127","m1uni":"dmm2215","m4uni":"","pid":"201412-59","m2uni":"as4521","timestring":"Wed Dec 10 22:53:39 2014","m4fname":"","language":"JAVA, Hive, Mahout, Hadoop, Python","m3lname":"Buch","dataset":"This data captures the process of offering incentives (a.k.a. coupons) to a large number of customers and forecasting those who will become loyal to the product. Let's say 100 customers are offered a discount to purchase two bottles of water. Of the 100 customers, 60 choose to redeem the offer. These 60 customers are the focus of this competition. You are asked to predict which of the 60 will return (during or after the promotional period) to purchase the same item again.

To create this prediction, you are given a minimum of a year of shopping history prior to each customer's incentive, as well as the purchase histories of many other shoppers (some of whom will have received the same offer). The transaction history contains all items purchased, not just items related to the offer. Only one offer per customer is included in the data. The training set is comprised of offers issued before 2013-05-01. The test set is offers issued on or after 2013-05-01.

Dataset available on - https://www.kaggle.com/c/acquire-valued-shoppers-challenge/data","m1lname":"Mehta","industry":"","analytics":"Adaptive Logistic Regression,
Online Logistic Regression, and
Mahout’s CrossFoldLearner","m2fname":"Ayushi","description":"Consumer brands often offer discounts to attract new shoppers to buy their products. The most valuable customers are those who return after this initial incented purchase. With enough purchase history, it is possible to predict which shoppers, when presented an offer, will buy a new item. However, identifying the shopper who will become a loyal buyer -- prior to the initial purchase -- is a more challenging task.","m1fname":"Dharmen","projectname":"Acquire Valued Shoppers","m3fname":"Nimai"},{"m2lname":"Wang","m4lname":"","m3uni":"mw2917","m1uni":"jg3527","m4uni":"","pid":"201412-57","m2uni":"zw2291","timestring":"Wed Dec 10 22:54:19 2014","m4fname":"","language":"Python, Java, Eclipse ","m3lname":"Wang","dataset":"We captured the dataset from the Douban website, contains the movie, user, and point of the movie whichi is evaluated by the user.","m1lname":"Jing","industry":"","analytics":"Item-based recommendation","m2fname":"Zihao","description":"With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch.

Our program can recommend movies to the user.","m1fname":"Guo","projectname":"Movie Recommendation and analysis of this algorithm","m3fname":"Mingyuan"},{"m2lname":"Wang","m4lname":"","m3uni":"mw2917","m1uni":"jg3527","m4uni":"","pid":"201412-57","m2uni":"zw2291","timestring":"Wed Dec 10 22:55:54 2014","m4fname":"","language":"Python, Java, Eclipse ","m3lname":"Wang","dataset":"We captured the dataset from the Douban website, including the movie name, user id, and point of the movie which is evaluated by the user.","m1lname":"Jing","industry":"","analytics":"Item-based recommendation","m2fname":"Zihao","description":"With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch.

Our program can recommend movies to the user.","m1fname":"Guo","projectname":"Movie Recommendation and analysis of this algorithm","m3fname":"Mingyuan"},{"m2lname":"Hwang","m4lname":"","m3uni":"sy2513","m1uni":"sh3246","m4uni":"","pid":"201412-64","m2uni":"ih138","timestring":"Wed Dec 10 22:56:24 2014","m4fname":"","language":"Hadoop, Spark, Python, and Google Cloud","m3lname":"Yoo","dataset":"Yahoo Finance Tick data for S&P 500","m1lname":"Huh","industry":"","analytics":"Big Data: In-Memory, Distributed File System
Valut at Risk: Monte Carlo simulation, Parametric method
","m2fname":"Iljoon","description":"Objective: Developing the Real-time Risk Management System (Intraday Value at Risk) for large complex portfolio in an unified framework

Expected Outcome: The system which performs the calculation of stressed VaR, \"what-if\" scenarios, stress-testing on complex portfolio with large number of underlying risk factors and vectors in real-time.

Importance: Risk management is crucial to throughout the investment/trading activities from front trading desk to back office. However, because of the complexity of calculating VaR in large multi-asset portfolio, delivering the VaR in real-time is not available at legacy system. Big Data with in-memory multi-dimensional analytics can resolve this big issue.
","m1fname":"Sung Joon","projectname":"Real-time Risk Management System","m3fname":"Sung Woo"},{"m2lname":"SHEN","m4lname":"CAO","m3uni":"fs2523","m1uni":"hz2361","m4uni":"yc2978","pid":"201412-52","m2uni":"ys2821","timestring":"Wed Dec 10 22:59:35 2014","m4fname":"YONGJIE","language":"hadoop mahout java c++","m3lname":"SONG","dataset":"Two datasets is used in the project. One database is from Amazon movie review center, which contains 7911684 movie reviews. Each review contains movie name, movie rating score, date of the review and comments. The other dataset is from Yahoo lab. The dataset contains the rating of different movies from various customers. It contains train text and test text files. Test text files are gathered chronologically after the training data. In the train data there are 211231 ratings from 7642 customers covering 11915 movies. In the test data there are 10136 ratings.","m1lname":"ZOU","industry":"","analytics":"Since we are targeting at both producer and viewer markets, we have to extract related information which could be utilized in the analysis such as key words extraction, cluster and recommendation. For key words extraction, we might want to pick up some of the most frequent informative words and cluster them to see what kind of categories they lie in. As for recommendation, we could simply recommend some related movies to viewers who express his or her potential interest in their previous comments. All these works could be done by Mahout in either local or HDFS mode. ","m2fname":"YUZHE","description":"The project could benefit the audience as well as the movie producers. For audience: According to the movie review data, the low rated movies would be filtered out in terms of categories, in other words, only the top rated movies left in the filtered and categorized data base. The data system could generate and list the top rated movies once the audience specify their preferred category. Moreover, the movies could also be classified and listed by audience-preferred actors. For producers: The average rating score of the movies could be calculated with regard to diverse categories. Nevertheless, the data system could summarize the keywords in the review and list the movie names by their rating. For instance, the system could summarize the keyword ‘son / daughter’ in the review, the movies that contain the keyword in the review would be listed and the number of listed movies is counted with their average rating score.","m1fname":"HUI","projectname":"Movie Exploration","m3fname":"FENGYI"},{"m2lname":"Sharpe","m4lname":"Vaidya","m3uni":"is2439","m1uni":"jae2154","m4uni":"pv2259","pid":"201412-63","m2uni":"sbs2193","timestring":"Wed Dec 10 23:02:03 2014","m4fname":"Preeti ","language":"Mahout, Python, Amazon Redshift, Php, Yii, Java/Eclipse, Neo4j","m3lname":"Sayal","dataset":"The dataset we used is private tunings data we got from analytics media group. They are a media analytics company that has access to set top box data from households that tracks everything they are watching on the television. ","m1lname":"Edgerton","industry":"","analytics":"We used item based and user based recommendations with Mahout. All of the data preprocessing/aggregation was done with/in Amazon Redshift using SQL. We are currently trying to build recommendations in Neo4j for comparisons and visualizations with out Mahout. ","m2fname":"Samuel","description":"Not much research has been done in terms of building recommendations for TV series based solely on the viewing behavior, not any implicit user-provided ratings. We were able to use 'pseudo-ratings' to show that we can make good recommendations using just viewing behavior. We built upon this further by integrating social media data to strengthen not only our ratings, but also to provide general trends about various trending series to complement the tunings data.
","m1fname":"Joshua","projectname":"AMG TV Genome Project / Recommendation Engine","m3fname":"Ishaan"},{"m2lname":"Han","m4lname":"","m3uni":"yd2303","m1uni":"hg2388","m4uni":"","pid":"201412-54","m2uni":"th2569","timestring":"Wed Dec 10 23:35:31 2014","m4fname":"","language":"Languages - PHP/MySQL/HTML/JAVA/Gremlin. Platforms - Linux/Apache/MySQL/Gremlin/Eclipse.","m3lname":"Du","dataset":"MovieLens 1M dataset - including 1 million ratings from 6000 users on 4000 movies. http://grouplens.org/datasets/movielens/","m1lname":"Guan","industry":"","analytics":"1. Movie recommendation algorithm implemented in graph database.
2. Online website hosted by an apache server.

","m2fname":"Tian","description":"Objective: Develop our own online movie shopping website/Implement movie recommendation functionality/use google analytics to do the user behavior analysis, which is important for sending advisement.

Innovations:
1. Using PHP/HTML language to build a online movie website.
2. Using Gremlin/Neo4j graph database to do movie recommendation.
3. Using MySQL to manage the movie and customer information.
4. Using Google analytics to do user behavior analysis

Importance:
1. Based on statistics, the online recommendation functionality could increase 30% of the sale.
2. The online video vender, Netflex announced \"The $1 million Netflix challenge\" to encourage the move recommendation algorithm development.
","m1fname":"Hang","projectname":" Google-Analytics, Gremlin - based Online Movie Recommendation System ","m3fname":"Yifan"},{"m2lname":"Parikh","m4lname":"","m3uni":"ps2791","m1uni":"ssb2171","m4uni":"","pid":"201412-45","m2uni":"yp2348","timestring":"Thu Dec 11 00:02:22 2014","m4fname":"","language":"Java, Python, JavaScript, Hive, AWS - S3, RDS, EBS, EMR","m3lname":"Sinha","dataset":"The dataset is taken from the Yelp Dataset Challenge. The link to the dataset is http://www.yelp.com/dataset_challenge. It has over 40,000 businesses, 250,000 users and 1.12 million reviews spanning multiple cities.

The software can support data that has 2 set of components, i.e users and businesses and a rating between those 2 components. Other examples are movie reviews, product reviews. ","m1lname":"Boobna","industry":"","analytics":"We implemented a user based collaborative filtering algorithm using a weighted bipartite graph. We used a recommendation power similarity score instead of other similarity measures like cosine similarity.

Using 95-5 cross-validation ratio, we analyzed the recommendations of our algorithm. Since the algorithm provides rating predictions, we used the root mean square method to calculate the error.

We ran hive queries on Amazon Elastic MapReduce to get these recommendation power and rating predictions.

We built a REST api on Elastic BeanStalk that gets these recommendations from Amazon S3.

For visualization, we built a web page that gets the recommended data using the REST api. These businesses are plotted on the map for better visualization.","m2fname":"Yash","description":"The data provided by Yelp currently can be sometimes overwhelming for the users to make a choice since they have a myriad of choices even for a specific set of businesses.

We have built a recommendation system which predicts what ratings the user would give to a particular business.

We have considered various factors like common businesses rated by the users, total ratings for a business, average user rating, etc. to give our prediction.

These predictions are arranged from highest to lowest to give recommendations to the user.","m1fname":"Siddharth","projectname":"Yelp Recommendation Analysis","m3fname":"Prateek"},{"m2lname":"Parikh","m4lname":"","m3uni":"ps2791","m1uni":"ssb2171","m4uni":"","pid":"201412-45","m2uni":"yp2348","timestring":"Thu Dec 11 00:06:13 2014","m4fname":"","language":"Java, Python, JavaScript, Hive, AWS - S3, RDS, EBS, EMR","m3lname":"Sinha","dataset":"The dataset is taken from the Yelp Dataset Challenge. The link to the dataset is http://www.yelp.com/dataset_challenge. It has over 40,000 businesses, 250,000 users and 1.12 million reviews spanning multiple cities.

The software can support data that has 2 set of components, i.e users and businesses and a rating between those 2 components. Other examples are movie reviews, product reviews. ","m1lname":"Boobna","industry":"","analytics":"We implemented a user based collaborative filtering algorithm using a weighted bipartite graph. We used a recommendation power similarity score instead of other similarity measures like cosine similarity.

Using 95-5 cross-validation ratio, we analyzed the recommendations of our algorithm. Since the algorithm provides rating predictions, we used the root mean square method to calculate the error.

We ran hive queries on Amazon Elastic MapReduce to get these recommendation power and rating predictions.

We built a REST api on Elastic BeanStalk that gets these recommendations from Amazon S3.

For visualization, we built a web page that gets the recommended data using the REST api. These businesses are plotted on the map for better visualization.","m2fname":"Yash","description":"The data provided by Yelp currently can be sometimes overwhelming for the users to make a choice since they have a myriad of choices even for a specific set of businesses.

We have built a recommendation system which predicts what ratings the user would give to a particular business.

We have considered various factors like common businesses rated by the users, total ratings for a business, average user rating, etc. to give our prediction.

These predictions are arranged from highest to lowest to give recommendations to the user.","m1fname":"Siddharth","projectname":"Yelp Recommendation Analysis","m3fname":"Prateek"},{"m2lname":"Jain","m4lname":"Jain","m3uni":"nsk2141","m1uni":"rg2936","m4uni":"sj2659","pid":"201412-58","m2uni":"nj2303","timestring":"Thu Dec 11 01:49:12 2014","m4fname":"Sanket","language":"Python, Javascript, jQuery, HTML","m3lname":"Kenkre","dataset":"Yelp Academic Dataset Challenge: http://www.yelp.com/dataset_challenge
","m1lname":"Goel","industry":"","analytics":"Query-based Heatmap - using Google Maps API v3 (Javascript, HTMl, jQuery)
Semantic Analysis - LDA using MALLET, python
Gamification: tapping user activity using metadata and frequency counts (python)
","m2fname":"Naman","description":"We perform three tasks: one, to benefit the users; two, to aid the businesses, and three, to help Yelp improve their own product.

I. User Perspective: Query-based HeatMap
• Helps user identify hubs for his/her interests
• Improves the usability, and the look-and-feel of the
interface

II. Business Perspective: Semantic Analysis & Topic Modeling
• Can help businesses figure out their strengths and weaknesses
• Monetization based on insights about what attracts the users

III. Yelp Perspective: Gamification
• Helps Yelp increase their customer base, market value, brand name, user loyalty
• Drive better customer retention and lifetime value","m1fname":"Rhea","projectname":"Yelp-er: Analyzing Yelp Data","m3fname":"Natasha"},{"m2lname":"Cai","m4lname":"","m3uni":"jz2606","m1uni":"bw2459","m4uni":"","pid":"201412-42","m2uni":"yc2870","timestring":"Thu Dec 11 01:54:34 2014","m4fname":"","language":"Python, Java","m3lname":"Zhang","dataset":"Dataset from Bitfinex (a trading platform for Bitcoin)
URL: https://www.bitfinex.com/
","m1lname":"Wang","industry":"","analytics":"Logistic Regression is often used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features).
","m2fname":"Yufan","description":"Using classification models to give a prediction on whether the asset price will go up or down by incorporating unstructured data stream
","m1fname":"Bowen","projectname":"Trading Using Nonparametric Time Series Classification Models","m3fname":"Junchao"},{"m2lname":"Xu","m4lname":"Zhu","m3uni":"lk2578","m1uni":"zl2335","m4uni":"cz2311","pid":"201412-56","m2uni":"yx2242","timestring":"Thu Dec 11 02:44:46 2014","m4fname":"Cong","language":"JavaScript, Hadoop","m3lname":"Kuang","dataset":"We have collected browser data from 20 contributors at Columbia, 10 from EE and 10 from CS.
For privacy issue, we will not submit it to public.

","m1lname":"Li","industry":"","analytics":"Cluster users using kmeans method.
Label websites based on labeled cluster.
Make visualize web page based on d3.js","m2fname":"Yifei","description":"Have you ever been in such a turmoil that you cannot fight against this bizarre and motley internet distracting you?
Our preliminary idea is to apply big data techniques to our daily life and provide you a fully visualization of your browser events and let you compare with peers so that improve your efficiency and rescue your time.
This extension is realized by tracing and classifying your browser history, statistically counting up the percentage of time you have spent on academic exploring, education, entertainment and shopping, and furtherly rate your concentration performance among all the students with similar background.
","m1fname":"Zheang","projectname":"User's Web Events Analysis Based on Browser Extension","m3fname":"Linjun"},{"m2lname":"Li","m4lname":"DeGiacomo","m3uni":"aj2599","m1uni":"pkd2108","m4uni":"ndd2122","pid":"201412-66","m2uni":"tl2617","timestring":"Thu Dec 11 09:27:20 2014","m4fname":"Nick","language":"Matlab, Hive, Hadoop, AWS, Bigquery","m3lname":"Jahan","dataset":"2012/2013 NYC Taxi Dataset

Obtained from the TA","m1lname":"Dutta","industry":"","analytics":"We used queries in AWS and Bigquery. And, we did some basic statistics work in Matlab and are now using Hive and AWS more. AWS and Bigquery are similar and are being used to check the reproducibility of our analyses. ","m2fname":"Tingting","description":"Objectives: What is the flux of taxis into a given neighborhood over particular periods of time?And, Can we find optimal fixed fare prices for specific (popular) destinations?

Innovations/Capabilities - Cost Analysis

Research Importance - Help taxi drivers and customers make profit/save","m1fname":"Preetam","projectname":"Location Specific Optimization of Taxi Efficiency in NYC","m3fname":"Aamir"},{"m2lname":"Luo","m4lname":"Gao","m3uni":"zg2185","m1uni":"yy2495","m4uni":"hg2357","pid":"201412-38","m2uni":"yl3026","timestring":"Thu Dec 11 09:50:39 2014","m4fname":"Huan","language":"Java Spring Framework, JQuery, Neo4j","m3lname":"Gao","dataset":"We query the Google Custom Search APIs to get the data and our system would generate the users' data gradually.","m1lname":"Yang","industry":"","analytics":"Term Frequency based Classification, graph database based recommendation, Natural Language Topic Extraction.","m2fname":"Yuanhui","description":"With the rapid development of the Internet, dozens of education resources are available on the Internet. It provides abundant information to educators and students. Meanwhile, people seek for more and more knowledge by themselves, not only in their own career areas, but also other fields. However, such education resources are highly disordered and dispersed on the Internet, which impedes people accessing them efficiently. To be specific, when people search for some topics in Google or some other general searching engines, the results may not satisfy their real demands. So it is important for us to design a more dedicated platform to extract meaningful information that fits in their demand. ","m1fname":"Yifan","projectname":"Study Buddy","m3fname":"Zidong"},{"m2lname":"Luo","m4lname":"Gao","m3uni":"zg2185","m1uni":"yy2495","m4uni":"hg2357","pid":"201412-38","m2uni":"yl3026","timestring":"Thu Dec 11 10:36:13 2014","m4fname":"Huan","language":"Java Spring Framework, JQuery, Neo4j ","m3lname":"Gao","dataset":"We query the Google Custom Search APIs to get the data and our system would generate the users' data gradually. ","m1lname":"Yang","industry":"","analytics":"Term Frequency based Classification, graph database based recommendation, Natural Language Topic Extraction.","m2fname":"Yuanhui","description":"With the rapid development of the Internet, dozens of education resources are available on the Internet. It provides abundant information to educators and students. Meanwhile, people seek for more and more knowledge by themselves, not only in their own career areas, but also other fields. However, such education resources are highly disordered and dispersed on the Internet, which impedes people accessing them efficiently. To be specific, when people search for some topics in Google or some other general searching engines, the results may not satisfy their real demands. So it is important for us to design a more dedicated platform to extract meaningful information that fits in their demand. ","m1fname":"Yifan","projectname":"Study Buddy","m3fname":"Zidong"},{"m2lname":"Xu","m4lname":"Zhu","m3uni":"lk2578","m1uni":"zl2335","m4uni":"cz2311","pid":"201412-56","m2uni":"yx2242","timestring":"Thu Dec 11 13:25:31 2014","m4fname":"Cong","language":"JavaScript, Hadoop","m3lname":"Kuang","dataset":"We have collected browser data from 20 contributors at Columbia, 10 from EE and 10 from CS.
For privacy issue, we will not submit it to public. ","m1lname":"Li","industry":"","analytics":"Cluster users using kmeans method.
Label websites based on cluster's label.
Make visualize web page based on d3.js","m2fname":"Yifei","description":"Our preliminary idea is to apply big data techniques to our daily life and provide you a fully visualization of your browser events and let you compare with peers so that improve your efficiency and rescue your time.
","m1fname":"Zheang","projectname":"User's Web Events Analysis Based on Browser Extension","m3fname":"Linjun"},{"m2lname":"Sun","m4lname":"","m3uni":"","m1uni":"hf2286","m4uni":"","pid":"201412-55","m2uni":"xs2232","timestring":"Thu Dec 11 14:43:00 2014","m4fname":"","language":"Mahout, JAVA","m3lname":"","dataset":"Reuters Archive;
Wikipedia Documents;","m1lname":"Fu","industry":"","analytics":"Latent Dirichlet Allocation","m2fname":"Xiuwen","description":"Document Classification;
Topic Exploration;
Document Recommendation;
Hot topic trend analysis;","m1fname":"Hao","projectname":"Document Analysis with Latent Dirichlet Allocation","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"aoa2124","m4uni":"","pid":"201412-48","m2uni":"","timestring":"Thu Dec 11 15:21:06 2014","m4fname":"","language":"Java, Javascript","m3lname":"","dataset":"The dataset is public and contains multivariate data types, with date-timestamps, and meter readings of a French household energy consumption over a 4 year period spanning 2006 to 2010.

It is provided by the university of California, Irvine.
","m1lname":"Aladesawe","industry":"","analytics":"Algorithms suggested will be:
Time-series analysis to spot trends
Plot of the data-points to learn a pattern or class of models
Regression to learn of underlying generative model of the dataset
Predict/generate data from the models learned

Visualization Tools: Jfreechart Java API and Tableau","m2fname":"","description":"- Write a Time Series Analysis Yarn Application to run on Hadoop Clusters

- Use Regression to forecast monthly energy consumption in a typical French household

- Application should accept time range into the future to make forecasts

- The output is dumped to file and available for visualization using a tool like Tableau or jfreechart

- The aim is to abstract all the complexity of the implementation, and give the average computer literate a means to do forecasting using very large datasets on any hadoop cluster

- Provide an easy to use and end-user friendly implementation of Time Series Analysis to run on a Hadoop Cluster

- Provide a tool for statisticians to run Time Series examples on hadoop clusters just as we could for classification and clustering of news group datasets and articles

- Leverage the ease of use of the tool to help energy traders make minimum-risk forecasting on datasets (Dataset used for testing)

- Convey the results in a easy to interact user interface, using sliders to generate additional predictions as the page advances to the right of the chart","m1fname":"Adeyemi","projectname":"Minimizing Risk for Energy Arbitrage","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"aoa2124","m4uni":"","pid":"201412-48","m2uni":"","timestring":"Thu Dec 11 15:51:51 2014","m4fname":"","language":"Java, Javascript","m3lname":"","dataset":"The dataset is public and contains multivariate data types, with date-timestamps, and meter readings of a French household energy consumption over a 4 year period spanning 2006 to 2010.

It is provided by the University of California, Irvine. ","m1lname":"Aladesawe","industry":"","analytics":"Algorithms suggested will be:
Time-series analysis to spot trends
Plot of the data-points to learn a pattern or class of models
Regression to learn of underlying generative model of the dataset
Predict/generate data from the models learned

Visualization Tools: Jfreechart Java API and Tableau","m2fname":"","description":"- Write a Time Series Analysis Yarn Application to run on Hadoop Clusters

- Use Regression to forecast monthly energy consumption in a typical French household

- Application should accept time range into the future to make forecasts

- The output is dumped to file and available for visualization using a tool like Tableau or jfreechart

- The aim is to abstract all the complexity of the implementation, and give the average computer literate a means to do forecasting using very large datasets on any hadoop cluster

- Provide an easy to use and end-user friendly implementation of Time Series Analysis to run on a Hadoop Cluster

- Provide a tool for statisticians to run Time Series examples on hadoop clusters just as we could for classification and clustering of news group datasets and articles

- Leverage the ease of use of the tool to help energy traders make minimum-risk forecasting on datasets (Dataset used for testing)

- Convey the results in a easy to interact user interface, using sliders to generate additional predictions as the page advances to the right of the chart
","m1fname":"Adeyemi","projectname":"Minimizing Risk for Energy Arbitrage","m3fname":""},{"m2lname":"Han","m4lname":"","m3uni":"yd2302","m1uni":"hg2388","m4uni":"","pid":"201412-54","m2uni":"th2569","timestring":"Thu Dec 11 23:44:18 2014","m4fname":"","language":"Languages - PHP/MySQL/HTML/JAVA/Gremlin. Platforms - Linux/Apache/MySQL/Gremlin/Eclipse. ","m3lname":"Du","dataset":"MovieLens 1M dataset - including 1 million ratings from 6000 users on 4000 movies.
http://grouplens.org/datasets/movielens/","m1lname":"Guan","industry":"","analytics":"1. Movie recommendation algorithm implemented in graph database.
2. Online website hosted by an apache server. ","m2fname":"Tian","description":"Objective: Develop our own online movie shopping website/Implement movie recommendation functionality/use google analytics to do the user behavior analysis, which is important for sending advisement.

Innovations:
1. Using PHP/HTML language to build a online movie website.
2. Using Gremlin/Neo4j graph database to do movie recommendation.
3. Using MySQL to manage the movie and customer information.
4. Using Google analytics to do user behavior analysis

Importance:
1. Based on statistics, the online recommendation functionality could increase 30% of the sale.
2. The online video vender, Netflex announced \"The $1 million Netflix challenge\" to encourage the move recommendation algorithm development. ","m1fname":"Hang","projectname":"Google-Analytics, Gremlin - based Online Movie Recommendation System","m3fname":"Yifan"},{"m2lname":"Rajan","m4lname":"","m3uni":"","m1uni":"efj2106","m4uni":"","pid":"201412-2","m2uni":"asr2171","timestring":"Fri Dec 12 22:01:51 2014","m4fname":"","language":"R, Java, Hadoop, Mahout, Caffe","m3lname":"","dataset":"We used the 10-class Yahoo Image Classification Dataset.","m1lname":"Johnson","industry":"","analytics":"We applied multiple image classification and feature extraction techniques as well as utilizing Cafe from UC Berkley.","m2fname":"Anand","description":"With the advent of social media the number of mobile pictures being taken and uploaded is increasing exponentially. Although most photos are uploaded with some basic metadata: date, time, camera model, and possibly geo-location - a great deal of details are missing when they enter the cloud. Unless users physically go through and “tag” each image this can create a search nightmare!

In order to make more effective image search it will be important to
develop and utilize advanced algorithms to help auto-tag images. Doing so can help narrow down image search and improve the quality of search results.

Leveraging the Yahoo! Labs Flickr dataset we plan to test and develop upon feature extraction methods utilizing a parallelized computing system to efficiently extract image characteristics.
Using these image characteristics we will train and test the image classification of these images and evaluate them based on precision. Going beyond this step we also plan to experiment with a GPU powered processing system to evaluate added benefits and performance benchmarks that might be had during the image analysis stage over a standard distributed system.

","m1fname":"Eric","projectname":"Image Classification in the Cloud and GPU (H-Classification & G-Classification)","m3fname":""},{"m2lname":"Zhu","m4lname":"","m3uni":"","m1uni":"to2232","m4uni":"","pid":"201412-4","m2uni":"kz2232","timestring":"Fri Dec 12 23:41:53 2014","m4fname":"","language":"Java, PHP, R, Pig, and Javascript ","m3lname":"","dataset":"We used the genomic and the clinical profiles from the 3,316 patients in TCGA data set. Since we transformed the data set into the common CSV format, our programs can take any RNAseq/microarrary-based genomic data if needed. ","m1lname":"Ou Yang","industry":"","analytics":"We accomplished the following deliverables:

1. Map-Reduce Concordance Index Algorithm
We created the map-reduce concordance index function by mapping all possible combination of the predictions into pairs and check if they are valid pairs and concordant with the response. Then in the reducing part we collect the valid and concordant pairs.

We reduced 66% input size by hashing the status of the patient and the survival time into one floating number, then we sort the hashed response by the order of prediction. Therefore we only transmit an array of floating numbers instead of a matrix.

2. Patient-Based Treatment Recommendation Engine
We created our own treatment recommendation engine by: First, compute the similarity of the patients in the data set with regard to the profile uploaded by user. Second, sort most similar profiles by the expected survival time. And finally return the treatment plan received by the patient with the longest expected survival.

The algorithm is validated using LOOCV over the 3,316 cancer patients, which yielded 87.12% true positive rate.

3. D3.js-based visualization for the Bayesian networks
We created the Bayesian network of the 30 genes using the bnlearn package in R. Then we visualize it using D3.js on our website. The subnetworks were validated using the publicly available gene ontology analysis tool.

4. PHP-Based Interactive Visualization
We created an interactive facility using PHP and R. Because we created the REST API for our concordance index function and the recommendation engine, we demonstrate them as a web service.","m2fname":"Kaiyi","description":"The objective of our project is to analyze the massive cancer genome profiles to create the model for diagnosis and treatment suggestion. Because of the feature number and the size of the samples, the analysis could only be done with the Big Data Analytics tools.
In our project, we implemented the similarity functions used in biomedical research. Our innovation are:

1. We created the map-reduce concordance index function and we successfully reduced the input size by 66%.
2. We identified 30 genes that are related to patients outcome using the toolkit and we visualized the network of them. We also validated they represent the cancer hallmarks.
3. We created a patient-based treatment recommendation engine using R, which returns the analysis and suggestion in one second with the plot.
4. We created the web API and the facility of the recommendation engine and the concordance index function.

This project is important because there was no diagnosis and treatment planning tools were created on the Pan-Cancer basis. And the tools we provided were not implemented on Hadoop platforms before. The project may not only shed a light on the molecular pathologies of the complicated malignancies and the potential therapeutic regimens but provide a tool for medical professionals to save lives. ","m1fname":"Tai-Hsien","projectname":"Network Analysis on the Big Cancer Genome Data","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jc3783","m4uni":"","pid":"201412-73","m2uni":"","timestring":"Wed Dec 17 01:12:08 2014","m4fname":"","language":"JAVA, Mahout, and OSX Mavericks were used","m3lname":"","dataset":"I used Yahoo Labs Movie ratings data from http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=4. It's a public data, and I received permissions to use them.
Another dataset I used is MovieLens from http://grouplens.org/datasets/movielens/ 100k version.
","m1lname":"Choi","industry":"","analytics":"Algorithms : Collaborative filtering user based recommendation, Weighting algorithms for different content-based factors I developed.
System : A recommendation system with collaborative filtering generating candidate recommendations and each factor layer based on user preferences were used to weigh candidates and make a final decision based on weights.
Analytics : Various experiments were conducted. I used training datasets to generate recommendations from the new system, and made a comparison to test datasets. The percentage of matching records were calculated for different factors. Collaborative filtering results were used as a control group while other results with different factors were compared.","m2fname":"","description":" In today’s world, the recommendation system is prevalent in any e-commerce or search engine website. Users get inundated with product recommendations in such websites. Most websites, however, provide preliminary matching on previous browsing history. Machine learning technique is used in recommendation approach. Collaborative filtering based on relation between users and items is introduced but the collaborative filtering was not far from being perfect.
Making movie recommendation much more personalized way is the goal of this project. To generate accurate recommendation results, more data from each user is essential while not overwhelming users for such data collection. I designed an algorithm to account for user’s other rating data such as demographics and content in addition to collaborative filtering technique. With these enhancements, user-item similarities can be more accurate as more factors are considered while users can receive more personalized recommendations. ","m1fname":"Jimin","projectname":"Improving Movie Recommender System with User Behavior changes and Demographics","m3fname":""},{"m2lname":"Maharishi","m4lname":"Naganath","m3uni":"ag3202","m1uni":"anm2147","m4uni":"akn2114","pid":"201412-8","m2uni":"em2852","timestring":"Fri Dec 19 06:49:34 2014","m4fname":"Aditya","language":"MongoDB, Mongo MapReduce, Node.js, Amazon EC2, AWS Elastic Beanstalk, Objective-C ","m3lname":"Gangopadhyay","dataset":"Our app uses crowdsourced data to generate the best path from one starting location to and end location. We generated simulated data based on pre-populated beginning and end coordinates for our paths, which we stored in the database ","m1lname":"Mishra","industry":"","analytics":"Comparing two points
• Compute the Euclidean distance between the two points
• Use an epsilon radius to determine if the two points are essentially the same
• Latitudes and longitudes in a local area differ by ~10^-4. We thus set our epsilon cutoff to be an order higher i.e. 10^-3
Comparing two paths
• Paths represented as “bag of points”, where each point is a latitude-longitude coordinate • Use the Jacard index (Tanimoto similarity metric) to determine similarity between paths
• Paths should be considered the same if they have many shared points.
• We thus use a Tau cutoff of 0.98 for the determination of similarity of two paths
Path clustering algorithm
• Algorithm loosely derived from k-means clustering
• First, bucket paths based on the Tanimoto similarity metric • Then, denote largest bucket as highest-voted path
E6893 Big Data Analytics – Final Project Presentation
© 2014 CY Lin, Columbia University
Big Data Algorithms
MapReduce algorithm
• Documents are stored in JSON in MongoDB in the form {
start : [ latitude, longitude ],
end : [ latitude, longitude ],
points : [ [lat,long], [lat,long] ... ]
}
• The map() function emits a concatenation of truncated start and end latitudes and longitudes as the key, and points array as the value
• The reduce() function computes clusters on the sets of points and outputs a random set (all are essentially equal) from the largest cluster
• MapReduce is useful to parallelize our computation of the buckets and implement scalability within our system
Determining a path from CoreLocation data
• Goal was to transform a continuous sequence of points into discrete paths
• Implemented a buffering mechanism based on whether the user was static or moving to
delimit the start and end of a given path

For the system modules we used a Node.js server and MongoDB to store the paths. AWS Beanstalk/EC2 is used for cloud deployment and load balancing. For the visualization we used iOS8 SDK.
","m2fname":"Esha","description":"GoogleMaps, Waze, and other path-recommendation services use a limited and pre-defined set of attributes to determine the worth of a path
Wish to utilize the very rich knowledge of individuals in their known environments to recommend paths
Not measure any attributes directly, but rather assume that a user taking a path is “voting” for that path as holistically better than any other
By dynamically updating our “knowledge” of an environment through end-user’s choices, we can provide a nuanced, insider recommendation of the current optimal path between two locales

What did we do?
• Replace traditional centrally-orchestrated route suggestion methods (GoogleMaps, Waze) with crowdsourced route suggestions
• Our app considers a person taking a particular route as a “vote” for that route, and then suggests to users the route with the most “votes”
Rationale
• Locals in an area 1) constitute the majority of routes taken 2) know the area best
• Therefore, the highest number of votes should go to the route voted best by locals
What makes it unique?
• Crowdsourced knowledge of environments provides detailed and nuanced information • Human senses and perception far more holistic than any specific metrics
• Crowdsourced data provides real-time updates on environment status
• construction on road, fallen tree, snow/ice not yet been shoveled ","m1fname":"Abhinav ","projectname":"PeopleMaps","m3fname":"Anirban"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"eb2871","m4uni":"","pid":"201412-68","m2uni":"","timestring":"Fri Dec 19 14:23:57 2014","m4fname":"","language":"Java, Platform: Mac OS X, Linux, AMI (Amazon EMR).","m3lname":"","dataset":"I used Pascal VOC 2009 dataset.

Object Detection dataset from Microsoft, which contained labeled images (I used them for BOW training):
http://research.microsoft.com/en-us/downloads/b94de342-60dc-45d0-830b-9f6eff91b301/

Also, I used some of my personal photos.","m1lname":"Barsoum","industry":"","analytics":"Algorithms:
------------
Gaussian: blur a set of images based on Gaussian, Gaussian parameters are set at the command line.
Median: blur a set of images based on median filter, median parameters are set at the command line.
Thumbnail: create a set of thumbnails from a set of images, the thumbnail size is set at the command line.
FaceDetector: Detect all faces for each image in the sequence file and highlight them in the result.
Color2Gray: convert a set of colored image into a gray images.
Dilate: dilate all images in HVision sequence file.
Erode: Erode all images in HVision sequence file.

Analytics:
----------
ImageSearch: content based image search.
ImageSearchTotal: content based image search using total order partition.
FaceStat: Provide a summary about how many images have 0, 1, 2, 3 or more faces.
BOW Classification Trainer: Run SVM trainer using BOW descriptor from a BOW clusters and labeled HVision sequence file. The result is one model per label.

Visualization:
--------------
I wrote a number of tools that dump the result in a format that it is easy to visualize.

Tools:
-----
1. Pack a number of images into HVision sequence file.
2. Unpack HVision sequence file into a number of images.
3. Dump the result of face detector.
4. Pack a number of labeled images into HVision sequence file (label will be stored in the metadata).
5. Create a Bag-of-Words(BOW) cluster model from a set of images.
Extract the result of BOW SVM classification into OpenCV compatible XML file.","m2fname":"","description":"With the huge amount of images captured every day and uploaded online, the need for a scalable Vision and Image Processing platform that can process, analyze and make sense of such data is becoming more important than ever.

Apache Hadoop MapReduce was designed to tackle similar problem for text processing, by splitting and distributing the workload on a number clusters built on the top of commodity hardware's, each node work on a small subset of the data. One of the benefit of Hadoop is its fault tolerance against unreliable hardware.

Although, Hadoop provides a great platform for processing a huge amount of data, it was not designed for processing image data.

Hadoop Vision (HVision) is an open source platform on the top of Apache Hadoop MapReduce, with the goal of providing an easy way to use scalable Computer Vision (CV) algorithms and Image Processing functionalities for developers and researchers. In other word, it is OpenCV on Hadoop.
","m1fname":"Emad","projectname":"HVision (Hadoop Vision Platform)","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"eb2871","m4uni":"","pid":"201412-68","m2uni":"","timestring":"Fri Dec 19 14:30:29 2014","m4fname":"","language":"Java, Platform: Mac OS X, Linux, AMI (Amazon EMR). ","m3lname":"","dataset":"I used Pascal VOC 2009 dataset.

Object Detection dataset from Microsoft, which contained labeled images (I used them for BOW training):
http://research.microsoft.com/en-us/downloads/b94de342-60dc-45d0-830b-9f6eff91b301/

Also, I used some of my personal photos. ","m1lname":"Barsoum","industry":"","analytics":"Analytics: Algorithms:
----------------------
Gaussian: blur a set of images based on Gaussian, Gaussian parameters are set at the command line.
Median: blur a set of images based on median filter, median parameters are set at the command line.
Thumbnail: create a set of thumbnails from a set of images, the thumbnail size is set at the command line.
FaceDetector: Detect all faces for each image in the sequence file and highlight them in the result.
Color2Gray: convert a set of colored image into a gray images.
Dilate: dilate all images in HVision sequence file.
Erode: Erode all images in HVision sequence file.

Analytics:
----------
ImageSearch: content based image search.
ImageSearchTotal: content based image search using total order partition.
FaceStat: Provide a summary about how many images have 0, 1, 2, 3 or more faces.
BOW Classification Trainer: Run SVM trainer using BOW descriptor from a BOW clusters and labeled HVision sequence file. The result is one model per label.

Visualization:
--------------
I wrote a number of tools that dump the result in a format that it is easy to visualize.

Tools:
-----
1. Pack a number of images into HVision sequence file.
2. Unpack HVision sequence file into a number of images.
3. Dump the result of face detector.
4. Pack a number of labeled images into HVision sequence file (label will be stored in the metadata).
5. Create a Bag-of-Words(BOW) cluster model from a set of images.
6. Extract the result of BOW SVM classification into OpenCV compatible XML file.
","m2fname":"","description":"With the huge amount of images captured every day and uploaded online, the need for a scalable Vision and Image Processing platform that can process, analyze and make sense of such data is becoming more important than ever.

Apache Hadoop MapReduce was designed to tackle similar problem for text processing, by splitting and distributing the workload on a number clusters built on the top of commodity hardware's, each node work on a small subset of the data. One of the benefit of Hadoop is its fault tolerance against unreliable hardware.

Although, Hadoop provides a great platform for processing a huge amount of data, it was not designed for processing image data.

Hadoop Vision (HVision) is an open source platform on the top of Apache Hadoop MapReduce, with the goal of providing an easy way to use scalable Computer Vision (CV) algorithms and Image Processing functionalities for developers and researchers. In other word, it is OpenCV on Hadoop. ","m1fname":"Emad","projectname":"HVision","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"eb2871","m4uni":"","pid":"201412-68","m2uni":"","timestring":"Fri Dec 19 16:03:22 2014","m4fname":"","language":"Language: Java, Platform: Mac OS X, Linux, AMI (Amazon EMR).","m3lname":"","dataset":"Dataset: I used Pascal VOC 2009 dataset.

Object Detection dataset from Microsoft, which contained labeled images (I used them for BOW training):
http://research.microsoft.com/en-us/downloads/b94de342-60dc-45d0-830b-9f6eff91b301/

Also, I used some of my personal photos.","m1lname":"Barsoum","industry":"","analytics":"Algorithms:
-----------
Gaussian: blur a set of images based on Gaussian, Gaussian parameters are set at the command line.
Median: blur a set of images based on median filter, median parameters are set at the command line.
Thumbnail: create a set of thumbnails from a set of images, the thumbnail size is set at the command line.
FaceDetector: Detect all faces for each image in the sequence file and highlight them in the result.
Color2Gray: convert a set of colored image into a gray images.
Dilate: dilate all images in HVision sequence file.
Erode: Erode all images in HVision sequence file.

Analytics:
----------
ImageSearch: content based image search.
ImageSearchTotal: content based image search using total order partition.
FaceStat: Provide a summary about how many images have 0, 1, 2, 3 or more faces.
BOW Classification Trainer: Run SVM trainer using BOW descriptor from a BOW clusters and labeled HVision sequence file. The result is one model per label.

Visualization:
--------------
I wrote a number of tools that dump the result in a format that it is easy to visualize.

Tools:
-----
1. Pack a number of images into HVision sequence file.
2. Unpack HVision sequence file into a number of images.
3. Dump the result of face detector.
4. Pack a number of labeled images into HVision sequence file (label will be stored in the metadata).
5. Create a Bag-of-Words(BOW) cluster model from a set of images.
6. Extract the result of BOW SVM classification into OpenCV compatible XML file. ","m2fname":"","description":"With the huge amount of images captured every day and uploaded online, the need for a scalable Vision and Image Processing platform that can process, analyze and make sense of such data is becoming more important than ever.

Apache Hadoop MapReduce was designed to tackle similar problem for text processing, by splitting and distributing the workload on a number clusters built on the top of commodity hardware's, each node work on a small subset of the data. One of the benefit of Hadoop is its fault tolerance against unreliable hardware.

Although, Hadoop provides a great platform for processing a huge amount of data, it was not designed for processing image data.

Hadoop Vision (HVision) is an open source platform on the top of Apache Hadoop MapReduce, with the goal of providing an easy way to use scalable Computer Vision (CV) algorithms and Image Processing functionalities for developers and researchers. In other word, it is OpenCV on Hadoop. ","m1fname":"Emad","projectname":"HVision","m3fname":""},{"m2lname":"Terzis","m4lname":"Zhang","m3uni":"tw2229","m1uni":"ohz2101","m4uni":"jfz2107","pid":"201412-70","m2uni":"jt2514","timestring":"Sun Dec 21 10:38:44 2014","m4fname":"Jimmy","language":"R, Python, Mahout, Java, Spark, Git, AWS on Ubuntu ","m3lname":"Wu","dataset":"Purchased dataset of price quotes on equities, exchange traded futures, and market indices over the last 15 years at the 1 minute granularity level.

The software should be able to support any financial data pre-processed into a similar format","m1lname":"Zhou","industry":"","analytics":"K Means Clustering
Random Trees
R Visualization of the data","m2fname":"John","description":"Understanding volatility in financial markets has long been of interest to hedge and speculators. The objective of this project is to model volatility in the market and perform feature selection and regressions to identify relationships between market volatility and various factors.","m1fname":"Oliver","projectname":"Financial Market Volatility","m3fname":"Tim"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jrb2211","m4uni":"","pid":"201412-76","m2uni":"","timestring":"Sun Dec 21 10:52:33 2014","m4fname":"","language":"The primary Scratch 2.0 platform, implemented in ActionScript, offers a web-based Initial Learning Environment in which children and adults develop fluency in computational thinking by dragging and interlocking blocks representative of programming constructs. Scratch 2.0 files, of extension type .sb2, are compressed in binary format, but extract into a set of resources and human-readable JSON (JavaScript Object Notation). Scratch Analyzer, programmed in Java, produces outputs compatible with a variety of big data infrastructures including Hadoop, Mahout, and Neo4j.","m3lname":"","dataset":"The Scratch website (scratch.mit.edu) exposes many projects created and shared by \"Scratchers\" who frequently enjoy learning from one another by remixing each other's ideas and code. These files have proven sufficient data sources for completing the final project. The long-term research agenda involves analyzing thousands of Scratch projects archived in the Computer Clubhouse Village (www.clubhousevillage.org), as well as evaluating student project portfolios as learning progresses during innovative instructional units in order to evidence alignment (or advise realignment) with standards, curricula, and personalized goals.","m1lname":"Bender","industry":"","analytics":"Scratch Analyzer allows sets of Scratch projects to serve as data inputs to Hadoop Distributed File System, Mahout's collaborative filtering algorithms, and Neo4j's graph database and visualization infrastructure. Scratch's compatibility with these analytic tools enables researchers and practitioners to develop learning systems which: 1) empower students to discover recommended projects, peers and mentors; and 2) offer educators means to ascend the instructional value chain by automating the identification of gaps both in curricular design and individual learning via evaluative project clustering and classification.","m2fname":"","description":"For learning-system designers who aim to infuse computational thinking within K-12 curricula, Scratch Analyzer is an open-source software suite that distills Scratch projects into inputs fir for insightful educational data mining and learning analytics. When fully integrated within an innovative learning system, Scratch Analyzer positions Scratch, the eminent visual programming environment for novices, as a game-based learning platform interconnected with an attainable future in which students engage in properly-paced, enjoyable, adaptive, and personalized mastery of the computational thinking concepts driving economic and social progress. Architected modularly in order to maximize flexibilty and extensibility, the system consists of three components: Scratch Extractor, Scratch Dispatcher, and Scratch Traverser. Scratch Extractor identifies Scratch blocks serialized as JSON and strips irrelevant syntax while preserving hierarchical structure. Scratch Dispatcher transforms extracted blocks into simple CSVs compatible with recommendation engines built using a variety of big data technologies. Scratch Traverser follows the hierarchical structure of extracted Scratch blocks and executes a user-supplied method for each block. In combination, these components enable Scratch Analyzer to automate time-consuming instructional and learning responsibilities by offering teachers, and their students, opportunities to access timely formative and summative feedback as they strive together to achieve durable learning.","m1fname":"Jeff","projectname":"Scratch Analyzer","m3fname":""},{"m2lname":"Aligbe","m4lname":"","m3uni":"","m1uni":"tkp2108","m4uni":"","pid":"201412-19","m2uni":"ma2799","timestring":"Mon Dec 22 01:39:24 2014","m4fname":"","language":"Java, Mahout","m3lname":"","dataset":"We use the free intra-day datasets provided at http://www.histdata.com.","m1lname":"Paine","industry":"","analytics":"We implement real time graphs of time series and various metrics measured from them. We also build a recommendation system which uses historical data to forecast price movement, and suggest BUY, SELL, or HOLD. ","m2fname":"Mark","description":"Every day, currency markets generate millions of data points at millisecond granularity. Traders use this data to make trading decisions, executing trades at points in time where they believe they can make a profit off the difference between the price their clients are willing to pay, and the price they receive. Independent traders can make money by exploiting trends in price movement. This trading requires the brokers to have access to not only the data itself, but also technical indicators and real time analytics which can give them insight. Thus, a large amount of processing must be done on large quantities of data, all in real time.

The problem of processing and analyzing these large quantities of data in real time is well served by the tools of big data analytics. We propose and implement an open source, real time streaming, visualization, and analytics engine for processing currency data. Our system is able to aggregate multiple streams of currency data, visualize it with updating graphs, and leverage Apache Mahout to identify currency trends and attempt to predict future price movement.
","m1fname":"Tim","projectname":"Curreny Data Aggregation, Visualization, and Analytics Engine","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ta2408","m4uni":"","pid":"201412-72","m2uni":"","timestring":"Mon Dec 22 17:58:10 2014","m4fname":"","language":"Java, Neo4j","m3lname":"","dataset":"A very large dataset of over 1 billion records was used. The records consist of names, addresses, and other information from US households spanning more than 100 years. The data is not public and was provided by special permission.","m1lname":"Adams","industry":"","analytics":"Graph database, statistical software from Apache Commons.","m2fname":"","description":"Optical character recognition (OCR) systems used to capture historical information from archive documents face the specialized problem of generating datasets whose tokens may have no external ground truth reference, thereby making supervised learning methods inapplicable. A system that requires no ground truth references e.g., lexicons nor training sets, to correct misread tokens would be ideal for integration into the information extraction process.","m1fname":"Thomas","projectname":"Error Correction in Large Volume OCR Datasets","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"twk2113","m4uni":"","pid":"201412-69","m2uni":"","timestring":"Mon Dec 22 18:24:56 2014","m4fname":"","language":"HortonWorks HDP 2.1 Hadoop; pig; hive; R; Mahout","m3lname":"","dataset":"A collection of NMON (http://nmon.sourceforge.net/pmwiki.php) performance log files for 2030 Linux servers with a history of 90 days was made available by the authors current employer. The data set is applicable to other organizations, but can be generated by any organization using the open source NMON utility.","m1lname":"Kellogg","industry":"","analytics":"HoltWinters; K-Means, Fuzzy K-Means clustering, Mahout, R , igraph R package

Extract, Load, Transform(ELT) for NMON metric data to Hadoop with PIG log to table transformation.","m2fname":"","description":"A system consisting of a software infrastructure and methodology utilizing Hadoop, Pig, Hive, R and Mahout components is presented. The system is designed to facilitate a parallel processing identification of aberrant performance behavior found in time series data for large groups of servers. Cluster analysis is used for discovery of possible high interest servers of small groups of servers out of the very large initial server population.","m1fname":"Tad","projectname":"Using Big Data (Hadoop) for Identification of Aberrant Behavior Clusters in Server Performance Time Series Data","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jap2216","m4uni":"","pid":"201412-71","m2uni":"","timestring":"Mon Dec 22 20:37:55 2014","m4fname":"","language":"Java Spring MVC, Javascript, Hadoop, HBase, Hive","m3lname":"","dataset":"The Kindergarten Class of 1998-99 (ECLS-K) data set used in this final project was retrieved from the U.S. Department of Education’s National Center for Education Statistics (NCES). NCES runs a program called the Early Childhood Longitudinal Study which includes three longitudinal studies that examine child development, school readiness, and early school experiences. The first of these studies is known as the Birth Cohort (ECLS-B) and it provides data of children born and 2001 and followed from birth through kindergarten entry. The Kindergarten Class of 2010-11 (ECLS-K:2011) study provides data on a kindergarten class of 2010-11 students from kindergarten through the fifth grade. The ECLS-K study selected for this final project provides data on a 1998-99 class of kindergarten students around the country through eighth grade.

The longitudinal data in ECLS-K allows researchers to study how a wide range of family, school, community, and individual factors are associated with school performance. The children in the study come from public and private schools and are from diverse socioeconomic and racial/ethnic backgrounds. The children’s parents, teachers, and school administrators also participated in the study. All participants provided information on children’s cognitive, social, emotional, and physical development. Also collected in the data provided by the study include information about the children’s home environment, home educational activities, school environment, classroom environments, among others.","m1lname":"Pava","industry":"","analytics":"Mahout's Canopy clustering and Kmeans Clustering are used to split the students into groups based on features selected by a user. Weighted averages are used to project a student's assessment scores in reading, mathematics, and science.

D3.js (also known as Data-Driven Documents) is an opensource JavaScript library for creating HTML, SVG, and CSS documents using data. D3.js is fast across modern browsers, and supports large datasets for interaction and animation. This library is used to create node graph and parallel coordinate visualizations.","m2fname":"","description":"Big data analytics has long enabled entities with vast computing power to systematically process large and complex data so that meaningful projections about the future can be made based on the past. With the introduction of open-source tools like Hadoop, however, vast computing power is no longer as big a constraint as it once was. Instead, clusters of commodity hardware may be used to distribute storage and processing so that costs may be kept low. One such entity that benefits from this lowered barrier to entry is the United States school system. Parents and teachers often find out too late that their students are at risk of low academic performance and may not have enough time to remedy the student’s chances of success for life after school. Big data analytics can help. This final project will propose a system that leverages longitudinal data provided by the Department of Education to project a student’s academic performance and enable stakeholders to identify changes that have helped similar students in the past to improve their academic performance.","m1fname":"Jairo","projectname":"Improving Education for At-Risk Students","m3fname":""},{"m2lname":"SHEN","m4lname":"CAO","m3uni":"fs2523","m1uni":"hz2361","m4uni":"yc2978","pid":"201412-52","m2uni":"ys2821","timestring":"Mon Dec 22 21:16:38 2014","m4fname":"YONGJIE","language":"hadoop mahout java c++ mysql","m3lname":"SONG","dataset":"Database is from Amazon movie review center, which contains 7911684 movie reviews. Each review contains movie name, movie rating score, date of the review and comments.","m1lname":"ZOU","industry":"","analytics":"Since we are targeting at both producer and viewer markets, we extract related information which could be utilized in the analysis such as key words extraction, cluster and recommendation. For key words extraction, we might want to pick up some of the most frequent informative words and cluster them to see what kind of categories they lie in. As for recommendation, we could give recommend movies through user-based/item-based recommendation algorithm. The keywords selection involves kmeans clustering algorithm. The further data processing use the functions of database. (MySQL)","m2fname":"YUZHE ","description":"The project could benefit the movie audience as well as the movie producers. For audience, according to the movie review data, the data system could generate and list the top rated movies once the audience specify their preferred category. Moreover, the movies could also be clustered and listed by audience-preferred keywords such as \"spy\", \"kungfu\". Then after users input their preference, more specific results will be output for users. For producers, the average rating score of the movies could be calculated with regard to diverse categories. Nevertheless, the data system could summarize the keywords in the review and list the keyword popularity from the keyword-related category. For instance, the system could calculate and assign the keyword \"flood\" a weighted score, and the movies that contain that keyword would be considered prediction scores with keywords scores. More keywords the producers choose in the system, theoretically, more accurate prediction the website will give. ","m1fname":"HUI","projectname":"Mo","m3fname":"FENGYI"},{"m2lname":"SHEN","m4lname":"CAO","m3uni":"fs2523","m1uni":"hz2361","m4uni":"yc2978","pid":"201412-52","m2uni":"ys2821","timestring":"Mon Dec 22 21:30:57 2014","m4fname":"YONGJIE","language":"hadoop mahout java c++ mysql ","m3lname":"SONG","dataset":"Database is from Amazon movie review center, which contains 7911684 movie reviews. Each review contains movie name, movie rating score, date of the review and comments.
","m1lname":"ZOU","industry":"","analytics":"Since we are targeting at both producer and viewer markets, we extract related information which could be utilized in the analysis such as key words extraction, cluster and recommendation. For key words extraction, we might want to pick up some of the most frequent informative words and cluster them to see what kind of categories they lie in. As for recommendation, we could give recommend movies through user-based/item-based recommendation algorithm. The keywords selection involves kmeans clustering algorithm. The further data processing use the functions of database(MySQL). ","m2fname":"YUZHE","description":"The project could benefit the movie audience as well as the movie producers. For audience, according to the movie review data, the data system could generate and list the top rated movies once the audience specify their preferred category. Moreover, the movies could also be clustered and listed by audience-preferred keywords such as \"spy\", \"kungfu\". Then after users input their preference, more specific results will be output for users. For producers, the average rating score of the movies could be calculated with regard to diverse categories. Nevertheless, the data system could summarize the keywords in the review and list the keyword popularity from the keyword-related category. For instance, the system could calculate and assign the keyword \"flood\" a weighted score, and the movies that contain that keyword would be considered prediction scores with keywords scores. More keywords the producers choose in the system, theoretically, more accurate prediction the website will give. ","m1fname":"HUI","projectname":"Movie Exploration","m3fname":"FENGYI"},{"m2lname":"Aligbe","m4lname":"","m3uni":"","m1uni":"tkp2108","m4uni":"","pid":"201412-19","m2uni":"ma2799","timestring":"Mon Dec 22 23:06:14 2014","m4fname":"","language":"Java, Mahout ","m3lname":"","dataset":"We use the free intra-day datasets provided at http://www.histdata.com. ","m1lname":"Paine","industry":"","analytics":"We implement real time graphs of time series and various metrics measured from them. We also build a recommendation system which uses historical data to forecast price movement, and suggest BUY, SELL, or HOLD. Details can be found in the slides, or on the report.","m2fname":"Mark","description":"Every day, currency markets generate millions of data points at millisecond granularity. Traders use this data to make trading decisions, executing trades at points in time where they believe they can make a profit off the difference between the price their clients are willing to pay, and the price they receive. Independent traders can make money by exploiting trends in price movement. This trading requires the brokers to have access to not only the data itself, but also technical indicators and real time analytics which can give them insight. Thus, a large amount of processing must be done on large quantities of data, all in real time.

The problem of processing and analyzing these large quantities of data in real time is well served by the tools of big data analytics. We propose and implement an open source, real time streaming, visualization, and analytics engine for processing currency data. Our system is able to aggregate multiple streams of currency data, visualize it with updating graphs, and leverage Apache Mahout to identify currency trends and attempt to predict future price movement. ","m1fname":"Tim","projectname":"Currency Data Aggregation, Visualization, and Analytics Engine","m3fname":""},{"m2lname":"Terzis","m4lname":"Zhang","m3uni":"tw2229","m1uni":"ohz2101","m4uni":"jfz2107","pid":"201412-70","m2uni":"jt2514","timestring":"Tue Dec 23 02:22:33 2014","m4fname":"Jimmy","language":"R, Python, Mahout, Java, Apache Spark, Git, AWS on Ubuntu, Bash","m3lname":"Wu","dataset":"Dataset: We have procured a massive dataset of price quotes on equities, exchange traded futures, futures, and market indices over the span of the last ten to fifteen years at the one minute granularity level. In addition to price quotes on specific instruments, our dataset features derivative indicators of priceand volume activity.

It was purchased, and not available to the public. The project should be able to support datasets that have the same variables attached to the data.","m1lname":"Zhou","industry":"","analytics":"mapreduce, logistic regressions with Mahout (Stochastic Gradient Descent, k-means cluster with Rhadoop, clustering, regularized ridge regression ","m2fname":"John","description":"Understanding volatility in financial markets has long been of interest to hedge and speculators. Empirical evidence has shown us that volatility is a highly nonlinear evolving process. Modeling this process using the Hadoop ecosystem can offer tremendous advantages over traditional economet- ric models that are limited to datasets which fit in main memory.
Financial firms need the ability to quickly and accurately assess volatility . Utilizing large historical datasets with various forms of big data processing methods provides them with the ability to effectively evaluate their exposure to risk.
This project seeks to apply various solutions to analyze transactional data with simulated market behavior to gain a better understanding of what factors influence risk and volatility at the symbol level. This will allow the user to gain insight into high frequency volatility, as well as getting a better understanding of the associated risk factors in order to make most informed decisions.
","m1fname":"Oliver","projectname":"Financial Market Volatility","m3fname":"Tim"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"aoa2124","m4uni":"","pid":"201412-48","m2uni":"","timestring":"Mon Dec 29 15:54:02 2014","m4fname":"","language":"Java","m3lname":"","dataset":"The dataset is public and contains multivariate data types, with date-timestamps, and meter readings of a French household energy consumption over a 4 year period spanning 2006 to 2010.

It is provided by the University of California, Irvine. ","m1lname":"Aladesawe","industry":"","analytics":"Analytics: Algorithms suggested will be:
Time-series analysis to spot trends
Plot of the data-points to learn a pattern or class of models
Regression to learn of underlying generative model of the dataset
Predict/generate data from the models learned

Visualization Tools: Output is written to File","m2fname":"","description":"- Write a Time Series Analysis Application

- Use Regression to forecast seasonal or monthly energy consumption in a typical French household

- Application should accept time range into the future to make forecasts

- The output is dumped to file and available for visualization

- Provide an easy to use and end-user friendly implementation of Time Series Analysis

- Leverage the ease of use of the tool to help energy traders make minimum-risk forecasting on datasets (Dataset used for testing)","m1fname":"Adeyemi","projectname":"Minimizing Risk for Energy Arbitrage","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"fw2253","m4uni":"","pid":"201412-62","m2uni":"","timestring":"Mon Dec 29 23:42:59 2014","m4fname":"","language":"Java, Android SDK, Linux, Mahout","m3lname":"","dataset":"Yahoo Labs - Yahoo! Search Marketing Advertiser Bidding Data, version 1.0","m1lname":"Wang","industry":"","analytics":"Mahout Recommendation Algorithm, Android SDK graphics suite","m2fname":"","description":"This project aims to explore a concept of providing advertisers who are bidding for advertisement spaces more intelligence when they are bidding for a product that he/she is not familiar with. The implementation is done on an Android mobile based platform, in order to cater towards the busy lifestyle of modern professionals.","m1fname":"Kevin","projectname":"Suggestive Advertisement Bidding Mobile Platform","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"mw2917","m1uni":"jg3527","m4uni":"","pid":"201412-57","m2uni":"zw2291","timestring":"Tue Dec 30 00:53:51 2014","m4fname":"","language":"Python, Java, Ecllipse","m3lname":"Wang","dataset":"We captured the data set from the Douban website, including the movie name, user id, and point of the movie which is evaluated by the user Using python code wrote by out team memebr.
","m1lname":"Guo","industry":"","analytics":"We used item-based recommendation algorithm.","m2fname":"Zihao ","description":"With the rapid growth of movie industry, people are facing numerous choices of different kinds of movies. People are overwhelmed by the choices and it may take a lot of time to decide which movie to watch. Our program can recommend movies to the users.","m1fname":"Jing ","projectname":"Movie Recommendation and Analysis of This Application","m3fname":"Mingyuan"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jg3460","m4uni":"","pid":"201412-74","m2uni":"","timestring":"Tue Dec 30 01:05:21 2014","m4fname":"","language":"Python, VBA , Java, Hadoop pig","m3lname":"","dataset":"log file which is in csv file format contained in ETS Internal IBT database ( standard tests data warehouse and test taker behavior tracking system)
","m1lname":"Gu","industry":"","analytics":"log data cleaning and grouping
Statistical analysis (descriptive statistics and t-test)
Regression model analysis (logistic regression)
","m2fname":"","description":"Test takers always change their selections or choices of questions during test. Therefore, it will be interesting to find out the effects of this behavior via analysis of log data, that whether test taker gain more points when their choice or not. Moreover, we can also gain more insights on test taker behavior.

1. helping test developer to understand IBT test taker behavior regarding choice switching. \u000b2. helping test taker to prepare for the standardized IBT test more effective and efficient. \u000b3. improve the IBT adaptive test question reliability and validity via analyzing the log data.
","m1fname":"Jiaming","projectname":"Big Data Analysis on Log Data of Standardized IBT Test Taker for Effects of Selection Changing","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ph2325","m4uni":"","pid":"201412-77","m2uni":"","timestring":"Tue Dec 30 10:47:14 2014","m4fname":"","language":"Apache Hive-Hive SQL/Apache Mahout-Python/Excel-VBA","m3lname":"","dataset":"In this project, the author used data provided by the USA Official Social Security Website, (http://www.ssa.gov/OACT/babynames/). The original datasets consist of 51 text files; each represents the 50 states and the District of Columbia, respectively. Each record in a file has the format: 2-digit state code, gender (M = male or F = female), 4-digit year of birth (starting with 1910), the 2-15 character name, and the number of occurrences of the name. Fields are delimited with a comma. Each file is sorted first on gender, then year of birth, and then on number of occurrences in descending order. When there is a tie on the number of occurrences names are listed in alphabetical order. Due to privacy concerns, the list of names is restricted to those with at least 5 occurrences. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.","m1lname":"Huang","industry":"","analytics":"Item-based recommender
User-based recommender
NearestNUser Neighbourhood
Threshold Neighbourhood
Pearson Correlation Similarity
Euclidean Distance Similarity
LogLikelihood Similarity
Tanimoto Coefficient Similarity
Hive SQL: Group by/Distribute/Sort/Distinct/Sum/Average
Excel Pivot Table","m2fname":"","description":"The basis of this project was to use various machine learning and data mining techniques learned in the Big Data Analytics class to solve a simple, but applicable social problem: How can parents pick “good” names for their new-born babies. The purpose of choosing this particular topic is to offer some alternative use cases for the knowledge and skills learned in this class, and to demonstrate how big data analytics, something that are purely technical and scientific, can be used to answer real-life questions.","m1fname":"Pei","projectname":"How to name your new-born baby(babies)?","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ma3246","m4uni":"","pid":"201412-39","m2uni":"","timestring":"Sun Jan 4 10:16:50 2015","m4fname":"","language":"Hadoop(Java and streaming PERL), MATLAB","m3lname":"","dataset":"Datasets used are located here:
http://legacy-www.swpc.noaa.gov/ftpmenu/warehouse.html

They include RSGA and SRS files which are tarred and zipped for each year. ","m1lname":"allen","industry":"","analytics":"Streaming PERL was used to re-format the files into 1-line files to facilitate processing via Hadoop. Then a number of Java map/reduce jobs were developed to parse the relevant data from the files and to perform analytics on that data (e.g. finding average values per month or calculating residuals between actual and predicted data). Then the data was imported into Matlab in order to further analyze the data, such as clustering and computing long term trends of the data.","m2fname":"","description":"The objective of this endeavor is to analyze historical space weather datasets in the hopes of finding patterns that will aid in future prediction of this data. Prediction of space atmospheric data is relevant to many aspects of satellite missions. Datasets covering a wide range of atmospheric phenomena dating as far back as 1966 are available at the Northern Oceanic and Atmospheric Association (NOAA) website. Particularly, the quality of predictions several days in the future will be analyzed in order to gauge the predictive capacity of current models.","m1fname":"matt","projectname":"Analysis of Historical Space Weather Data","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ph2325","m4uni":"","pid":"201412-77","m2uni":"","timestring":"Sun Jan 4 17:56:22 2015","m4fname":"","language":"Apache Hive-Hive SQL/Apache Mahout-Python/Excel-VBA","m3lname":"","dataset":"In this project, the author used data provided by the USA Official Social Security Website, (http://www.ssa.gov/OACT/babynames/). The original datasets consist of 51 text files; each represents the 50 states and the District of Columbia, respectively. Each record in a file has the format: 2-digit state code, gender (M = male or F = female), 4-digit year of birth (starting with 1910), the 2-15 character name, and the number of occurrences of the name. Fields are delimited with a comma. Each file is sorted first on gender, then year of birth, and then on number of occurrences in descending order. When there is a tie on the number of occurrences names are listed in alphabetical order. Due to privacy concerns, the list of names is restricted to those with at least 5 occurrences. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count. ","m1lname":"Huang","industry":"","analytics":"Item-based recommender
User-based recommender
NearestNUser Neighbourhood
Threshold Neighbourhood
Pearson Correlation Similarity
Euclidean Distance Similarity
LogLikelihood Similarity
Tanimoto Coefficient Similarity
Hive SQL: Group by/Distribute/Sort/Distinct/Sum/Average
Excel Pivot Table ","m2fname":"","description":"The basis of this project was to use various machine learning and data mining techniques learned in the Big Data Analytics class to solve a simple, but applicable social problem: How can parents pick “good” names for their new-born babies. The purpose of choosing this particular topic is to offer some alternative use cases for the knowledge and skills learned in this class, and to demonstrate how big data analytics, something that are purely technical and scientific, can be used to answer real-life questions.","m1fname":"Pei","projectname":"How to name your new born babies","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"kjw2146","m4uni":"","pid":"201412-1","m2uni":"","timestring":"Thu Jan 8 06:16:09 2015","m4fname":"","language":"Java code and 3rd party Java libraries are used. Apache Hadoop, Apache Lucene, and berkeleylm are the 3rd party libraries.","m3lname":"","dataset":"The data set is private. I tested language model generation with a data set containing transcripts from FAA controllers in the National Air Space. The software is not specific to transcripts of this nature, though there are some parameters in the Lucene text analysis and tokenization code that may need to be altered if text with annotations are used.","m1lname":"White","industry":"","analytics":"The data is used to extract a model of the language (i.e., word sequences seen in training data) for use in a speech recognizer. In a MapReduce job, the Apache Lucene library is utilized to perform transcript analysis and segmentation as the input to a language model estimation tool (i.e., berkeleylm). This is the starting work to being able to create the statistical models necessary for speech recognition in a distributed environment, and be able to efficiently store and access that data during runtime operations.","m2fname":"","description":"Performance of speech recognition systems is strongly dependent on the process of adapting general tools to a specific task. The adaptation process generally utilizes audio and transcription data from the domain of interest to bias the recognizer for strong performance in the domain of interest. The amount of training data modern acoustic and language models can utilize to increase expected performance before diminishing returns strongly kicks in has continued to grow. This growth in training data causing additional demands on the speech system builder to store this data and use it to train statistical models. The speech analytics software library is proposed to aid the speech system builder in managing this data explosion in a distributed environment.","m1fname":"Kyle","projectname":"Speech Analytics Toolkit","m3fname":""},{"m2lname":"Yan","m4lname":"","m3uni":"","m1uni":"cz2346","m4uni":"","pid":"201505-18","m2uni":"jy2654","timestring":"Mon May 18 15:48:34 2015","m4fname":"","language":"Java, Python, PHP, HTML, CSS, JavaScript, MySQL, AmazonEC2","m3lname":"","dataset":"We collect the data by ourself though Android-based Smartphone.","m1lname":"Zhao","industry":"","analytics":"Naive Bayes, kNN, SVM

We apply these three widely-used machine learning algorithms to build our recognition model.","m2fname":"Junkai","description":"Develop an Android application to collect sensor data from the phone and send to the web
server.

Using the Android API to get the real-time activity information and send to the web server.
Collect data from the phone and use machine learning tools to create a suitable model to
recognize the current activity from sensor information.

Build a web server to receive data from the phone, predict the activity through the model and
display the result on the webpage in real time. We will focus on SIX activities: Standing,
Sitting, Jogging, Walking, Walking down stairs, Walking up stairs.

Compare the results of our own model and Android API. Calculate the accuracy of each
model. ","m1fname":"Chenze","projectname":"Real-time Human Activity Recognition","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"","m1uni":"xz2408","m4uni":"","pid":"201505-20","m2uni":"dl2943","timestring":"Mon May 18 16:12:06 2015","m4fname":"","language":"Python","m3lname":"","dataset":"User information and tweets collected via Twitter API.
AFINN-111","m1lname":"Zhou","industry":"","analytics":"Our Algorithms can be divided into two parts:
Status analysis: Analyze the frequency and the sentiment of tweets, as well as their repeating words, punctuations, uppercase words and average length.
User analysis: Analyze user’s profile and its friends, including following/follower ratio, status count and created time.
Visualizations:
Word cloud and Friend network using D3.js","m2fname":"Di","description":"The goal of our project is to tell whether a person is reliable or not by analyzing his/her social media account based on a rating mechanism. We will design some algorithms to compute a score for each candidate person. A person with a higher score is more likely to be an authority. ","m1fname":"Xinyi","projectname":"Social Media Reliability Analysis","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"","m1uni":"xz2408","m4uni":"","pid":"201505-20","m2uni":"dl2943","timestring":"Mon May 18 16:12:29 2015","m4fname":"","language":"Python","m3lname":"","dataset":"User information and tweets collected via Twitter API.
AFINN-111","m1lname":"Zhou","industry":"","analytics":"Our Algorithms can be divided into two parts:
Status analysis: Analyze the frequency and the sentiment of tweets, as well as their repeating words, punctuations, uppercase words and average length.
User analysis: Analyze user’s profile and its friends, including following/follower ratio, status count and created time.
Visualizations:
Word cloud and Friend network using D3.js","m2fname":"Di","description":"The goal of our project is to tell whether a person is reliable or not by analyzing his/her social media account based on a rating mechanism. We will design some algorithms to compute a score for each candidate person. A person with a higher score is more likely to be an authority. ","m1fname":"Xinyi","projectname":"Social Media Reliability Analysis","m3fname":""},{"m2lname":"Chuang","m4lname":"","m3uni":"","m1uni":"sl3833","m4uni":"","pid":"201505-3","m2uni":"hc2751","timestring":"Mon May 18 16:26:41 2015","m4fname":"","language":"Python, HTML, CSS, JavaScript","m3lname":"","dataset":"Yelp is providing all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research.

Business Objects
Business objects contain basic information about local businesses.

Review Objects
Review objects contain the review text, the star rating, and information on votes Yelp users have cast on the review.

","m1lname":"Lin","industry":"","analytics":"We use Python for parsing the original data. D3.js, jQuery and JavaScript for visualization. We implement two interactive application of this dataset. First one is sunbrust diagram, showing the direct graph between schools, business, and number of reviews. The second one is map graph, showing the geographic view of distribution of these business, and the detail information of businesses.","m2fname":"Hao-Hsiang","description":"We want to provide an interesting and interactive virtualization application for the Yelp's Academic Dataset by using JavaScript with D3 (Data-Driven Document) library. \u000bTo let user directly access to the data, and pass clear informations to user. For us, it's a really good chance to practice combining web front-end and data analysis. Since the capabilities of browsers and devices are dramatically improved these years, front-end application has become more dominated. These could also become a good example for those who wants to do the combination of front-end and data analytics.","m1fname":"Sun-Yi","projectname":"Visualization of Yelp's Academic Dataset","m3fname":""},{"m2lname":"Cao","m4lname":"","m3uni":"","m1uni":"rk2797","m4uni":"","pid":"201505-23","m2uni":"lc2847","timestring":"Mon May 18 18:09:50 2015","m4fname":"","language":"Java, Eclipse, Maven, Twitter4j, AlchemyAPI, adminLTE","m3lname":"","dataset":"Twitter Data. We extracted the data as required","m1lname":"Kompella","industry":"","analytics":"1. Data extraction
2. Data filtering
3. Data parsing
4. adminLTE dashboard for visualizing","m2fname":"Lingyuan ","description":"Objective:

Using Text obtained from tweets, the machine should be able to sense the emotion behind the tweet and visualize it to an estimate of the sentiment and come up with suitable improvements over the existing methods.

Tools:

Java, Maven, Eclipse, Twitter4j, AlchemyAPI, adminLTE

","m1fname":"Rama Subbaraya Kashyap","projectname":"Dynamic Sentiment Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"efj2106","m4uni":"","pid":"201505-10","m2uni":"","timestring":"Mon May 18 18:28:09 2015","m4fname":"","language":"pySpark, AlchemyAPI, Python NLTK Package, R, Hadoop","m3lname":"","dataset":"For headline analysis I web-scraped Yahoo! Finance using R and performed sentiment analysis using the Python NLTK package to analyze the sentiment and for the Twitter portion I used the AlchemyAPI at IBM within pySpark. For the historical options data and volatility/contract analysis I used a Bloomberg Terminal. ","m1lname":"Johnson","industry":"","analytics":"Insert","m2fname":"","description":"The objective of this project was to adapt and extend the concept of social sentiment analysis into an investment strategy. By using public sentiment (Tweets and Yahoo! Finance Headlines) as an signal for earnings performance and then optimizing the contract selection (Call or Put) based on the recent trading volume and the predicted change in stock price we could isolate which stocks are more accurately correlated with sentiment and what appropriate investment strategies should be taken. ","m1fname":"Eric","projectname":"NLP of Public Sentiment for Earning Calls","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ta2408","m4uni":"","pid":"201505-26","m2uni":"","timestring":"Mon May 18 19:03:39 2015","m4fname":"","language":"Spark, Java","m3lname":"","dataset":"A dataset of over 1 billion records were tested, as provided by a private institution for research purposes.","m1lname":"Adams","industry":"","analytics":"Bayesian statistical inference, un-materialized sub-graph traversals.","m2fname":"","description":"Optical character recognition (OCR) systems used to capture historical information from archive documents face the specialized problem of generating datasets whose tokens may have no external ground truth reference, thereby making supervised learning methods inapplicable.

An ideal system would allow the generation, usage, and introspection of corpus-specific inference engines, capable of automatic correction of misread token and requiring no ground truth references (e.g., lexicons nor training sets).
","m1fname":"Thomas","projectname":"Enhanced Error Correction in Large Volume OCR Datasets","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ss4609","m4uni":"","pid":"201505-22","m2uni":"","timestring":"Mon May 18 19:23:59 2015","m4fname":"","language":"Python, NLTK, Python-Twitter, Windows, Tableau","m3lname":"","dataset":"We used twitter data. We downloaded it using python-twitter which is google's wrapper around the rest api.","m1lname":"Shrivastava","industry":"","analytics":"TF-IDF
Tech-IB Scoring
AFINN
Tableau
Time Series Graph
Heat Map
Geo Map
Word Cloud","m2fname":"","description":"In the world of Investment banking, analysts mainly look at company information to prepare their pitch books and do pricing and other investment banking related activities. Generally they look at historic data which is generated monthly if not quarterly. This project aims to provide them with analysis and visualizations based on tweeter feed for a such companies so that they can leverage this information in making a better informed decisions than their competitors.

There are three main objectives of this project:
1. Create an application to download and generate a clean set of company tweets data.
2. Create a scoring algorithm specific to analyzing technology companies.
3. Develop interesting and actionable visualizations for investment banking analysts

The main innovation done in this project is to mine data from twitter that could be used to generate actionable insights for Investment Bankers. I created a Tech-IB score methodology which helps with this specific analysis. I also created interactive visualizations so that analysts can drill down and filter information on the fly.

These project is important as it will provide Investment Bankers client intelligence, pitch book materials and on the fly company analysis based on real time twitter data thus augmenting their company analysis.","m1fname":"Shreyas","projectname":"Company Analysis for Investment Banking using Twitter Data","m3fname":""},{"m2lname":"Chen","m4lname":"","m3uni":"","m1uni":"ym2491","m4uni":"","pid":"201505-4","m2uni":"xc2291","timestring":"Mon May 18 22:41:36 2015","m4fname":"","language":"Python, Tweeter Streaming API","m3lname":"","dataset":"1.Real time stream Tweets with a key word of Uber using Tweeter Streaming API

2.Sentiment score dataset generated in step 1","m1lname":"Ma","industry":"","analytics":"We used the twitter streaming API to stream the real time tweets data with a key word of Uber. After streaming for 3 days, we got the 300.4Mb raw Tweets dataset. Since almost every Tweet in tweets dataset has location information, we wrote a Python program to extract the tweets that located in specific location (e.g. New York, Los Angeles). Then we extracted 4 dataset whose tweets are located in New York, Los Angeles, San Francisco, and Dallas. We can get the average sentiments score to Uber in these four cities by just running our sentiments Python code to process these four datasets separately.
As for the algorithm of our sentiments analysis code, we first got a sentiments dictionary dataset (Fig.3) from the Internet, which contains almost all the adjectives and their sentiments scores range from -6 to 6. The algorithm is first to find out all the sentiments words in each tweet’s text, then look up the sentiments dictionary and compute the average sentiments score of each tweet, and finally compute the average sentiments score of all the tweets in the dataset.
","m2fname":"Xi","description":"1.Tweets represent people’s true feeling.
2.Public sentiments on an event reflects a trend.
3.Companies need to know what they thought about their products.
4.Why not considering Tweets as part of Marketing Strategy.","m1fname":"Yunge ","projectname":"Products Marketing Strategy by Analyzing Twitter Keyword Sentiments","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"mt2994","m4uni":"","pid":"201505-14","m2uni":"","timestring":"Mon May 18 22:59:24 2015","m4fname":"","language":"python 3","m3lname":"","dataset":"Twitter dataset: from free Twitter API
Stock prices dataset: from free Yahoo Finance API
RSS dataset: from Huffington webset","m1lname":"Tian","industry":"","analytics":"The whole application is built in Python 3. Stock prices are pulled from Yahoo Finance API, its totally free and have many options of time intervals. Such data are visualized using matplotlib library in Python. Similarly, Twitter data are pulled from free Twitter API and apply sentiment analysis on such data, basically using nltk library in python, to get the key words. The database used as knowledge base for sentiment analysis is SQLite in Python, which is a SQL database API. Then, those key words are used as selection method to pull RSS feed data from the Hufffington website. GUI is done using tkinter in Python. ","m2fname":"","description":"Getting into stocks and shares can be quite puzzling, and in order to decide which stock to invest on, one need to gather all kinds of information, which can be a particular annoying work to do. But thanks to the big data technology, we may gather and filter useful information by programming. This is the idea of building the Stock Market Assistant Application, to automatically gather information about stocks, which may save investors a lot of effort.
Instead of gathering a huge volume of data that may overwhelm users, my intention was to provide specifically “useful” information that will greatly contribute to the decision-making processes of users. By “useful” I mean certain kind of information that may predict the up or down of stock prices. Through some research, I found out that messages on Twitter can affect stock prices. So here comes the idea that I can use big data method to do some sentiment analysis on Twitter data, and use the result as keyword to select information for users.
There are basically two ways of sentiment analysis, one is to classify the opinion as positive, negative, or neutral; the other is subjectivity/objectivity identification, that is to classify a given text into one of two classes: objective or subjective. I choose the latter one based on the concerns that firstly it works better for the goal to find keywords for information filtering, and secondly, it is hard to tell how the sentiment will influence the stock market. For example, even though we may classify a certain opinion as negative, it won’t necessary will lead to the stocks price to go down, since it will greatly depends on what kind of events it comments on.","m1fname":"Mengying","projectname":"Stock Market Assistant Application","m3fname":""},{"m2lname":"Han","m4lname":"","m3uni":"","m1uni":"yd2302","m4uni":"","pid":"201505-7","m2uni":"th2569","timestring":"Tue May 19 01:22:09 2015","m4fname":"","language":"Languages - Python/PHP/MySQL/HTML/. Platforms - Spark/Linux/Apache/MySQL. ","m3lname":"","dataset":"MovieLens 1M dataset - including 1 million ratings from 6000 users on 4000 movies. http://grouplens.org/datasets/movielens/
Large Movie Review Dataset v1.0 - including 50,000 reviews along with their associated sentiment labels. http://ai.stanford.edu/~amaas/data/sentiment/
Rotten Tomatoes Movie Reviews Dataset - including 15,000 unlabeled reviews from critics for 2000 movies. Built by ourselves.","m1lname":"Du","industry":"","analytics":"1. Movie review sentiment analysis.
2. Natural language processing.
3. Naive Bayes classifier.
4. Feature extraction by detecting bigrams. ","m2fname":"Tian","description":"Objective: Develop our own online movie shopping website/Implement movie review sentiment analysis to improve user experience

Innovations:
1. Using PHP/HTML language to build a online movie website.
2. Using MySQL to manage the movie and customer information.
3. Implementing the review sentiment analysis system with very high accuracy.
4. Building the movie dataset which contains movie basic information along with associated critics reviews.

Importance:
1. Improve the user experience when they are browsing the movie website, saving their time spending on reading reviews.
2. Provide a method to implement sentiment analysis with high accuracy, which not only can be performed for movie reviews, but also for other kinds of text processing. ","m1fname":"Yifan","projectname":"Movie Review Sentiment Analysis System for Movie Website","m3fname":""},{"m2lname":"Ji","m4lname":"","m3uni":"","m1uni":"zx2170","m4uni":"","pid":"201505-9","m2uni":"yj2345","timestring":"Tue May 19 01:47:07 2015","m4fname":"","language":"C++, JAVA, JS","m3lname":"","dataset":"We use the dataset from Meetup.com to simulate the preference of users. Meetup.com provide the user id and groups id which the user joined. We write a java program to pull data from meetup.com. Meetup.com server provide APIs for data collection including group ID, tag ID, event ID, user ID as well as the detailed information about event, groups and users. For our system, we hope to use the relationship between users and groups to simulate people’s fond for apartments and houses. So we remain the user ids and group ids for computing convenience. To get the detailed information about the users or groups, we can send url request to meet up API and get the response with detailed information.","m1lname":"Xu","industry":"","analytics":"We use collaborative filtering algorithm for recommend system. We use two kinds of different algorithms including neighborhood-based CF and item-based CF. For roommate matching, we use cosine-based similarity and pearson's correlation. For visualization, we use Neo4j to present the users and groups relationship network. The most important thing is that we apply our system to the real website for recommendation. ","m2fname":"Yuzhong","description":"The amount of information in the world is increasing far more quickly than our ability to process it. Technology has been developed to reduce the barriers to publishing and distributing information. In our former project, we built a website for house rental around universities like Columbia University or NYU. One of the features of our website is that everyone can collect the houses or apartments information and discuss with someone who is also interested in this apartment. This is really beneficial for students who want to share apartments with others. We all know that finding a good roommate is important and living with someone you don’t like is very annoying. So based on the original idea, we want to do it further and recommend roommates and apartments for people who is looking for apartments or roommates. The project is particularly important for our apartment rental website. ","m1fname":"Zhicheng","projectname":"Roommates and Apartments Recommendation System","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"gvs2106","m4uni":"","pid":"201505-2","m2uni":"","timestring":"Tue May 19 02:10:23 2015","m4fname":"","language":"Python, Spark, Django","m3lname":"","dataset":"The dataset was acquired using the Riot Games API. It contains about 45000 ranked solo queue games and is about 1.5 GB. I had to write a script that uses the Riot API to generate the data. The data is not public, but with a development key can be similarly generated. As this project is specifically targeted at League of Legends, this is the only data it supports.","m1lname":"Saldanha","industry":"","analytics":"I used several different methods of classification: SVM with gradient descent, logistic regression with gradient descent, logistic regression with limited memory BGFS, decision trees, and random forests. These were used to predict which team would win based on their champion picks. The test error was compared to that of a baseline model, which used the average champion win rates of each team to predict the winner. I also wrote a simple Django web server to display the results and provide an interface for the system.","m2fname":"","description":"The goal of this project was to predict the winner of a ranked solo queue game of League of Legends. As millions of people play it everyday, this software could be useful to a large portion of people. There are over 150 trillion possible matchups; therefore no player has nearly enough experience to know whether his team or his opponent's is stronger compositionally. There is, however, a record of every single game every played stored by Riot Games. A portion of this can be used to train classifiers that can predict which team comp is stronger. Although stronger team comp does not necessarily equate to winning, it is certainly a large advantage. Furthermore, while teams are still choosing their champions, they can use the other part of this tool which outputs the best champion to pick next - taking into account your team's other picks thus far and those of the opponent.","m1fname":"Gavin","projectname":"League of Legends Prediction","m3fname":""},{"m2lname":"Jiang","m4lname":"","m3uni":"","m1uni":"sz2476","m4uni":"","pid":"201505-19","m2uni":"yj2338","timestring":"Tue May 19 02:38:20 2015","m4fname":"","language":"Python, Java, Matlab, Twitter4j, Scikit","m3lname":"","dataset":"We used Twitter Streaming API for twitter data collection. Yelp dataset is from yelp dataset challenge.","m1lname":"Zhang","industry":"","analytics":"Sentiment Feature Quantification:
One-dimensional quantification (Polarity)
Multi-dimensional quantification (Coordination)

Clustering and Recommendation:
K-means clustering
Cluster prediction: compute cluster centers and predict cluster index for each sample.
Matlab for clustering result visualization

","m2fname":"Yongchen","description":"Objectives: Twitter and Yelp data
Innovation: propose an original recommendation based on emotion feature quantification and do recommendation to users by clustering and cluster prediction. The idea of mapping of recommendation is also original. Another high light is the sentiment analysis is implemented in both one and multiple dimensions. The multiple dimension sentiment analysis describes sentiment more precisely.","m1fname":"Siyuan ","projectname":"Emotion quantification and Recommendation","m3fname":""},{"m2lname":"Jiang","m4lname":"","m3uni":"","m1uni":"sz2476","m4uni":"","pid":"201505-19","m2uni":"yj2338","timestring":"Tue May 19 02:39:28 2015","m4fname":"","language":"Python, Java, Matlab, Twitter4j, Scikit","m3lname":"","dataset":"We used Twitter Streaming API for twitter data collection. Yelp dataset is from yelp dataset challenge.","m1lname":"Zhang","industry":"","analytics":"Sentiment Feature Quantification:
One-dimensional quantification (Polarity)
Multi-dimensional quantification (Coordination)

Clustering and Recommendation:
K-means clustering
Cluster prediction: compute cluster centers and predict cluster index for each sample.
Matlab for clustering result visualization

","m2fname":"Yongchen","description":"Objectives: Twitter and Yelp data
Innovation: propose an original recommendation based on emotion feature quantification and do recommendation to users by clustering and cluster prediction. The idea of mapping of recommendation is also original. Another high light is the sentiment analysis is implemented in both one and multiple dimensions. The multiple dimension sentiment analysis describes sentiment more precisely.","m1fname":"Siyuan ","projectname":"Emotion quantification and Recommendation","m3fname":""},{"m2lname":"Zhang","m4lname":"","m3uni":"","m1uni":"ph2439","m4uni":"","pid":"201505-8","m2uni":"xz2410","timestring":"Tue May 19 03:18:39 2015","m4fname":"","language":"Java, Matlab, Python","m3lname":"","dataset":"Types of Datasets:
(1)Movement: accelerometer, gyrometer
(2)Environment: temperature, light, location
(3)Usage Pattern:SMS, phone call
(4)Voice: speech [from Interactive Emotional Dyadic Motion Capture (IEMOCAP) database]","m1lname":"Hu","industry":"","analytics":"Tools: Android Studio, SQLite, Matlab(LibSVM 3.20, NNTools)

Algorithm: SVM, Neural Network, Linear regression, Logistic regression","m2fname":"Xuan","description":"Objectives: In this project, we aim to develop our own emotion prediction algorithm and a mobile app for mobile users to track their current emotion and make an evaluation of their long-term emotion patterns. This will be extremely essential and meaningful to help them establish a healthy life style and improve the performance of their work and studies.

Innovations: emotion prediction model

Capabilities: predict users' current emotion and present their recent emotion trend

Why important:
Wide scope of applications: Mobile Therapies, Personalized Recommendations, and Mood Sharing.","m1fname":"Peiran","projectname":"EmoMining: An Emotion Detection System based on Speech and Multiple Sensors","m3fname":""},{"m2lname":"Chen","m4lname":"","m3uni":"","m1uni":"ml3662","m4uni":"","pid":"201505-12","m2uni":"dc3026","timestring":"Tue May 19 03:46:11 2015","m4fname":"","language":"Python, Javascript","m3lname":"","dataset":"Real-time Instagram, Foursquare data and Yelp API dataset.","m1lname":"Lin","industry":"","analytics":"We do two kinds of analysis: sentimental analysis and word semantic similarity analysis.
The word semantic similarity analysis is based on WordNet. Algorithms like The Shortest Path Based Measure, Wu & Palmer’s Scaled Measure, Leakcock & Chodorow’s Measure and Li’s Measures are used.
The sentimental analysis uses chi-squared test and classification.
The recommendation result is showed in a customized Google Map.","m2fname":"Duo","description":"As the popularization of personal computers and smartphones, there’s a growing demand for online service of travelling information and recommendation. Although there are already a couple of applications/websites that offer such service, deficiency of these websites or applications are obvious. We want to launch a customized website that can give users recommendation based on real time social media streaming data. Users do not need to know where they are going, all they need to do is type anything coming up to their minds. The website will recommend them some places to hang out.","m1fname":"Miao","projectname":"My Travel Agent","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ly2324","m4uni":"","pid":"201505-17","m2uni":"","timestring":"Tue May 19 06:36:07 2015","m4fname":"","language":"Matlab / MAC","m3lname":"","dataset":"Oxford Building Image Set","m1lname":"Lin","industry":"","analytics":"Color region extraction

Voting Scoring
","m2fname":"","description":"With the fast-spread of social media and the explosively expansion of Internet, the information flowing over the Internet grows dramastically. And quite a large amount of the information is image data. So build a fast and reliable image query system is essential, and can be reused in many other fields.

One kind of image query system relies on tagging images. With tagged images, we can easily do text search based on their tags. It may be a fast solution but we cannot always get what we want. At most cases, tags can only provide a rather large-scale description of one image with few details. For example, a tag “House” indicates there is a house in that image, but we don’t know its type as well as its color.

Another way is to do content-based image query. Unlike previous one, a content-based image query may be based on the color histogram, objects, color regions(local feature), which allows a more detailed and accurate search. And I also believe that it’s really a natural way, like our humans, to do image query. It’s always cool that we train computers to do similar things as us and then let them do repeatedly.

So I try to build a content-based image query system, relying on features generated from color regions of images.
","m1fname":"Yuan","projectname":"A Content-based Image Query System","m3fname":""},{"m2lname":"Xu","m4lname":"","m3uni":"","m1uni":"lq2156","m4uni":"","pid":"201505-16","m2uni":"cx2177","timestring":"Tue May 19 10:06:19 2015","m4fname":"","language":"Python","m3lname":"","dataset":"Here we use .csv files download from the website and it contains personal information such as gender, age for 891 passengers.","m1lname":"Qi","industry":"","analytics":"Logistic Regression; Support Vector Machine; Random forest(in the future)","m2fname":"Chen","description":"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class passengers. We try to build a model to predict what sorts of people were likely to survive in this disaster and compare our result with the real one.
","m1fname":"Li","projectname":"Predict Survival on the Titanic","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"lw2582","m4uni":"","pid":"201505-15","m2uni":"","timestring":"Tue May 19 10:46:14 2015","m4fname":"","language":"Swift","m3lname":"","dataset":"Using Twitter API to collect the data","m1lname":"Wu","industry":"","analytics":"keyword based approach of sentiment analysis is tried, and using API that uses Maximum Entropy algorithm.","m2fname":"","description":"My motivation is to collect the most recent or popular 100 tweets within a 1.5 miles radius of the user’s current location, classify the sentiment of these tweets into three categories (positive, neutral, and negative), and visualize them on the map. To collect the tweets, I use the twitter framework. To make classifications, I introduce simple keyword-based approaches and a machine learning algorithms. To visualize the results, I develop an iOS application to display them on a map because when I collect the tweets I need to provide information about the user's current location, which makes the application most suitable to be applied on the mobile phone.","m1fname":"Linyin","projectname":"Tweet Sentiment Classification and iOS AppVisualization","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"lw2582","m4uni":"","pid":"201505-15","m2uni":"","timestring":"Tue May 19 10:48:14 2015","m4fname":"","language":"Swift","m3lname":"","dataset":"Using Twitter API to collect the data","m1lname":"Wu","industry":"","analytics":"keyword based approach of sentiment analysis is tried, and using API that uses Maximum Entropy algorithm.","m2fname":"","description":"My motivation is to collect the most recent or popular 100 tweets within a 1.5 miles radius of the user’s current location, classify the sentiment of these tweets into three categories (positive, neutral, and negative), and visualize them on the map. To collect the tweets, I use the twitter framework. To make classifications, I introduce simple keyword-based approaches and a machine learning algorithms. To visualize the results, I develop an iOS application to display them on a map because when I collect the tweets I need to provide information about the user's current location, which makes the application most suitable to be applied on the mobile phone.","m1fname":"Linyin","projectname":"Tweet Sentiment Classification and iOS AppVisualization","m3fname":""},{"m2lname":"Philip","m4lname":"","m3uni":"","m1uni":"ak3674","m4uni":"","pid":"201505-11","m2uni":"sp3174","timestring":"Tue May 19 12:02:01 2015","m4fname":"","language":"Python","m3lname":"","dataset":"We have two examples in our submission that use the NTRUEncryption Library. example_bobalice.py and example_enDom.py are two files with sample calculations. example_enDom.py performs a calculation also run by Dr. Jye-Ren Shieh. Unfortunately, we cannot make Dr. Shieh's calculations public, but our computations match his exactly.","m1lname":"Khan","industry":"","analytics":"NTRU Encyption and Decryption
Polynomial Addition, Multiplication, Division
Extended Euclidean Algorithm for Polynomials
Extended Euclidean Algorithm for Integers
Centered Lift
Modular Arithmetic Applied to Polynomial Ceofficients","m2fname":"Sunil","description":"Our project is the creation of a python library of functions which implement the algorithms necessary for an NTRU encryption of polynomials as described below.

Specifically, for the purposes of our project we will also test the implementation of the libraries functions for recommendation domain encryption methodology described by JR Shieh, Lin, and Wu to test whether single messages and full scores can be encrypted and decrypted without noise.","m1fname":"Abdus","projectname":"NTRU Python Library with Application to Encrypted Domain","m3fname":""},{"m2lname":"Philip","m4lname":"","m3uni":"","m1uni":"ak3674","m4uni":"","pid":"201505-11","m2uni":"sp3174","timestring":"Tue May 19 12:08:09 2015","m4fname":"","language":"Python","m3lname":"","dataset":"We have two examples in our submission that use the NTRUEncryption Library. example_bobalice.py and example_enDom.py are two files with sample calculations. example_enDom.py performs a calculation also run by Dr. Jye-Ren Shieh. Unfortunately, we cannot make Dr. Shieh's calculations public, but our computations match his exactly. ","m1lname":"Khan","industry":"","analytics":"NTRU Encyption and Decryption
Polynomial Addition, Multiplication, Division
Extended Euclidean Algorithm for Polynomials
Extended Euclidean Algorithm for Integers
Centered Lift
Modular Arithmetic Applied to Polynomial Ceofficients
Modular Inversion","m2fname":"Sunil","description":"Our project is the creation of a python library of functions which implement the algorithms necessary for an NTRU encryption of polynomials as described below.

Specifically, for the purposes of our project we will also test the implementation of the libraries functions for recommendation domain encryption methodology described by JR Shieh, Lin, and Wu to test whether single messages and full scores can be encrypted and decrypted without noise.
","m1fname":"Abdus","projectname":"NTRU Python Library with Application to Encrypted Domain","m3fname":""},{"m2lname":"Dai","m4lname":"","m3uni":"","m1uni":"jy2653","m4uni":"","pid":"201505-1","m2uni":"yd2300","timestring":"Tue May 19 15:50:40 2015","m4fname":"","language":"Java, Javascript, Tomcat 7.0 ","m3lname":"","dataset":"We use Twitter Streaming API to collect the real-time data concerning a topic. Twitter generously provide the data for academic research purpose for free. Once you have a Twitter account, you could apply for a pair of consumer key and access token. Once thing should be kept in mind is that when you listening on the hot topic, like \"Apple\", you may only be to get partition of the data. Twitter Streaming API: https://dev.twitter.com/streaming/overview","m1lname":"Yang","industry":"","analytics":"1. Real-time Emotional Index : we have proposed and implemented a index called real-time index, which is the overall sentiment value of public in a fix time window.
2. Sentimental Analysis : we use the online sentimental analysis API provided by Stanford NLP, which is implemented through Maximum entropy algorithm.
3. We have used the morris.js and flot.js package for real-time displaying data. We use the Ajax communication model to enable the communication between back-end server and front-end browser.","m2fname":"Yihong","description":"In October 2008, a rumor that Steve Jobs suffered a major heart attack circulated on the social media. Out of the gloomy perspective towards Apple after the leaving of Jobs, the rumor has incurred a substantial impact, resulting in almost $9 billion market value loss. The social media seems to be a perfect platform to spread around rumors in timely manner. If such rumors could not be detected quickly, they could affect a brand heavily by deprecating innocent public's emotion towards a company.
However, we recognize that Twitter not only provide an effective channel for spreading out rumors, it could also be an ideal platform for us to detect and constrain such rumors. Since the information flowed on the Twitter is publicly accessible, we could also get access to those information and make some real-time inference. Thus, when we detect out some abnormal trends of the information, some alarms could be triggered and necessary measures could be taken.
According our observation, the percentage of negative tweets to positive tweets should be stable. If there is no exceptional events going on, the real-time emotional index should converge to a real value in the range of [-1, 1]. However, if there is a negative rumor appears and spreads widely among the public, there could be a sharp increase in the percentage of negative tweets, which could incur an obvious decrease of the emotional index. Apparently, we could set a threshold for the real-time emotional index according to the ordinary situation(when there is no exception tweet events going on), and then the abnormal trend could be signaled and detected.
","m1fname":"Jingwei","projectname":"A real-time rumor detecting system (Sentimental analysis based solution)","m3fname":""},{"m2lname":"Grimaud","m4lname":"","m3uni":"","m1uni":"tma2131","m4uni":"","pid":"201505-25","m2uni":"ag3017","timestring":"Tue May 19 16:59:16 2015","m4fname":"","language":"Languages: C/C++, CUDA, HTML5, Javascript, PHP, Java, Ruby, bash scripting; Platforms: Caffe, GPU, Jetson TK1, Hadoop map/reduce, Hadoop streaming, Opencv, ffmpeg. ","m3lname":"","dataset":"Open Connectome Project, Kasthuri11 dataset","m1lname":"Adams","industry":"","analytics":"caffe deep learning software, LeNet architecture, html5/javascript interface for ground-truthing EM patches, on-demand image classification, novel unsupervised sequence learning based on stationary ergodic process research.","m2fname":"Joseph-Alexandre","description":"The Big Connectome project developed a framework for applying deep learning techniques to the classification of electron microscopy patches. It built on the Brain Edge Detection project from E6893. It has both web services, backend processing and includes a big data framework (i.e. Hadoop, Spark). For E6895, we enhanced our Big Connectome ecosystem substantially to include deep learning using caffe on multiple hardware. We compare caffe training on a CPU only workstation, NVidia Jetson TK1 and AWS CPU and GPU-enabled nodes.

This research is important because it sets the groundwork for greatly improving the understanding of mammalian brain regions through state-of-the-art EM slice classification. We focus on a modified LeNet architecture and train a deep net for classification of vescicle regions in a brain. We use data provided by the Open Connectome Project. We also develop a new web-based interface for fast and accurate ground-truthing of EM patches. We use this tool to mark 200 training patches and 200 test patches. Then we train our deep net to obtain a training set accuracy of .995 and testing set accuracy of .79. ","m1fname":"Terrence","projectname":"Big Connectome","m3fname":""},{"m2lname":"Sun","m4lname":"","m3uni":"","m1uni":"qs2147","m4uni":"","pid":"201505-27","m2uni":"ys2816","timestring":"Tue May 19 17:12:33 2015","m4fname":"","language":"Python","m3lname":"","dataset":"The dataset we use is from Kaggle website. Click here or en- ter https://www.kaggle.com/c/ acm-sf-chapter-hackathon- small/data to download the dataset. As mentioned in Sec- tion 3, our project will only use train.csv, test.csv and small_product data.xml.

train.csv and test.csv contain informa- tion on what items users clicked on after making a search. Each line of train.csv describes a user’s click on a single item. It contains the following fields:(user, sku, category, query, click_time, query_time). small\\_product\\_data.xml contains information about xbox products like name, sku, release time, price and description. Only the description will be used in our content based filtering method.

A notice for dataset : even though there is test.csv for us to predict the user's interest, it is insufficient to evaluate our algorithm since test.csv misses ``sku'' attribute. In order to evaluate our algorithms, we randomly split train.csv into two parts: training part and test part. The proportion of training data and testing data is 9:1.
","m1lname":"Shen","industry":"","analytics":"Recommendation System, TF-IDF, Multiclassification, SVC, Spell Check, Content Based Filtering, Collaborative Filtering","m2fname":"Yun","description":"To meet the multiple needs of multiple customers, the man- ufacturer produces a wide variety of products. Now, cus- tomers have to process massive information before finding what they want. For the customers, the information is ex- tremely overloaded. One solution to this problem is recom- mendation systems. Besides, as what Steve Jobs said, it is not the customers’ job to know what they want. Hav- ing a better understanding of what the customers want be- fore they realize it themselves can help the sellers gain more profit. In this project, we developed a recommendation sys- tem which combines both content-based filtering and collab- orative filtering to predict which X-box game the customer will be most interested in based on the Best Buy mobile website data.
","m1fname":"Qiuyang","projectname":"Xbox Game Recommendation based on Best Buy Mobile Website Data","m3fname":""},{"m2lname":"Costa","m4lname":"","m3uni":"","m1uni":"sm3891","m4uni":"","pid":"201505-24","m2uni":"mfc2141","timestring":"Tue May 19 17:13:07 2015","m4fname":"","language":"Apache PySpark, Python (Google Custom Search API, Google apiclient, Bing Search API, Alchemy API, Alchemy apiclient, Webhose.io API, Webhose.io webhose, Microsoft pybing, NumPy, scipy, nltk, gensim) ","m3lname":"","dataset":"The goal of this project is facilitated by starting out with increasingly larger corpora of documents obtained with already optimized search engine APIs, namely Google, Bing and Webhose.io. To apply natural language processing techniques on these raw datasets, reference datasets have been obtained from wordfrequency.info (freely available 6,000 word subsample of the \"Corpus of Contemporary American English\" word frequency and genre dataset), and also through using python package NLTK (natural language toolkit).","m1lname":"Mourad","industry":"","analytics":"Initial preprocessing using hard-coded/selective pre-compiled regular expression pattern matching. Bag of Word document vector embeddings. K-means parallel clustering. Word2Vec (google). Gensim MultiCore LDA. Text segmentation based on semantic word embeddings.","m2fname":"Miguel","description":"The goal of this project is to build a \"semantic category\" information retrieval system that is able to get results in realtime. Search engines are widely used on a daily basis. This project aims to develop independent research in the area not biased towards results that have already been obtained, and has so far accomplished the initial step of machine-learning driven preprocessing of large Internet corpora.","m1fname":"Sami","projectname":"Categorical Optimized Information Retrieval","m3fname":""},{"m2lname":"Sun","m4lname":"","m3uni":"","m1uni":"qs2147","m4uni":"","pid":"201505-27","m2uni":"ys2816","timestring":"Tue May 19 17:13:21 2015","m4fname":"","language":"Python","m3lname":"","dataset":"The dataset we use is from Kaggle website. Click here or en- ter https://www.kaggle.com/c/ acm-sf-chapter-hackathon- small/data to download the dataset. As mentioned in Sec- tion 3, our project will only use train.csv, test.csv and small_product_data.xml.

train.csv and test.csv contain informa- tion on what items users clicked on after making a search. Each line of train.csv describes a user’s click on a single item. It contains the following fields:(user, sku, category, query, click_time, query_time). small_product_data.xml contains information about xbox products like name, sku, release time, price and description. Only the description will be used in our content based filtering method.

A notice for dataset : even though there is test.csv for us to predict the user's interest, it is insufficient to evaluate our algorithm since test.csv misses sku' attribute. In order to evaluate our algorithms, we randomly split train.csv into two parts: training part and test part. The proportion of training data and testing data is 9:1.
","m1lname":"Shen","industry":"","analytics":"Recommendation System, TF-IDF, Multiclassification, SVC, Spell Check, Content Based Filtering, Collaborative Filtering","m2fname":"Yun","description":"To meet the multiple needs of multiple customers, the man- ufacturer produces a wide variety of products. Now, cus- tomers have to process massive information before finding what they want. For the customers, the information is ex- tremely overloaded. One solution to this problem is recom- mendation systems. Besides, as what Steve Jobs said, it is not the customers’ job to know what they want. Hav- ing a better understanding of what the customers want be- fore they realize it themselves can help the sellers gain more profit. In this project, we developed a recommendation sys- tem which combines both content-based filtering and collab- orative filtering to predict which X-box game the customer will be most interested in based on the Best Buy mobile website data.
","m1fname":"Qiuyang","projectname":"Xbox Game Recommendation based on Best Buy Mobile Website Data","m3fname":""},{"m2lname":"Sun","m4lname":"","m3uni":"","m1uni":"qs2147","m4uni":"","pid":"201505-27","m2uni":"ys2816","timestring":"Tue May 19 17:15:50 2015","m4fname":"","language":"Python","m3lname":"","dataset":"The dataset we use is from Kaggle website. Click here or en- ter https://www.kaggle.com/c/ acm-sf-chapter-hackathon- small/data to download the dataset. As mentioned in Sec- tion 3, our project will only use train.csv, test.csv and small_product_data.xml.

train.csv and test.csv contain informa- tion on what items users clicked on after making a search. Each line of train.csv describes a user’s click on a single item. It contains the following fields:(user, sku, category, query, click_time, query_time). small_product_data.xml contains information about xbox products like name, sku, release time, price and description. Only the description will be used in our content based filtering method.

A notice for dataset : even though there is test.csv for us to predict the user's interest, it is insufficient to evaluate our algorithm since test.csv misses sku' attribute. In order to evaluate our algorithms, we randomly split train.csv into two parts: training part and test part. The proportion of training data and testing data is 9:1.
","m1lname":"Shen","industry":"","analytics":"Recommendation System, TF-IDF, Multiclassification, SVC, Spell Check, Content Based Filtering, Collaborative Filtering","m2fname":"Yun","description":"To meet the multiple needs of multiple customers, the manufacturer produces a wide variety of products. Now, customers have to process massive information before finding what they want. For the customers, the information is extremely overloaded. One solution to this problem is recommendation systems. Besides, as what Steve Jobs said, it is not the customers’ job to know what they want. Having a better understanding of what the customers want before they realize it themselves can help the sellers gain more profit. In this project, we developed a recommendation system which combines both content-based filtering and collaborative filtering to predict which X-box game the customer will be most interested in based on the Best Buy mobile website data.
","m1fname":"Qiuyang","projectname":"Xbox Game Recommendation based on Best Buy Mobile Website Data","m3fname":""},{"m2lname":"Sun","m4lname":"","m3uni":"","m1uni":"qs2147","m4uni":"","pid":"201505-27","m2uni":"ys2816","timestring":"Tue May 19 17:19:15 2015","m4fname":"","language":"Python","m3lname":"","dataset":"The dataset we use is downloaded from Kaggle website. Click here or enter https://www.kaggle.com/c/acm-sf-chapter-hackathon-small to download the dataset. train.csv and test.csv contain informa- tion on what items users clicked on after making a search. Each line of train.csv describes a user’s click on a single item. It contains the following fields:(user, sku, category, query, click_time, query_time). small_product_data.xml contains information about xbox products like name, sku, release time, price and description. Only the description will be used in our content based filtering method.

A notice for dataset : even though there is test.csv for us to predict the user's interest, it is insufficient to evaluate our algorithm since test.csv misses sku' attribute. In order to evaluate our algorithms, we randomly split train.csv into two parts: training part and test part. The proportion of training data and testing data is 9:1.
","m1lname":"Shen","industry":"","analytics":"Recommendation System, TF-IDF, Multiclassification, SVC, Spell Check, Content Based Filtering, Collaborative Filtering","m2fname":"Yun","description":"To meet the multiple needs of multiple customers, the manufacturer produces a wide variety of products. Now, customers have to process massive information before finding what they want. For the customers, the information is extremely overloaded. One solution to this problem is recommendation systems. Besides, as what Steve Jobs said, it is not the customers’ job to know what they want. Having a better understanding of what the customers want before they realize it themselves can help the sellers gain more profit. In this project, we developed a recommendation system which combines both content-based filtering and collaborative filtering to predict which X-box game the customer will be most interested in based on the Best Buy mobile website data.
","m1fname":"Qiuyang","projectname":"Xbox Game Recommendation based on Best Buy Mobile Website Data","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"adc2171","m4uni":"","pid":"201505-21","m2uni":"","timestring":"Wed May 20 06:40:54 2015","m4fname":"","language":"Python, PHP, Mysql, Jquery","m3lname":"","dataset":"https://data.stackexchange.com/","m1lname":"Cunha","industry":"","analytics":"Pyspark","m2fname":"","description":"AskUbuntu is a question and answer site for students, professionals, and researchers that aims to make technological information easily accessible to everyone. It has a huge dataset containing millions of posts and comments about Ubuntu operational system, including its softwares and applications.

AskUbuntu offers a unique approach to different users. Its dataset is well known to provide students comprehensive subject coverage without the information overload of a general search engine.

I decided to use this database because of its academic value, and I hope to contribute for AskUbuntu mission, which is to share and grow the world's knowledge.

That being said, I aim to optimize AskUbuntu web platform by analyzing and improving its public available dataset. Based on specific key-words and tags, I will make a final feedback/recommendation systems containing the most relevant questions related to a given matter.","m1fname":"Andre","projectname":"AskUbuntu - Data Analytics","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cl300","m4uni":"","pid":"201412-89","m2uni":"","timestring":"Thu Dec 17 12:14:53 2015","m4fname":"","language":"Assembly","m3lname":"","dataset":"all","m1lname":"Lin","industry":"Information","analytics":"others","m2fname":"","description":"Great","m1fname":"CY","projectname":"This is My Project","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cl300","m4uni":"","pid":"201512-78","m2uni":"","timestring":"Thu Dec 17 13:35:38 2015","m4fname":"","language":"Assembly","m3lname":"","dataset":"all","m1lname":"Lin","industry":"Information","analytics":"others","m2fname":"","description":"Great","m1fname":"CY","projectname":"This is My Magic Project","m3fname":""},{"m2lname":"Zhang","m4lname":"","m3uni":"","m1uni":"yx2797","m4uni":"","pid":"201512-21","m2uni":"zz2361","timestring":"Thu Dec 17 14:35:10 2015","m4fname":"","language":"MATLAB OpenGL Spark","m3lname":"","dataset":"The image dataset with different resolution is created by us. We choose the original 8 images with resolution 720*1280 varying in content and down sample them into three different resolution:720*1280, 540*960, 360*640. Then we degrade them with blur, jpeg compression and noise with different parameters. The total number of images is 504. We collect the subject quality score of the dataset by calculate the average score of ten people. We split the dataset to training set with 72 image, 24 for each resolution.","m1lname":"Zhang","industry":"Media","analytics":"Full-reference image quality assessment for different resolution can be divided into 5 parts, introducing the effect of quality loss due to degradation, different assessment basis of observer because of resolution and content difference and Human Visual System.

1)DCT Transform
According to the energy concentration property of DCT, the DCT coefficients of reference image is able to divided into low frequency part, high frequency part and half high frequency part for calculation.

2)SSIM calculation of low frequency part of reference image and target image
The SSIM calculation of low frequency part of reference image and target image is to measure the loss of quality in the process of degradation. Multiplying with the Human Visual System parameter acquired by the distance from observer to screen, the height of screen and the resolution of image, the SSIM is also able to reflect the influence of observing pattern.

3)The proportion of high frequency part of reference image to all alternating current coefficients
The characteristic of high frequency part is described in 2.4. Detailed calculation is to compute the square sum of all high frequency coefficients of reference image, representing the loss in energy due to the reduction in resolution. Then compute the square sum of all alternating current coefficients, representing the total energy of details in reference image. Finally, compute the quotient of the square sum of all high frequency coefficients to the square sum of all alternating current coefficients, representing the proportion of loss in details due to the reduction of resolution. The quotient has a high correlation with the content and resolution. For image with same resolution, image rich in detail has a higher proportion than image with less detail. For images with the same content, the higher the resolution is, the less the proportion is. As a result, this proportion satisfies the requirement of extracting parameter related to content and resolution.

4)The proportion of half high frequency part of reference image to all alternating current coefficients
The half high frequency part of reference image is the rest part of reference image to an imaginary target image with the half height and width of the reference image. The calculation process is similar to the proportion of high frequency part of reference image to all alternating current coefficients. This proportion is only related to the content of image.

Finally, three reasonable parameters related to image quality assessment are extracted. Input the three parameters into the trained model of regression, to get the objective quality score of the image.","m2fname":"Zhili","description":"Nowadays people use different devices to acquire visual information, so an algorithm to assess quality of images with different resolution is in need. In this article, we propose a full reference image quality assessment algorithm for different resolution and test its performance. Moreover, we have tested and analyzed the difference of choosing different regression training methods to the same parameters extracted. Hope to provide a reference for the choice of training method in future development of image quality assessment.","m1fname":"Youjia","projectname":"Full Reference Image Quality Assessment for Different Resolution","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"ys2843","m1uni":"zc2270","m4uni":"","pid":"201512-15","m2uni":"sw2962","timestring":"Thu Dec 17 14:45:16 2015","m4fname":"","language":"Python, MongoDB, Tableau, D3, Spark","m3lname":"Shi","dataset":"The dataset we tested is tweets from Nov 29 to Dec 05 focusing on San Bernardino Shooting. We got this dataset by using Twitter API.
We also used classified tweets as our training data for sentiment analysis, which can make our outcome more fair.
The dataset is here: http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/","m1lname":"Cai","industry":"Media","analytics":"Python was used to connect Twitter API to download the historical tweets data. MongoDB was employed for data cleaning and aggregation. Data Visualization was implemented in Tableau and D3 both. Finally we did sentiment analysis on the tweet content in Spark. ","m2fname":"Sitian","description":"This century happens a lot of events globally, either tragic or inspiring, e.g. Paris Attack or Zuckerberg's donation recently. Global news agency used to be the most efficient and accurate method for people to detect these events. However, because of the rapid development of social media, these magic websites and apps can spread news to everywhere of this planet more quickly. Our project aims to use Twitter as an example to prove the efficiency and accuracy of event detection through social media, comparing with three most popular new agencies in the world: CNN, Reuters and Xinhua news. The target event we chose is San Bernardino Shooting, which just happened several weeks ago. ","m1fname":"Zhuxi","projectname":"Cross-source Event Detection Through Social Media","m3fname":"Yi "},{"m2lname":"Kang","m4lname":"","m3uni":"","m1uni":"jh3534","m4uni":"","pid":"201512-34","m2uni":"wk2269","timestring":"Thu Dec 17 15:51:44 2015","m4fname":"","language":"Java, Yelp API, Advanced REST client, Android Studio, Android SDK, ","m3lname":"","dataset":"Yelp Challenge Dataset (public on http://www.yelp.com/dataset_challenge)

The academic dataset includes data from cities all around the world: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison in U.S.; Montreal and Waterloo in Canada; Edinburgh in U.K.; Karlsruhe in Germany.

1.6M reviews and 500K tips by 366K users for 61K businesses.","m1lname":"Hu","industry":"Retail","analytics":"Analytics: Hadoop, Hive

Map-reduce: a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Collaborative Filtering Recommender Algorithm: collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
Item-based recommendation: Calculate similarity between items and make recommendations. ","m2fname":"Wendan","description":"Yelp offers a platform for consumers to find restaurants through reviews and ratings. A typical search on Yelp requires the custom to input keywords and shows a bunch of search results to pick by hand. However, when a customer travels to a new place and look for a suitable restaurant, he/she may not want to try different keywords and read tons of reviews before making a right choice. There comes our customized recommender system.

Our project is designed to analyze the Yelp open database – challenge academic dataset and provide item-similarity ranking using big data analytical tools. Then our system will recommend relevant restaurants using collaborative filtering recommender algorithm from all nearby restaurants grabbed by Yelp API. The recommender works based on the personal preference of specific customer and shows the result on an android app.
","m1fname":"Jing","projectname":"Yelp Dataset Analysis and Customized Recommender System","m3fname":""},{"m2lname":"Mahajan","m4lname":"","m3uni":"sq2168","m1uni":"bjs2135","m4uni":"","pid":"201512-12","m2uni":"mm4399","timestring":"Thu Dec 17 16:27:10 2015","m4fname":"","language":"Python; Java","m3lname":"Qian","dataset":"Tweets pulled from Twitter API. Users gathered from list of followers of each of 30 NBA teams. Up to 3,200 latest tweets pulled for each user.","m1lname":"Slakter","industry":"Media","analytics":"Clustering via Spark; Sentiment Analysis of Tweets via NLTK Toolkit; Recommendations via Mahout; Visualizations via SystemG","m2fname":"Mayank","description":"The goal of this project was to delve into the specific affinities of NBA followers with teams in the hopes of providing fans with recommendations to root for or against other teams in the NBA depending on the teams they follow currently and how they relate to other users in the dataset. This sort of analysis can likely be used by advertising firms to attempt to bring more fans to certain NBA teams.","m1fname":"Brian","projectname":"Sports Fandom","m3fname":"Sheng"},{"m2lname":"jiang","m4lname":"","m3uni":"zl2348","m1uni":"sd2810","m4uni":"","pid":"201512-36","m2uni":"zj2173","timestring":"Thu Dec 17 16:50:01 2015","m4fname":"","language":"Python, Js, Spark, Hadoop Hdfs","m3lname":"lv","dataset":"
Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.","m1lname":"dong","industry":"Media","analytics":"Collaborative Filtering ALS algorithm

Create RDDs for Books table and Ratings table, separate data into train and test. Train a mode and test on test data to tune parameters for later use.

Get new user ratings from front-end for every new user and merge it into Rating table and use the complete data to train new model for new user.

Use the new complete model to predict ratings on each book for the new user(exclude those rated by user of course).

Select top 25 books and only keep books which are reviewed more than 10 times.
return recommendations to user (to be rated further).
","m2fname":"zewei","description":"In this era of information explosion, have you ever thought about how many books out there? For a reader, it is actually a pain in back to find a good book to read. This would take time and might end up reading a not good book for a while till realized that it is not a good fit.

Here our App comes to rescue. Our App aims to make the best recommendations on books for users. In order to have enough knowledge to recommend properly, we take a huge database which consists of millions of books and user rating information from Amazon data source. The data is huge, but the recommendation needs to be done in second, thats why this is a real world big data computation challenge.","m1fname":"shiyu","projectname":"Target Your Next READING : Book Recommender","m3fname":"zixuan"},{"m2lname":"DeRosa","m4lname":"","m3uni":"jma2215","m1uni":"kj2347","m4uni":"","pid":"201512-27","m2uni":"kderosa","timestring":"Thu Dec 17 17:26:46 2015","m4fname":"","language":"Hadoop, HDFS, Mahout, Amazon Web Services S3/EC2/EMR, Python, Java, Javascript, Gensim, PyLDAvis","m3lname":"Adelson","dataset":"Yelp Dataset from http://www.yelp.com/dataset_challenge.

The software should be able to support any textual corpus to perform topic modeling.","m1lname":"Jayaraman","industry":"Information","analytics":"We implemented topic modeling using batch LDA(Latent Dirichlet Allocation), online LDA and online HDP(Hierarchical Dirichlet Processing).

Batch LDA was implemented in Mahout while online LDA and online HDP were in Gensim.

We performed a quantitative evaluation using perplexity and a visual evaluation using pyLDAvis as a visualization tool. Modifications to pyLDAvis to allow topic models from Mahout to be visualized will be submitted to pyLDAvis creators and made available as part of their open-source package and is a significant contribution of this project.

","m2fname":"Kyle","description":"Use topic modeling to analyze the Yelp Reviews dataset

Find a list of differences in topics between high-rating and low-rating reviews for the same class of businesses.
This could be useful for business owners to see the factors that make customers like or dislike a particular category of business.

Examine the feasibility of reproducing the Yelp category hierarchy purely by analyzing the review text.
Yelp has a multi-level manually created category hierarchy to categorize businesses. Automating this could help simplify this process as well as categorizing businesses that could legitimately belong to multiple categories.

Examine the differences in results between various algorithms for topic modeling
Batch LDA using Mahout, online LDA and online HDP(Hierarchical Dirichlet Processing) using Gensim. This is useful to see which algorithms/tools are effective for topic modeling over Big Data corpuses.
","m1fname":"Karthik ","projectname":"Analyzing the Yelp Review Dataset with Topic Modeling","m3fname":"Jon"},{"m2lname":"DeRosa","m4lname":"","m3uni":"jma2215","m1uni":"kj2347","m4uni":"","pid":"201512-27","m2uni":"kderosa","timestring":"Thu Dec 17 17:29:05 2015","m4fname":"","language":"Hadoop, HDFS, Mahout, Amazon Web Services S3/EC2/EMR, Python, Java, Javascript, Gensim, PyLDAvis","m3lname":"Adelson","dataset":"Yelp Dataset from http://www.yelp.com/dataset_challenge.

The software should be able to support any textual corpus to perform topic modeling.","m1lname":"Jayaraman","industry":"Information","analytics":"We implemented topic modeling using batch LDA(Latent Dirichlet Allocation), online LDA and online HDP(Hierarchical Dirichlet Processing).

Batch LDA was implemented in Mahout while online LDA and online HDP were in Gensim.

We performed a quantitative evaluation using perplexity and a visual evaluation using pyLDAvis as a visualization tool.

Modifications to pyLDAvis to allow topic models from Mahout to be visualized will be submitted to pyLDAvis creators and made available as part of their open-source package and is a significant contribution of this project.

","m2fname":"Kyle","description":"Use topic modeling to analyze the Yelp Reviews dataset

Find a list of differences in topics between high-rating and low-rating reviews for the same class of businesses.
This could be useful for business owners to see the factors that make customers like or dislike a particular category of business.

Examine the feasibility of reproducing the Yelp category hierarchy purely by analyzing the review text.
Yelp has a multi-level manually created category hierarchy to categorize businesses. Automating this could help simplify this process as well as categorizing businesses that could legitimately belong to multiple categories.

Examine the differences in results between various algorithms for topic modeling
Batch LDA using Mahout, online LDA and online HDP(Hierarchical Dirichlet Processing) using Gensim. This is useful to see which algorithms/tools are effective for topic modeling over Big Data corpuses.
","m1fname":"Karthik ","projectname":"Analyzing the Yelp Review Dataset with Topic Modeling","m3fname":"Jon"},{"m2lname":"DeRosa","m4lname":"","m3uni":"jma2215","m1uni":"kj2347","m4uni":"","pid":"201512-27","m2uni":"kderosa","timestring":"Thu Dec 17 17:31:53 2015","m4fname":"","language":"Hadoop, HDFS, Mahout, Amazon Web Services S3/EC2/EMR, Python, Java, Javascript, Gensim, PyLDAvis","m3lname":"Adelson","dataset":"Yelp Dataset from http://www.yelp.com/dataset_challenge.

The software should be able to support any textual corpus to perform topic modeling.","m1lname":"Jayaraman","industry":"Information","analytics":"We implemented topic modeling using batch LDA(Latent Dirichlet Allocation), online LDA and online HDP(Hierarchical Dirichlet Processing).

Batch LDA was implemented in Mahout while online LDA and online HDP were in Gensim.

We performed a quantitative evaluation using perplexity and a visual evaluation using pyLDAvis as a visualization tool.

Modifications to pyLDAvis to allow topic models from Mahout to be visualized will be submitted to pyLDAvis creators and made available as part of their open-source package and is a significant contribution of this project.

","m2fname":"Kyle","description":"

Use topic modeling to analyze the Yelp Reviews dataset

Find a list of differences in topics between high-rating and low-rating reviews for the same class of businesses.
This could be useful for business owners to see the factors that make customers like or dislike a particular category of business.

Examine the feasibility of reproducing the Yelp category hierarchy purely by analyzing the review text.
Yelp has a multi-level manually created category hierarchy to categorize businesses. Automating this could help simplify this process as well as categorizing businesses that could legitimately belong to multiple categories.

Examine the differences in results between various algorithms for topic modeling
Batch LDA using Mahout, online LDA and online HDP(Hierarchical Dirichlet Processing) using Gensim. This is useful to see which algorithms/tools are effective for topic modeling over Big Data corpuses.
","m1fname":"Karthik ","projectname":"Analyzing the Yelp Review Dataset with Topic Modeling","m3fname":"Jon"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"wz2297","m4uni":"","pid":"201512-7","m2uni":"","timestring":"Thu Dec 17 17:33:54 2015","m4fname":"","language":"Objective-C, Bash, R","m3lname":"","dataset":"Twitter Streaming API Dataset","m1lname":"Zhang","industry":"Information","analytics":"Algorithms: Maximum Entropy, Keyword Based Counting.
Visualization: MapKit in iOS","m2fname":"","description":"Objective:
Use big data collected by Twitter to generate useful regional mood map.

Innovation:
Bring the big data related to mood expression to mobile app, and people can check it easily.

Motivation:
(1) Help the government to get a sense of regional mood;
(2) Help people decide which area is a \"happier\" area to live;
(3)A template application for other areas.","m1fname":"Wenyu","projectname":"Regional Mood Assessment Application based on Tweets","m3fname":""},{"m2lname":"Bai","m4lname":"","m3uni":"ln2287","m1uni":"ly2331","m4uni":"","pid":"201512-45","m2uni":"hb2484","timestring":"Thu Dec 17 18:11:48 2015","m4fname":"","language":"Python, SQL, Flask, HTML, CSS, Javascript","m3lname":"Nan","dataset":"By requesting various datasets from Yelp APIs, general business information, check-ins, user information, and reviews are used as major datasets. The description page is on the urls: http://www.yelp.com/dataset_challenge","m1lname":"Yang","industry":"Social Science-Government","analytics":"Data storage is implemented using SparkSQL. Recommendation algorithm is achieved by using Spark MLlib library which contains common learning algorithms for machine learning. Visualization through web page displaying interface is created using Flask platform. Most programs reply on Python scripts.","m2fname":"Hongyang","description":"The objectives of our research and project is to capture the cultural trend emerging in different cities and provide recommendation to user utilizing their historical preferences. By building a front-end and back-end systems for visualization and data storage, our system provides better user-friendly interface comparing to other similar platform.","m1fname":"Lan","projectname":"Yelp Data Analysis and Recommendation","m3fname":"Yazhuo"},{"m2lname":"Jia","m4lname":"","m3uni":"yc3079","m1uni":"xc2331","m4uni":"","pid":"201512-61","m2uni":"j2330","timestring":"Thu Dec 17 18:32:47 2015","m4fname":"","language":"Java, C++, System G, Matlab, Python, Hadoop","m3lname":"Chen","dataset":"The dataset of this project is downloaded from Yahoo Webscope Dataset: Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0. There are many movie rating datasets on the Internet, we choose this data set because except for movie id, user id and rating, it provides more information about the users: their gender and ages. This additional information gives us more choice to analyze the dataset. The dataset is tested both in original and after classification.
Here is a breif description on the dataset:
(1) \"ydata-ymovies-user-movie-ratings-train-v1_0.txt\" contains a small sample of Yahoo! users' ratings of movies, with the following
fields:
0 anonymized user_id
1 movie_id
2 rating(from 1(F) to 13(A+))
3 converted rating(from 1 to 5: A-,A, A+ will be converted to 5)
(2) \"ydata-ymovies-user-demographics-v1_0.txt\" contains user demographic information, with the following fields:
0 anonymized user_id
1 birthyear
2 gender
(3) \"ydata-ymovies-mapping-to-eachmovie-v1_0.txt\" contains a mapping from the movie ids used in this Yahoo! Movies dataset to the corresponding movies ids and titles used in the EachMovie dataset. The mapping may be incomplete or incorrect. The EachMovie dataset was created by the Digital Equipment Corporation's Systems Research Center and is not associated with Yahoo! or available via Yahoo!. The file contains the following fields:
0 yahoo_movie_id
1 movie title
2 eachmovie_movie_id","m1lname":"Cao","industry":"Media","analytics":"The dataset is almost ready to use, but we modified the dataset for further analysis. We classfied the data both in gender and age (ten years a group). Then User-based Recommender and Item-based Recommender are used.
With the evaluation algorithms in Mahout, several algorithms are tested like Nearest-N Neighborhood, Threshold-based Neighborhood, Euclidean Distance Similarity and Pearson Correlation Similarity.
Finally, we chose the best algorithm who performs better on the accuracy and time efficiency. And System G toolkit is used to visualize the recommendation result.
","m2fname":"Tiancheng","description":"This projected is aimed to recommend movies to users according to the dataset of their previous ratings. Before reommendation, the users are divided into different groups by their age or gender. Compared to the original dataset, the grouped data is processed in a higher running speed with equal (even slightly better) accuracy. And by separating data groups, we can get different taste patterns among users with specific charactersitics. These properties are meaningful to recommendation system when processing big data and enhance interactivity.
","m1fname":"Xu","projectname":"Movie Recommendation and Analytics","m3fname":"Yanjing"},{"m2lname":"Jia","m4lname":"","m3uni":"yc3079","m1uni":"xc2331","m4uni":"","pid":"201512-61","m2uni":"tj2330","timestring":"Thu Dec 17 18:38:25 2015","m4fname":"","language":"Java, C++, System G, Neo4j, Python, Matlab, Hadoop","m3lname":"Chen","dataset":"The dataset of this project is downloaded from Yahoo Webscope Dataset: Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0. There are many movie rating datasets on the Internet, we choose this data set because except for movie id, user id and rating, it provides more information about the users: their gender and ages. This additional information gives us more choice to analyze the dataset. The dataset is tested both in original and after classification.
Here is a breif description on the dataset:
(1) \"ydata-ymovies-user-movie-ratings-train-v1_0.txt\" contains a small sample of Yahoo! users' ratings of movies, with the following
fields:
0 anonymized user_id
1 movie_id
2 rating(from 1(F) to 13(A+))
3 converted rating(from 1 to 5: A-,A, A+ will be converted to 5)
(2) \"ydata-ymovies-user-demographics-v1_0.txt\" contains user demographic information, with the following fields:
0 anonymized user_id
1 birthyear
2 gender
(3) \"ydata-ymovies-mapping-to-eachmovie-v1_0.txt\" contains a mapping from the movie ids used in this Yahoo! Movies dataset to the corresponding movies ids and titles used in the EachMovie dataset. The mapping may be incomplete or incorrect. The EachMovie dataset was created by the Digital Equipment Corporation's Systems Research Center and is not associated with Yahoo! or available via Yahoo!. The file contains the following fields:
0 yahoo_movie_id
1 movie title
2 eachmovie_movie_id
","m1lname":"Cao","industry":"Media","analytics":"The dataset is almost ready to use, but we modified the dataset for further analysis. We classfied the data both in gender and age (ten years a group). Then User-based Recommender and Item-based Recommender are used.
With the evaluation algorithms in Mahout, several algorithms are tested like Nearest-N Neighborhood, Threshold-based Neighborhood, Euclidean Distance Similarity and Pearson Correlation Similarity.
Finally, we chose the best algorithm who performs better on the accuracy and time efficiency. And System G toolkit is used to visualize the recommendation result.
","m2fname":"Tiancheng","description":"This projected is aimed to recommend movies to users according to the dataset of their previous ratings. Before reommendation, the users are divided into different groups by their age or gender. Compared to the original dataset, the grouped data is processed in a higher running speed with equal (even slightly better) accuracy. And by separating data groups, we can get different taste patterns among users with specific charactersitics. These properties are meaningful to recommendation system when processing big data and enhance interactivity.
","m1fname":"Xu","projectname":"Movie Recommendation and Analytics","m3fname":"Yanjing"},{"m2lname":"Song","m4lname":"","m3uni":"ys2816","m1uni":"qs2147","m4uni":"","pid":"201512-20","m2uni":"ps2839","timestring":"Thu Dec 17 18:59:46 2015","m4fname":"","language":"The primary language we use is python. Other libraries we use include scikit-learn, GraphLab Create, keras, Pandas, numpy, etc.","m3lname":"Sun","dataset":"The dataset we use is about the Otto Group Product information which can be downloaded from the kaggle website. (Download Link: https://www.kaggle.com/c/otto-group-product-classification-challenge/data). There are more than 200,000 products information in the dataset. Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. There are nine categories for all products. Each target category represents one of most important product categories (like fashion, electronics, etc.). We randomly split data into training and testing set to create 10 pieces of dataset to help us validate our algorithms and models.

Our software can also support similar data format (e.g. id, feature1, feature2, ..., class_id) to do classification.","m1lname":"Shen","industry":"Information","analytics":"Analytics:
Use log loss to evaluate the classification results of different models.

Algorithms:
(1) Random Forest
(2) Neural Network
(3) XGBoost
(4) Combination

System modules:
(1) Utility module: provides functions to generate 10 groups datasets and evaluate the classification results.
(2) Model module: provides different models (e.g. random forest, neural network, xgboost, combination) to do classification.

Visualization:
Log loss table and chart for different models.","m2fname":"Peng","description":"The objective of our project is to automatically classify the products according to their features. The innovations we make are trying different models, tuning corresponding parameter to make the classification as accurate as possible.

We think automatic classification is very import for e-commerce companies which usually have several thousand products needing to be added to their product line. There are 3 reasons. First, the consistent analysis of the products for these companies is crucial. Second, due to diverse global infrastructure, many identical products can be classified differently. Third, the quality of the product analysis depends heavily on the ability to cluster similar products accurately.
","m1fname":"Qiuyang","projectname":"Otto Group Product Classification","m3fname":"Yun"},{"m2lname":"Cao","m4lname":"","m3uni":"yc3079","m1uni":"tj2330","m4uni":"","pid":"201512-61","m2uni":"xc2331","timestring":"Thu Dec 17 19:02:12 2015","m4fname":"","language":"Java, C++, System G, Neo4j, Python, Matlab, Hadoop ","m3lname":"Chen","dataset":"The dataset of this project is downloaded from Yahoo Webscope Dataset: Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0. There are many movie rating datasets on the Internet, we choose this data set because except for movie id, user id and rating, it provides more information about the users: their gender and ages. This additional information gives us more choice to analyze the dataset. The dataset is tested both in original and after classification.
Here is a breif description on the dataset:
(1) \"ydata-ymovies-user-movie-ratings-train-v1_0.txt\" contains a small sample of Yahoo! users' ratings of movies, with the following
fields:
0 anonymized user_id
1 movie_id
2 rating(from 1(F) to 13(A+))
3 converted rating(from 1 to 5: A-,A, A+ will be converted to 5)
(2) \"ydata-ymovies-user-demographics-v1_0.txt\" contains user demographic information, with the following fields:
0 anonymized user_id
1 birthyear
2 gender
(3) \"ydata-ymovies-mapping-to-eachmovie-v1_0.txt\" contains a mapping from the movie ids used in this Yahoo! Movies dataset to the corresponding movies ids and titles used in the EachMovie dataset. The mapping may be incomplete or incorrect. The EachMovie dataset was created by the Digital Equipment Corporation's Systems Research Center and is not associated with Yahoo! or available via Yahoo!. The file contains the following fields:
0 yahoo_movie_id
1 movie title
2 eachmovie_movie_id ","m1lname":"Jia","industry":"Media","analytics":"The dataset is almost ready to use, but we modified the dataset for further analysis. We classfied the data both in gender and age (ten years a group). Then User-based Recommender and Item-based Recommender are used.
With the evaluation algorithms in Mahout, several algorithms are tested like Nearest-N Neighborhood, Threshold-based Neighborhood, Euclidean Distance Similarity and Pearson Correlation Similarity.
Finally, we chose the best algorithm who performs better on the accuracy and time efficiency. And System G toolkit is used to visualize the recommendation result. ","m2fname":"Xu","description":"This projected is aimed to recommend movies to users according to the dataset of their previous ratings. Before reommendation, the users are divided into different groups by their age or gender. Compared to the original dataset, the grouped data is processed in a higher running speed with equal (even slightly better) accuracy. And by separating data groups, we can get different taste patterns among users with specific charactersitics. These properties are meaningful to recommendation system when processing big data and enhance interactivity. ","m1fname":"Tiancheng","projectname":"Movie Recommendation and Analytics","m3fname":"Yanjing"},{"m2lname":"Berard","m4lname":"","m3uni":"","m1uni":"ak3808","m4uni":"","pid":"201512-48","m2uni":"aab2227","timestring":"Thu Dec 17 19:14:03 2015","m4fname":"","language":"Python, Apache Hadoop, Apache Spark, Node.js, Highcharts and Highstock JavaScript libraries","m3lname":"","dataset":"Our platform is designed to be dataset agnostic, and can work with any financial time series in the comma separated value format. For the purposes of our demonstration, we used historical S&P 500 index stock tick data from Quantcode (). This dataset is free and publicly available.","m1lname":"Kakar","industry":"Finance","analytics":"In terms of analytics, we implemented algorithms to calculate the returns for a stock portfolio. We also wrote algorithms to compute statistics on the returns, such as the Mean, Standard Deviation, Sharpe Ratio, Maximum Drawdown etc.

We wrote custom algorithms to process the dataset using Spark Resilient Distributed Datasets (RDDs). We implemented a rudimentary server in node.js to serve our final web application. The visualizations in the web application were implemented using the HighCharts and HighStock JavaScript APIs.","m2fname":"Alice ","description":"Out project leverages big data technologies to realize a backtesting engine for algorithmic trading strategies. Backtesting is the most crucial step in the development cycle of a trading algorithm. In order to test a trading strategy thoroughly, a large amount of financial time series data is required. Storing, reading and processing these large datasets is a computationally intensive task.

The backtesting engine that we implemented uses a Hadoop data warehouse as its backbone. We then use Apache Spark to read these large amounts of data and process them according to user implemented trading strategies and calculate essential performance metrics. Finally, we use a JavaScript based web application to visualize the performance of these algorithms viz. the benchmark. Our system is able to handle arbitrarily complex trading strategies implemented in Python.

Algorithmic trading is a highly competitive sector of financial markets and our platform provides flexible backtesting solution that can be scaled easily and can be integrated seamlessly with other tools and methods. ","m1fname":"Akshaan","projectname":"Map-Reduce for Algorithmic Trading","m3fname":""},{"m2lname":"DeRosa","m4lname":"","m3uni":"jma2215","m1uni":"kj2347","m4uni":"","pid":"201512-27","m2uni":"kderosa","timestring":"Thu Dec 17 19:24:01 2015","m4fname":"","language":"Hadoop, HDFS, Mahout, Amazon Web Services S3/EC2/EMR, Python, Java, Javascript, Gensim, PyLDAvis","m3lname":"Adelson","dataset":"Yelp Dataset from http://www.yelp.com/dataset_challenge.

The software should be able to support any textual corpus to perform topic modeling.","m1lname":"Jayaraman","industry":"Information","analytics":"We implemented topic modeling using batch LDA(Latent Dirichlet Allocation), online LDA and online HDP(Hierarchical Dirichlet Processing).

Batch LDA was implemented in Mahout while online LDA and online HDP were in Gensim.

We performed a quantitative evaluation using perplexity and a visual evaluation using pyLDAvis as a visualization tool.

Modifications to pyLDAvis to allow topic models from Mahout to be visualized will be submitted to pyLDAvis creators and made available as part of their open-source package and is a significant contribution of this project.

","m2fname":"Kyle","description":"

Use topic modeling to analyze the Yelp Reviews dataset

Find a list of differences in topics between high-rating and low-rating reviews for the same class of businesses.
This could be useful for business owners to see the factors that make customers like or dislike a particular category of business.

Examine the feasibility of reproducing the Yelp category hierarchy purely by analyzing the review text.
Yelp has a multi-level manually created category hierarchy to categorize businesses. Automating this could help simplify this process as well as categorizing businesses that could legitimately belong to multiple categories.

Examine the differences in results between various algorithms for topic modeling
Batch LDA using Mahout, online LDA and online HDP(Hierarchical Dirichlet Processing) using Gensim. This is useful to see which algorithms/tools are effective for topic modeling over Big Data corpuses.
","m1fname":"Karthik ","projectname":"Analyzing the Yelp Review Dataset with Topic Modeling","m3fname":"Jon"},{"m2lname":"Zhang","m4lname":"","m3uni":"yc3121","m1uni":"sr3254","m4uni":"","pid":" 201512-68","m2uni":"yz2869","timestring":"Thu Dec 17 19:45:05 2015","m4fname":"","language":"Java,Python, Hadoop, Mahout","m3lname":"Cui","dataset":"The dataset is fetched from https://www.eventbrite.com/d/ny--new-york/events/ with a crawler. It contains The events available around New York and some attributes of the event like time, date, cost, etc...
","m1lname":"Ren","industry":"Information","analytics":"Web crawler in Python
Bayesian model with java and hadoop
Kmeans Clustering used(Mahout)
","m2fname":"Yeran","description":"The numerous events available on eventbrite is overwhelming for users to choose from.

Our project provides a recommendation system which can help users make their choice.

Actually our recommendation system is not item-based. It's some kind of hybrid content-based and model-based. It mekes recommendation based on the attributes of the events that the user has preference for. And one of these attributes is the clustering result of the event details which is a text description of the events.And this is the major contribution of our work.","m1fname":"Shiwei","projectname":"Item-based Event Recommendation Based on User’s Preference","m3fname":"Yiqing"},{"m2lname":"Jia","m4lname":"","m3uni":"yc3079","m1uni":"xc2331","m4uni":"","pid":"201512-61","m2uni":"tj2330","timestring":"Thu Dec 17 20:00:13 2015","m4fname":"","language":"Java, C++, System G, Neo4j, Python, Hadoop, Matlab","m3lname":"Chen","dataset":"The dataset of this project is downloaded from Yahoo Webscope Dataset: Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0. There are many movie rating datasets on the Internet, we choose this data set because except for movie id, user id and rating, it provides more information about the users: their gender and ages. This additional information gives us more choice to analyze the dataset. The dataset is tested both in original and after classification.
Here is a breif description on the dataset:
(1) \"ydata-ymovies-user-movie-ratings-train-v1_0.txt\" contains a small sample of Yahoo! users' ratings of movies, with the following
fields:
0 anonymized user_id
1 movie_id
2 rating(from 1(F) to 13(A+))
3 converted rating(from 1 to 5: A-,A, A+ will be converted to 5)
(2) \"ydata-ymovies-user-demographics-v1_0.txt\" contains user demographic information, with the following fields:
0 anonymized user_id
1 birthyear
2 gender
(3) \"ydata-ymovies-mapping-to-eachmovie-v1_0.txt\" contains a mapping from the movie ids used in this Yahoo! Movies dataset to the corresponding movies ids and titles used in the EachMovie dataset. The mapping may be incomplete or incorrect. The EachMovie dataset was created by the Digital Equipment Corporation's Systems Research Center and is not associated with Yahoo! or available via Yahoo!. The file contains the following fields:
0 yahoo_movie_id
1 movie title
2 eachmovie_movie_id
","m1lname":"Cao","industry":"Media","analytics":"The dataset is almost ready to use, but we modified the dataset for further analysis. We classfied the data both in gender and age (ten years a group). Then User-based Recommender and Item-based Recommender are used.
With the evaluation algorithms in Mahout, several algorithms are tested like Nearest-N Neighborhood, Threshold-based Neighborhood, Euclidean Distance Similarity and Pearson Correlation Similarity.
Finally, we chose the best algorithm who performs better on the accuracy and time efficiency. And System G toolkit is used to visualize the recommendation result.
","m2fname":"Tiancheng","description":"This projected is aimed to recommend movies to users according to the dataset of their previous ratings. Before reommendation, the users are divided into different groups by their age or gender. Compared to the original dataset, the grouped data is processed in a higher running speed with equal (even slightly better) accuracy. And by separating data groups, we can get different taste patterns among users with specific charactersitics. These properties are meaningful to recommendation system when processing big data and enhance interactivity.
","m1fname":"Xu","projectname":"Movie Recommendation and Analytics","m3fname":"Yanjing"},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":" 201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:06:08 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP http://dblp.uni-trier.de/, a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"Yanez","industry":"Life Science","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level. Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"MIchelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly","m3fname":"Yu Hsuan"},{"m2lname":"jiang","m4lname":"","m3uni":"zl2348","m1uni":"sd2810","m4uni":"","pid":"201512-36","m2uni":"zj2173","timestring":"Thu Dec 17 20:08:24 2015","m4fname":"","language":"Python, Js, Spark, Hadoop Hdfs","m3lname":"lv","dataset":"
Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.","m1lname":"dong","industry":"Media","analytics":"Collaborative Filtering ALS algorithm

Create RDDs for Books table and Ratings table, separate data into train and test. Train a mode and test on test data to tune parameters for later use.

Get new user ratings from front-end for every new user and merge it into Rating table and use the complete data to train new model for new user.

Use the new complete model to predict ratings on each book for the new user(exclude those rated by user of course).

Select top 25 books and only keep books which are reviewed more than 10 times.
return recommendations to user (to be rated further).
","m2fname":"zewei","description":"In this era of information explosion, have you ever thought about how many books out there? For a reader, it is actually a pain in back to find a good book to read. This would take time and might end up reading a not good book for a while till realized that it is not a good fit.

Here our App comes to rescue. Our App aims to make the best recommendations on books for users. In order to have enough knowledge to recommend properly, we take a huge database which consists of millions of books and user rating information from Amazon data source. The data is huge, but the recommendation needs to be done in second, thats why this is a real world big data computation challenge.","m1fname":"shiyu","projectname":"Target Your Next READING : Book Recommender","m3fname":"zixuan"},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":" 201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:09:22 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP http://dblp.uni-trier.de/, a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"Yanez","industry":"Life Science","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level. Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"MIchelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly","m3fname":"Yu Hsuan"},{"m2lname":"Patil","m4lname":"","m3uni":"","m1uni":"rrm2112","m4uni":"","pid":"201512-24","m2uni":"krp2135","timestring":"Thu Dec 17 20:09:31 2015","m4fname":"","language":"Python 2.x, PySpark, Spark, Mac","m3lname":"","dataset":"Open High Low Close Volume (OHLC) data from Yahoo was used to gather 14 years of stock prices. Estimize data for wallstreet predicted eps (eps estimate) as well as the actual eps (earnings per share) when a company reported. We also got this information from Zacks/Quandl. The earnings and especially earnings estimates data is hard to come by. Zacks sells this for $1800/per year. There's also a cost from Estimize, however we were able to convince both Zacks/Quantdl and Estimize to give this data to us for free for school purposes. ","m1lname":"Martin","industry":"Finance","analytics":"For data munging we used the Pandas (Python) library extensively. We used Python to interact with api's such as Estimize, and also to download data with urllib2. For model development, we used MLlib from Spark (PySpark). Initial model development was done in Scala, however, we realized the rich ecosystem that Python provided gave us more choices. For example, we used Scikit to some evaluation after getting back the results from Apache Spark.","m2fname":"Kedar","description":"The objective is to investigate how well we can predict the likelihood of a company beating earnings estimate. This information could be used as a signal in a trading system. This application is one step removed from predicting actual earnings, which we think is a more difficult task due to the various influences on a company's earnings. The system implicitly uses the knowledge of the Analysts, and we believe it can only improve with more data. We haven't seen any other application doing this. There's Estimize, but they're trying to predict the earnings; we try to predict a earnings surprise which is more valuable to investors.","m1fname":"Roberto","projectname":"Earnings Predictor: Predict whether Earnings will Beat Estimates","m3fname":""},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":" 201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:17:05 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP http://dblp.uni-trier.de/, a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"Yanez","industry":"Life Science","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level. Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"MIchelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly","m3fname":"Yu Hsuan"},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":" 201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:19:10 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP http://dblp.uni-trier.de/, a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"Yanez","industry":"Life Science","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level. Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"MIchelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly","m3fname":"Yu Hsuan"},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":" 201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:19:52 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP http://dblp.uni-trier.de/, a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"Yanez","industry":"Life Science","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level. Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"MIchelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly","m3fname":"YuHsuan"},{"m2lname":"Kang","m4lname":"","m3uni":"","m1uni":"jh3534","m4uni":"","pid":"201512-34","m2uni":"wk2269","timestring":"Thu Dec 17 20:25:19 2015","m4fname":"","language":"Java, Yelp API, Advanced REST client, Android Studio, Android SDK","m3lname":"","dataset":"Yelp Challenge Dataset (public on http://www.yelp.com/dataset_challenge)

The academic dataset includes data from cities all around the world: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison in U.S.; Montreal and Waterloo in Canada; Edinburgh in U.K.; Karlsruhe in Germany.

The academic dataset contains 1.6M reviews and 500K tips by 366K users for 61K businesses. ","m1lname":"Hu","industry":"Retail","analytics":"Hadoop, Hive

Map-reduce: a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.

Collaborative Filtering Recommender Algorithm: collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
Item-based recommendation: Calculate similarity between items and make recommendations.","m2fname":"Wendan","description":"Yelp offers a platform for consumers to find restaurants through reviews and ratings. A typical search on Yelp requires the custom to input keywords and shows a bunch of search results to pick by hand. However, when a customer travels to a new place and look for a suitable restaurant, he/she may not want to try different keywords and read tons of reviews before making a right choice. There comes our customized recommender system.

Our project is designed to analyze the Yelp open database – challenge academic dataset and provide item-similarity ranking using big data analytical tools. Then our system will recommend relevant restaurants using collaborative filtering recommender algorithm from all nearby restaurants grabbed by Yelp API. The recommender works based on the personal preference of specific customer and shows the result on an android app.
","m1fname":"Jing","projectname":"Yelp Dataset Analysis and Customized Recommender System","m3fname":""},{"m2lname":"Berard","m4lname":"","m3uni":"","m1uni":"ak3808","m4uni":"","pid":"201512-48","m2uni":"aab227","timestring":"Thu Dec 17 20:31:17 2015","m4fname":"","language":"Python, Apache Hadoop, Apache Spark, Node.js, Highcharts and Highstock JavaScript libraries","m3lname":"","dataset":"Our platform is designed to be dataset agnostic, and can work with any financial time series in the comma separated value format. For the purposes of our demonstration, we used historical S&P 500 index stock tick data from Quantcode (). This dataset is free and publicly available.","m1lname":"Kakar","industry":"Finance","analytics":"In terms of analytics, we implemented algorithms to calculate the returns for a stock portfolio. We also wrote algorithms to compute statistics on the returns, such as the Mean, Standard Deviation, Sharpe Ratio, Maximum Drawdown etc.

We wrote custom algorithms to process the dataset using Spark Resilient Distributed Datasets (RDDs). We implemented a rudimentary server in node.js to serve our final web application. The visualizations in the web application were implemented using the HighCharts and HighStock JavaScript APIs.","m2fname":"Alice ","description":"Out project leverages big data technologies to realize a backtesting engine for algorithmic trading strategies. Backtesting is the most crucial step in the development cycle of a trading algorithm. In order to test a trading strategy thoroughly, a large amount of financial time series data is required. Storing, reading and processing these large datasets is a computationally intensive task.

The backtesting engine that we implemented uses a Hadoop data warehouse as its backbone. We then use Apache Spark to read these large amounts of data and process them according to user implemented trading strategies and calculate essential performance metrics. Finally, we use a JavaScript based web application to visualize the performance of these algorithms viz. the benchmark. Our system is able to handle arbitrarily complex trading strategies implemented in Python.

Algorithmic trading is a highly competitive sector of financial markets and our platform provides flexible backtesting solution that can be scaled easily and can be integrated seamlessly with other tools and methods.
","m1fname":"Akshaan","projectname":"Map Reduce for Algorithmic Trading","m3fname":""},{"m2lname":"Du","m4lname":"","m3uni":"qj2131","m1uni":"jl4350","m4uni":"","pid":"201512-54","m2uni":"cd2789","timestring":"Thu Dec 17 20:33:18 2015","m4fname":"","language":"Language: Python; Platform: Mac and Ubuntu","m3lname":"Jin","dataset":"We use the amazon product data from Julian UCSD by sending a request to him (The description page for the data set is http://jmcauley.ucsd.edu/data/amazon/). This dataset contains product reviews and metadata from Amazon, including 143.7 million reviews regarding 1.2 million products spanning May 1996 - July 2014.
","m1lname":"Li","industry":"Retail","analytics":"Machine learning algorithms:
Logistic Regression: A regression/classification model that estimating probabilities of categorical target variable by applying logistic function to a linear combination of all input features.
Naive Bayes: Naive bayes is a set of algorithm based on a naive assumption that features are independent to each other. Specifically, we use multinomial naive bayes classifier in this project, which is a naive bayes algorithm for multinomially distributed data.
Gradient Boosted Decision Trees: A regression/classification model based on gradient boosting with decision trees as base classifier. Gradient boosting is like other boosting methods. But it generalizes them by allowing optimization of an arbitrary differentiable loss function.

Hyper-parameter optimization:
Sequential model-based global optimization based on Tree-structured Parzen Estimator
","m2fname":"Chengcheng","description":" Amazon allows customers to evaluate reviews by voting whether it is helpful or not. However, the new reviews are probably more important but without any helpfulness evaluation; the old helpful reviews are convincing but out of date. We would give customers high quality recent reviews, if we could sort the recent reviews by helpfulness or only pick up the helpful reviews. Since the recent reviews don’t have the helpfulness evaluation by other people, we need a model to predict the helpfulness of the reviews.
In this project, we accomplished the amazon reviews helpfulness prediction applying different technologies. We design many features and conducted experiments to decide which features and model to use to achieve highest PRC-AUC value for the task.
This model can be integrated into Amazon as a ranker or filter for the most recent reviews.","m1fname":"Jianhao","projectname":" Product Review Helpfulness Prediction on Amazon Dataset","m3fname":"Qiurui"},{"m2lname":"Pugliese","m4lname":"","m3uni":"","m1uni":"mar2260","m4uni":"","pid":"201512-26","m2uni":"jp3571","timestring":"Thu Dec 17 20:43:33 2015","m4fname":"","language":"Python, Apache Hadoop, HDFS, Apache Spark","m3lname":"","dataset":"We scraped player's historical data from rotoguru1.com which is an online fantasy sports reference and statistics archive.","m1lname":"Raimi","industry":"Media","analytics":"We created a roster prediction algorithm which is a combination of clustering and vector normalization, implemented through Spark's RDD action and transformation abilities. ","m2fname":"Justin","description":"Fantasy sport gaming participation has been rapidly increasing over the past several years in several forms. The latest adaptation, Daily Fantasy, has seen a sharp spike in popularity as several gaming platforms have enabled players to compete for cash.

The promise of a cash reward has caused several government agencies to start investigating the amount of skill involved with succeeding in Daily Fantasy Sports. If no skill is required and success is only a matter of luck, these games could be considered gambling and would be subject to the rules and regulations of each state.

Leveraging a dataset of NBA players past game performances, we plan to develop a recommendation system that provides an end-user with the optimal team for any given day. If the recommended team performs well enough that the end-user is successful then we will have shown that Daily Fantasy is not a game predicated on luck.","m1fname":"Michael","projectname":"Predicting Optimal Daily Fantasy Basketball Rosters","m3fname":""},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":" 201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:44:41 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP http://dblp.uni-trier.de/, a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"Yanez","industry":"Life Science","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level. Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"MIchelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly","m3fname":"YuHsuan"},{"m2lname":"Pugliese","m4lname":"","m3uni":"","m1uni":"mar2260","m4uni":"","pid":"201512-26","m2uni":"jp3571","timestring":"Thu Dec 17 20:46:32 2015","m4fname":"","language":"Python, Apache Hadoop, HDFS, Apache Spark","m3lname":"","dataset":"We scraped player's historical data from rotoguru1.com which is an online fantasy sports reference and statistics archive.","m1lname":"Raimi","industry":"Media","analytics":"We created a roster prediction algorithm which is a combination of clustering and vector normalization, implemented through Spark's RDD action and transformation abilities. ","m2fname":"Justin","description":"Fantasy sport gaming participation has been rapidly increasing over the past several years in several forms. The latest adaptation, Daily Fantasy, has seen a sharp spike in popularity as several gaming platforms have enabled players to compete for cash.

The promise of a cash reward has caused several government agencies to start investigating the amount of skill involved with succeeding in Daily Fantasy Sports. If no skill is required and success is only a matter of luck, these games could be considered gambling and would be subject to the rules and regulations of each state.

Leveraging a dataset of NBA players past game performances, we plan to develop a recommendation system that provides an end-user with the optimal team for any given day. If the recommended team performs well enough that the end-user is successful then we will have shown that Daily Fantasy is not a game predicated on luck.","m1fname":"Michael","projectname":"Predicting Optimal Daily Fantasy Basketball Rosters","m3fname":""},{"m2lname":"Tadmor","m4lname":"","m3uni":"ys2898","m1uni":"may2114","m4uni":"","pid":"201512-39","m2uni":"mdt2125","timestring":"Thu Dec 17 20:46:36 2015","m4fname":"","language":"Neo4j, Python, Flask, D3js, Py2neo, NTLK, iGraph, Digital Ocean droplet","m3lname":"Shih","dataset":"The dataset we used originates from DBLP (http://dblp.uni-trier.de/), a computer science bibliography website hosted at Universitat Trier in Germany. The content of this data includes paper information, paper citation, author information and author collaboration. In total the corpus contains over 2 million papers, 8 million citations, and close to 2 million authors. ","m1lname":"A. Yanez","industry":"Information","analytics":"Paper community clustering: Girvan-Newman algorithm, Multi-Level.
Visualization: Forced-Directed Graph, Bubble Chart, Word Cloud","m2fname":"Michelle","description":"Scholar publications embody the dynamic growing human knowledge base. The publications and interconnections between publications reveal trends and topics of interest. Identifying new trends, tracing shifts in trends typically requires in depth knowledge of the field and high literacy in the field. Scholarly visually organizes publications revealing trends graphically. Furthermore, Scholarly computationally identifies emerging research trends as well as various publication analytics. Scholarly is highly usable via a web interface and allows interactive traversal and manipulation of thousands of publications at a time.","m1fname":"Miguel","projectname":"Scholarly: Academic Data Visualization and Analysis","m3fname":"YuHsuan"},{"m2lname":"Zhao","m4lname":"","m3uni":"yy2641","m1uni":"xl2523","m4uni":"","pid":"201512-10","m2uni":"jz2685","timestring":"Thu Dec 17 20:53:30 2015","m4fname":"","language":"Python,Java,Javascript,SQL,AWS","m3lname":"Yang","dataset":"(1)Stanford Artificial Intelligence Laboratory “Large Movie Review Dataset”
This dataset was used to build up a movie review sentiment classifier
(2)Latest twitter data collected by twitter API
Utilized twitter API to collect 149606 twitters relevant to twitter
Applied these twitters to the above mentioned classifier and calculate the rating for each movie

","m1lname":"Lan","industry":"Media","analytics":"Algorithms
Scikit-learn package was used to build:
(1)Naive Bayes Classifier
(2)Linear SVM Classifer
Applied the test date in the “Large Movie Review Dataset” to evaluate the model
The correct rate of Linear SVM is 10% percent higher
So the Linear SVM Classifier was selected to classify the twitters
Two vectorize algorithms was used to vectorize the words
(1)Count Vectorize algorithm
(2)TF-IDF Vectorize algorithm
TF-IDF Vectorize algorithm performs better
For the Linear SVM classifier
Tried soft margin parameter C from 0.1 to 100
When C=0.5, the model performs best

Visualization
Use Raphael.JS to visualize the result by date and by area. Construct webpage based on AWS.
Welcome to visit our website
http://twittermovierating.elasticbeanstalk.com/home.jsp","m2fname":"Jingmei","description":"Project Goals Description:
Recently the commercial market of movies grows larger and larger.
But there are too many movies and sometimes it is hard to choose to watch which movies. Our program benefits plenty of people by providing real time rating of movies on social media in their location, even for movies not on show yet!
First, “Large Movie Review Dataset” was utilized to train a Linear SVM model to classify positive and negative movie review.
Meanwhile collect the twitter relevant to movies with geo tag and date by twitter API and storage these data to RDS.
Secondly, classify the twitters for each movie and calculate their rating by each area and by each date.
Finally, visualize the result on the website to benefit more people.
There are four highlights of our project
(1) Can predict rating for movies not on show yet, help people to select movie.
(2) Since people from different area may like different movie, predicts the rating for targeting area
(3) The rating is update real time and people can see the trend of movie rating
(4) Visualize the data on the website. Every body has free access to visit the website and use our result to choose movie to watch.
Actually, I am using the result of our project to choose movie currently.
Have a try of our website:
http://twittermovierating.elasticbeanstalk.com/home.jsp

","m1fname":"Xing","projectname":"Twitter Based Movie Recommendation System","m3fname":"Yao"},{"m2lname":"Chen","m4lname":"","m3uni":"xl2519","m1uni":"qx2155","m4uni":"","pid":"201512-35","m2uni":"cc3701","timestring":"Thu Dec 17 20:55:09 2015","m4fname":"","language":"Python, Java, Javascript, Mahout, Django, D3.js","m3lname":"Li","dataset":"Yahoo Webscope Search Marketing Advertiser-phrase Bipartite Graph Database
Anonymized graph reflecting the pattern of connectivity between advertisers and some of the search keyword phrases they bid on.
Total nodes: 653,260:
459,678 anonymous phrases,
193,582 anonymous advertiser ids,
2,278,448 edges, representing the act of an advertiser bidding on a phrase.
","m1lname":"Xu","industry":"Media","analytics":"Format the raw data by Python;
Use Mahout + Java to conduct bidding recommendation;
Django + Python is used as a web framework for our website;
Data visualization is accomplished by D3.js toolkit.
","m2fname":"Chen","description":"As many recommendations aim at the users based on their phrase searching and clicking, we want to design the recommendation for another kind of users, that is, the advertisers. According to the keyword phrases they bid on, we hope to recommend several appropriate keyword phrases for them. In addition, we would like to build a platform for advertisers to access this information in an easy way.
","m1fname":"Qi","projectname":"Auction Recommendation for Advertiser","m3fname":"Xiaowen"},{"m2lname":"Mani","m4lname":"","m3uni":"","m1uni":"ar3390","m4uni":"","pid":"201512-71","m2uni":"sm3906","timestring":"Thu Dec 17 21:00:09 2015","m4fname":"","language":"Spark, Python, SQL","m3lname":"","dataset":"Dataset
Rich US gov dataset on UG colleges and future student performance
Exhaustive data for over 20 years
1.6 GB
Implementation
Spark
Python
Apache Mahout
Algorithms

","m1lname":"Raghupathi","industry":"Information","analytics":"Recommendation algorithm
Clustering analysis
Filtering","m2fname":"Senthil Krishna","description":"Draw a more straight line correlation between college acceptances, and student background
Biggest frustrations for students is understanding “fit” for a university holistically
Undergraduate universities and their admission rates, statistics, and criteria were analysed and a recommendation engine based on user entry was created
“Find your Fit” helps use quantitative methods and data analytics to build a recommendation engine to achieve this
","m1fname":"Ashwin","projectname":"Find Your Fit: University Edition; Match the School to the Student ","m3fname":""},{"m2lname":"Han","m4lname":"","m3uni":"","m1uni":"rw2611","m4uni":"","pid":"20150512-18","m2uni":"sh3447","timestring":"Thu Dec 17 21:18:31 2015","m4fname":"","language":"Python ","m3lname":"","dataset":"Statistics of 448 soccer games","m1lname":"Wang","industry":"Media","analytics":"Pearson correlation coefficient
Decision tree C5.0
Linear regression
","m2fname":"Shuaiyu","description":"Using data analysis to find important factors in a soccer game and how the factors influence a game.","m1fname":"Rui","projectname":"Data analysis on soccer team performance","m3fname":""},{"m2lname":"Yan","m4lname":"","m3uni":"yy2632","m1uni":"xw2400","m4uni":"","pid":"201512-47","m2uni":"jy2728","timestring":"Thu Dec 17 21:19:12 2015","m4fname":"","language":"Mongodb, python, HTML","m3lname":"Yu","dataset":"Data retrieved from basketball-reference.com
Covers all NBA basketball stats in 31,686 NBA regular season games from 1985-86 season to 2013-2014 season
\u0000","m1lname":"Wang","industry":"Information","analytics":"MongoDB’s built-in aggregation framework

Each stage of the aggregation pipeline operates on the results of preceding stage

Each stage can filter and transform the individual documents

$match stage, $unwind stage, $group stage, $sort stage, $limit stage, $project, $cond, $sum, $group, $gte.
\u0000","m2fname":"Jiadong","description":"Explore individual sporting interests
Find out the critical factors lead to win NBA games: Rebound, Assist, Points, Block and etc.
Help coach by using the results to strong his team
\u0000","m1fname":"Xuhui","projectname":"Factors Lead to Win NBA Games \u0000","m3fname":"Yuantuo"},{"m2lname":"He","m4lname":"","m3uni":"ys2819","m1uni":"jy2732","m4uni":"","pid":"201512-29","m2uni":"zh2255","timestring":"Thu Dec 17 21:22:25 2015","m4fname":"","language":"Java, Hadoop MapReduce, Python/Flask, Javascript, D3.js, jQuery/AJAX, Matlab","m3lname":"Shen","dataset":"We used the MNIST handwritten digit dataset provided by LeCun et al., available at http://yann.lecun.com/exdb/mnist/.

The dataset is also available as convenient .csv files at http://pjreddie.com/projects/mnist-in-csv/.

Our software can process arbitrary input data from files assuming that each row represents a data point, with the first element representing an integer class, followed by N numerical attributes (in this case, pixel values). ","m1lname":"Yuan","industry":"Information","analytics":"A parallel Feed-Forward Neural Network was implemented in MapReduce, using mini-batch gradient descent in each of the Mappers. Implementation of the Random Forest is still ongoing.

A Python/Flask web server was implemented, which displays a D3.js plot of the sum of squared error at a given iteration of the Neural Network's backpropagation training process. This plot is automatically refreshed as the MapReduce job progresses.

Ongoing is an effort to create D3.js graph displays, showing representative inputs and outputs of the mappers and reducers in each of our parallel implementations.","m2fname":"Ziyu","description":"While learning to use existing machine learning tools such as Mahout on Hadoop, we were frequently frustrated by the lack of meaningful output regarding the performance and state of the models while they were being trained. For example a user might run a job to completion, wasting a lot of time, only to realize that due to improper parameter settings the job produced poor quality or undesired output. This could be circumvented, for example, by displaying the error of the model while it is being trained, giving users immediate feedback and the option to terminate the job early.

Our goal is to first devise parallel machine learning algorithms, namely a Random Forest and Feed-Forward Neural Network, and implement them using MapReduce in Hadoop. Secondly, we would like to provide real-time visualizations of the training performance of these parallel algorithms and display these visualizations in a web browser.

For a possible cost in performance, this project will provide a much more transparent view of what is going on inside of the implemented MapReduce algorithms, which will serve as a convenient debugging and teaching tool. We also hope to demonstrate that our parallel designs for these machine learning algorithms are feasible to implement in a MapReduce framework.","m1fname":"Jie","projectname":"Visualization of Machine Learning Algorithms in MapReduce","m3fname":"Yubin"},{"m2lname":"Liu","m4lname":"","m3uni":"wh2333","m1uni":"zs2262","m4uni":"","pid":"201512-28","m2uni":"hl2906","timestring":"Thu Dec 17 21:26:07 2015","m4fname":"","language":"Linux; C++; Shell; Matlab; Java; CSS; Javascript; HTML.","m3lname":"Hu","dataset":"UCF101 - Action Recognition Data Set. It is public and available at http://crcv.ucf.edu/data/UCF101.php. Our system can support all kinds of video dataset for action recognition and event detection.","m1lname":"Shou","industry":"Media","analytics":"Our implementation consists of three parts. (1) Video representation: motivated by the great success of deep learning approaches, we use 3D CNN to extract feature for representing video. 3D CNN can capture appearance and motion information simultaneously. For each video, we use 3D CNN model pre-trained on Sports1M dataset and then fine-tuned on UCF101 dataset, to extract fc7 activations as feature vector for describing videos in UCF101 dataset. (2) Hash: we use LSH to learn a hashing function on UCF101 dataset based on the video feature vector extracted by 3D CNN, and generate a binary code for each video. (3) Retrieval: a video retrieval system based on hamming distance and corresponding visualization interface are developed. The effectiveness of our sysytem is proven by the experimental result on UCF101, which is a large scale video dataset. When we set the number of videos retrieved by the system as 5, our system achieves 82.36% mean precision. We also implement a visualization website demo. Through the website, users can indicate the specific query video and see the retrieved videos of high similarity.
","m2fname":"Hongyi","description":"The number of videos uploaded on the Internet grows at an astonishing growth speed. In this project, we aim to develop a robust and effective system to conduct large scale video search and retrieval. Given a query video, our system identifies its several nearest neighbors which are relevant videos to be returned and shown to user.

This system can benefit many daily applications as follows. Automated surveillance systems in public area like airport and subway station usually capture many videos. Our system can help police to automatically analyze these videos to detect abnormal and suspicious activities as opposed to normal activities, by using some videos containing abnormalities to search and retrieve similar abnormalities in the videos captured by surveillance systems. Another application is advertisement recommendation. For example, Youtube can treat the video watched by a user as the query instance, and search in the advertisement video dataset to retrieve videos that probably are interesting for the user.","m1fname":"Zheng","projectname":"Large Scale Video Search and Retrieval via CNN","m3fname":"Weiye"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"kmg2165","m4uni":"","pid":"201512-2","m2uni":"","timestring":"Thu Dec 17 21:26:35 2015","m4fname":"","language":"Scala, Apache Spark, HDFS, KML/Google Earth","m3lname":"","dataset":"New York City Taxi & Limousine Commission Trip Record Dataset
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml ","m1lname":"Graney","industry":"Transportation","analytics":"KML visualizations were used for the origin and destination points of trip clusters. k-Means clustering was used to group geographically similar trips.","m2fname":"","description":"Can we classify typical trips that New Yorkers take in taxis?
Which trips are the most popular?","m1fname":"Kevin","projectname":"Classifying NYC Taxi Trips","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"kmg2165","m4uni":"","pid":"201512-2","m2uni":"","timestring":"Thu Dec 17 21:30:10 2015","m4fname":"","language":"Scala, Apache Spark, HDFS, KML/Google Earth","m3lname":"","dataset":"New York City Taxi & Limousine Commission Trip Record Dataset
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml ","m1lname":"Graney","industry":"Transportation","analytics":"KML visualizations were used for the origin and destination points of trip clusters. k-Means clustering was used to group geographically similar trips.","m2fname":"","description":"Can we classify typical trips that New Yorkers take in taxis?
Which trips are the most popular?","m1fname":"Kevin","projectname":"Classifying NYC Taxi Trips","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"cl3418","m1uni":"ln2334","m4uni":"","pid":"201512-58","m2uni":"bw2491","timestring":"Thu Dec 17 21:30:30 2015","m4fname":"","language":"Java SE, Matlab, Bash, Ubuntu Linux, SQLite, Eclipse, Apache Hadoop, MapReduce APIs, Android APIs, Amazon EC2","m3lname":"Liu","dataset":"RSS positioning fingerprints. Training RSS is collected in advance. Testing RSS is collected and applied.","m1lname":"Niu","industry":"Information","analytics":"K-Means clustering, Euclidean distance similarity, kNN classification, MapReduce","m2fname":"Bin","description":"With the development of wireless network, there are many indoor location-related technologies and applications, particularly in the aspect of context-aware applications. Distributed WLAN based indoor location system is applied not only for its low cost, but also for the non-registered 2.4GHz ISM band and free wireless license for 802.11 b/g protocol. In indoor WLAN environment, the fingerprint location algorithm is realized by existed facilities, without adding extra equipments. What’s more, its upgrading and maintenance enact little impact on the clients, which results in a wide range of applications. In this project, we use WLAN fingerprints features to achieve high-accuracy positioning. Given the size of the fingerprints sample dataset(more than 16,000) we collected, we introduce some Big Data analytics tools in the whole data analytic process. All the fingerprint location algorithms rely on the standardized WLAN RSS values computed by Hadoop MapReduce, and the typical RSS-based fingerprinting location algorithms.
","m1fname":"Li","projectname":"Decentralized Indoor Positioning Based on WLAN Fingerprints","m3fname":"Chang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ng2565","m4uni":"","pid":"201512-14","m2uni":"","timestring":"Thu Dec 17 21:32:32 2015","m4fname":"","language":"Python","m3lname":"","dataset":"The dataset is downloaded from Kaggle website.

https://www.kaggle.com/c/acm-sf-chapter-hackathon-small","m1lname":"Gupta","industry":"Retail","analytics":"Recommendation System, TF-IDF, SVC, Spell Check, Content Based Filtering, Collaborative Filtering
","m2fname":"","description":"Before buying any product online, one must do intensive research on the product’s reviews, ratings , number of people who rated them, ratings from recent users. Users’ rating and reviews are very important factors that increase the chances of a product being sold online. So, in this project, I tried to build a product recommendation system based on the user's search query and algorithm like collaborative filtering and content based filtering.
","m1fname":"Neha","projectname":"Product recommendation using customers’ search or click behavior","m3fname":""},{"m2lname":"QIAN","m4lname":"","m3uni":"jc4260","m1uni":"ts2957","m4uni":"","pid":"201512-59","m2uni":"cq2171","timestring":"Thu Dec 17 21:34:19 2015","m4fname":"","language":"Python, R","m3lname":"CHEN","dataset":"We use dataset from Yelp Dataset Challenge in 2016, including users, businesses, and reviews related information.
","m1lname":"SHEN","industry":"Information","analytics":"We firstly implement two algorithms, similarity algorithm and weight function, to filter out the potential unreliable reviews and then, create a classification model to find out the reliable and useful reviews for users.","m2fname":"CHEN","description":"People heavily rely on the opinionated social media to make decisions, such as reviews on Yelp.
However, the overwhelming reviews on Yelp sometimes are not efficient for users to make wise decisions.
Our project aims to provide reliable and useful reviews to users for their decision making.
","m1fname":"TIANHE","projectname":"Reliable Reviews Recommendation","m3fname":"JIAQI"},{"m2lname":"Guo","m4lname":"","m3uni":"xh2251","m1uni":"lq2156","m4uni":"","pid":"201512-67","m2uni":"jg3639","timestring":"Thu Dec 17 21:36:54 2015","m4fname":"","language":"Python","m3lname":"Hu","dataset":"We download our dataset through Steam Web API. It is completely public and everyone can download it by applying a unique API key. It is structured as JSON.","m1lname":"Qi","industry":"Media","analytics":"We use logistic regression and K-nearest neighbors as our main algorithms used in our machine.","m2fname":"Jiaqi ","description":"Dota2 is a free-to-play multiplayer online battle arena video game developed by Valve Corporation. These years the winning pool of Dota2 competition has become larger. So it is useful to find the relationship between hero selection and winning the game. Choosing heroes with higher winning probability according to our machine will help the team to win the game.","m1fname":"Li","projectname":"Predict Dota2 Game Outcome","m3fname":"Xinyuan"},{"m2lname":"Lu","m4lname":"","m3uni":"dy2307","m1uni":"yl3406","m4uni":"","pid":"201512-31","m2uni":"ml3806","timestring":"Thu Dec 17 21:43:39 2015","m4fname":"","language":"Python, Matlab, Pig, Hadoop, Mahout, networks, inetsoft","m3lname":"Yao","dataset":"The datasets we used in this study are \"ATC pedestrian tracking dataset\" and \"Pedestrian tracking with group annotations\" were obtained from the project primary purposed in enabling mobile social robots to work in public spaces , founded by JST/CREST in japan [1]
The data was collected between October 24, 2012 and November 29, 2013. The data collection was done every week on Wednesday and Sunday, from morning until evening (9:40-20:20). The dataset consists of 92 days in total.
The data is provided as CSV files, each row in a CSV file corresponds to a single tracked person at a single instant, and it contains the following fields:
time [ms] (unixtime + milliseconds/1000), person id, position x [mm], position y [mm], position z (height) [mm], velocity [mm/s], angle of motion [deg], facing angle [deg]
For group data set : PEDESTRIAN_ID GROUP_SIZE PARTNER_ID_1 ... (list of ids of all other pedestrians in group) NUMBER_OF_INTERACTING_PARTNERS INTERACTION_PARTNER_ID_1 ... (list of all socially interacting partners)
* Two data set sharing same ids","m1lname":"Lu","industry":"Retail","analytics":"Our analysis can be separated into 3 parts. The first part is flow rate analysis, in which we used Pig to filter and process raw data. For instance, since the original dataset sort rows with time, we used Pig to resort them by person’s ids and grouped data about a same person. After analysis, we used Inetsolf to plot our results.

The second part is distribution analysis and visualization. We firstly used python to pre-processed the dataset and divide the monitoring area into grids. We also calculated many arguments, such as people intensity using python. We then used MATLAB to plot in total 42 days of data focusing on different characteristics. Pig performed filtering of the data. We used Spark to do clustering work and clustered the data based on different categories (for example x, y coordinates and velocity).

The third part involves group behavior analysis and visualization. We firstly used python to pre-process the dataset, selecting those pedestrians appeared in the group dataset area. Then appended another column to the end of each row indicating whether they are involved in a group (this information comes from group dataset). We then used Mahout logistic regression algorithm to train and test models, using datasets from different dates. Then we used NetworkX to plot the group data, and used Fruchterman-Reingold force-directed algorithm to layout the plot.
","m2fname":"Mengzhuo","description":"We find lots of interesting found from analysing the pedestrian flow intensity by time and position, and take the facing angle, velocity, motion of angle, height into consideration. These includes:

1. There are more people on Weekends but mainly in the afternoon.
2. Some area is popular only when event happens (which is a waste of space when there is no event)
3. There is in \"invisible path\" where more people tends to walk on it on weekdays.
4. People follows the public order in narrow areas, and people tends to be politer on weekend.
5. The distribution of children tends to be highly centralised.
6. There are more children and group shoppers on weekend.

We also make some classification to distinguish group shopper and individual shoppers.

The analysis is very important because it helps the mall to find some invisible fact concealed in boring pedestrian flow data and come up with useful suggestions and strategies.

The program we design can also be used for other sensor tracking scenario such as traffic tracking, body movement, etc.","m1fname":"Yan","projectname":"Pedestrian Tracking for ATC Shopping Mall","m3fname":"Dingyu"},{"m2lname":"Zhang","m4lname":"","m3uni":"yc3121","m1uni":"sr3254","m4uni":"","pid":"201512-68","m2uni":"yz2869","timestring":"Thu Dec 17 21:48:51 2015","m4fname":"","language":"Java,Python, Hadoop, Mahout","m3lname":"Cui","dataset":"We fetched the events available on eventbrite around New York with a crawler. Each event has some event attribute like time, date, cost, address, etc..
And We mannual labeled some data for training. Because we need user's authentication to get access his or her data online.","m1lname":"Ren","industry":"Information","analytics":"Web crawler used to fetch data on the Internet

Bayesian model trained with MapReduce procedure which is the core of our recommendation system

Kmeans clustering method in mahout used to get the cluster labels for event details

","m2fname":"Yeran","description":"There are numerous events available on eventbrite.It can be overwhelming for users to choose from.

Our project provides a recommendation system which can help users make their choice.

This is a content-based recommendation system. It makes recommendation for users based on the attributes of the events they have preference for. Also one of the attributes of the events is the a detailed description of the event. We use the cluster label of this as one feature of the event which makes our method both model-based and content-based.","m1fname":"Shiwei","projectname":"Content-based Event Recommendation Based on User’s Preference","m3fname":"Yiqing"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cd2698","m4uni":"","pid":"201512-78","m2uni":"","timestring":"Thu Dec 17 21:49:38 2015","m4fname":"","language":"Python, OSX, PostgreSQL","m3lname":"","dataset":"I used the dataset provided with the python package nfldb. This package creates a database of statistics of plays from every NFL game since 2009 and an API for manipulating the data.","m1lname":"Darmetko","industry":"Media","analytics":"KMeans clustering, modified knapsack algorithm, scikit-learn, numpy, various statistical measurements such as variance, standard deviation, quartiles, etc.","m2fname":"","description":"Fantasy football has become very popular in recent years with the advent of Daily Fantasy Sports and continued huge television ratings of National Football League games and championship. It has spawned numerous million dollar companies that give away millions of dollars to people who compete in their fantasy sports events.

A common issue in fantasy football is understanding the value of players categorized as 'Boom or Bust' : players who either score a high amount of points in a game or score barely any points. Are they worth the same as someone who’s average score is the same, but doesn’t have as high of a maximum range? Is a team full of these types of players better than a team of consistent players over the course of a season due to the chance of any player reaching their ceiling and the height of the ceiling?

This project will attempt to identify, quantify and analyze the worth of these players compared to their counter type: consistent players. It will also attempt to determine the optimal balance between consistent and Boom or Bust players on a fantasy team.","m1fname":"Craig","projectname":"Boom or Bust: The Fantasy Gold Rush","m3fname":""},{"m2lname":"Perez Sanchez","m4lname":"","m3uni":"jc4499","m1uni":"ar3579","m4uni":"","pid":"201512-16","m2uni":"pp2550","timestring":"Thu Dec 17 21:51:52 2015","m4fname":"","language":"Python, JavaScript, Spark","m3lname":"Colomer","dataset":"1. Weather Underground NYC Weather Dataset - We crawled the website to extract the data
http://www.wunderground.com/history/airport/KNYC/\" + year + \"/\" + month + \"/\" + day + \"/DailyHistory.html?format=1

2. NYC OPEN DATA: NYPD Motor Vehicle Collisions - We downloaded it from the below website:
https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95

Our application can support any dataset that contains weather and accident collisions data (after a few modifications)","m1lname":"Roy","industry":"Information","analytics":" 1. Data Preprocessing
a. Filter data: remove irrelevant columns using Python
b. Unify datasets (date timezones) using Python
c. Merge datasets using Python and SQL queries
2. Use Spark to get accident statistics for weather and location
m[x, y, d, w] = # of accidents that happened on location (x,y) on the dth day of the week, with weather condition w.
3. Upload aggregated results to MySQL
4. Feed the data into CartoDB
5. Create web server using Flask (Python)
6. For every request, using Google Maps API and Wunderground API","m2fname":"Pedro","description":"Our project highlights the areas of the city one should avoid for today’s and next 10 day’s weather conditions. For that, we correlate weather conditions with the probability of a traffic accident happening. This tool helps any NYC resident to make a more informed decision about the route to take around the city. Instead of just picking the fastest route, they now have the option to pick the safest one.","m1fname":"Abhijit","projectname":"Accident Prediction System","m3fname":"Juan Pablo"},{"m2lname":"Li","m4lname":"","m3uni":"st2972","m1uni":"mm4700","m4uni":"","pid":"201512-43","m2uni":"tl2493","timestring":"Thu Dec 17 22:00:15 2015","m4fname":"","language":"Spark, Python, HTML5, CSS, Javascript, Flask, Google Place API, Chrome, Firefox.","m3lname":"Tan","dataset":"Realtime chat message and speech recognition words, sentiment list.","m1lname":"Mei","industry":"Life Science","analytics":"Sentiment Algorithm. Recommendation.","m2fname":"Tianlong","description":"The food industry is the one of the largest and most vital industries in the world. Thus, food recommendation is one of the most important popular issue that is addressed by a lot of big data applications. It is noticed that those blooming food recommendation applications may not provide integrated information about restaurants. What’s more, people are likely to talk about what they ate or what they want to eat, we can collect those chat about foods to do some analysis. The outcome of those analysis of someone is believed to indicate that he or she may be interested in a certain food. Making full use of talks between people bring numerous valuable information for us to improve our food recommendation, which motivates us to set up our project to.","m1fname":"Mei","projectname":"TalkyFoodie","m3fname":"Shanqing"},{"m2lname":"Yakar","m4lname":"","m3uni":"ksb2153","m1uni":"zj2187","m4uni":"","pid":"201512-57","m2uni":"cy2364","timestring":"Thu Dec 17 22:00:17 2015","m4fname":"","language":"LARSA, OpenBrIM, R, MATLAB, VB.NET, Java, ParamML","m3lname":"Bayer","dataset":"We generated a dataset using LARSA, a software suite for the analysis of bridges. LARSA structures can then be simulated using Finite Element Analysis to simulate behavior of the structures as a result of an external stimulus, like earthquakes, wind loads, or standard use. Our dataset was generated using a simulated 'train' traveling over the bridge at 60 miles per hour.
The same simulation was run 5863 times for different parameter sets, giving the same number of simulated results, each containing 8 points of interest on the bridge with acceleration values taken at 20 Hz over 5 seconds. The result for 5863 states came to 67.1 MB, and took 2 days on 5 Amazon Web Servers to compute. The dataset will be substantially larger as the number of parameters and parameter variations increase. ","m1lname":"Jin","industry":"Transportation","analytics":"Data Visualization: colormaps generated using Matlab
OpenBrIM for bridge parameterization and visualization
Recommendation Algorithms using Mahout
LARSA for Finite Element Model Visualization and Finite Element Analysis (dataset generation)","m2fname":"Cihat Cagin","description":"In the interests of preventing catastrophic failure in aging bridges, we seek to develop a method to measure and/or monitor the health of a real-world bridge by analyzing its vibration response and comparing it with a simulated response which is generated by Finite Element Analysis.
The current state of the project can generate color maps which illustrate the features of a bridge which are most important to its health. This will allow future researchers to focus in on those aspects, thereby increasing their understanding of bridge safety.
","m1fname":"Ziyue","projectname":"Bridge Health Monitoring","m3fname":"Karl Scribante"},{"m2lname":"Zhao","m4lname":"","m3uni":"yy2641","m1uni":"xl2523","m4uni":"","pid":"201512-10","m2uni":"jz2685","timestring":"Thu Dec 17 22:01:28 2015","m4fname":"","language":"Python,Java,Javascript,SQL,AWS","m3lname":"Yang","dataset":"(1)Stanford Artificial Intelligence Laboratory “Large Movie Review Dataset”
This dataset was used to build up a movie review sentiment classifier
(2)Latest twitter data collected by twitter API
Utilized twitter API to collect 149606 twitters relevant to twitter
Applied these twitters to the above mentioned classifier and calculate the rating for each movie
","m1lname":"Lan","industry":"Media","analytics":"Algorithms
Scikit-learn package was used to build:
(1)Naive Bayes Classifier
(2)Linear SVM Classifer
Applied the test date in the “Large Movie Review Dataset” to evaluate the model
The correct rate of Linear SVM is 10% percent higher
So the Linear SVM Classifier was selected to classify the twitters
Two vectorize algorithms was used to vectorize the words
(1)Count Vectorize algorithm
(2)TF-IDF Vectorize algorithm
TF-IDF Vectorize algorithm performs better
For the Linear SVM classifier
Tried soft margin parameter C from 0.1 to 100
When C=0.5, the model performs best

Visualization
Use Raphael.JS to visualize the result by date and by area. Construct webpage based on AWS
Welcome to visit our website
http://twittermovierating.elasticbeanstalk.com/home.jsp
","m2fname":"JIngmei","description":"Project Goals Description:
Recently the commercial market of movies grows larger and larger.
But there are too many movies and sometimes it is hard to choose to watch which movies. Our program benefits plenty of people by providing real time rating of movies on social media in their location, even for movies not on show yet!
First, “Large Movie Review Dataset” was utilized to train a Linear SVM model to classify positive and negative movie review.
Meanwhile collect the twitter relevant to movies with geo tag and date by twitter API and storage these data to RDS.
Secondly, classify the twitters for each movie and calculate their rating by each area and by each date.
Finally, visualize the result on the website to benefit more people.
There are four highlights of our project
(1) Can predict rating for movies not on show yet, help people to select movie.
(2) Since people from different area may like different movie, predicts the rating for targeting area
(3) The rating is update real time and people can see the trend of movie rating
(4) Visualize the data on the website. Every body has free access to visit the website and use our result to choose movie to watch.
Actually, I am using the result of our project to choose movie currently.
Have a try of our website:
http://twittermovierating.elasticbeanstalk.com/home.jsp
","m1fname":"Xing","projectname":"Twitter Based Movie Recommendation System","m3fname":"Yao"},{"m2lname":"Yang","m4lname":"","m3uni":"zl2471","m1uni":"pz2209","m4uni":"","pid":"201512-49","m2uni":"ty2313","timestring":"Thu Dec 17 22:05:30 2015","m4fname":"","language":"Java, Mahout, Matlab and R","m3lname":"Luo","dataset":"We download the data from Netflix official web site.","m1lname":"Zhao","industry":"Media","analytics":"We use controlled variable analytics to find out the difference between different clustering method and the influence of weight on clustering.
Clustering method:k-means, fuzzy k-means,canopy clustering and spectral clustering
We use Matlab to generate visual graphs.","m2fname":"Tianchun","description":"In this era of information explosion, data processing and analytic are playing more and more importance role in the companies based-on web service.
Netflix prize is a data analytic prize held in 2009 aiming at selecting efficient user-based movie scoring prediction model. The Netflix movie rating dataset gives a comprehensive view of the audiences' preference of movies before 2005. The clustering and recommendation research on this dataset is highly convincing.
Different from the custom clustering method, our project aims at building more precise clustering model for this Netflix movie dataset. More specifically speaking, a weighted clustering model weight all vector features to same level of magnitude using algorithm created by ourselves.
Based on the clustering result, we build up a movie recommender which is more efficient that mahout recommender models. With a simple application of OMDB API, the recommender is both efficient and user-friendly.","m1fname":"Pengyuan","projectname":"Movie clustering and recommendation based on Netflix movie rating data","m3fname":"Ziyi "},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"xt2167","m4uni":"","pid":"201512-75","m2uni":"","timestring":"Thu Dec 17 22:20:03 2015","m4fname":"","language":"scikit-learn, C++, Python","m3lname":"","dataset":"The dataset comprise of the Form 13F filings from the past year, investor properties, and security fundamentals. All data is sourced from Bloomberg. ","m1lname":"Tan","industry":"Finance","analytics":"- MiniBatchKMeans for clustering, used to discover investment trends from filings data
- Stochastic Gradient Descent for classification of holders vs non-holders for a security, Used to find potential new investors","m2fname":"","description":"There are so many financial instruments and participants in the markets that it's hard for a financial professional to sift through all of the data. Also the data is always changing, creating new trends and opportunities. This situation is ideal for big data analytics since it provides fast analysis of large amounts of data. Below are the two problems that are tackled in this project.

Trend Analysis
- Analysts who cover a sector will do research and report on trends for the quarter or year
- combination of analyzing numbers and reading industry reports
Project: Why not use the investors’ (Institutions, Funds) own filings and infer trends directly?

Investor Search
- When looking for investors for a company, generally look at other companies who invest in similar securities but not the target company
Project: Use the existing investors’ profiles to search for similar investors.

","m1fname":"Xin Luan","projectname":"Investor Trend Analysis and Investor Search using Big Data Analytics","m3fname":""},{"m2lname":"Dang","m4lname":"","m3uni":"","m1uni":"cz2393","m4uni":"","pid":"201512-38","m2uni":"wd2265","timestring":"Thu Dec 17 22:20:53 2015","m4fname":"","language":"Hadoop, Pig Latin, System G, Matlab","m3lname":"","dataset":"We got our dataset from UN dataset directly from the website and it is all public.","m1lname":"Zhan","industry":"Social Science-Government","analytics":"Linear regression
Pig Latin – Filtering, file matching
Matlab various plotting functions
","m2fname":"Weipeng","description":"Nowadays studying abroad is a global phenomenon - an increasing number of students around the world are studying overseas to pursue higher education, especially in recent years. As part of the social life, we want to examine the evolution of this phenomenon and how it is affected by global and national economy. In addition, we want to prospect the future development of this phenomenon in upcoming years.","m1fname":"Chuan","projectname":"Analysis Between Economy and Students Studying Abroad","m3fname":""},{"m2lname":"jiang","m4lname":"","m3uni":"zl2348","m1uni":"sd2810","m4uni":"","pid":"201512-36","m2uni":"zj2173","timestring":"Thu Dec 17 22:32:26 2015","m4fname":"","language":"Python, Js, Spark, Hadoop Hdfs","m3lname":"lv","dataset":"
Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.","m1lname":"dong","industry":"Media","analytics":"Collaborative Filtering ALS algorithm

Create RDDs for Books table and Ratings table, separate data into train and test. Train a mode and test on test data to tune parameters for later use.

Get new user ratings from front-end for every new user and merge it into Rating table and use the complete data to train new model for new user.

Use the new complete model to predict ratings on each book for the new user(exclude those rated by user of course).

Select top 25 books and only keep books which are reviewed more than 10 times.
return recommendations to user (to be rated further).
","m2fname":"zewei","description":"In this era of information explosion, have you ever thought about how many books out there? For a reader, it is actually a pain in back to find a good book to read. This would take time and might end up reading a not good book for a while till realized that it is not a good fit.

Here our App comes to rescue. Our App aims to make the best recommendations on books for users. In order to have enough knowledge to recommend properly, we take a huge database which consists of millions of books and user rating information from Amazon data source. The data is huge, but the recommendation needs to be done in second, thats why this is a real world big data computation challenge.","m1fname":"shiyu","projectname":"Target Your Next READING : Book Recommender","m3fname":"zixuan"},{"m2lname":"jiang","m4lname":"","m3uni":"zl2348","m1uni":"sd2810","m4uni":"","pid":"201512-36","m2uni":"zj2173","timestring":"Thu Dec 17 22:34:44 2015","m4fname":"","language":"Python, Js, Spark, Hadoop Hdfs","m3lname":"lv","dataset":"
Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

BX-Users
Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values.

BX-Books
Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site.

BX-Book-Ratings
Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.","m1lname":"dong","industry":"Media","analytics":"Collaborative Filtering ALS algorithm

Create RDDs for Books table and Ratings table, separate data into train and test. Train a mode and test on test data to tune parameters for later use.

Get new user ratings from front-end for every new user and merge it into Rating table and use the complete data to train new model for new user.

Use the new complete model to predict ratings on each book for the new user(exclude those rated by user of course).

Select top 25 books and only keep books which are reviewed more than 10 times.
return recommendations to user (to be rated further).
","m2fname":"zewei","description":"In this era of information explosion, have you ever thought about how many books out there? For a reader, it is actually a pain in back to find a good book to read. This would take time and might end up reading a not good book for a while till realized that it is not a good fit.

Here our App comes to rescue. Our App aims to make the best recommendations on books for users. In order to have enough knowledge to recommend properly, we take a huge database which consists of millions of books and user rating information from public data source. The data is huge, but the recommendation needs to be done in second, thats why this is a real world big data computation challenge.","m1fname":"shiyu","projectname":"Target Your Next READING : Book Recommender","m3fname":"zixuan"},{"m2lname":"Prumachuk","m4lname":"","m3uni":"sg2665","m1uni":"jcc2267","m4uni":"","pid":"201512-66","m2uni":"jp3495","timestring":"Thu Dec 17 22:36:15 2015","m4fname":"","language":"Ubuntu 15.10 server running on Microsoft Azure cloud virtual machine. Development also done on Mac and Windows PCs.","m3lname":"Guleff","dataset":"Dataset extracted from https://angel.co/api to get data on startup companies.

Crunchbase data hub https://data.crunchbase.com/ could also be used. Any list of people could be used (for example American Medical Association or American Bar Association membership lists).

Data also extracted from LinkedIn profiles using Google search/screen scraping and from non-profit charity homepages which could be found through http://www.charitynavigator.org/ and http://www.guidestar.org/Home.aspx.

Lattitude/Longitude from text for geographic display determined using: http://www.findlatitudeandlongitude.com","m1lname":"Correa","industry":"Social Science-Government","analytics":"Platforms:
Neo4J – graph database
Python’s NLTK package – text analytics: top-N word stems
bottle.py – web server interface for Python
PhantomJS/CasperJS – headless browser: web scraping
Excel – csv file formatting and column manipulation

Languages:
Cypher – graph database queries
Python – API data extraction, web server scripting, screen scraping
Javascript – screen scraping events and processing, visualization
Java – test REST API, text file creation

Algorithms:
Building a knowledge graph
Shortest path (to find referrals)
NLTK (to match charity web site keywords with LinkedIn ‘cares about’ causes)

Visualization:
D3
Neo4j data browser
Mapbox – JS based mapping tool for graph visualization
Popoto.js – graph visualization
Chart.js / D3 - Charting
","m2fname":"Janet","description":"Social networking exploration framework for Non-Profits.

Useful for Non-Profits to identify potential business partners for co-marketing, and to identify potential donors from pool of new venture startup founders.

Use Neo4j graph database to analyze social networks to identify nodes, relationships and properties of interest. Queries include: List all people with a particular charitable interest (data extracted from LinkedIn 'cares about'); find shortest path from one person to another (we know one person, how can we use that information to get a referral to another person); what are the top 25 universities for company founders (using LinkedIn education)","m1fname":"John","projectname":"Social/Business Network Analysis for Charitable Fundraising","m3fname":"Sam"},{"m2lname":"Bai","m4lname":"","m3uni":"gl2520","m1uni":"st2957","m4uni":"","pid":"201512-22","m2uni":"hb2479","timestring":"Thu Dec 17 22:39:03 2015","m4fname":"","language":"python, Spark","m3lname":"Liang","dataset":"The dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.
Data fields:
Dates - timestamp of the crime incident
Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
Descript - detailed description of the crime incident (only in train.csv)
DayOfWeek - the day of the week
PdDistrict - name of the Police Department District
Resolution - how the crime incident was resolved (only in train.csv)
Address - the approximate street address of the crime incident
X - Longitude
Y - Latitude","m1lname":"Tan","industry":"Social Science-Government","analytics":"Algorithms: Log odds ratio calculation, Artificial Neuron Networks, Support Vector Machine, Random Forest, Gradient Boosting Decision Tree.
Modules: Pyspark, pandas, sci-kit learn, seaborn","m2fname":"Haoyue","description":"This project has two objectives: Firstly to gain insights into the intrinsic relationships between temporal as well as geographical informations and the category of crimes through using various analytics approaches; Secondly to apply main-stream open source data analytics tools in practice in order to discover the relative pros and cons of each tool.

The value of this projects lies in three aspects: Firstly it demonstrates the solution to yet another facet of big data: the complexity of data, rather than the sheer amount; Secondly, it showed the prematurity of Spark as a general-proposed machine learning library by comparative machine learning experiments; Thirdly, it managed to extract empirical information from visualization as well as insights from machine learning, both of which would be helpful for predicting the crime occurrence of San Francisco in the future. ","m1fname":"Sirui","projectname":"San Francisco Crime Classification","m3fname":"Guihao"},{"m2lname":"Zhang","m4lname":"","m3uni":"ly3318","m1uni":"hj2405","m4uni":"","pid":"201512-37","m2uni":"mz2499","timestring":"Thu Dec 17 22:50:59 2015","m4fname":"","language":"C++, Matlab, OpenCV, Python on Ubuntu System","m3lname":"Liu","dataset":"We used video dataset privately provided by Prof. Wen H. Peng. The size of dataset we have used is about 700 videos with 50GB.","m1lname":"Jin","industry":"Information","analytics":"Core Algorithms:
1. Topic Model (Online LDA)
2. Retrieve-Based Online Clustering
3. CDVS

System Modules:
1. Frame Extractor
2. Descriptor Database
3. Clustering Module
4. Document Parsing Module
5. LDA Training Module","m2fname":"Moning","description":"Our project aims to find an effective and efficient descriptor for video to extend image search to video search. By exploring the similarity between the structure of videos and text, we utilize topic model to summarize the features of the video. Furthermore, we make our algorithm capable to deal with the scalability issue encountered by online application in real world. Eventually, we are able to compare the similarity between different videos.","m1fname":"Hongyi","projectname":"Large-Scale Visual Search","m3fname":"Yang"},{"m2lname":"Li","m4lname":"","m3uni":"yc3113","m1uni":"yo2265","m4uni":"","pid":"201512-23","m2uni":"kl2831","timestring":"Thu Dec 17 22:54:58 2015","m4fname":"","language":"Python, Mahout, Java, MySQL, HTML, CSS","m3lname":"Cao","dataset":"Yelp Dataset Challenge: mainly users, business, and reviews.
Other dataset we can use: any reviews, but definitely need some data preprocessing steps.","m1lname":"Ou","industry":"Information","analytics":"Python lda, Python gensim, Stanford CoreNLP, Mahout cvb","m2fname":"Ke","description":"Recommendation is more and more important in modern society.

Review analysis has become a critical reference in recommendation and
business strategies nowadays. Exploration into the feedbacks of the users
can grant us incredible insights.

Given such untapped treasure of resources, we aim at harnessing the
fusion of the review analysis and recommendation, and try to extract
valuable advice for business management.
","m1fname":"Yufei","projectname":"Recommendation based on review analysis","m3fname":"Ye"},{"m2lname":"Prumachuk","m4lname":"","m3uni":"sg2665","m1uni":"jcc2267","m4uni":"","pid":"201512-66","m2uni":"jp3495","timestring":"Thu Dec 17 22:56:09 2015","m4fname":"","language":"Ubuntu 15.10 server running on Microsoft Azure cloud virtual machine. Development also done on Mac and Windows PCs.","m3lname":"Guleff","dataset":"Dataset extracted from https://angel.co/api to get data on startup companies.

Crunchbase data hub https://data.crunchbase.com/ could also be used. Any list of people could be used (for example American Medical Association or American Bar Association membership lists).

Data also extracted from LinkedIn profiles using Google search/screen scraping and from non-profit charity homepages which could be found through http://www.charitynavigator.org/ and http://www.guidestar.org/Home.aspx.

Lattitude/Longitude from text for geographic display determined using: http://www.findlatitudeandlongitude.com","m1lname":"Correa","industry":"Social Science-Government","analytics":"Platforms:
Neo4J – graph database
Python’s NLTK package – text analytics: top-N word stems
bottle.py – web server interface for Python
PhantomJS/CasperJS – headless browser: web scraping
Excel – csv file formatting and column manipulation

Languages:
Cypher – graph database queries
Python – API data extraction, web server scripting, screen scraping
Javascript – screen scraping events and processing, visualization
Java – test REST API, text file creation

Algorithms:
Building a knowledge graph
Shortest path (to find referrals)
NLTK (to match charity web site keywords with LinkedIn ‘cares about’ causes)

Visualization:
D3
Neo4j data browser
Mapbox – JS based mapping tool for graph visualization
Popoto.js – graph visualization
Chart.js / D3 - Charting
","m2fname":"Janet","description":"Social networking exploration framework for Non-Profits.

Useful for Non-Profits to identify potential business partners for co-marketing, and to identify potential donors from pool of new venture startup founders.

Use Neo4j graph database to analyze social networks to identify nodes, relationships and properties of interest. Queries include: List all people with a particular charitable interest (data extracted from LinkedIn 'cares about'); find shortest path from one person to another (we know one person, how can we use that information to get a referral to another person); what are the top 25 universities for company founders (using LinkedIn education)","m1fname":"John","projectname":"Social/Business Network Analysis for Charitable Fundraising","m3fname":"Sam"},{"m2lname":"Kong","m4lname":"","m3uni":"js4567","m1uni":"gj2249","m4uni":"","pid":"201512-30","m2uni":"xk2122","timestring":"Thu Dec 17 23:12:00 2015","m4fname":"","language":"pig, spark, hadoop, matlab, javascript, python, jquery","m3lname":"Shen","dataset":"2015-01 NYC yellow cab trip data.
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","m1lname":"Jing","industry":"Transportation","analytics":"K-means Clustering","m2fname":"Xianglu","description":"Cluster and visualize NYC taxi trip data.","m1fname":"Guochen","projectname":"Analysis and Visualization of NYC Taxi Trip Data","m3fname":"Junfei"},{"m2lname":"Wu","m4lname":"","m3uni":"qw2180","m1uni":"zw2289","m4uni":"","pid":"201512-62","m2uni":"yw2682","timestring":"Thu Dec 17 23:13:55 2015","m4fname":"","language":"Python","m3lname":"Wang","dataset":"Yelp data provided by yelp.com. It provides names, locations, reviews, ratings of different shops and users of yelp.com","m1lname":"Wu","industry":"Information","analytics":"Weighted recommendation algorithm,
Natural language processing,
IOS app development","m2fname":"Yi","description":"Many people rely on Yelp to explore new restaurants.

But Yelp always ‘surprises’ us with bad recommendations.

We found out that the traditional rating and recommendation has following limits:
No personalization.
Dummy users and fake review.
Extreme opinions affects too much.

We decide to come up with a new rating and recommendation system that solves the problems above and gives more accurate advice.

We look through the past experience and rating of users in yelp and find out people with similar taste of them. Those people's advice are valued more than other people.
Also we looked into past review of people and found out what they really concern about, and did recommendation based on that.
","m1fname":"Zuyi","projectname":"Personalized Restaurant Recommendation based on Yelp data","m3fname":"Qianbo"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ks3184","m4uni":"","pid":"201512-52","m2uni":"","timestring":"Thu Dec 17 23:14:40 2015","m4fname":"","language":"Python, Spark","m3lname":"","dataset":"Data was obtained from a Chinese website: http://www.zhaopin.com/
","m1lname":"Shen","industry":"Information","analytics":"CRF, HMM, DP, LDA, GBM","m2fname":"","description":"1. Most job search websites are using keyword searching to help find desired job
2. Users are becoming more and more impatient when they visit websites and cannot find the desired jobs in an immediate way
3. Find a way to recommend suitable jobs for a given applicant based on resume information
","m1fname":"Ke","projectname":"Job Recommandation System","m3fname":""},{"m2lname":"Phadke","m4lname":"","m3uni":"","m1uni":"pat70","m4uni":"","pid":"201512-72","m2uni":"mp3212","timestring":"Thu Dec 17 23:19:55 2015","m4fname":"","language":"Languages : Scala, Java, JavaScript . Platforms : Spark Streaming, Kafka, Vert.x, Hadoop{CDH}, AngularJS UI Grid .","m3lname":"","dataset":"Twitter streaming API filtered by test holdings and their associated components.
Simulated Trade and Market data based on Yahoo data set.","m1lname":"Thatte","industry":"Finance","analytics":"Stanford-CoreNLP's model for RNN (Recursive Neural Network) classification algorithms.
Custom Model for Scoring Sentiments.
","m2fname":"Manjiri","description":"We have built a live ticking UI and compute platform using open-source software and commodity infrastructure to handles the three Vs of Big Data Analytics.

This single calculation platform runs operations in parallel on high-volume information such as Transactional and Market data and News feeds for all hosted portfolios and user accounts.

It ingests high-velocity data such as Twitter feeds and Trade or Market Data events, runs calculations on the compute engine, and emits delta notifications keyed for targeting peers with registered interest. Events are routed to a live grid capable of ticking data changes and flexible enough to dynamically add data points to its columnar layout.

It uses a custom model for classifying high-variety micro-batches of tweets and adjusting component weights based on whether a holding is driving a wider sentiment, bucking it, or will be affected by the wider sector or market it belongs to.
","m1fname":"Paresh","projectname":"Active Sentiment Analysis And Calculations For The Live Portfolio","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"yx2318","m4uni":"","pid":"201512-6","m2uni":"","timestring":"Thu Dec 17 23:20:25 2015","m4fname":"","language":"Java ,OS X","m3lname":"","dataset":"
The dataset I am using is called Book-Crossing Dataset by a Germany University.
It is public which can be obtained from:
http://www2.informatik.uni-freiburg.de/~cziegler/BX/
","m1lname":"Xu","industry":"Retail","analytics":"Item-based collaborative filtering, User-based collaborative filtering
Single Value Decomposition
Content-based collaborative filtering using TF-IDF ","m2fname":"","description":"Objectives: Build a book recommender system.
Innovations: Apart from basic functionalities and basic algorithms, I wrote a TF-IDF algorithm from scratch which analyse book titles and recommend based on the similarity of natural language.
","m1fname":"Yingtao","projectname":"Byou: Book for you Recommender System","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"sl3773","m1uni":"xl2493","m4uni":"","pid":"201512-44","m2uni":"zl2438","timestring":"Thu Dec 17 23:22:19 2015","m4fname":"","language":"Python, C++, Pig, OpenCV","m3lname":"Lin","dataset":"Our dataset basically can be any set of pictures with location labels as their filenames. For our test purpose, we constructed a dataset including several dozens of photos of places of interest all over the world. For the practical use, we used existent services like Google Street View to obtains varieties of pictures with location labels.
","m1lname":"Liu","industry":"Media","analytics":"Image Preprocessing:
For each picture in database, we compress it into the size of 200*200 pixels using thumbnail function in Python Imaging Library.

Image Abstraction:
OpenCV is used for generating the features of the picture. And then for each feature, we use 20 hash functions to hash the feature into 20 different hash key. We restricted the number of features for each picture within 100 avoid meaningless features. Suppose a picture has 95 features, then the picture is represented in our database file (database.csv) as 95*20 hash keys.
We also abstract the pictures uploaded by user in the same way, and store it into (query.csv)

Image Match:
We implemented the searching process by matching picture using Pig Latin script.
After we extracted features for each picture in our database as well as the pictures uploaded by user, we let Pig to deal with the database.csv and query.csv to get the matching picture.
Pig will first generate hash table for database and the query. To do this, pig group the images by hash key. So that images with the same hash key are all put into the same bin. Since we hash each feature by 20 hash functions, we will have 20 hash tables for both the database and the query. Then join query with database to find pictures which has similar feature and sort these pictures by the number of similar feature. Return the top three picture as the most similar result.
","m2fname":"Zhengrong","description":"Our app aims at recommending some places to visit based on user’s photo album. The information of photo album is useful. It reveals user’s travel reference (Classic, Modern, Natural, Arts...) For example, if a user has a lot of pictures about mountains, then he/she might also want to visit other famous mountains. Or if a user has a lot of pictures about domes, then he/she might also want to visit places like United States Capitol, or Taj Mahal. Our app uses photo similarity to find pictures similar to user’s album and then recommend new places based on the new pictures found.
","m1fname":"Xingying","projectname":"Trip Advisor Based on Image Recognition","m3fname":"Sen"},{"m2lname":"Dosoudil","m4lname":"","m3uni":"wz2333","m1uni":"jgw2128","m4uni":"","pid":"201512-1","m2uni":"jd3225","timestring":"Thu Dec 17 23:24:20 2015","m4fname":"","language":"Java (visualizer, API caller), node.js (Job Automator backend), HTML/Javascript (Job Automator Front End), mongoDB (Job Automator database), Hive (StubHub data database)","m3lname":"Zhou","dataset":"We used the stubhub API to pull ticket price information for events listed in the city. The API will pull all recent data, so the data set is constantly changing. You can find the documentation here:

(need to make an account)

https://developer.stubhub.com/store/

Our code can take any data that reports prices in the same format as the StubHub API.

The Job Automator can run any command line job at all, making it very general-purpose and powerful.","m1lname":"Wood","industry":"Retail","analytics":"Visualization was done with java built in GUI table packages.

SQL sorting algorithms were used to sort through the table.

node.js along with express, socket.io and mongodb libraries were used for the job automator backend.

Hive was used for the big data storage (stub hub data)","m2fname":"Jake ","description":"Problems Faced:

Ticket Pricing:
---------------
-The price of all event tickets in the world is a big data problem
-Many common big data platforms offer little in the way of data visualization
-There are few services that allow you to find sudden drops in price for all tickets from any vendor

Job Automation:
------------------
-There are very few general, powerful command line job automators
-Linux's Cron only allows jobs to be run at certain times (not as dependent on other jobs)
-Linux's Cron lacks a user interface
-Oozie has MANY compatibility issues
-Oozie can only run Hadoop jobs and is difficult to configure

Our Solution:
--------------
-Code that references the StubHub API to pull ticket price information for all events in the city (can pull from infinitely many events, but we kept it smaller due to limited computing power)
-Uses Hive to store the data
-Visualizer allows searching and sorting amongst results with a sudden price drop
-A job automator that can set up these jobs on the command line to run at regular intervals
-The job automator can run ANY kind of jobs and offers an easy to use UI","m1fname":"Jake","projectname":"Automated Ticket Price Drop Reporting","m3fname":"Weiyi (Anne)"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ll2698","m4uni":"","pid":"201512-5","m2uni":"","timestring":"Thu Dec 17 23:26:21 2015","m4fname":"","language":"Python, MongoDB, JavaScript, HTML, CSS","m3lname":"","dataset":"This project aims to discover insights from the college scorecard project data provided by the US department of education. While the workflow is mostly specifically oriented towards this data set, the tools and the framework are applicable to any projects that call for interactive web-based exploration on a large data set.","m1lname":"Liu","industry":"Social Science-Government","analytics":"MongoDB is used as the database due to its support of JSON-like documents that allow missing values and lack of a schema. Python is used to perform the data analysis and machine learning tasks, with the Numpy, Pandas and Scikit-learn libraries providing fast computations. Flask, another Python library, is used to construct the web application. D3.js, crossfilter.js and dc.js are JavaScript libraries used to enable interactive data exploration. The Bootstrap framework is used to structure the web application.

In terms of machine learning, spectral clustering with a self-implemented similarity measure Gower Dissimilarity is used to find relationships between colleges using mixed types of data. Adaptive Boosting with decision dumps performs the regression task.","m2fname":"","description":"A college education is pivotal to one's professional and personal development in the US. However, it can also be expensive and remains a difficult decision for students and their families. This project aims to provide a web-based interactive exploration tool to understand the cost and return of college education using public data. Various statistics on historical data are presented and a tool to discover similar colleges and predict cost, debt, and earnings when attending colleges with specific features is also accessible on the web application. ","m1fname":"Lian","projectname":"Cost and Return of College Education in the US","m3fname":""},{"m2lname":"Yu","m4lname":"","m3uni":"zj2195","m1uni":"cx2179","m4uni":"","pid":"201512-51","m2uni":"yy2636","timestring":"Thu Dec 17 23:26:52 2015","m4fname":"","language":"Python, System G","m3lname":"Jiang","dataset":"We use the Scrapy, an open source web crawler framework to write the crawler and crawl the link data from the website. The crawler is written in Python.

The crawler start crawling from the homepage of the university (http://www.columbia.edu/), and saves the link information into csv file.
","m1lname":"Xu","industry":"Information","analytics":"Centralities(degree, closeness and betweenness)

Pagerank

Communities(Clustering coefficient)
","m2fname":"Yue","description":"The IBM System G is a comprehensive set of graph computing tools. Its key feature compared with the traditional analytic systems is that it is designed to deal with the data linked with each other. And the data on the Internet especially the links on a website fit this feature very well.

We think using a new tool to do some analytics on the university’s own website is fun. As there is few data analytics on that, we can show the construction of the university's website and find more about our university in this way.
","m1fname":"Chen","projectname":"Data Visualization and Analytics of Columbia’s Website Based on IBM System G","m3fname":"Zhongzhu"},{"m2lname":"Kong","m4lname":"","m3uni":"js4567","m1uni":"gj2249","m4uni":"","pid":"201512-30","m2uni":"xk2122","timestring":"Thu Dec 17 23:30:51 2015","m4fname":"","language":"pig, hadoop, spark, javascript, jquery, matlab","m3lname":"Shen","dataset":"nyc 2015-01 yellow cab trip data.
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","m1lname":"Jing","industry":"Transportation","analytics":"K-means clustering","m2fname":"Xianglu","description":"Analyze and visualize nyc taxi trip data.
To make recommendation to nyc taxi drivers about popular places to pick up passengers on the coming new year's day.","m1fname":"Guochen","projectname":"ANALYSIS AND VISUALIZATION OF NYC TAXI TRIP DATA","m3fname":"Junfei"},{"m2lname":"Vysyaraju","m4lname":"","m3uni":"pm2824","m1uni":"sc3973","m4uni":"","pid":"201512-56","m2uni":"scv2114","timestring":"Thu Dec 17 23:31:54 2015","m4fname":"","language":"MySQL, Node JS, Express Framework, JavaScript, CSS ,Spark, Hadoop, Python, HTML, Amcharts","m3lname":"Matey","dataset":"We extracted around 50000 tweets relevant to US Presidential Elections 2016. We trained our classifier using well known set of pre-classified sentiment data.

Since Twitter does not explicitly allow any developer to publish the dataset to public.

Our system supports any Tweets in form of JSON.","m1lname":"Choudhary","industry":"Information","analytics":"Analytics:
1) Naive Bayes Classifier has been used for generating the sentiment for each tweet.
2) Content based tweet classification has been performed by normalising each feature data over each class and then the normalised feature parameters are used to establish the score for each tweet by finding the aggregate of the 6 features. In order to identify the anomalies/peculiarities, a hard threshold of 0.85 has been established on the score for each tweet.
Algorithms:
1) Naive Bayes Classification
Visualization:
1) We visualised aggregate sentiment for each class in a bar chart format.
2) For generating Heat Maps and Location based visualisation, we have used Google Maps API.
3) We use amcharts as a visualization tool for the tweeting temporal behavior. ","m2fname":"Sarat Chandra","description":"Objective: Our idea is to integrate spatial (location based) and temporal (time based) trends to establish a correlation between the two classes of analysis. A lot of these trends originate from a single source which are often important to localize and are an interesting problem to investigate.

Innovations: We used features relevant to each tweet and tried to establish a score reflecting the anomaly/peculiarity about it so as to identify the source of these trends.

Capabilities: The project designed is independent of the topic of analysis and can be easily used to perform analysis on any trending topic. Though we have used Naive Bayes Classifier for sentiment analysis it can be easily swapped with any other API/classifier. Since our database is distributed (on Amazon EC2) it can be accessed easily for analysis. The project design is modular, so our user interface does not depend on backend which it only uses for query.","m1fname":"Shivam","projectname":"Visualization of Spatial Temporal Patterns of Tweeting Behavior","m3fname":"Palash Sushil"},{"m2lname":"Li","m4lname":"","m3uni":"yc3113","m1uni":"yo2265","m4uni":"","pid":"201512-23","m2uni":"kl2831","timestring":"Thu Dec 17 23:34:37 2015","m4fname":"","language":"Python, Mahout, Java, MySQL, HTML, CSS","m3lname":"Cao","dataset":"Yelp Dataset Challenge: mainly users, business, and reviews.
Other dataset we can use: any reviews, but definitely need some data preprocessing steps.","m1lname":"Ou","industry":"Information","analytics":"Python lda, Python gensim, Stanford CoreNLP, Mahout cvb","m2fname":"Ke","description":"Recommendation is more and more important in modern society.
Review analysis has become a critical reference in recommendation and business strategies nowadays. Exploration into the feedbacks of the users can grant us incredible insights.
Given such untapped treasure of resources, we aim at harnessing the fusion of the review analysis and recommendation, and try to extract valuable advice for business management.","m1fname":"Yufei","projectname":"Recommendation based on review analysis","m3fname":"Ye"},{"m2lname":"Li","m4lname":"","m3uni":"cl3391","m1uni":"xd2169","m4uni":"","pid":"201512-11","m2uni":"yl3390","timestring":"Thu Dec 17 23:38:06 2015","m4fname":"","language":"Java, Pig Latin, Matlab, SQL, Python, Google Cloud Platform","m3lname":"Liu","dataset":"Our data set is combined of three parts with the source dataset chosen from Open Data NYC.

(1) The first part is:
Green Taxi Trip (https://data.cityofnewyork.us/view/n4kn-dy2y)
Yellow Taxi Trip Data(https://data.cityofnewyork.us/view/ba8s-jw6u)

(2) The second part is:
Assistance_Trained_Data(https://data.cityofnewyork.us/Transportation/Medallion-Drivers-Passenger-Assistance-Trained/td5q-ry6d )

(3) The third part is:
Databases created by ourselves after we conducted filter, clustering and recommendation on the first two parts data.","m1lname":"Duan","industry":"Transportation","analytics":"(1) Tools
Hadoop, Mahout, Eclipse, Pig, Matlab, Google Cloud, Sitebuilder, Tableau

(2) Algorithms
*Recommendation: item-based similarity measurement
*Filter: Collaborative Filtering
*Clustering: k-means
*Query : Google BigQuery","m2fname":"Yunzhe","description":"NYC is a highly trafficked city where people valued time and efficiency more, and Taxi plays a vital important role in this great city. Thus, we want to provide valuable suggestions and analysis for both passengers and drivers to make their trip more efficient and convenient.

Our project makes contributions on the following functions by the analysis of big data:
(1) Driver-Based
Driver license with Disabled Service expiration date reminder
Time based popular pick up location recommendation
Maximize driver’s profit

(2) Passenger-Based
Trip fare and time estimation
Tip amount recommendation
Popular boarding location recommendation
Disabled Service request
Maximize the probability to take a taxi in a hurry

(3) Highlight
Real-time analytics for taxi drivers and passengers based on tweets
Customized visualization analytics results about taxi trips for passengers and drivers
Establishing multi-angle, all around online open platform for public to lead a better life in NYC","m1fname":"Xiaonan","projectname":"Passenger-and-Driver-Based Analytics of NYC Taxi Database","m3fname":"Changtai"},{"m2lname":"Jing","m4lname":"","m3uni":"js4567","m1uni":"xk2122","m4uni":"","pid":"201512-30","m2uni":"gj2249","timestring":"Thu Dec 17 23:39:40 2015","m4fname":"","language":"Pig, Spark, Javascript, JQuery, Python","m3lname":"Shen","dataset":"http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","m1lname":"Kong","industry":"Transportation","analytics":"K-means","m2fname":"Guocheng","description":"Find the peak traffic hour in a day using Pig. Visualize the NYC taxi trip in order to find the most popular pick-up and drop-off location. And we use Spark to programmatically calculate the clusters and compare the results with the visual analysis.","m1fname":"Xianglu","projectname":"Analysis and Visualization of NYC Taxi Trip Data","m3fname":"Junfei"},{"m2lname":"Ge","m4lname":"","m3uni":"pz2210","m1uni":"xz2461","m4uni":"","pid":"201512-41","m2uni":"jg3635","timestring":"Thu Dec 17 23:40:49 2015","m4fname":"","language":"Python, HTML/CSS, Javascript, Google Map Javascript API","m3lname":"Zhou","dataset":"We used the NYPD Motor Vehicle Collisions dataset from NYC Open Data website: https://data.cityofnewyork.us/Public-Safety/collision/bpv4-gfc4","m1lname":"Zhang","industry":"Transportation","analytics":"Algorithms: Variational Inference(VI) for Gaussian Mixture Model for numerical data analysis and Latent Dirichlet allocation for topic modeling.

Visualization: We used Google Map Javascript API created an interactive traffic accident map, pinpointing where and when accidents happened, flagging particularly dangerous stretches of area. Then we can visualize it on our website, checking different accident type and concrete information.

","m2fname":"Jimin","description":"Objectives: Our project objective is to analyze the NYC Motor Vehicle Collision dataset to cluster the type of traffic accidents and create a crash map to help people learn about the traffic accidents and explore the reasons. Due to the large amount of dataset, the analysis is better be done by Big Data Analytics tools.
Innovations: The traffic accident information always revealed in the form of statistical data graph or file. We applied the processed dataset into the Google Map that users could interact with it and learn about the traffic reasons.
Capabilities: By creating a traffic accident map pinpointing where and when accidents happen, flagging particularly dangerous stretches of area, our system could help people learn about and prevent the potential traffic accidents.
The Importance: Our project could help drivers learn about the traffic accidents and government can explore the collision reasons, to avoid potential lose.
","m1fname":"Xiaowen","projectname":"Analysis of Motor Vehicle Accident in NYC","m3fname":"Peiran"},{"m2lname":"Albahar","m4lname":"","m3uni":"hsp2112","m1uni":"syk2133","m4uni":"","pid":"201512-70","m2uni":"haa2139","timestring":"Thu Dec 17 23:41:08 2015","m4fname":"","language":"Scala, Pig Latin / Platforms: Mac OS X Yosemite","m3lname":"Powar","dataset":"https://data.cityofnewyork.us/Environment/Public-Recycling-Bins/sxx4-xhzg)
• DSNY's Refuse and Recycling Disposal Networks- For each Community District, the name and address of the location where Refuse, Paper, and Metal/Glass/Plastic collected in that district are disposed of under normal operating circumstances (https://data.cityofnewyork.us/City-Government/DSNY-s-Refuse-and-Recycling-Disposal-Networks/kzmz-ivhb)
• Special Waste Drop-off Sites- Addresses and coordinates of each of 5 special waste drop-off sites where New York City residents can dispose of automotive batteries, motor oil, oil filters, passenger car tires, transmission fluids, fluorescent light bulbs, thermostats, household batteries, and latex paint (https://data.cityofnewyork.us/Environment/Special-Waste-Drop-off-Sites/a34j-ihvy)
• Recycling Diversion and Capture Rates-For each Community District, its Recycling Diversion rate (percentage of total municipal solid waste collected by the Department of Sanitation (DSNY that is disposed of by recycling) and Capture Rate (% of total Paper or Metal/Glass/Plastic in the waste stream that is disposed of by recycling)(https://data.cityofnewyork.us/Environment/Recycling-Diversion-and-Capture-Rates/gaq9-z3hz)
• popultion by community in NYC:https://data.cityofnewyork.us/City-Government/New-York-City-Population-By-Community-Districts/xi7c-iiu2
• DSNY's record of mothly tonnage of wastes:https://data.cityofnewyork.us/City-Government/DSNY-Monthly-Tonnage-Data/ebb7-mvp5
","m1lname":"Kumar","industry":"Social Science-Government","analytics":"Tools: Apache Hadoop, Apache Spark, Texas A&M GeoServices, Microsoft Excel

Visualization: Tableau,CartoDB, GoogleMaps API

Analytics: K-means clustering

Project shall be uploaded on Weebly (bigdatawastemanagement.weebly.com)
","m2fname":"Hadeel","description":"Waste Management has always been a topic of discussion since the outburst of technology ensued by population rise and improper infrastructure. In this project, we are trying to advance New York City’s goals to reduce landfill waste by encouraging every New Yorker to audit their waste to green their homes and workplaces.

This shall be achieved by the following:
• Providing heat maps to identify areas which are saturated with recycling bins and those which are underserved and could be improved
• Analyze the density of recycling bins over the population in that area
• Analyze the datasets over various parameters in any region of NYC to determine which areas are classically generating a lot of wastes and which areas recycle which kind of waste (paper, reuse, MGP) in a better manner
• Visualize the results of clustering using choropleth heat maps embedded over Google maps
• Suggest DSNY some interesting ways to achieve the goal of zero waste
","m1fname":"Shreya","projectname":"Waste Management using Big Data","m3fname":"Harnoor Singh"},{"m2lname":"Chen","m4lname":"","m3uni":"hs2874","m1uni":"sz2540","m4uni":"","pid":"201512-46","m2uni":"gc2665","timestring":"Thu Dec 17 23:43:10 2015","m4fname":"","language":"Python, MySQL","m3lname":"Sun","dataset":"The dataset is from Reddit and Amazon. It is a txt file containing comments separated by lines. The dataset is uploaded to MySQL database.","m1lname":"Zhao","industry":"Information","analytics":"First, we apply the Cosine Similarity algorithm to each sentence pair in the dataset to measure the overlap. We use this algorithm to tokenize word in the sentences. Stopwords are filtered first (not in phrasal overlap) and common words have low weight in measuring the overlap. Square root is used to reduce the effect of the long-sentence to the whole distribution, and the length of target text is controlled by identifying the direct.

Second, we use the PageRank algorithm to determined the “importance” of each sentences in the dataset, i.e., its overlap with all the other sentences. We take sentences as “pages” in the PageRank algorithm, and similarity is used to weight the “link” (overlap) between each sentence pair in the iterating computation of the PageRank score (the level of “importance”) of each sentence. The top sentences (“important”) share more overlap with more sentences in the dataset.","m2fname":"Guangshi","description":"The goal of this project is to extract hot topics from social network. There are lots of companies that do business with or use social network to make profits. As trillions of comments produced everyday, this software could be useful to a large portion of companies. Social network companies such as Twitter, Facebook, and Reddit, are generating tons of topics everyday. This software is able to rank the hot topics from big data. A portion of data can be used to train classifiers that can predict which issue is more important. ","m1fname":"SIhan","projectname":"Hot Issue Extractor","m3fname":"Haitian"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sg662","m4uni":"","pid":" 201512-76 ","m2uni":"","timestring":"Thu Dec 17 23:44:24 2015","m4fname":"","language":"C#, Scala, Java","m3lname":"","dataset":"The live production data set is proprietary to a client. However, a test data set will be uploaded and anyone can freely insert any other dataset. The only requirement is that each document (email, post, etc.) be in a separate file and that all files reside in one directory.","m1lname":"Gabor","industry":"Information","analytics":"Spark MLib (k-means classification) was used for this project. In addition, text preprocessing algorithms for lemmatization and stop word removal were utilized.","m2fname":"","description":"Background Information:

A typical customer service organization can receive many free-form support requests emailed to a designated mailbox. Free-form requests received in this manner need to be reviewed, categorized and prioritized by support staff. The process can be very time consuming for a service organization which routinely receives hundreds or thousands of emailed requests per day.

Proposed Solution:

Implement a support-request classification system that uses machine learning algorithms to automatically classify incoming requests for swifter handling by specialized groups of support professionals. This divide-and-conquer approach can help reduce an organization’s single large support request queue to many smaller, queues for faster processing by distributed groups of support technicians.
","m1fname":"Sam","projectname":"Achieving Greater Efficiency Using Machine Classification of Support E-Mails ","m3fname":""},{"m2lname":"Perry","m4lname":"","m3uni":"tra2116","m1uni":"ab3955","m4uni":"","pid":"201512-42","m2uni":"cp2824","timestring":"Thu Dec 17 23:47:22 2015","m4fname":"","language":"Javascript, HTML, Google Distance Matrix API","m3lname":"Ali","dataset":"We tested the Inpatient Prospective Payment System Provider Summary for the Top 100 DRGs. We found this dataset through searching the web. Our platform can support larger datasets that follow the same format as this one.","m1lname":"Bhargava","industry":"Life Science","analytics":"We used sorting and recommendation algorithms.","m2fname":"Caleb","description":"We created a tool for patients to know their medical expenses prior to receiving care. Our platform sorts hospitals by cost, analyzes cost for diagnosis-related groups and lists the distance for a selected hospital. We used Javascript and HTML to design our project.","m1fname":"Anubha","projectname":"Hospital Charge Data Analysis","m3fname":"Turab"},{"m2lname":"Liu","m4lname":"","m3uni":"sw3024","m1uni":"rw2612","m4uni":"","pid":"201512-63","m2uni":"yl3399","timestring":"Thu Dec 17 23:47:56 2015","m4fname":"","language":"Java, Python, JavaScript, Eclipse","m3lname":"Wang","dataset":"The dataset we use is Yelp dataset challenge. And I have download it from Yelp website.","m1lname":"Wang","industry":"Retail","analytics":"User-based recommendation;
KMeans clustering;
Google Map visualization","m2fname":"Yuyang","description":"1. Traditional recommendation system in Yelp is based on the rating simply, which is not aimed at specific customers. Our project is designed to give recommendations based on customers’ own preferences.
2. Our project visualizes the recommendation results in a map, using some map APIs to make it more clear and fascinating.

","m1fname":"Ruoqi","projectname":"Map-based Restaurant Recommendation","m3fname":"Siyu"},{"m2lname":"Cheng","m4lname":"","m3uni":"","m1uni":"hd2342","m4uni":"","pid":"201512-33","m2uni":"bc2651","timestring":"Thu Dec 17 23:48:35 2015","m4fname":"","language":"hadoop,mahout,spark,matlab","m3lname":"","dataset":"The heights,weights,gpa,age,gender of columbia students.
By sampling survey and through social media.
Any numeric datasets can be supported.","m1lname":"Du","industry":"Information","analytics":"Expectation Maximization Algorithm, Variation Inference, K-means, Probit Regression.","m2fname":"Baokun","description":"To know better and deeper about those algorithms we have learned from class.
To implement these algorithms ourselves.
To analyze and predict students figures in columbia university,and fine correlations between parameters.
","m1fname":"Hanyi","projectname":"ANALYZING AND PREDICTION OF COLUMBIA STUDENTS' FIGURES","m3fname":""},{"m2lname":"Cheng","m4lname":"","m3uni":"","m1uni":"hd2342","m4uni":"","pid":"201512-33","m2uni":"bc2651","timestring":"Thu Dec 17 23:48:44 2015","m4fname":"","language":"hadoop,mahout,spark,matlab","m3lname":"","dataset":"The heights,weights,gpa,age,gender of columbia students.
By sampling survey and through social media.
Any numeric datasets can be supported.","m1lname":"Du","industry":"Information","analytics":"Expectation Maximization Algorithm, Variation Inference, K-means, Probit Regression.","m2fname":"Baokun","description":"To know better and deeper about those algorithms we have learned from class.
To implement these algorithms ourselves.
To analyze and predict students figures in columbia university,and fine correlations between parameters.
","m1fname":"Hanyi","projectname":"ANALYZING AND PREDICTION OF COLUMBIA STUDENTS' FIGURES","m3fname":""},{"m2lname":"Morgan","m4lname":"","m3uni":"lnv2107","m1uni":"sp3290","m4uni":"","pid":"201512-53","m2uni":"jem2268","timestring":"Thu Dec 17 23:49:30 2015","m4fname":"","language":"Apache Spark, OpenCV, Python","m3lname":"Valdivia","dataset":"-The Color Facial Recognition Technology (FERET) Database. Used to obtain Frontal faces training sample.

-Caltech Background Dataset. Used to obtain non-face (negative) training samples.

-CMU/VASC Image Database. Image Files for Test Set C/Image Files for the Rotated Test Set were used for testing purposes.

-Labeled Faces In the Wild. A database of face photographs designed for studying the problem of unconstrained face recognition. The data set contains more than 13,000 images of faces collected from the web. Each face has been labeled with the name of the person pictured. Used to train eigenfaces and recognition on well-known people.","m1lname":"Paterakis","industry":"Information","analytics":"1. Viola Jones Implementation: Weak learners based on Haar type features, used as part of an AdaBoost Cascade Classifier

2. Convolutional Neural Networks: A single, multi layered neural network using local connectivity, weight sharing and pooling.

3. Principal Component Analysis - Eigenface dimensionality reduction.

4. Support Vector Machines: Classification for the purpose of image recognition (input is the output of PCA).","m2fname":"Justine Elizabeth","description":"With the prolific use of smartphones, millions of photos are uploaded everyday to cloud storages.
Commercial applications that are using face detection for the purposes of access control, identification systems, surveillance, social media \"tagging\" etc., have so far only scratched the surface of these data sets.

The objective of our project is to explore two algorithms with fundamentally different approaches for facial detection and examine the tradeoff between computational efficiency and robustness to variances in pose, angle and lighting variations.

The Viola Jones implementation is the benchmark for real time facial detection. Meanwhile, Neural Networks have a very high degree of expressiveness to allow for the aforementioned challenges. Performance of both of these algorithms is highly correlated to the size of data on which they are trained. We expanded this project further by combining detection with a recognition algorithm using PCA for dimension reduction and an SVM for classification.","m1fname":"Stamatios","projectname":"Face Detection","m3fname":"Lauren Nicole"},{"m2lname":"Hsu","m4lname":"","m3uni":"mc4107","m1uni":"lt2590","m4uni":"","pid":"20151225","m2uni":"ch3141","timestring":"Thu Dec 17 23:50:59 2015","m4fname":"","language":"Python, HTML5, CSS, Google App Engine ","m3lname":"Chu","dataset":"Yelp Dataset Challenge
•1.6M reviews and 500K tips by 366K users for 61K businesses
•481K business attributes, e.g., hours, parking availability, ambience.
•Social network of 366K users for a total of 2.9M social edges.
•Aggregated check-ins over time for each of the 61K businesses","m1lname":"Tsai","industry":"Information","analytics":"Tool:
- Backend : Google App Engine
- Frontend : HTML5, CSS
- Third party API: Yelp Search API
- Big Data Tool : Spark
- Package : lda, gensim, nltk, WordNet(Princeton Univerisity)
Algorithm:
- Lda for topic modeling
- Map Reduce for word count
- WordNet Augmented Keyword Matching
","m2fname":"Chia-Hao","description":"Find out what user really cares about from their low rating reviews. We use topic modeling (LDA) to define many topics words and use Map Reduce to find high frequency words to match the important words in the text. And use an inverse mapping of these keywords to generate searchable keyword to recommend restaurant to the user.","m1fname":"Liang-Chun","projectname":"Reverse Recommendation on Yelp","m3fname":"Ming-Ching"},{"m2lname":"Gopisetty","m4lname":"Chopda","m3uni":"sa3205","m1uni":"jsc2226","m4uni":"jjsc2253","pid":"201512-13","m2uni":"ssg2147","timestring":"Thu Dec 17 23:51:17 2015","m4fname":"Jayni","language":"JavaScript, Node.js, HTML5, CSS3, R, SQL, AWS (DB, SQS)","m3lname":"Ahmed","dataset":"Twitter Streaming API. With both geo tagged and keyword search (tracks both words in the tweet and hashtag). Used node module to implement the above and novel code to extract it. ","m1lname":"Chhatwal","industry":"Social Science-Government","analytics":"Tools: D3.js, WordCloud, HeatMap, Co-occurence graphs, amCharts
Algorithms: Semantic, sentiment and graph analysis","m2fname":"Sanjana","description":"To analyze the ongoing trends using Twitter Data
Observe the popularity of candidates at different geographical locations
Determine public opinions through sentiment analysis
Categorise the user profiles tweeting on political elections
Popular Hashtags and phrases/words used by users to tweet about a particular candidate
Relevance of phrase/word used in tweets for candidates
Visualise how often the candidates are talked about together
Determine concepts talked about candidate in social media

Importance:
Can help campaign runners figure out the public opinions for candidates
Monetization based on insights from general public
Helps campaign runners to identify the most frequently used words for their campaign
Will allow us to identify results based on our location of interest
Improve the usability through enhanced look-and-feel sort of interface
","m1fname":"Jivtesh","projectname":"Twitter Analytics on the 2016 US Presidential Elections","m3fname":"Saad"},{"m2lname":"Pan","m4lname":"","m3uni":"kc2980","m1uni":"xl2494","m4uni":"","pid":"201512-19","m2uni":"hp2414","timestring":"Thu Dec 17 23:52:47 2015","m4fname":"","language":"Java, Python, Mahout","m3lname":"Chen","dataset":"API from Twitter by authorization, LinkedIn profiles","m1lname":"Li","industry":"Information","analytics":"K-Means
Spark
TF-IDF
System-G map","m2fname":"Haowen","description":"This project is the foundation of an algorithm for better social mobile applications. For years every of the apps has certain disadvantages on its own. One might be good at linking people together with Facebook, but not anything after a match; the other may be able to match people with similar hobbies and tendency, but none of the real information. This project involves one's profile in his/her will from LinkedIn and Twitter, thus pairing or clustering people in similar business, education level, and interest group via their professional profiles and Tweets. This is the algorithm of one of a kind, which is versatile enough to cover every detail in characteristic and professional track that best links people together.

With this function people do not need necessarily download multiple apps for social life, it can thusly be a very basic tool on device.","m1fname":"Xuran ","projectname":"Find People Just Like You!","m3fname":"Kun"},{"m2lname":"Hsu","m4lname":"","m3uni":"mc4107","m1uni":"lt2590","m4uni":"","pid":"201512-25","m2uni":"ch3141","timestring":"Thu Dec 17 23:54:17 2015","m4fname":"","language":"Python, HTML5, CSS, Google App Engine ","m3lname":"Chu","dataset":"Yelp Dataset Challenge
•1.6M reviews and 500K tips by 366K users for 61K businesses
•481K business attributes, e.g., hours, parking availability, ambience.
•Social network of 366K users for a total of 2.9M social edges.
•Aggregated check-ins over time for each of the 61K businesses","m1lname":"Tsai","industry":"Information","analytics":"Tool:
- Backend : Google App Engine
- Frontend : HTML5, CSS
- Third party API: Yelp Search API
- Big Data Tool : Spark
- Package : lda, gensim, nltk, WordNet(Princeton Univerisity)
Algorithm:
- Lda for topic modeling
- Map Reduce for word count
- WordNet Augmented Keyword Matching
","m2fname":"Chia-Hao","description":"Find out what user really cares about from their low rating reviews. We use topic modeling (LDA) to define many topics words and use Map Reduce to find high frequency words to match the important words in the text. And use an inverse mapping of these keywords to generate searchable keyword to recommend restaurant to the user.","m1fname":"Liang-Chun","projectname":"Reverse Recommendation on Yelp","m3fname":"Ming-Ching"},{"m2lname":"Jin","m4lname":"","m3uni":"cx2178","m1uni":"mc4081","m4uni":"","pid":"201512-69","m2uni":"lj2379","timestring":"Thu Dec 17 23:54:19 2015","m4fname":"","language":"Hadoop Mahout, Python (sklearn), Matlab","m3lname":"Xu","dataset":"TLC Trip Record Data at http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
But the raw data has to be cleansed and extracted (self-written utility https://github.com/ecsark/ubermax/blob/master/src/extract_pk_loc.py)

Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data is in .csv files split by month with roughly 2.5 gigabytes for green and yellow cabs in all per month.
","m1lname":"Cheng","industry":"Transportation","analytics":"We first clustered locations using Mini-batch K-Means (a variant that generally runs faster for big data). We then match each data point to one of the cluster and aggregate the data points. We then incrementally evaluated the route benefits and designed a heuristic algorithm to search the best routes, whose top results are recommended to the user. We have also implemented an assessment framework using Matlab to tune the parameters in our system.

For the front-end, we used Google Map API integrated with AngularJS. Specifically, clustered-marker API is especially useful to gain a quick overview of the data on the map.","m2fname":"Lingqiu","description":"Unlike taxi drivers, many Uber drivers work part time. Sometimes they just want to take few trips on his way home or work, while accepting arbitrary ride requests blindly could waste them hours of time but earn very limited profit. Our application Uber Max would advise them on selecting passengers (and their destinations) to increase their total expected revenue. The application will recommend drivers based on their current location and their preset information of their final destination with the next destination that could bring them the greatest expected overall revenue. Uber Max is definitely the first one in its kind, as a tool to help people profit in the Uberized economy in a scientific way.","m1fname":"Munan","projectname":"Uber Max","m3fname":"Chuwen"},{"m2lname":"Zhu","m4lname":"","m3uni":"zw2327","m1uni":"yw2770","m4uni":"","pid":"201512-50","m2uni":"jz2664","timestring":"Thu Dec 17 23:55:01 2015","m4fname":"","language":"Python; Java","m3lname":"Wan","dataset":"Yelp dataset;
Amazon dataset;
","m1lname":"Wang","industry":"Information","analytics":"Recommendation; Text analysis;","m2fname":"Jingtao","description":"I. Using the review text data to analyze on the words, especially word frequency. Mining for the useful and exciting purposes(Cluster, Classifier);
II, Provide more accurate interest recommendation for customers;","m1fname":"Yaxin","projectname":"Yelp Recommendation & Lexical Analysis","m3fname":"Zhibo"},{"m2lname":"Zhang","m4lname":"","m3uni":"rd2704","m1uni":"tl2693","m4uni":"","pid":"201512-17","m2uni":"sz2539","timestring":"Thu Dec 17 23:55:36 2015","m4fname":"","language":"Nodejs, python and R. It is cross-platform.","m3lname":"Duan","dataset":"Amazon Review data provided by Prof Julian McAuley from USCD. They are only for research use.
Twitter data crawled by Twitter API.
Bestbuy Review data craweled by Bestbuy API
","m1lname":"Li","industry":"Information","analytics":"Similar Matching, Sentiment Analysis, etc.","m2fname":"Shengtong","description":"To help users handle multiple sources of products, and to enable them to search products based on images, we built this evaluation system.
According to the input image, the application finds similar products from sources, such as Amazon, BestBuy, Twitter and NY Times. Using the reviews, ratings, and comments of these similar products, the application gives out a review summary to evaluate the product in the input image","m1fname":"Tiezheng","projectname":"Yet Another Evaluation System","m3fname":"Ruiqi"},{"m2lname":"Zhu","m4lname":"","m3uni":"zw2327","m1uni":"yw2770","m4uni":"","pid":"201512-50","m2uni":"jz2664","timestring":"Thu Dec 17 23:56:30 2015","m4fname":"","language":"Python; Java","m3lname":"Wan","dataset":"Yelp dataset;
Amazon dataset;
","m1lname":"Wang","industry":"Information","analytics":"Recommendation; Text analysis;","m2fname":"Jingtao","description":"I. Using the review text data to analyze on the words, especially word frequency. Mining for the useful and exciting purposes(Cluster, Classifier);
II, Provide more accurate interest recommendation for customers;","m1fname":"Yaxin","projectname":"Yelp Recommendation & Lexical Analysis","m3fname":"Zhibo"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ka2513","m4uni":"","pid":"201512-74","m2uni":"","timestring":"Thu Dec 17 23:56:45 2015","m4fname":"","language":"Ubuntu Server 14.04 LTS on Amazon EC2, Python, Linux Shell Scripting, MongoDB, Spark, Google Charts API, Javascript, HTML","m3lname":"","dataset":"Twitter data via Twitter Streaming API with track terms set to candidate name and candidate handle. Training dataset with approx. 1.6 million tweets manually classified as positive/negative available at http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip. Location data was derived using 2015 U.S. Gazetteer Files available at https://www.census.gov/geo/maps-data/data/gazetteer2015.html.","m1lname":"Alshewski","industry":"Social Science-Government","analytics":"The following classification algorithms are used for this project – Naïve Bayes and Support Vector Machines. The tweets are classified as positive or negative and denote whether a user will be voting for or against a candidate. The tweets are aggregated by state for each candidate to derive estimated probability of candidate winning each state. Finally, these probabilities are used to run a Monte Carlo simulation to predict the results of election. All steps are implemented in Python.","m2fname":"","description":"Based on the US Constitution, in order to win the presidential election, the candidate needs to receive more than half of Electoral College votes. There are 538 EC votes total. Thus, the winner is the one who gets at least 270. If both get 269, we have a tie.
Nation’s popular vote has no impact on the election results. For example, in 2000, Al Gore won the popular vote by more than a half a million votes, but George W. Bush became President. What matters is winning individual states. And to win the state a candidate needs to get the popular vote which can be less than 50%. Each state has a number of EC votes assigned, which is determined as the number of members in the House of Representatives plus the number of members in the Senate for each state. For historical reasons each state (except two small states, ME (4), NE (5)) use a winner takes all method to distribute their EC Votes. Thus, a candidate who wins the popular vote in a state receives ALL of the EC votes from that state.
Therefore, running an election campaign is not only art but science. By strategically choosing the states with larger EC votes and higher likelihood of winning that state, a candidate increases his/her chance of winning the election while using the funding in the most efficient way. One way to determine target states is to know the public sentiment toward a candidate in that state. Knowing the sentiment allows campaign manager to maximize the effectiveness of the campaign. The campaign manager may pinpoint location where additional effort and funding is required to win the state’s EC votes. In addition, the campaign manager may re-allocate resources and funding from states where a candidate is trailing a lot to a “battlefiled” state. “Battlefield” state is a state where gap between candidates is very small.
In addition, the prediction can be monetized. There are markets set up where one can buy shares of options that give payoffs depending on who wins the USA Presidential Election. One can purchase this type of options at Iowa Electronic Markets http://tippie.uiowa.edu/iem/markets/pres16.html
The objective of this project is to assess public sentiment by using Twitter messages. This method is more cost efficient, has larger sample of the population, can be done in less time, and responds quicker to the events than regular polls.
","m1fname":"Kirill","projectname":"Predicting The United States Presidential Election Results Using Twitter sentiment","m3fname":""},{"m2lname":"Chen","m4lname":"Wang","m3uni":"lw2589","m1uni":"jz2612","m4uni":"cw2826","pid":"201512-65","m2uni":"cc3757","timestring":"Thu Dec 17 23:56:58 2015","m4fname":"Changchang","language":"AWS EC2, S3, Cognito & SQS , OpenCV, Tesseract, Spark, Swift","m3lname":"Wu","dataset":"images of lecture notes (Topic: Bayesian Models for Machine Learning)
20 news group","m1lname":"Zhong","industry":"Information","analytics":"SVM, TF-IDF, OCR (Please see in slides architecture design)","m2fname":"Chang","description":" Most companies do not allow their employees or visitors to take photos of any confidential materials, such as documents, white board, or screenshots, using their personal devices. So we have designed and implemented an app that help the companies immediately detect whether the photo just taken contains confidential information.
","m1fname":"Jialu","projectname":"Image recognition with a huge dataset on iOS devices ","m3fname":"Liang"},{"m2lname":"Hsu","m4lname":"","m3uni":"mc4107","m1uni":"lt2590","m4uni":"","pid":"201512-25","m2uni":"ch3141","timestring":"Thu Dec 17 23:57:42 2015","m4fname":"","language":"Python, HTML5, CSS, Google App Engine ","m3lname":"Chu","dataset":"Yelp Dataset Challenge
•1.6M reviews and 500K tips by 366K users for 61K businesses
•481K business attributes, e.g., hours, parking availability, ambience.
•Social network of 366K users for a total of 2.9M social edges.
•Aggregated check-ins over time for each of the 61K businesses","m1lname":"Tsai","industry":"Information","analytics":"Tool:
- Backend : Google App Engine
- Frontend : HTML5, CSS
- Third party API: Yelp Search API
- Big Data Tool : Spark
- Package : lda, gensim, nltk, WordNet(Princeton Univerisity)
Algorithm:
- Lda for topic modeling
- Map Reduce for word count
- WordNet Augmented Keyword Matching
","m2fname":"Chia-Hao","description":"Find out what user really cares about from their low rating reviews. We use topic modeling (LDA) to define many topics words and use (Map Reduce - Spark) to find high frequency words to find out the important words in the text. And use an inverse mapping of these keywords to generate searchable keyword and use Yelp Search API to recommend 5 restaurants to the user.","m1fname":"Liang-Chun","projectname":"Reverse Recommendation on Yelp","m3fname":"Ming-Ching"},{"m2lname":"ZHOU","m4lname":"","m3uni":"","m1uni":"cz2351","m4uni":"","pid":"201512-55","m2uni":"cz2342","timestring":"Thu Dec 17 23:57:52 2015","m4fname":"","language":"Pig-Latin, Python, Tableau, Mac OS","m3lname":"","dataset":"https://www.kaggle.com/c/sf-crime/data

This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set.

Data fields
Dates - timestamp of the crime incident
Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict.
Descript - detailed description of the crime incident (only in train.csv)
DayOfWeek - the day of the week
PdDistrict - name of the Police Department District
Resolution - how the crime incident was resolved (only in train.csv)
Address - the approximate street address of the crime incident
X - Longitude
Y - Latitude","m1lname":"ZHENG","industry":"Social Science-Government","analytics":"K-means algorithm, Euclidean Distance, pie chart/line chart/histgram to visualize the data.","m2fname":"CHONG","description":"According to San Francisco Police Department and my own experience, San Francisco has one of the highest crime rates in America compared to all communities of all sizes, from the smallest towns to the very largest cities. Within California, more than 98% of the communities have a lower crime rate than San Francisco. So our project aim to analyze things like, which area has high crime rate in SF, the relationship between crime rate, crime type and location in SF, which crime category is relatively easy to solve. Given these information, people, especially travelers will know which areas are relatively safe to go, when will be the worst time to go out, which kind of crimes they need to pay attention to.","m1fname":"CHEN","projectname":"San Francisco Crime Data Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tw2516","m4uni":"","pid":"201512-9","m2uni":"","timestring":"Thu Dec 17 23:58:15 2015","m4fname":"","language":"Java, Python, Spark, NLTK","m3lname":"","dataset":"It's tested on computers with 32GB RAM in Fudan University

Dataset is collected from http://www.ncbi.nlm.nih.gov/mesh through Pubmed","m1lname":"WANG","industry":"Life Science","analytics":"Models:
(1)Linear Models:
Logistic Regression Model, Support Vector Machine(SVM)
(2)Naïve Bayes Models:
Bernoulli, Multinomial

F1 Measurement","m2fname":"","description":"As more and more documents are showing up in MEDLINE in U.S. National Library of Medicine. The labeling(Mesh)can be a hard work. But with over 25 millions of existing documents and their Mesh, we can try different means to label Mesh to future documents in MEDLINE automatically. This would be a good example for the usage of machine learning in medical field and it could be extended to other areas and applied to other database easily.","m1fname":"TAO","projectname":"MeSH indexing with Spark","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"","m1uni":"zl2406","m4uni":"","pid":"201512-40","m2uni":"xw2341","timestring":"Thu Dec 17 23:58:22 2015","m4fname":"","language":"Python, Spark, Cypher, D3","m3lname":"","dataset":"http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede","m1lname":"Liang","industry":"Information","analytics":"- Text mining and feature extraction:NLTK and text mining tools in Python, Spark LDA API with Python and Scala
- Classification algorithms: Logistic regression, Random Forest
- Building the graph database in Neo4j
- Visualizations from D3 toolkit for implementing a visualization to represent the output","m2fname":"Xinli","description":"Q&A platform is increasingly important for students, engineers and scientists sharing their knowledge and get their questions answered. Piazza, Stack Exchange are two of popular forums for us.

As users, we are interested in:
What are heated discussed topics
How easily they get their problems solved using such platforms

As developers, we are interested in:
The problems users are facing and how they can take such information to improve their products and documentation.
","m1fname":"Zhen","projectname":"Delving into the Q&A network – text mining and graph analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"wz2295","m4uni":"","pid":"201512-3","m2uni":"","timestring":"Thu Dec 17 23:58:48 2015","m4fname":"","language":"Python","m3lname":"","dataset":"Yago","m1lname":"Zhang","industry":"Information","analytics":"Graph Algo","m2fname":"","description":"Relationship Explainer","m1fname":"Wangda","projectname":"RelEx: Relationship Explainer Using Knowledge Base","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ma3411","m4uni":"","pid":"201512-8","m2uni":"","timestring":"Thu Dec 17 23:58:53 2015","m4fname":"","language":"Linux, Spark, SQLSpark, Python, Hive, Spark MLib","m3lname":"","dataset":"Yahoo! Search Marketing advertiser bidding data, version 1.0.

I downloaded it from the Yahoo dataset.
","m1lname":"Allouah","industry":"Media","analytics":"I have implemented different algorithms based on the available libraries:

-Support Vector Machine
-Random Forest
-Neural Network","m2fname":"","description":"The Goal of this project is to find the bidding strategy of an advertiser during a search word auction. The projects uses some machine learning techniques combined with the large scale data tools: Spark, SQL, MLlib...","m1fname":"Amine","projectname":"Search Word Auction: Bidding Strategy","m3fname":""},{"m2lname":"Qie","m4lname":"","m3uni":"bz2269","m1uni":"sc3919","m4uni":"","pid":"201512-32","m2uni":"sq2179","timestring":"Fri Dec 18 00:00:07 2015","m4fname":"","language":"Python","m3lname":"Zheng","dataset":"The traffic accident dataset is analyzed in this project, which can be downloaded from the following link
http://www.wnyc.org/story/nyc-opens-traffic-crash-data-finally/","m1lname":"Chang","industry":"Transportation","analytics":"Naive Bayesian Classifier
Self modified Naive Bayesian Classifier
Decision Tree Classifier
","m2fname":"Sheng","description":"Analyse the correlation among factors summarized in dataset
Make prediction on the desired factor based on training data
The predictions can be applied as advisory instructions to driver to reduce the danger of involving in traffic accidents in NYC, also as the evidence to help political decision on traffic issues","m1fname":"Shuo","projectname":"Analysis of Traffic accidents in NY City","m3fname":"Baochan"},{"m2lname":"Chen","m4lname":"Wang","m3uni":"lw2589","m1uni":"jz2612","m4uni":"cw2826","pid":"201512-65","m2uni":"cc3757","timestring":"Fri Dec 18 00:00:52 2015","m4fname":"Changchang","language":"AWS EC2, S3, Cognito & SQS , OpenCV, Tesseract, Spark, Swift","m3lname":"Wu","dataset":"images of lecture notes (Topic: Bayesian Models for Machine Learning)
20 news group","m1lname":"Zhong","industry":"Information","analytics":"SVM, TF-IDF, OCR (Please see in slides architecture design)","m2fname":"Chang","description":" Most companies do not allow their employees or visitors to take photos of any confidential materials, such as documents, white board, or screenshots, using their personal devices. So we have designed and implemented an app that help the companies immediately detect whether the photo just taken contains confidential information.
","m1fname":"Jialu","projectname":"Image recognition on iOS devices ","m3fname":"Liang"},{"m2lname":"Chen","m4lname":"Wang","m3uni":"lw2589","m1uni":"jz2612","m4uni":"cw2826","pid":"201512-65","m2uni":"cc3757","timestring":"Fri Dec 18 00:01:26 2015","m4fname":"Changchang","language":"AWS EC2, S3, Cognito & SQS , OpenCV, Tesseract, Spark, Swift","m3lname":"Wu","dataset":"images of lecture notes (Topic: Bayesian Models for Machine Learning)
20 news group","m1lname":"Zhong","industry":"Information","analytics":"SVM, TF-IDF, OCR (Please see in slides architecture design)","m2fname":"Chang","description":" Most companies do not allow their employees or visitors to take photos of any confidential materials, such as documents, white board, or screenshots, using their personal devices. So we have designed and implemented an app that help the companies immediately detect whether the photo just taken contains confidential information.
","m1fname":"Jialu","projectname":"Confidential Document Detection on iOS devices ","m3fname":"Liang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"wz2295","m4uni":"","pid":"201512-3","m2uni":"","timestring":"Fri Dec 18 00:04:00 2015","m4fname":"","language":"Python, JavaScript, Neo4j","m3lname":"","dataset":"Yago","m1lname":"Zhang","industry":"Information","analytics":"Analytics: analyzing graph relationships

Algorithms: graph pattern matching, graph mining

System Modules: Neo4j, Py2neo, jQeury & Bootstrap

Visualization: D3.js","m2fname":"","description":"Relationship Explainer","m1fname":"Wangda","projectname":"RelEx: Relationship Explainer Using Knowledge Base","m3fname":""},{"m2lname":"Chen","m4lname":"Wang","m3uni":"lw2589","m1uni":"jz2612","m4uni":"cw2826","pid":"201512-65","m2uni":"cc3757","timestring":"Fri Dec 18 00:05:58 2015","m4fname":"Changchang","language":"AWS EC2, S3, Cognito & SQS, OpenCV, Tesseract, Spark, Swift","m3lname":"Wu","dataset":"20 news group","m1lname":"Zhong","industry":"Information","analytics":"SVM, OCR, TF-IDF","m2fname":"Chang","description":"Most companies do not allow their employees or visitors to take photos of any confidential materials, such as documents, white board, or screenshots, using their personal devices. So we have designed and implemented an app that help the companies immediately detect whether the photo just taken contains confidential information.
","m1fname":"Jialu","projectname":"Confidential Document Detection on iOS device","m3fname":"Liang"},{"m2lname":"Wang","m4lname":"","m3uni":"","m1uni":"zl2406","m4uni":"","pid":"201505-40","m2uni":"xw2341","timestring":"Fri Dec 18 00:11:03 2015","m4fname":"","language":"Python, Spark, Cypher","m3lname":"","dataset":"Stack Exchange \"data dumps\" of its publicly available content, http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede","m1lname":"Liang","industry":"Information","analytics":"- Text mining and feature extraction:NLTK and text mining tools in Python, Spark LDA API with Python and Scala
- Classification algorithms: Logistic regression, Random Forest
- Building the graph database in Neo4j
- Visualizations from D3.js toolkit for implementing a visualization to represent the output","m2fname":"Xinli","description":"- Text mining and feature extraction:NLTK and text mining tools in Python, Spark LDA API with Python and Scala
- Classification algorithms: Logistic regression, Random Forest
- Building the graph database in Neo4j
- Visualizations from D3.js toolkit for implementing a visualization to represent the output","m1fname":"Zhen","projectname":"Delving into the Q&A network –text mining and graph analysis","m3fname":""},{"m2lname":"Deshmukh","m4lname":"","m3uni":"ku2151","m1uni":"ar3539","m4uni":"","pid":"201512-64","m2uni":"ad3293","timestring":"Fri Dec 18 00:16:33 2015","m4fname":"","language":"MATLAB, Java, Hadoop, Excel","m3lname":"Upadhyay","dataset":"Datasets:
- New York State Load Data for 2011 - Obtained from the NYISO website
- New York State Solar Insolation Data - NREL
- New York State Wind Speed Data - NREL

The software can support any set of hourly demand profile and weather data given in .csv format.
","m1lname":"Ramakrishnan","industry":"Life Science","analytics":"Multi-Objective Optimization using Genetic Algorithm, Multi-Objective Optimization using Brute-Force, Mapreduce","m2fname":"Ankita","description":"All states in the USA must have a renewable portfolio standard (RPS) where the set goals to meet a certain percentage of their energy demand with renewable energy. The goal of this project is to determine the energy mix required to supply a specified percentage of the annual energy demand for a given electricity load profile. We consider the load profile of the state of New York for the entire year of 2011 and output the optimal combination of wind turbine generators and photovoltaic arrays required to supply 50% of the annual energy demand. Optimization is done with the objective of minimizing the capital cost of the generation resources while under the constraints of meeting 50% of the load and wasting less than 20% of the generated energy. These results will allow policy makers to make informed decisions on which energy providers to contract and which technologies to focus on in a given geographical location. ","m1fname":"Akhilesh","projectname":"Optimal Hybrid Renewable Energy Capacity for Target Grid Penetration","m3fname":"Kaustubh"},{"m2lname":"WANG","m4lname":"","m3uni":"zw2327","m1uni":"jz2664","m4uni":"","pid":"201512-50","m2uni":"yw2770","timestring":"Fri Dec 18 00:19:06 2015","m4fname":"","language":"Python, Mahout, Java, HTML, CSS，javascript","m3lname":"WAN","dataset":"Yelp Dataset Challenge: mainly users, business, and reviews.
We filter the dataset by:1.Users who give five star or one star all the time; who have only a few review count;2.Reviews which only contains a few words;3.Business stores which contains only a few reviews.
Then we get new dataset with these features:Users who have vivid reviews and give stars value of acceptable deviation;Reviews with no less than 20 words;Business stores with no less than 5 reviews.

","m1lname":"ZHU","industry":"Information","analytics":"Probabilistic Model in NPL;
Collaborative Filtering Recommender Algorithm: collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating).
Item-based recommendation: Calculate similarity between items and make recommendations.","m2fname":"YAXIN","description":"I.Using the review text data to analyze on the words, especially word frequency. Mining for the useful and exciting purposes(Cluster, Classifier);
II.The information in Yelp dataset are too redundant especially some reviews may be useless; we try to extract BAD information out; provide more accurate interest recommendation for customers.Three recommendations we use: 1.Find nice places for new city-visitors:
Input: name of a city
Output: Best restaurants, hair salons, medical cares...of the city
2.Recommend stores based on user interests:
Input: userID
Output: Business store ID
3.Recommendation evaluation
3.","m1fname":"JINGTAO","projectname":"Yelp Recommendation & Lexical Analysis","m3fname":"ZHIBO"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jc4295","m4uni":"","pid":"201512-73","m2uni":"","timestring":"Fri Dec 18 00:19:38 2015","m4fname":"","language":"PHP, HTML, JAVA, Mahout, Eclipse","m3lname":"","dataset":"Instagram API. Google Map API","m1lname":"Chen","industry":"Media","analytics":"The input geographic position was provided by google map API, which converts a location into latitude and longitude.
Input tag #nationalpark in Instagram API to find all related images, this also involves with pagination using do-while loop, since Instagram only allows to display 33 images per pages.
The location of each image is filtered by input location so that only images within 200miles will be displayed.
The images will then be filtered so that only the most popular liked image will be shown.
After finding the most popular national park near one’s location, the national park could be the input to look for the related hiking trails.
The hiking trails are rated as easy, moderate, and hard. Item-based Recommendation is implemented here to provide the most reasonable trails.
","m2fname":"","description":"Hiking is one of the outdoor activities where people could do body exercise and explore the nature. Some of the existing applications don’t provide a update-to-date data and lacks the connection between national parks and trails. The goal of this project is to develop a web application to help find the most popular national parks based on where users live, and to recommend the most reasonable trails by difficulty ratings: easy, moderate and hard.
","m1fname":"Jing","projectname":"Recommendation: Hikes in the National Parks","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jc4295","m4uni":"","pid":"201512-73","m2uni":"","timestring":"Fri Dec 18 01:25:14 2015","m4fname":"","language":"PHP, HTML, JAVA, Mahout, Eclipse","m3lname":"","dataset":"Instagram API. Google Map API, Individual National Park Hiking Trails' webpage","m1lname":"Chen","industry":"Media","analytics":"The input geographic position was provided by google map API, which converts a location into latitude and longitude. Input tag #nationalpark in Instagram API to find all related images, this also involves with pagination using do-while loop, since Instagram only allows to display 33 images per pages. The location of each image is filtered by input location so that only images within 200miles will be displayed. The images will then be filtered so that only the most popular liked image will be shown. After finding the most popular national park near one’s location, the national park could be the input to look for the related hiking trails. The hiking trails are rated as easy, moderate, and hard. Item-based Recommendation is implemented here to provide the most reasonable trails.","m2fname":"","description":"Hiking is one of the outdoor activities where people could do body exercise and explore the nature. Some of the existing applications don’t provide a update-to-date data and lacks the connection between national parks and trails. The goal of this project is to develop a web application to help find the most popular national parks based on where users live, and to recommend the most reasonable trails by difficulty ratings: easy, moderate and hard.","m1fname":"Jing ","projectname":"Recommendation: Hikes in the National Parks","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"zhang.wangda","m4uni":"","pid":"201512-3","m2uni":"","timestring":"Fri Dec 18 03:01:47 2015","m4fname":"","language":"Python, JavaScript, Neo4j","m3lname":"","dataset":"Yago: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/","m1lname":"Zhang","industry":"Information","analytics":"The system mainly uses graph traversal algorithms such as shortest path and graph pattern matching. Other algorithms including random walks are used for analyzing and ranking path patterns.

The system uses jQuery and Bootstrap as the front-end, D3.js for graph visualization, Bottle as the web framework, Py2neo as the persistence access library, and Neo4j for knowledge graph storage.","m2fname":"","description":"Most existing search engines only provide keyword search or basic question answering services, and are unable to answer relationship queries. Sparql queries used in knowledge bases are difficult for users without specialized training, and the query results are not presented straightforwardly. To make relationship explanation more intuitive, this project develops a system specially for explaining relationships between two arbitrary objects from general domains. The relationships between objects are obtained from a large knowledge base and then visualized so that they can be easily understood by the users. The system also learns representative path patterns and uses them to speed up query performance.","m1fname":"Wangda","projectname":"RelEx: Relationship Explainer Using Knowledge Base","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tw2516","m4uni":"","pid":"201512-9","m2uni":"","timestring":"Fri Dec 18 03:12:32 2015","m4fname":"","language":"Java, Python, Spark, NLTK","m3lname":"","dataset":"It's tested on computers with 32GB RAM in Fudan University

Dataset is collected from http://www.ncbi.nlm.nih.gov/mesh through Pubmed","m1lname":"WANG","industry":"Life Science","analytics":"Models:
(1)Linear Models:
Logistic Regression Model, Support Vector Machine(SVM)
(2)Naïve Bayes Models:
Bernoulli, Multinomial

F1 Measurement","m2fname":"","description":"As more and more documents are showing up in MEDLINE in U.S. National Library of Medicine. The labeling(Mesh)can be a hard work. But with over 25 millions of existing documents and their Mesh, we can try different means to label Mesh to future documents in MEDLINE automatically. This would be a good example for the usage of machine learning in medical field and it could be extended to other areas and applied to other database easily.","m1fname":"TAO","projectname":"MeSH indexing with Spark","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cy2403","m4uni":"","pid":"201512-4","m2uni":"","timestring":"Fri Dec 18 05:51:55 2015","m4fname":"","language":"Python, Hadoop, MySQL, Microsoft Azure ","m3lname":"","dataset":"Roit API will be used for collecting dataset. In this project, match-v2.2 API will be most useful, because match API will provide data such as Champions, Player Rank, and all details of each game. A python code is written to fetch data.","m1lname":"Yuan","industry":"Information","analytics":"K-means Clustering, Support Vector Machine(SVM) Classification","m2fname":"","description":"League of Legends is one of the most popular multiplayer online battle arena game. The decisive factors of a game’s result include: players’ performance, objective control, team strategy and team composition.

This project aims to analyze some pre-match factors, examine their influence on the game’s outcome, and providing a potential pre-match prediction method.

The analysis could provide online analysis about past matches, help players make better plan before each game, and help pro team with strategy making.","m1fname":"Chenli","projectname":"League of Legends Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tw2516","m4uni":"","pid":"201512-9","m2uni":"","timestring":"Fri Dec 18 08:10:55 2015","m4fname":"","language":"Java,Python,Spark,NLTK","m3lname":"","dataset":"Dataset is collected from http://www.ncbi.nlm.nih.gov/mesh through Pubmed
Appy four classifier to this dataset and get results separately.","m1lname":"WANG","industry":"Life Science","analytics":"Models:
(1)Linear Models:
Logistic Regression Model, Support Vector Machine(SVM)
(2)Naïve Bayes Models:
Bernoulli, Multinomial

F1 Measurement","m2fname":"","description":"As more and more documents are showing up in MEDLINE in U.S. National Library of Medicine. The labeling(Mesh)can be a hard work. But with over 25 millions of existing documents and their Mesh, we can try different means to label Mesh to future documents in MEDLINE automatically. This would be a good example for the usage of machine learning in medical field and it could be extended to other areas and applied to other database easily.","m1fname":"TAO","projectname":"MeSH indexing","m3fname":""},{"m2lname":"li","m4lname":"","m3uni":"ys2824","m1uni":"zh2220","m4uni":"","pid":"201512-60","m2uni":"pl2556","timestring":"Fri Dec 18 08:29:00 2015","m4fname":"","language":"python, spark, GPU","m3lname":"song","dataset":"From a Kaggle competition, we download it on the website https://www.kaggle.com/c/datasciencebowl/data","m1lname":"huang","industry":"Life Science","analytics":"CNN, neural network, various techniques in deep learning(momentum,dropout,initialization,etc)
image preprocessing, ","m2fname":"pan","description":"Image recognition for the plankton photo.","m1fname":"ziheng","projectname":"plankton image recognition","m3fname":"yifang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"car2228","m4uni":"","pid":"201512-77","m2uni":"","timestring":"Fri Dec 18 10:09:18 2015","m4fname":"","language":"R and C++","m3lname":"","dataset":"The dataset used is from the Center for Research on Security Prices (CRSP). The extract that I took includes daily closing prices for every company that was a constituent of the S&P 500 Index over the period from 2010-2011. I use the 2010 data as the training sample and the 2011 data as the test sample. Within each sample, I restricted the data to stocks for which price data exist for every trading day. Only stock tickers and stock returns were used in this analysis.

These data are not public; however, many universities including Columbia purchase subscriptions to them. Any interested researcher can request access from the reference librarians in the Business School. I obtained the data by requesting access and then downloading from the CRSP website (linked from the Columbia library website).","m1lname":"Rohlfs","industry":"Finance","analytics":"Much of the coding and run-time involved preparing the data, which was performed in R and run on my home desktop system.

I looked into using Mahout for the classification algorithms, but given the large number of continuous regressors in my model, it seemed to be more efficient for me to simply use my own code. I used two methods: Logistic regression (Newton's method rather than SGD) and K-nearest neighbor (Euclidean distance). I also tested a Perceptron algorithm, but the run-time (even on a 5% sample) was prohibitively long. I wrote the logistic regression procedure in R and the k-nearest neighbor procedure in C++.","m2fname":"","description":"Pairs trading is a well-established statistical aribtrage technique that involves identifying \"mean reverting\" pairs of stocks. The correlation between the two stocks' returns is estimated over the recent history. The trader supposes that any sharp movement that departs from this historical relationship is attributable to a temporary mispricing that will be corrected in the market. Hence, if stock x rises relative to y, then the researcher shorts x and buys y until the prices return to their stable relationship.

The question of how to identify correlated pairs represents a major gap in the literature on pairs trading. Historically, researchers simply pick the stocks of companies with similar characteristics, such as Conoco-Phillips and Exxon-Mobil, or they search for stocks with high correlations. There is little evidence, however, to suggest that correlated stock pairs are the ones for which pairs trading strategies are the most profitable.

To address this limitation in the literature, I construct a new dataset in which the level of observation is the stock pair and the outcome of interest is the profitability of a simple pairs trading strategy using those two stocks. I then use Machine Learning techniques to identify which pairs of stocks would lead to the most profitable trading strategies.","m1fname":"Christopher","projectname":"Identifying Correlated Stock Pairs","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"","m1uni":"zl2406","m4uni":"","pid":"201512-40","m2uni":"xw2341","timestring":"Fri Dec 18 10:09:47 2015","m4fname":"","language":"Python, Spark, Cypher, d3.js","m3lname":"","dataset":"- Q&A data from Stack Exchange Data Explorer (SEDE), we choose the data science categories to conduct our analysis.

https://archive.org/details/stackexchange","m1lname":"Liang","industry":"Information","analytics":"- Text mining and feature extraction:NLTK and text mining tools in Python, Spark LDA API with Python and Scala
- Classification algorithms: Logistic regression, Random Forest
- Building the graph database in Neo4j
- Visualizations from D3.js toolkit for implementing a visualization to represent the output","m2fname":"Xinli","description":"Q&A platform is increasingly important for students, engineers and scientists sharing their knowledge and get their questions answered. Piazza, Stack Exchange are two of popular forums for us.

As users, we are interested in:
- What are heated discussed topics
- How easily they get their problems solved using such platforms

As developers, we are interested in:
-The problems users are facing and how they can take such information to improve their products and documentation.

Our project addresses such problems by
- Extracting the topics out of large amount of posts and the topic distribution of each document
- Predicting the good quality answers by building a predictive model
- Visualizing the “network” of questions, to know what’s the trends and relationships among discussed topics","m1fname":"Zhen","projectname":"Delving into the Q&A network – text mining and graph analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rw2611","m4uni":"","pid":"rw2611","m2uni":"","timestring":"Fri Dec 18 10:25:51 2015","m4fname":"","language":"Python ","m3lname":"","dataset":"soccer team analyze","m1lname":"Wang","industry":"Media","analytics":"classification","m2fname":"","description":"analyze soccer team","m1fname":"Rui","projectname":"Data analysis on soccer team performance","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rw2611","m4uni":"","pid":"rw2611r","m2uni":"","timestring":"Fri Dec 18 10:29:57 2015","m4fname":"","language":"Python ","m3lname":"","dataset":"soccer team analyze","m1lname":"Wang","industry":"Media","analytics":"soccer team analyze","m2fname":"","description":"soccer team analyze","m1fname":"Rui","projectname":"soccer team analyze","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rw2611","m4uni":"","pid":"201512-18","m2uni":"","timestring":"Fri Dec 18 10:32:25 2015","m4fname":"","language":"Python ","m3lname":"","dataset":"soccer team analyze","m1lname":"Wang","industry":"Information","analytics":"soccer team analyze","m2fname":"","description":"soccer team analyze","m1fname":"Rui","projectname":"soccer team analyze","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"car2228","m4uni":"","pid":"201512-77","m2uni":"","timestring":"Fri Dec 18 10:38:13 2015","m4fname":"","language":"Data cleaning and construction performed in R. Classification algorithms coded in R and C++.","m3lname":"","dataset":"The data that I used are from the Center for Research on Stock Prices (CRSP). These data include daily closing prices for different stocks. I selected all companies that were constituents of the S&P 500 Index over the period from 2010 to 2011. I used 2010 as the training period and 2011 as the test period. For each year, I included only companies with data for every trading day. Tickers (to identify company identities) and stock returns were the only data sources used in this project.

These data are not public, but many universities including Columbia purchase subscriptions to them. Interested researchers can request access from the reference librarians in the Business School. I downloaded them from the CRSP website, where one can name the relevant stocks or simply request all of the S&P 500 Constituents. I actually named the ones I wanted in my data request based upon public lists of the S&P 500 constituents.

The algorithms I use can be applied to any dataset of stock pairs, which can be constructed from any time-series on multiple stocks.
","m1lname":"Rohlfs","industry":"Finance","analytics":"Much of the coding involved in this project was to construct the dataset of profitability for each stock pair---which involved running an algorithm on a year of data each pair of stocks in the data (roughly 115,000 for each year). This data cleaning was performed in R.

I looked into using Mahout, but given the large number of continuous regressors in my model, Naive Bayes didn't make sense, so I focused on two alternative methods: Logistic regression and K-nearest neighbor estimation. Rather than SGD, I used Newton's method for the Logistic regression, which I coded in R. For the K-nearest neighbor estimation, I used a Euclidean distance metric, which I coded in C++. I also looked into using the Perceptron algorithm (coded in C++), but the program was prohibitively slow, so I did obtain results from that method.","m2fname":"","description":"Pairs trading is a well-established statistical aribtrage technique that involves identifying \"mean reverting\" pairs of stocks. The correlation between the two stocks' returns is estimated over the recent history. The trader supposes that any sharp movement that departs from this historical relationship is attributable to a temporary mispricing that will be corrected in the market. Hence, if stock x rises relative to y, then the researcher shorts x and buys y until the prices return to their stable relationship.

The question of how to identify correlated pairs represents a major gap in the literature on pairs trading. Historically, researchers simply pick the stocks of companies with similar characteristics, such as Conoco-Phillips and Exxon-Mobil, or they search for stocks with high correlations. There is little evidence, however, to suggest that correlated stock pairs are the ones for which pairs trading strategies are the most profitable.

To address this limitation in the literature, I construct a new dataset in which the level of observation is the stock pair and the outcome of interest is the profitability of a simple pairs trading strategy using those two stocks. I then use Machine Learning techniques to identify which pairs of stocks would lead to the most profitable trading strategies.","m1fname":"Christopher","projectname":"Identifying Correlated Stock Pairs","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sg662","m4uni":"","pid":"201512-76","m2uni":"","timestring":"Fri Dec 18 19:03:30 2015","m4fname":"","language":"C#, Scala, Java","m3lname":"","dataset":"A proprietary production dataset was used, but a test data set will be provided that was used to validate the classification model.","m1lname":"Gabor","industry":"Information","analytics":"K-means clustering and classification, lemmatization, stop-word removal, Stanford NLP text utilities, Advanced Spark Analytics.","m2fname":"","description":"Background Information:

A typical customer service organization can receive many free-form support requests emailed to a designated mailbox. Free-form requests received in this manner need to be reviewed, categorized and prioritized by support staff. The process can be very time consuming for a service organization which routinely receives hundreds or thousands of emailed requests per day.

Proposed Solution:

Implement a support-request classification system that uses machine learning algorithms to automatically classify incoming requests for swifter handling by specialized groups of support professionals. This divide-and-conquer approach can help reduce an organization’s single large support request queue to many smaller, queues for faster processing by distributed groups of support technicians.
","m1fname":"Sam","projectname":"Achieving Greater Efficiency Using Machine Classification of Support E-Mails ","m3fname":""},{"m2lname":"Zhao","m4lname":"","m3uni":"yy2641","m1uni":"xl2523","m4uni":"","pid":"201512-10","m2uni":"jm2685","timestring":"Fri Dec 18 19:03:53 2015","m4fname":"","language":"Python,Java,Javascript,SQL,AWS ","m3lname":"Yang","dataset":"(1)Stanford Artificial Intelligence Laboratory “Large Movie Review Dataset”
This dataset was used to build up a movie review sentiment classifier
(2)Latest twitter data collected by twitter API
Utilized twitter API to collect 149606 twitters relevant to twitter
Applied these twitters to the above mentioned classifier and calculate the rating for each movie ","m1lname":"Lan","industry":"Media","analytics":"Scikit-learn package was used to build:
(1)Naive Bayes Classifier
(2)Linear SVM Classifer
Applied the test date in the “Large Movie Review Dataset” to evaluate the model
The correct rate of Linear SVM is 10% percent higher
So the Linear SVM Classifier was selected to classify the twitters
Two vectorize algorithms was used to vectorize the words
(1)Count Vectorize algorithm
(2)TF-IDF Vectorize algorithm
TF-IDF Vectorize algorithm performs better
For the Linear SVM classifier
Tried soft margin parameter C from 0.1 to 100
When C=0.5, the model performs best

Visualization
Use Raphael.JS to visualize the result by date and by area. Construct webpage based on AWS
Welcome to visit our website
http://twittermovierating.elasticbeanstalk.com","m2fname":"Jingmei","description":"Recently the commercial market of movies grows larger and larger.
But there are too many movies and sometimes it is hard to choose to watch which movies. Our program benefits plenty of people by providing real time rating of movies on social media in their location, even for movies not on show yet!
First, “Large Movie Review Dataset” was utilized to train a Linear SVM model to classify positive and negative movie review.
Meanwhile collect the twitter relevant to movies with geo tag and date by twitter API and storage these data to RDS.
Secondly, classify the twitters for each movie and calculate their rating by each area and by each date.
Finally, visualize the result on the website to benefit more people.
There are four highlights of our project
(1) Can predict rating for movies not on show yet, help people to select movie.
(2) Since people from different area may like different movie, predicts the rating for targeting area
(3) The rating is update real time and people can see the trend of movie rating
(4) Visualize the data on the website. Every body has free access to visit the website and use our result to choose movie to watch.
Actually, I am using the result of our project to choose movie currently.
Have a try of our website:
http://twittermovierating.elasticbeanstalk.com/home.jsp
","m1fname":"Xing","projectname":"Twitter Based Movie Recommendation System","m3fname":"Yao"},{"m2lname":"Zhao","m4lname":"","m3uni":"yy2641","m1uni":"xl2523","m4uni":"","pid":"201512-10","m2uni":"jz2685","timestring":"Fri Dec 18 19:45:45 2015","m4fname":"","language":"Python,Java,Javascript,SQL,AWS, AmazonEC2","m3lname":"Yang","dataset":"(1)Stanford Artificial Intelligence Laboratory “Large Movie Review Dataset”
This dataset was used to build up a movie review sentiment classifier
(2)Latest twitter data collected by twitter API
Utilized twitter API to collect 149606 twitters relevant to twitter
Applied these twitters to the above mentioned classifier and calculate the rating for each movie ","m1lname":"Lan","industry":"Media","analytics":"Scikit-learn package was used to build:
(1)Naive Bayes Classifier
(2)Linear SVM Classifer
Applied the test date in the “Large Movie Review Dataset” to evaluate the model
The correct rate of Linear SVM is 10% percent higher
So the Linear SVM Classifier was selected to classify the twitters
Two vectorize algorithms was used to vectorize the words
(1)Count Vectorize algorithm
(2)TF-IDF Vectorize algorithm
TF-IDF Vectorize algorithm performs better
For the Linear SVM classifier
Tried soft margin parameter C from 0.1 to 100
When C=0.5, the model performs best

Visualization
Use Raphael.JS to visualize the result by date and by area. Construct webpage based on AWS
Welcome to visit our website
http://twittermovierating.elasticbeanstalk.com","m2fname":"Jingmei","description":"Recently the commercial market of movies grows larger and larger.
But there are too many movies and sometimes it is hard to choose to watch which movies. Our program benefits plenty of people by providing real time rating of movies on social media in their location, even for movies not on show yet!
First, “Large Movie Review Dataset” was utilized to train a Linear SVM model to classify positive and negative movie review.
Meanwhile collect the twitter relevant to movies with geo tag and date by twitter API and storage these data to RDS.
Secondly, classify the twitters for each movie and calculate their rating by each area and by each date.
Finally, visualize the result on the website to benefit more people.
There are four highlights of our project
(1) Can predict rating for movies not on show yet, help people to select movie.
(2) Since people from different area may like different movie, predicts the rating for targeting area
(3) The rating is update real time and people can see the trend of movie rating
(4) Visualize the data on the website. Every body has free access to visit the website and use our result to choose movie to watch.
Actually, I am using the result of our project to choose movie currently.
Have a try of our website:
http://twittermovierating.elasticbeanstalk.com
","m1fname":"Xing","projectname":"Twitter Based Movie Recommendation System","m3fname":"Yao"},{"m2lname":"Deshmukh","m4lname":"","m3uni":"ku2151","m1uni":"ar3539","m4uni":"","pid":"201512-64","m2uni":"ad3293","timestring":"Fri Dec 18 20:44:35 2015","m4fname":"","language":"MATLAB, Java, Hadoop","m3lname":"Upadhyay","dataset":"- New York State Load Data for 2011 - Obtained from the NYISO website
- New York State Solar Insolation Data - NREL
- New York State Wind Speed Data - NREL

The software can support any set of hourly demand profile and weather data given in .csv format. ","m1lname":"Ramakrishnan","industry":"Life Science","analytics":" Multi-Objective Optimization using Genetic Algorithm, Multi-Objective Optimization using Brute-Force Method, Mapreduce","m2fname":"Ankita","description":"All states in the USA must have a renewable portfolio standard (RPS) where the set goals to meet a certain percentage of their energy demand with renewable energy. The goal of this project is to determine the energy mix required to supply a specified percentage of the annual energy demand for a given electricity load profile. We consider the load profile of the state of New York for the entire year of 2011 and output the optimal combination of wind turbine generators and photovoltaic arrays required to supply 50% of the annual energy demand. Optimization is done with the objective of minimizing the capital cost of the generation resources while under the constraints of meeting 50% of the load and wasting less than 20% of the generated energy. These results will allow policy makers to make informed decisions on which energy providers to contract and which technologies to focus on in a given geographical location. ","m1fname":"Akhilesh","projectname":"Hybrid Renewable Capacities for Target Grid Penetration","m3fname":"Kaustubh"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"car2228","m4uni":"","pid":"201512-77","m2uni":"","timestring":"Fri Dec 18 22:10:55 2015","m4fname":"","language":"Data cleaning performed in R. Classification algorithms coded in R and C++.","m3lname":"","dataset":"The data that I used are from the Center for Research on Stock Prices (CRSP). These data include daily closing prices for different stocks. I selected all companies that were constituents of the S&P 500 Index over the period from 2010 to 2011. I used 2010 as the training period and 2011 as the test period. For each year, I included only companies with data for every trading day. Tickers (to identify company identities) and stock returns were the only data sources used in this project.

These data are not public, but many universities including Columbia purchase subscriptions to them. Interested researchers can request access from the reference librarians in the Business School. I downloaded them from the CRSP website, where one can name the relevant stocks or simply request all of the S&P 500 Constituents. I actually named the ones I wanted in my data request based upon public lists of the S&P 500 constituents.

The algorithms I use can be applied to any dataset of stock pairs, which can be constructed from any time-series on multiple stocks.","m1lname":"Rohlfs","industry":"Finance","analytics":"Much of the coding involved in this project was to construct the dataset of profitability for each stock pair---which involved running an algorithm on a year of data each pair of stocks in the data (roughly 115,000 for each year). This data cleaning was performed in R.

I looked into using Mahout, but given the large number of continuous regressors in my model, Naive Bayes didn't make sense, so I focused on two alternative methods: Logistic regression and K-nearest neighbor estimation. Rather than SGD, I used Newton's method for the Logistic regression, which I coded in R. For the K-nearest neighbor estimation, I used a Euclidean distance metric, which I coded in C++. I also looked into using the Perceptron algorithm (coded in C++), but the program was prohibitively slow, so I did obtain results from that method.","m2fname":"","description":"Pairs trading is a well-established statistical aribtrage technique that involves identifying \"mean reverting\" pairs of stocks. The correlation between the two stocks' returns is estimated over the recent history. The trader supposes that any sharp movement that departs from this historical relationship is attributable to a temporary mispricing that will be corrected in the market. Hence, if stock x rises relative to y, then the researcher shorts x and buys y until the prices return to their stable relationship.

The question of how to identify correlated pairs represents a major gap in the literature on pairs trading. Historically, researchers simply pick the stocks of companies with similar characteristics, such as Conoco-Phillips and Exxon-Mobil, or they search for stocks with high correlations. There is little evidence, however, to suggest that correlated stock pairs are the ones for which pairs trading strategies are the most profitable.

To address this limitation in the literature, I construct a new dataset in which the level of observation is the stock pair and the outcome of interest is the profitability of a simple pairs trading strategy using those two stocks. I then use Machine Learning techniques to identify which pairs of stocks would lead to the most profitable trading strategies.","m1fname":"Christopher","projectname":"Identifying Correlated Stock Pairs","m3fname":""},{"m2lname":"Perez Sanchez","m4lname":"","m3uni":"jpc2192","m1uni":"ar3579","m4uni":"","pid":"201512-16","m2uni":"pp2550","timestring":"Sat Dec 19 17:50:35 2015","m4fname":"","language":"Python, JavaScript, Spark","m3lname":"Colomer","dataset":"1. Weather Underground NYC Weather Dataset - We crawled the website to extract the data
http://www.wunderground.com/history/airport/KNYC/\" + year + \"/\" + month + \"/\" + day + \"/DailyHistory.html?format=1

2. NYC OPEN DATA: NYPD Motor Vehicle Collisions - We downloaded it from the below website:
https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95

Our application can support any dataset that contains weather and accident collisions data (after a few modifications)","m1lname":"Roy","industry":"Information","analytics":"1. Data Preprocessing
a. Filter data: remove irrelevant columns using Python
b. Unify datasets (date timezones) using Python
c. Merge datasets using Python and SQL queries
2. Use Spark to get accident statistics for weather and location
m[x, y, d, w] = # of accidents that happened on location (x,y) on the dth day of the week, with weather condition w.
3. Upload aggregated results to MySQL
4. Feed the data into CartoDB
5. Create web server using Flask (Python)
6. For every request, using Google Maps API and Wunderground API","m2fname":"Pedro","description":"Our project highlights the areas of the city one should avoid for today’s and next 10 day’s weather conditions. For that, we correlate weather conditions with the probability of a traffic accident happening. This tool helps any NYC resident to make a more informed decision about the route to take around the city. Instead of just picking the fastest route, they now have the option to pick the safest one.","m1fname":"Abhijit","projectname":"Accident Prediction System","m3fname":"Juan Pablo"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ks3184","m4uni":"","pid":" 201512-52","m2uni":"","timestring":"Wed Dec 23 05:26:13 2015","m4fname":"","language":"Python, Spark","m3lname":"","dataset":"Data was scraped from Zhaopin (www.zhaopin.com), which is a leading career platform in China. ","m1lname":"Shen","industry":"Information","analytics":"TF-IDF, HMM, Dynamic Programming, LDA, Gradient Boosting Model","m2fname":"","description":"Job seeking and recurring websites have been experiencing a striking rise, but most of those websites are still using key-word matching to help search for jobs. In this paper, we introduce a recommendation system that combines both content-based filtering and classification to produce a score for the given resume that how well it fits the given industry category, which helps to match the right candidate with the right job.","m1fname":"Ke","projectname":"Recommendation System for Job Seeking and Recruiting","m3fname":""},{"m2lname":"zheng","m4lname":"","m3uni":"","m1uni":"cz2342","m4uni":"","pid":"201512-56","m2uni":"cz2351","timestring":"Wed Dec 23 21:04:16 2015","m4fname":"","language":"Python, pig, tableau","m3lname":"","dataset":"No other data","m1lname":"zhou","industry":"Social Science-Government","analytics":"Kmeans, Python pygmaps, tableau","m2fname":"chen","description":"The topic of this paper is about San Francisco crime rate. In this paper, we will discuss the relationship between crime rate, crime type and location in SF. Also, we can find where is the most high crime rate areas in SF and how to distribute police manpower.","m1fname":"chong","projectname":"San Francisco Crime Rate Analysis","m3fname":""},{"m2lname":"Han","m4lname":"","m3uni":"","m1uni":"rw2611","m4uni":"","pid":"201512-18","m2uni":"sh3447","timestring":"Wed Dec 23 21:30:34 2015","m4fname":"","language":"Python SPSS","m3lname":"","dataset":"500 soccer games","m1lname":"Wang","industry":"Social Science-Government","analytics":"(1) Pearson Correlation Coefficient

(2)Classification

(3)Liner Regression","m2fname":"Shuaiyu","description":"As one of the most favorate sport, soccer is always a hot topic in people's life. The statistic of a soccer game can reflect the performance of a professional team. In normal situation, most fans believe that the controlling percentage and the shots on goal are the most important factors in a game. In this project, we analyzed the a dataset including 500 games, to find which factor is most important that determin the result of a game. The results almost meet all the expected thought in usual, also there are some special cases different from what we used to think.
","m1fname":"Rui","projectname":"Data analysis on soccer team performance","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"wz2295","m4uni":"","pid":"201512-3","m2uni":"","timestring":"Wed Dec 23 23:19:11 2015","m4fname":"","language":"Python, JavaScript, Cypher, Neo4j ","m3lname":"","dataset":"Yago Knowledge Base: http://www.mpi-inf.mpg.de/yago-naga/yago

The project implemented a customized TSV file parser and importer, so as long as the RDF graph file is in TSV format, it can be used in this system.","m1lname":"Zhang","industry":"Information","analytics":"The system mainly uses graph traversal algorithms such as shortest path and graph pattern matching. Other algorithms including random walks are used for analyzing and ranking path patterns.

The system uses jQuery and Bootstrap as the front-end, D3.js for graph visualization, Bottle as the web framework, Py2neo as the persistence access library, and Neo4j for knowledge graph storage. ","m2fname":"","description":"Most existing search engines only provide keyword search or basic question answering services, and are unable to answer relationship queries. Sparql queries used in knowledge bases are difficult for users without specialized training, and the query results are not presented straightforwardly. To make relationship explanation more intuitive, this project develops a system specially for explaining relationships between two arbitrary objects from general domains. The relationships between objects are obtained from a large knowledge base and then visualized so that they can be easily understood by the users. The project also analyzes representative path patterns and uses them to speed up query performance.","m1fname":"Wangda","projectname":"RelEx: Relationship Explainer Using Knowledge Base","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jc4295","m4uni":"","pid":"201512-73","m2uni":"","timestring":"Thu Dec 24 00:20:14 2015","m4fname":"","language":"PHP, HTML, JAVA (Eclipse), Apache Mahout","m3lname":"","dataset":"Instagram API. Google Map API, Individual National Park Hiking Trails' webpage","m1lname":"Chen","industry":"Media","analytics":"The input geographic position was provided by Google Map API, which converts a location into latitude and longitude. Input tag #nationalpark in Instagram API to find all related images, this also involves with pagination using do-while loop, since Instagram only allows to display 33 images per pages. The location of each image is filtered by input location so that only images within 200 miles will be displayed. The images will then be filtered so that only the most popular liked image will be shown. After finding the most popular national park near one’s location, the national park could be the input to look for the related hiking trails. The hiking trails are rated as easy, moderate, and strenuous. Item-based Recommendation is implemented here to provide the most reasonable","m2fname":"","description":"As the amount of data grows around the world, many of them could be utilized and help establish an easy path to explore the world. There are many search engines out there to make information searching convenient. Recommendation, which is about predicting patterns of taste, has been utilized extensively to provide good suggestions of things. Hiking is one of the outdoor activities where people could do body exercise and explore the nature.The amount of hikers has been gradually increasing and the demand for finding a good hiking trail is growing. However, there are many existing applications that don’t provide up-to-date data or in-depth analytics of a trail. The goal of this project is to first develop a web application to help discover and recommend the most popular national parks based users’ geographic position and their popularity on social media. And then to recommend the most preferential hiking trails based difficulty ratings: easy, moderate and strenuous.","m1fname":"Jing","projectname":" Recommendation: Hikes in the National Parks","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jc4295","m4uni":"","pid":"201512-73","m2uni":"","timestring":"Thu Dec 24 00:33:07 2015","m4fname":"","language":"PHP, HTML, JAVA (Eclipse), ApacheMahout","m3lname":"","dataset":"Instagram API. Google Map API, Individual National Park Hiking Trails' webpage","m1lname":"Chen","industry":"Information","analytics":"The input geographic position was provided by Google map API, which converts a input location into latitude and longitude.
Input tag #nationalpark in Instagram API to find all related images, this also involves with pagination using do-while loop, since Instagram only allows to display 33 images per page.
The location of each National Park image was filtered by input location so that only images within 200miles will be displayed.
The images were then be filtered so that only the most popular liked images were shown.
After finding the most popular national park near one’s location, the national park could be the input to look for the related hiking trails via Apache Mahout . Log-likelihood similarity was utilized to compute the similarity of each hiking trail. The hiking trails were categorized as easy, moderate, and strenuous.
","m2fname":"","description":"As the amount of data grows around the world, many of them could be utilized and help establish an easy path to explore the world. There are many search engines out there to make information searching convenient. Recommendation, which is about predicting patterns of taste, has been utilized extensively to provide good suggestions of things. Hiking is one of the outdoor activities where people could do body exercise and explore the nature. The amount of hikers has been gradually increasing and the demand for finding a good hiking trail is growing. However, there are many existing applications that don’t provide up-to-date data or in-depth analytics of a trail. The goal of this project is to first develop a web application to help discover and recommend the most popular national parks based users’ geographic position and their popularity on social media. And then to recommend the most preferential hiking trails based difficulty ratings: easy, moderate and strenuous.","m1fname":"Jing","projectname":"Recommendation: Hikes in the National Parks","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"","m1uni":"zl2406","m4uni":"","pid":"201512-40","m2uni":"xw2341","timestring":"Thu Dec 24 02:06:37 2015","m4fname":"","language":"Python, Spark, Scala, Cypher, d3.js","m3lname":"","dataset":"Q&A data from Stack Exchange Data Dump, we choose the data science categories to conduct our analysis.","m1lname":"Liang","industry":"Information","analytics":"- Text mining and feature extraction: NLTK and text mining tools in Python, Spark LDA API with Python and Scala
- Sentiment analysis using AlchemyAPI
- Classification algorithms: Random Forest
- Building the graph database in Neo4j
- Visualizations from D3.js toolkit for implementing a visualization to represent the output
","m2fname":"Xinli","description":"Stack Exchange is a Q&A platform where software engineers, scientists, students share knowledge and get questions answered.

As users, we are interested in:
- What are heated discussed topics
- How to filtering best answers among all the given answers

As developers, we are interested in:
-The problems users are facing and how they can take such information to improve their products and documentation.

Our project addresses such problems by
- Extracting topics out of large amount of posts and the topic distribution of each document
- Predicting the best answers by building a classification model
- Visualizing the “network” of questions, to know what’s the trends and relationships among discussed topics
","m1fname":"Zhen","projectname":"Delving into the Q&A network – text mining and graph analysis","m3fname":""},{"m2lname":"Liang","m4lname":"","m3uni":"","m1uni":"xw2341","m4uni":"","pid":"201512-40","m2uni":"zl2406","timestring":"Thu Dec 24 02:59:39 2015","m4fname":"","language":"Python, Spark, Scala, Cypher, d3.js, AlchemyAPI ","m3lname":"","dataset":"Q&A data from Stack Exchange Data Dump, we choose the data science categories to conduct our analysis. ","m1lname":"Wang","industry":"Information","analytics":"- Text mining and feature extraction: NLTK and text mining tools in

Python, Spark LDA API with Python and Scala

- Sentiment analysis using AlchemyAPI

- Classification algorithms: Random Forest

- Building the graph database in Neo4j

- Visualizations from D3.js toolkit for implementing a visualization to represent the

output","m2fname":"Zhen","description":"Stack Exchange is a Q&A platform where software engineers, scientists, students

share knowledge and get questions answered.

As users, we are interested in:

- What are heated discussed topics

- How to filtering best answers among all the given answers

As developers, we are interested in:

-The problems users are facing and how they can take such information to improve

their products and documentation.

Our project addresses such problems by

- Extracting topics out of large amount of posts and the topic distribution of each

document

- Predicting the best answers by building a classification model

- Visualizing the “network” of questions, to know what’s the trends and relationships

among discussed topics","m1fname":"Xinli","projectname":"DELVING INTO THE Q&A NETWORK –TEXT MINING AND GRAPH ANALYSIS","m3fname":""},{"m2lname":"Liang","m4lname":"","m3uni":"","m1uni":"xw2341","m4uni":"","pid":"201512-40","m2uni":"zl2406","timestring":"Thu Dec 24 03:04:09 2015","m4fname":"","language":"Python, Spark, Scala, Cypher, d3.js, AlchemyAPI ","m3lname":"","dataset":"Q&A data from Stack Exchange Data Dump, we choose the data science categories to conduct our analysis. ","m1lname":"Wang","industry":"Information","analytics":"- Text mining and feature extraction: NLTK and text mining tools in Python, Spark LDA API with Python and Scala

- Sentiment analysis using AlchemyAPI

- Classification algorithms: Random Forest

- Building the graph database in Neo4j

- Visualizations from D3.js toolkit for implementing a visualization to represent the output","m2fname":"Zhen","description":"Stack Exchange is a Q&A platform where software engineers, scientists, students share knowledge and get questions answered.

As users, we are interested in:

- What are heated discussed topics

- How to filtering best answers among all the given answers

As developers, we are interested in:

-The problems users are facing and how they can take such information to improve

their products and documentation.

Our project addresses such problems by

- Extracting topics out of large amount of posts and the topic distribution of each document

- Predicting the best answers by building a classification model

- Visualizing the “network” of questions, to know what’s the trends and relationships among discussed topics","m1fname":"Xinli","projectname":"DELVING INTO THE Q&A NETWORK –TEXT MINING AND GRAPH ANALYSIS","m3fname":""},{"m2lname":"Liang","m4lname":"","m3uni":"","m1uni":"xw2341","m4uni":"","pid":"201512-40","m2uni":"zl2406","timestring":"Thu Dec 24 03:05:29 2015","m4fname":"","language":"Python, Spark, Scala, Cypher, d3.js, AlchemyAPI ","m3lname":"","dataset":"Q&A data from Stack Exchange Data Dump, we choose the data science categories to conduct our analysis. ","m1lname":"Wang","industry":"Information","analytics":"- Text mining and feature extraction: NLTK and text mining tools in Python, Spark LDA API with Python and Scala
- Sentiment analysis using AlchemyAPI
- Classification algorithms: Random Forest
- Building the graph database in Neo4j
- Visualizations from D3.js toolkit for implementing a visualization to represent the output","m2fname":"Zhen","description":"Stack Exchange is a Q&A platform where software engineers, scientists, students share knowledge and get questions answered.

As users, we are interested in:
- What are heated discussed topics
- How to filtering best answers among all the given answers

As developers, we are interested in:
-The problems users are facing and how they can take such information to improve their products and documentation.

Our project addresses such problems by
- Extracting topics out of large amount of posts and the topic distribution of each document
- Predicting the best answers by building a classification model
- Visualizing the “network” of questions, to know what’s the trends and relationships among discussed topics","m1fname":"Xinli","projectname":"DELVING INTO THE Q&A NETWORK –TEXT MINING AND GRAPH ANALYSIS","m3fname":""},{"m2lname":"Vysyaraju","m4lname":"","m3uni":"pm2824","m1uni":"sc3973","m4uni":"","pid":"201512-56","m2uni":"scv2114","timestring":"Thu Dec 24 11:07:03 2015","m4fname":"","language":"Python,JavaScript, Node.JS, MySQL,Spark,Hadoop","m3lname":"Matey","dataset":"We generated the dataset ourselves your Twitter's API. ","m1lname":"Choudhary","industry":"Information","analytics":"Naive Bayesian Algorithm, Clustering","m2fname":"Sarat Chandra","description":"Objectives: The objective of this project was to establish Temporal(Time) and Spatial (Geographical) correlation between the two. Also the designed can validate where a particular trend is true or is a rumor ","m1fname":"Shivam ","projectname":"Visualization of Spatial Temporal Patterns of Tweeting Behavior","m3fname":"Palash Sushil"},{"m2lname":"zheng","m4lname":"","m3uni":"","m1uni":"cz2342","m4uni":"","pid":"201512-55","m2uni":"cz2351","timestring":"Thu Dec 24 15:18:15 2015","m4fname":"","language":"Python, pig tableau","m3lname":"","dataset":"No other data","m1lname":"zhou","industry":"Social Science-Government","analytics":"Kmeans, tableau","m2fname":"chen","description":"The topic of this paper is about San Francisco crime rate. In this paper, we will discuss the relationship between crime rate, crime type and location in SF. Also, we can find where is the most high crime rate areas in SF and how to distribute police manpower. The data is downloaded by kaggle.com provided by SFPD, including 1 million numbers and 10 variables. In our project, we use Pig Latin script to process the data then convert them into tableau for the behavior analysis. After that, we use Kmeans to classify the total data to 10 clusters based on the location of crime incidents occurred.","m1fname":"chong","projectname":"San Francisco Crime Rate","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ka2513","m4uni":"","pid":"201512-74","m2uni":"","timestring":"Fri Dec 25 10:40:13 2015","m4fname":"","language":"Ubuntu Server 14.04 LTS on Amazon EC2, Python, Linux Shell Scripting, MongoDB, Spark, Google Charts API, Javascript, HTML.","m3lname":"","dataset":"Twitter data via Twitter Streaming API with track terms set to candidate name and candidate handle. Training dataset with approx. 1.6 million tweets manually classified as positive/negative available at http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip. Location data was derived using 2015 U.S. Gazetteer Files available at https://www.census.gov/geo/maps-data/data/gazetteer2015.html.","m1lname":"Alshewski","industry":"Social Science-Government","analytics":"I used Naïve Bayes classification algorithm for this project. The tweets are classified as positive or negative and denote whether a user will be voting for or against a candidate. The tweets are aggregated by state for each candidate to derive estimated probability of candidate winning each state. Finally, these probabilities are used to run a Monte Carlo simulation to predict the results of election. All steps are implemented in Python.","m2fname":"","description":"Based on the US Constitution, in order to win the presidential election, the candidate needs to receive more than half of Electoral College votes. There are 538 EC votes total. Thus, the winner is the one who gets at least 270. If both get 269, we have a tie.
Nation’s popular vote has no impact on the election results. For example, in 2000, Al Gore won the popular vote by more than a half a million votes, but George W. Bush became President. What matters is winning individual states. And to win the state a candidate needs to get the popular vote which can be less than 50%. Each state has a number of EC votes assigned, which is determined as the number of members in the House of Representatives plus the number of members in the Senate for each state. For historical reasons each state (except two small states, ME (4), NE (5)) use a winner takes all method to distribute their EC Votes. Thus, a candidate who wins the popular vote in a state receives ALL of the EC votes from that state.
Therefore, running an election campaign is not only art but science. By strategically choosing the states with larger EC votes and higher likelihood of winning that state, a candidate increases his/her chance of winning the election while using the funding in the most efficient way. One way to determine target states is to know the public sentiment toward a candidate in that state. Knowing the sentiment allows campaign manager to maximize the effectiveness of the campaign. The campaign manager may pinpoint location where additional effort and funding is required to win the state’s EC votes. In addition, the campaign manager may re-allocate resources and funding from states where a candidate is trailing a lot to a “battlefield” state. “Battlefield” state is a state where gap between candidates is very small.
In addition, the prediction can be monetized. There are markets set up where one can buy shares of options that give payoffs depending on who wins the USA Presidential Election. One can purchase this type of options at Iowa Electronic Markets http://tippie.uiowa.edu/iem/markets/pres16.html.
The objective of this project is to assess public sentiment by using Twitter messages. This method is more cost efficient, has larger sample of the population, can be done in less time, and responds quicker to the events than regular polls.
","m1fname":"Kirill","projectname":"Predicting The United States Presidential Election Results Using Twitter Sentiment","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ka2513","m4uni":"","pid":"201512-74","m2uni":"","timestring":"Fri Dec 25 10:44:52 2015","m4fname":"","language":"Ubuntu Server 14.04 LTS on Amazon EC2, Python, Linux Shell Scripting, MongoDB, Spark, Google Charts API, Javascript, HTML.","m3lname":"","dataset":"Twitter data via Twitter Streaming API with track terms set to candidate name and candidate handle. Training dataset with approx. 1.6 million tweets manually classified as positive/negative available at http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip. Location data was derived using 2015 U.S. Gazetteer Files available at https://www.census.gov/geo/maps-data/data/gazetteer2015.html.","m1lname":"Alshewski","industry":"Social Science-Government","analytics":"I used Naïve Bayes classification algorithm for this project. The tweets are classified as positive or negative and denote whether a user will be voting for or against a candidate. The tweets are aggregated by state for each candidate to derive estimated probability of candidate winning each state. Finally, these probabilities are used to run a Monte Carlo simulation to predict the results of election. All steps are implemented in Python.","m2fname":"","description":"Based on the US Constitution, in order to win the presidential election, the candidate needs to receive more than half of Electoral College votes. There are 538 EC votes total. Thus, the winner is the one who gets at least 270. If both get 269, we have a tie.
Nation’s popular vote has no impact on the election results. For example, in 2000, Al Gore won the popular vote by more than a half a million votes, but George W. Bush became President. What matters is winning individual states. And to win the state a candidate needs to get the popular vote which can be less than 50%. Each state has a number of EC votes assigned, which is determined as the number of members in the House of Representatives plus the number of members in the Senate for each state. For historical reasons each state (except two small states, ME (4), NE (5)) use a winner takes all method to distribute their EC Votes. Thus, a candidate who wins the popular vote in a state receives ALL of the EC votes from that state.
Therefore, running an election campaign is not only art but science. By strategically choosing the states with larger EC votes and higher likelihood of winning that state, a candidate increases his/her chance of winning the election while using the funding in the most efficient way. One way to determine target states is to know the public sentiment toward a candidate in that state. Knowing the sentiment allows campaign manager to maximize the effectiveness of the campaign. The campaign manager may pinpoint location where additional effort and funding is required to win the state’s EC votes. In addition, the campaign manager may re-allocate resources and funding from states where a candidate is trailing a lot to a “battlefield” state. “Battlefield” state is a state where gap between candidates is very small.
In addition, the prediction can be monetized. There are markets set up where one can buy shares of options that give payoffs depending on who wins the USA Presidential Election. One can purchase this type of options at Iowa Electronic Markets http://tippie.uiowa.edu/iem/markets/pres16.html.
The objective of this project is to assess public sentiment by using Twitter messages. This method is more cost efficient, has larger sample of the population, can be done in less time, and responds quicker to the events than regular polls.
","m1fname":"Kirill","projectname":"Predicting The United States Presidential Election Results Using Twitter Sentiment","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"zhang.wangda","m4uni":"","pid":"201512-3","m2uni":"","timestring":"Tue Jan 5 16:52:34 2016","m4fname":"","language":"Python, JavaScript, Cypher, Neo4j ","m3lname":"","dataset":"Yago Knowledge Base: http://www.mpi-inf.mpg.de/yago-naga/yago

The project implemented a customized TSV file parser and importer, so as long as the RDF graph file is in TSV format, it can be used in this system.
","m1lname":"Zhang","industry":"Information","analytics":"The system mainly uses graph traversal algorithms such as shortest path and graph pattern matching. Other algorithms including random walks are used for analyzing and ranking path patterns.

The system uses jQuery and Bootstrap as the front-end, D3.js for graph visualization, Bottle as the web framework, Py2neo as the persistence access library, and Neo4j for knowledge graph storage.
","m2fname":"","description":"Most existing search engines only provide keyword search or basic question answering services, and are unable to answer relationship queries. Sparql queries used in knowledge bases are difficult for users without specialized training, and the query results are not presented straightforwardly. To make relationship explanation more intuitive, this project develops a system specially for explaining relationships between two arbitrary objects from general domains. The relationships between objects are obtained from a large knowledge base and then visualized so that they can be easily understood by the users. The project also analyzes representative path patterns and uses them to speed up query performance.
","m1fname":"Wangda","projectname":"RelEx: Relationship Explainer Using Knowledge Base","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cl300","m4uni":"","pid":"201605-48","m2uni":"","timestring":"Wed May 11 14:52:09 2016","m4fname":"","language":"Assembly","m3lname":"","dataset":"Secret","m1lname":"Lin","industry":"Information","analytics":"Brain network","m2fname":"","description":"This is a great project.","m1fname":"Ching-Yung","projectname":"A test project","m3fname":""},{"m2lname":"Phadke","m4lname":"","m3uni":"","m1uni":"pat70","m4uni":"","pid":"201605-45","m2uni":"mp3212","timestring":"Wed May 11 17:49:52 2016","m4fname":"","language":"GPU/CUDA/OpenCL1.2 (pyCUDA), Python, Alchemy API, HighCharts, Java/Scala/JavaScript/Vertx/Spark/Kafka/AngularJS ","m3lname":"","dataset":"• Twitter Streaming

• Yahoo Finance

• Google Finance ","m1lname":"Thatte","industry":"Finance","analytics":"• Stats Analysis (StdDev) run on GPU/OpenCL1.2 using pyOpenCL

• Alchemy Sentiment Analysis

• HighCharts (Angular JS)

Twitter/Kafka streamed to RDDs:
• windowed reduction for price updates
• keyed joins for sentiment weighting","m2fname":"Manjiri","description":"Portfolio Management tools are hand-rolled by individual teams for their calculations and platforms. We previously built a single calculation, scalable platform that gave fund managers a ticking view of their holdings (we had implemented a basic version of this in the lower level course (E6893) in Fall 2015).

In this project we have reiterated on the same idea , and enhanced it using the tools and technologies used in this course.
The calculators perform meaningful operations on real data as we learnt in this course. All operations are performed live and results are presented in real time –
• Alchemy scoring of Twitter streaming data.
• Actual prices from Google and Yahoo Finance.
• Prices predictions run on GPU Grid using pyOpenCL.

Visualization techniques address information overload and provide meaningful reporting rather than dumping data to user.","m1fname":"Paresh","projectname":"Stateful algorithms/UI to process streaming stock activity and news.","m3fname":""},{"m2lname":"Sihag","m4lname":"","m3uni":"","m1uni":"aab2234","m4uni":"","pid":"201605-34","m2uni":"gs2835","timestring":"Wed May 11 19:59:57 2016","m4fname":"","language":"Python/Jupyter Docker","m3lname":"","dataset":"The data set with its many attributes was provided by Expedia for a professional competition on Kaggle. The training data set is approximately 3.5GB with almost 37 million unique data points. Link: https://www.kaggle.com/c/expedia-hotel-recommendations/data","m1lname":"Bagri","industry":"Retail","analytics":"Analytics:
1. Room count Analytics
2. Search Span
3. Mobile Analytics
4. International Industry Analytics

Algorithms:
1. PCA – Feature Generation
2. Mini-Batch K-means
3. Support Vector Machines, RandomForestClassifier
4. Neural Network Model","m2fname":"Gautam","description":"It is our goal to accurately analyze the large data set provided by Expedia containing attributes relevant to the hotel industry and thus find interesting and illuminating insights which are valuable not only to hotel owners to build strategies to increase revenue or help with marketing campaigns but also to potential customers looking for making a booking. Through various correlations of the attributes, we generate graphs and charts that point to extremely relevant and perhaps unconventional results. We are currently working on Machine Learning algorithms to provide hotel recommendations for the customers. By putting our project to use in the hotel industry, we wish to claim that hotel owners and customers gained a better understanding of the trends and hence both benefitted from the data analytics carried out.","m1fname":"Aditya","projectname":"Expedia Hotel Analytics","m3fname":""},{"m2lname":"Yu","m4lname":"","m3uni":"","m1uni":"pw2394","m4uni":"","pid":"201605-23","m2uni":"gy2226","timestring":"Wed May 11 20:10:19 2016","m4fname":"","language":"Python, R, SQL","m3lname":"","dataset":"Source 1: R ggplot2movie package dataset, with variables such as movie ratings, production year, budgets, MPAA rate, movie votes and corresponding distribution etc.

Source 2: IMDB top 250 movie reviews collected through IMDBpie in Python.

Source 3: Over 2000 pieces of tweets collected through R in two recent movies.","m1lname":"Wu","industry":"Media","analytics":"SQL language in R is used to clean the data and choose useful predictive features.
Imputation for missing values is done using proximity in Random Forests.
We use several statistical learning algorithms (Random Forests, SVM, K-means and DBSCAN) to do movie rating prediction and similar movies clustering. Sentiment Analysis and WordCloud are performed for visualization.","m2fname":"Guanshun","description":"This project primarily aims at selecting import predictive features to predict movie rating and clustering various movies. Sentiment analysis and keyword analysis on good movies is our secondary goal.

The import features in the statistical learning models can be an easy indicator of whether a movie is good or not. Audiences can be benefited by looking into the \"important features\" of a movie. Producers and directors can also consider these factors.

Clustering results could as well be a good start point for movie recommendation system. It can give a recommendation to users by comparing respective cluster with users’ previous preference clustering results. We can even use cluster results to roughly predict movie ratings. Therefore, in future work, users’ previous rating and preference might be a major data source.

As we are more interested in good movies (especially top movies). A sentiment analysis would be useful for us to better understand the users’ need.
","m1fname":"Peng","projectname":"Magic Movie Predictors","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rtm2129","m4uni":"","pid":"201605-32","m2uni":"","timestring":"Wed May 11 21:10:30 2016","m4fname":"","language":"Python, Apache Spark, Selenium, BeautifulSoup, Pandas, TextBlob, SciKit-Learn, R","m3lname":"","dataset":"Gathered data using a combination of wget, API calls, and web scraping.

- Wikipedia Page Views
- Open Movie Database (OMDb) API
- IMDB User Reviews
","m1lname":"Munoz","industry":"Media","analytics":"- NaiveBayes Classifier from TextBlob for sentiment analysis
- Binary classifiers:
1. -regularized logistic regression (“logit”),
2. Support vector machines (SVMs) using a radial basis function (RBF) kernel (“svm-rbf”),
3. SVMs using a linear kernel (“svm-linear”),
4. SVMs using a polynomial kernel (“svm-poly”), and
5. Boosted decision trees using the Adaboost algorithm (“adaboost\").
","m2fname":"","description":"For my final report, I investigated the ability to predict the winner of the Best Picture Oscar Award from the nominees for ceremonies celebrating years 2008 through 2015. Thus, the goal of this project was to use the following in order to create and evaluate binary classifiers for predicting the winner of the Best Picture Award at the Oscars:

1. Wikipedia page views for a particular movie,
2. Critic ratings,
3. Audience ratings, and
4. Sentiment analysis of long-form audience reviews.

This project is an extension of a project in Haughton et al. (2015) where the authors show how to analyze sentiment from Internet Movie Database (IMDb) movie reviews and Twitter data for nine movies [1]. They suggest one future direction to be investigating the prediction ability for a larger set of movies.","m1fname":"Richard","projectname":"Oscar Winner Prediction based on User Reviews and Wikipedia Activity","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sl3953","m4uni":"","pid":"201605-44","m2uni":"","timestring":"Wed May 11 22:03:11 2016","m4fname":"","language":"Python","m3lname":"","dataset":"Tweets collected via Twitter API
","m1lname":"Liu","industry":"Social Science-Government","analytics":"Algorithm:
Text mining, a subfield of cluster analysis, is the analysis of large collections of text to find patterns between documents.
Consensus clustering, In the case of text mining, the consensus matrix is used in place of the term document matrix when clustering again (apply for time constraints).
","m2fname":"","description":"General Purpose
Track down twitters with keywords about 2016 elections to analyze which candidate is
more popular among the web. Conduct a brief prediction of 2016 Election with one-month
data tracked from Twitter.com

Why is this important
2016 election might be the most important election of our lifetimes because Republican
candidates and Democrats’ candidates share different opinions on policies regarding to
climate changes, immigrants` policies, which side will win more spots in Supreme Courts,
and many more things which make this year`s election important.
","m1fname":"Shuhang","projectname":"Twitter Based 2016 Election Predictions and Candidates` Popularity Check","m3fname":""},{"m2lname":"Shah","m4lname":"","m3uni":"","m1uni":"dp2781","m4uni":"","pid":"201605-12","m2uni":"ss4936","timestring":"Wed May 11 22:24:37 2016","m4fname":"","language":"Python","m3lname":"","dataset":"The data has been obtained from professor Julian McAuley specifically for this project. More information can be found at: http://jmcauley.ucsd.edu/data/amazon/

This dataset is not publicly available.","m1lname":"Parekh","industry":"Retail","analytics":"Classification models used: Naive Bayes, Logistic Regression, Decision Tree and Gradient Boosting","m2fname":"Sahil","description":"This project aims to predict the helpfulness of Amazon product reviews, specifically in the category of Grocery and Gourmet food. We train various classifier models using a training set of preprocessed reviews. The classifier predicts if the review is helpful or not.

On purchase of a product, Amazon invites the buyer to rate the product and write a review. It also asks users to vote if a review is helpful or not.Amazon uses the number of helpful votes while deciding the order of the top reviews. Herein lies the drawback of the review helpfulness system. Older product reviews would have been read by more people over time and probably would receive more helpful votes than a newer review which might be equally helpful. Since the older review would have more votes, it would be shown before newer reviews.
","m1fname":"Dhrumil","projectname":"Predicting Amazon Product Review Helpfulness","m3fname":""},{"m2lname":"Shen","m4lname":"","m3uni":"","m1uni":"bw2491","m4uni":"","pid":"201605-3","m2uni":"xs2259","timestring":"Wed May 11 22:26:48 2016","m4fname":"","language":"Python, Java, Javascript, Flask Framework, MySQL, Twitter Streaming API, AlchemyAPI, Linux","m3lname":"","dataset":"We used twitter data in this project, and the data can be obtained by our self-made crawler. But our software can also support other datasets!","m1lname":"Wang","industry":"Retail","analytics":"Crawling data using Twitter Streaming API;
Clustering the data using Spark and Kmeans Clustering Algorithm;
Doing sentimental analysis on the data using AlchemyAPI;
Injecting all the records after processing into MySQL database;
Rendering final result using Flask framework and Google Chart APIs.","m2fname":"Xinyi","description":"The main goal of our project, a web application, is to help a customer investigate the product he would like to buy without viewing all the tweets or help a company to obtain the opinion of its customer toward its product as soon as possible. According to our surveys, a customer, if he is not familiar with the product he would like to buy, his desire of buying this product would be influenced by the existed tweets of his friends. So for a company it’s necessary to obtain the opinions of its customers toward its product as soon as possible. But it is very time consuming and sometimes even misleading to view all the positive and negative tweets at the same time. But now our project can help these companies! The expected result of web application can analysis all the existing tweets regarding a product and extracting the sentiments of these tweets and render the final distribution of the sentiments on the user interface using various charts. ","m1fname":"Bin","projectname":"Social Sentimental Comment Detecting System","m3fname":""},{"m2lname":"Zhao","m4lname":"","m3uni":"","m1uni":"yx2318","m4uni":"","pid":"201605-4","m2uni":"yz2877","timestring":"Wed May 11 22:42:39 2016","m4fname":"","language":"Python, JavaScript, HTML, CSS","m3lname":"","dataset":"We used the Twitter Streaming API to help us to get the real-time data","m1lname":"Xu","industry":"Information","analytics":"We used Flask to support our front-end design.
Tweepy to obtain real-time data.
TF-IDF is the core algorithm to recommend accounts for user.","m2fname":"Yuechen","description":"Our motivation to implement this application is due the fact that many people like sharing their ideas on Twitter. There are so many interesting data can be obtained from the tweets. However, many people rely on electronics so much that they are not that easy to get close to friends. Thus, we think it would be a good idea to make a combination, namely, use Twitter to help people find friends who have similar ideas and who are also very close. ","m1fname":"Yingtao","projectname":"Emate: Friend Finder Application","m3fname":""},{"m2lname":"Zhao","m4lname":"","m3uni":"","m1uni":"yx2318","m4uni":"","pid":"201605-4","m2uni":"yz2877","timestring":"Wed May 11 22:49:07 2016","m4fname":"","language":"Python, Flask, JavaScript, HTML, CSS","m3lname":"","dataset":"We used the Twitter Streaming API to help us to get the real-time data","m1lname":"Xu","industry":"Information","analytics":"We used Flask to support our front-end design
We used Tweepy to obtain real-time data.
TF-IDF is the core algorithm to recommend accounts for user.","m2fname":"Yuechen","description":"Our motivation to implement this application is due the fact that many people like sharing their ideas on Twitter. There are so many interesting data can be obtained from the tweets. However, many people rely on electronics so much that they are not that easy to get close to friends. Thus, we think it would be a good idea to make a combination, namely, use Twitter to help people find friends who have similar ideas and who are also very close. ","m1fname":"Yingtao","projectname":"Emate: Friend Finder Application","m3fname":""},{"m2lname":"Zhao","m4lname":"","m3uni":"","m1uni":"yx2318","m4uni":"","pid":"201605-4","m2uni":"yz2877","timestring":"Wed May 11 22:52:34 2016","m4fname":"","language":"Python, Flask, JavaScript, HTML, CSS","m3lname":"","dataset":"We used the Twitter Streaming API to help us to get the real-time data","m1lname":"Xu","industry":"Information","analytics":"We used Flask to support our front-end design
We used Tweepy to obtain real-time data.
TF-IDF is the core algorithm to recommend accounts for user.","m2fname":"Yuechen","description":"Our motivation to implement this application is due the fact that many people like sharing their ideas on Twitter. There are so many interesting data can be obtained from the tweets. However, many people rely on electronics so much that they are not that easy to get close to friends. Thus, we think it would be a good idea to make a combination, namely, use Twitter to help people find friends who have similar ideas and who are also very close. ","m1fname":"Yingtao","projectname":"Emate: Friend Finder Application","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"up2131","m4uni":"","pid":"201605-21","m2uni":"","timestring":"Thu May 12 00:30:46 2016","m4fname":"","language":"Python, GPU, Tensorflow","m3lname":"","dataset":"Leafsnap","m1lname":"Parikh","industry":"Life Science","analytics":"Convolutional Neural Network
- rectified linear unit
- convolution
- Local Response Normalization
- L2 Norm
- Max pooling","m2fname":"","description":"My main objective was to Identify tree species by analyzing leaf images using Convolutional Neural Networks. Convolutional Neural Network based classification has been show to reach up to 97% accuracy on CIFAR and MNIST datasets. I wanted to see if I can reach the same level of accuracy on leaf dataset. I chose to use Tensorflow for this experiment simply because it is relatively easy to set up, it supports GPU computing and has a robust interface to visualize the neural network. The leaf database was sparse so to expand the dataset to the level that is required by a convolutional neural network I applied random \"mutations\" to the images. The neural network is steadily converging. ","m1fname":"Urvish","projectname":"Leafsnap2","m3fname":""},{"m2lname":"Dai","m4lname":"","m3uni":"","m1uni":"zl2442","m4uni":"","pid":"201605-16","m2uni":"yd2349","timestring":"Thu May 12 00:35:43 2016","m4fname":"","language":"Python, Html, CSS, Javascript","m3lname":"","dataset":"The data set are gather from some trip review websites.
The raw data are saved in the txt file so most of the edit software can use it.
And the analyzed data are saved in cvs file so I recommend to use Excel.","m1lname":"Lin","industry":"Life Science","analytics":"Mainly based on the item-based Collaborative filtering using tags and sentiment analytics.
And the result will be shown on the website.","m2fname":"Yihan","description":"This project aims to develop a searching platform, by which tourists in road trip can receive the recommendations of interest of places based on their input information. This platform supports 5 different kinds of input boxes, where people can choose any of them as the computing standard of the platform. The five distinct categories are the preferred places people want to travel, tourism time, budget, soft services and characters of tourists (teenagers prefer playground or shopping mall).

Since all of the trip website only provides the information of accommodation and the reviews for the tourist's attractions, but not able to make decision for the users. If users need to choose a place to travel from several places, he needs to survey a lot by himself and it takes time. So we develop this web application for the users to make judgement for them.

We will consider the factors of trip like weather, what users want and the accommodations to predict the best trip destination for them. And also return some tourist's attraction that they may be interested in. With the help of our website, users can quickly decide the place to go and to schedule their trip.","m1fname":"Zikai","projectname":"Analysis of Tourism Destination Prediction","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ka2513","m4uni":"","pid":"201605-46","m2uni":"","timestring":"Thu May 12 00:39:08 2016","m4fname":"","language":"Ubuntu Server 14.04 LTS on Amazon EC2, Python/pySpark, Linux Shell Scripting, SQLite.","m3lname":"","dataset":"Stock price data obtained from Netfonds Bank and Google Finance websites.","m1lname":"Alshewski","industry":"Finance","analytics":"1) Modified version of Dynamic Time Warping algorithm with Manhattan and Euclidean distances was used for identification of patterns in stock price movements.
2) Random Forest regression was used for predicting the future price of the stock.
3) Charts visualization was implemented using Python's matplotlib package.","m2fname":"","description":"The methods used to analyze securities and make investment decisions fall into two very broad categories: fundamental analysis and technical analysis. While fundamental analysis involves analyzing the characteristics of a company in order to estimate its value, technical analysis takes a completely different approach. Technical analysis is only concerned with the price movements in the market. Technical analysis combines the laws of supply and demand with the “psychological” part of trading. The main focus of technical analysis is to identify patterns of stock movements. These patterns, in turn, will give a trader a hint of what is the future direction of price movement will be. One way to identify stock price patterns is to visually inspect the stock price chart. However, there are major drawbacks with this approach – there are thousands of stocks available and there are various chart time frequencies in which a certain pattern may form. Therefore, it is very cumbersome and time consuming process to visually identify a stock from a large pool of stocks. Moreover, especially when searching for patterns on small frequency chart, by the time the pattern is identified it may be too late to execute a trade since the expected stock movement already happened.
The purpose of this project is to programmatically identify stock patterns of interest by scanning through the large pool of stocks. Once the stock(s) are identified the program is going to run regression analysis to predict future price of selected stock(s) in order to confirm the direction of the price. The idea is to have two factors affecting the decision whether to execute a trade on a particular stock. For example, a certain pattern may indicate that the stock price will go up, but regression may produce the opposite result. In this case it may be better to put this stock on the watchlist instead of making a trade. In case, when the pattern and regression indicate the price movement in the same direction it is safer to make a trade.","m1fname":"Kirill","projectname":"Identifying patterns in stock price movements and predicting future price","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"kks2142","m4uni":"","pid":"201605-30","m2uni":"","timestring":"Thu May 12 02:37:10 2016","m4fname":"","language":"Spark, Python (pandas, NumPy, scikit-learn, statsmodels)","m3lname":"","dataset":"• LendingClub’s data is freely available here (https://www.lendingclub.com/info/download-data.action)

I use all data for approved loans from 2013-2015, available in 4 .csvs. This includes 56 variables such as characteristics of the loan {interest rate, term, amount, purpose} and facts about the borrower like {income, occupation, location}. There are 884,633 loans in the dataset, of which 190,720 contain all the variables I want and have either fully paid off or been charged off (the borrower defaulted).

• U.S. state-level economic indicators from the St. Louis FRED database (https://research.stlouisfed.org/fred2/)

I use monthly, quarterly, and annual data from 2013-2015. For each U.S. state, I pull a time series of Civilian Unemployment Rate, House Price Index, and Real Median Household Income. I apply lag and difference transformations to these three macroeconomic indicators and match to each borrower’s state.

• News data from the Global Database of Events, Language, and Tone (GDELT, http://gdeltproject.org/)

GDELT is an enormous web scraper that reads the news in almost every country, in all major languages, in real-time. Anytime an article is published in a major news source, it will appear soon thereafter in GDELT. There are two databases—the Event Database and the Global Knowledge Graph. I use .csvs from the latter (GKG). Daily GKG data is available from April 2013 onward, so I restrict my sample for all three datasets to this. I used Spark to process every day of available data in this three year sample, which totaled 100 GB in size.
","m1lname":"Sattar","industry":"Finance","analytics":"I used Spark to preprocess GDELT data and pandas to preprocess LendingClub and FRED data. I matched each loan in the LendingClub data to regional macro indicators in FRED and to sentiment scores in GDELT. I then used a logit model to determine a set of relevant features, and then did a grid search of machine learning models to minimize classification error.","m2fname":"","description":"Peer-to-peer lenders are gaining larger and larger shares of the credit market, and their loans are being securitized and moved throughout the global financial system. P2P loans are a new and fundamentally different asset class underwritten on different criteria from traditional consumer debt. My project uses machine learning and natural language processing to study financial risk in a quickly growing industry.","m1fname":"Kaivan","projectname":"LendingClub Loan Performance","m3fname":""},{"m2lname":"Parekh","m4lname":"","m3uni":"","m1uni":"rk2845","m4uni":"","pid":"201605-2","m2uni":"prp2121","timestring":"Thu May 12 03:10:06 2016","m4fname":"","language":"Java, Python, IntelliJ, Apache Spark","m3lname":"","dataset":"The list of songs was compiled from the website -
http://www.rrindex.com/azsong.htm

Lyrics for each song was collected programmatically using the “pyLyrics” API for Python.

These lyrics were stored in a structured manner in individual files.
","m1lname":"Kulkarni","industry":"Media","analytics":"Clustering was done using : Latent Dirichlet Allocation algorithm
Visualization was done using : HighCharts API","m2fname":"Parth","description":"Song clustering and recommendation is an important feature in music streaming applications. Accuracy of this feature is imperative for retaining and attracting users. Song clustering can be of 2 types – Audio based and content (lyrics) based. For users who want to listen to songs having a similar meaning (might be of different genre), content based clustering would be the solution to achieve better user experience.
","m1fname":"Rohan","projectname":"Clustering and Recommendation of Songs Based On Lyrics.","m3fname":""},{"m2lname":"Parekh","m4lname":"","m3uni":"","m1uni":"rk2845","m4uni":"","pid":"201605-2","m2uni":"prp2121","timestring":"Thu May 12 03:26:52 2016","m4fname":"","language":"Java, Python, Apache Spark","m3lname":"","dataset":"The list of songs was compiled from the website -
http://www.rrindex.com/azsong.htm

Lyrics for each song was collected programmatically using the “pyLyrics” API for Python.

These lyrics were stored in a structured manner in individual files.
","m1lname":"Kulkarni","industry":"Media","analytics":"Clustering : Latent Dirichlet Allocation algorithm for Topic Modelling
Visualization : HighCharts","m2fname":"Parth","description":"Song clustering and recommendation is an important feature in music streaming applications
Accuracy of this feature is imperative for retaining and attracting users
Song clustering can be of 2 types – Audio based and content (lyrics) based
For users who want to listen to songs having a similar
meaning (might be of different genre), content based clustering would be the solution to achieve better user experience","m1fname":"Rohan","projectname":"Clustering and Recommendation of Songs based on Lyrics","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"","m1uni":"tl2699","m4uni":"","pid":"201605-37","m2uni":"kl2855","timestring":"Thu May 12 06:19:24 2016","m4fname":"","language":"Python and Scala","m3lname":"","dataset":"Yahoo Finance","m1lname":"Lin","industry":"Finance","analytics":"Monte Carlo, Variance-Covariance, Spark","m2fname":"Keying","description":"We are to estimate the Value at Risk for investment.

Value at risk is used by risk managers in order to measure and control
the level of risk which the firm undertakes. It is a simple measure of an
investment’s risk that tries to provide a reasonable estimate of maximum
probable loss over a particular time period. ","m1fname":"Tao","projectname":"VaR Estimation on Spark","m3fname":""},{"m2lname":"Jiao","m4lname":"","m3uni":"","m1uni":"ql2268","m4uni":"","pid":"201605-29","m2uni":"zj2203","timestring":"Thu May 12 06:32:12 2016","m4fname":"","language":"python","m3lname":"","dataset":"We collect the data from Steam Web api, including user and game information. ","m1lname":"Li","industry":"Media","analytics":"SVD, User-based and Item-based Collaborative Filtering","m2fname":"Zihan","description":"We build a game recommendation system for steam user based on games they played. We use play time as the user's preference on a game. Based on Collaborative Filtering algorithm, we recommend both games and friends for a user. ","m1fname":"Qianwen","projectname":"Game recommendation on Steam","m3fname":""},{"m2lname":"Phadke","m4lname":"","m3uni":"","m1uni":"pat70","m4uni":"","pid":"201605-45","m2uni":"mp3212","timestring":"Thu May 12 07:59:13 2016","m4fname":"","language":"GPU/CUDA/OpenCL1.2 (pyCUDA), Python, Alchemy API, HighCharts, Java/Scala/JavaScript/Vertx/Spark/Kafka/AngularJS ","m3lname":"","dataset":"• Twitter Streaming
• Yahoo Finance
• Google Finance
","m1lname":"Thatte","industry":"Finance","analytics":"Stats Analysis (StdDev) run on GPU/OpenCL1.2 using pyOpenCL

Alchemy Sentiment Analysis

HighCharts (Angular JS)

Twitter/Kafka streamed to RDDs:
• windowed reduction for price updates
• keyed joins for sentiment weighting","m2fname":"Manjiri","description":"Portfolio Management tools are hand-rolled by individual teams for their calculations and platforms. We previously built a single calculation, scalable platform that gave fund managers a ticking view of their holdings (we had implemented a basic version of this in the lower level course (E6893) in Fall 2015).

In this project we have reiterated on the same idea , and enhanced it using the tools and technologies used in this course.
The calculators perform meaningful operations on real data as we learnt in this course. All operations are performed live and results are presented in real time –
• Alchemy scoring of Twitter streaming data.
• Actual prices from Google and Yahoo Finance.
• Prices predictions run on GPU Grid using pyOpenCL.

Visualization techniques address information overload and provide meaningful reporting rather than dumping data to user.
","m1fname":"Paresh","projectname":"Stateful algorithms/UI to process streaming stock activity and news.","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"vl2351","m4uni":"","pid":"201605-43","m2uni":"","timestring":"Thu May 12 09:17:46 2016","m4fname":"","language":"R","m3lname":"","dataset":"All the data was scraped off Google Scholar's webpages ( painfully ).","m1lname":"Le Breton","industry":"Social Science-Government","analytics":"For the regression part: The linear regression, regression trees with gradient boosting and generalized additive model.

To compute the different graph-related metrics, we used the \"igraph\" R package.

The gender of authors was inferred from their first names using the Genderize.io API.","m2fname":"","description":"The goal is to predict the h-index of authors given personal informations but also informations related to the author's position in the network of publications/co-authorships.

The h-index is a commonly used measure of an author's impact in its field, we want to study how this impact can be explained by the author's position in the network.","m1fname":"Victor","projectname":"Networks and fame for researchers","m3fname":""},{"m2lname":"Jiang","m4lname":"","m3uni":"","m1uni":"ys2867","m4uni":"","pid":"201605-15","m2uni":"mj2716","timestring":"Thu May 12 10:09:01 2016","m4fname":"","language":"AWS platform and Python","m3lname":"","dataset":"The dataset we use is The Japanese Female Facial Expression
(JAFFE) database. There are 213 images with 7 expressions.
http://www.kasrl.org/jaffe.html

Other images in square size can be accepted","m1lname":"Song","industry":"Information","analytics":"The algorithm we use is Principal Components Analysis (PCA)
PCA can help reduce the dimensions of the image matrix .
","m2fname":"Meng","description":"Face recognition is commonly used in many fields.
The first step of face recognition is to compute eigenfaces.
Eigenface is the basis to reconstruct a facial image.
Every face can be represented by addition of eigenfaces with different weights.
Eigenfaces show out the most obvious features of all faces.
They can make us have better understanding of our faces.
","m1fname":"Yao","projectname":"Eigenface Calculator","m3fname":""},{"m2lname":"Bhargava","m4lname":"","m3uni":"","m1uni":"gtm2122","m4uni":"","pid":"201605-9","m2uni":"ab3955","timestring":"Thu May 12 10:38:30 2016","m4fname":"","language":"Python, JavaScript, Ubuntu 14.04, Apache Spark, Yelp API, GoogleMaps API, VMWARE workstation player 12","m3lname":"","dataset":"NYPD 7 Major Felony dataset was tested, accessible through data.gov for the public. Also daily alerts from spotcrime.com was used to update dataset daily. This was downloaded using imaplib which helps to access email.","m1lname":"Maliakal","industry":"Transportation","analytics":"Gaussian Mixture Model was learned with 4 clusters using Apache Spark, Google Maps API was used for visualization. Crime locations were plotted. The \"safe\" routes were plotted and juxtaposed against Google Maps' recommendation and crime plots. ","m2fname":"Anubha","description":"In this project, the objective was to plot the safest route taking into consideration historical crime data.
Gaussian mixture models with 4 clusters were used to model the distribution of crime throughout New York. 24 hours shops were assumed as safe areas and were used to form checkpoints.
The coordinate of the 24 hour shop would be inputted into the gaussian mixture value, lower the value lesser the degree that ie belongs to the clusters making it \"safe\".

This is important because we all know that New York is very unsafe and has a high crime rate. These models could help in avoiding areas with high chance of a crime happening.","m1fname":"Gabriel","projectname":"Safest Route Prediction in New York City","m3fname":""},{"m2lname":"Thomas","m4lname":"","m3uni":"","m1uni":"ab3955","m4uni":"","pid":"201605-9","m2uni":"gtm2122","timestring":"Thu May 12 10:38:35 2016","m4fname":"","language":"Javascript, HTML, Python, Spark, Yelp API, Google Maps API","m3lname":"","dataset":"The NYPD 7 Major Felony Incidents dataset was tested. It was accessible off data.gov and available to the public. We also used Spotcrime.com's live data feed of crime alerts.","m1lname":"Bhargava","industry":"Transportation","analytics":"We used a Gaussian mixture model to fit the crime locations and a mixed Gaussian Multivariate Distribution to determine the locations of the 24 hour shops to check the safety. We used these locations as waypoints in plotting the route. For visualization, we used the Google Maps API.","m2fname":"Gabriel","description":"This platform will predict the safest walking route in New York City using a NYPD crime dataset and live feed from Spotcrime.com. It determines the path where the least crime occurs and where there are the most 24 hour shops. Our algorithm used a Gaussian Mixture Model to predict this.","m1fname":"Anubha","projectname":"Safest Route Prediction in New York City","m3fname":""},{"m2lname":"Sun","m4lname":"","m3uni":"","m1uni":"pz2210","m4uni":"","pid":"201605-7","m2uni":"hs2874","timestring":"Thu May 12 11:06:04 2016","m4fname":"","language":"Python, Postgre SQL, HTML, CSS, JS","m3lname":"","dataset":"1. Yelp Dataset - Business Division; Yelp official website reviews.
2. Through Yelp developer API and web scraping technique.
3. Any other life science relevant dataset, our system can support.","m1lname":"Zhou","industry":"Life Science","analytics":"Analytics: Machine Learning Algorithm, like: LDA, GMM.

Algorithms: Web Scraping, Latent Dirichlet allocation, Gaussian Mixture Model.

System Modules: Data Collecting, Data Analyst (Filter, Parse), Data Storing, Data Visualization.

Visualization: We build a front-end and back-end visualization system to further illustrate our project. The web lets the user to choose each category and input the search key. our system could provide relevant information to the user.","m2fname":"Haitian","description":"Objectives: Nightlife never ends in NYC, it’s an essential and indispensable part of people’s daily life. There is always an increasingly demand on Nightlife market. It’s meaningful for us to take a deeper research on it. So, our objective is to provide the best NYC Nightlife recommendations based on user’s preferences (key words).

Innovations: Used Regular Expression, Stop words, Alchemy to filter data. Implemented LDA and GMM to make clustering of dataset. Storing the data on Microsoft Cloud Service and visualized data through HTML/CSS based website.

Importance: As I said in Objectives, Nightlife is an indispensable part of New York people's daily life. It has a very promising market to be developed. And the machine learning relevant algorithm is the best tool to process and analyze big dataset.
","m1fname":"Peiran ","projectname":"The Best NYC Nightlife","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rtm2129","m4uni":"","pid":"201605-32","m2uni":"","timestring":"Thu May 12 11:06:40 2016","m4fname":"","language":"Python, Apache Spark, Selenium, BeautifulSoup, Pandas, TextBlob, SciKit-Learn, R","m3lname":"","dataset":"Gathered data using a combination of wget, API calls, and web scraping.

- Wikipedia Page Views
- Open Movie Database (OMDb) API
- IMDB User Reviews","m1lname":"Munoz","industry":"Media","analytics":"- Naive Bayes Classifier from TextBlob for sentiment analysis
- Binary classifiers:
1. -regularized logistic regression (“logit”),
2. Support vector machines (SVMs) using a radial basis function (RBF) kernel (“svm-rbf”),
3. SVMs using a linear kernel (“svm-linear”),
4. SVMs using a polynomial kernel (“svm-poly”), and
5. Boosted decision trees using the Adaboost algorithm (“adaboost\"). ","m2fname":"","description":"For my final report, I investigated the ability to predict the winner of the Best Picture Oscar Award from the nominees for ceremonies celebrating years 2008 through 2015. Thus, the goal of this project was to use the following in order to create and evaluate binary classifiers for predicting the winner of the Best Picture Award at the Oscars:

1. Wikipedia page views for a particular movie,
2. Critic ratings,
3. Audience ratings, and
4. Sentiment analysis of long-form audience reviews.

This project is an extension of a project in Haughton et al. (2015) where the authors show how to analyze sentiment from Internet Movie Database (IMDb) movie reviews and Twitter data for nine movies [1]. They suggest one future direction to be investigating the prediction ability for a larger set of movies.","m1fname":"Richard","projectname":"Oscar Winner Prediction based on User Reviews and Wikipedia Activity","m3fname":""},{"m2lname":"WANG","m4lname":"","m3uni":"","m1uni":"sx2172","m4uni":"","pid":"201605-6","m2uni":"yw2810","timestring":"Thu May 12 11:13:36 2016","m4fname":"","language":"Python, MATLAB, alchemyapi, tweepy","m3lname":"","dataset":"There are three data sources used:
Small area health insurance estimates program (SAHIE)
Small Area Income and Poverty Estimates (SAIPE)
Tweets on health insurance with different location

The first two are from US Census API, where we can get an official overview about how the uninsured rate is related/not related with the poverty rate.
The last one is used for sentimental analysis on the public opinion on the local health insurance.

Any, as long as it is well processed and in a structured format. (csv, json)","m1lname":"XING","industry":"Retail","analytics":"K-menas clustering
Linear regression
PPMCC
Sentimental analysis
Visualisations of data in U.S. map using python with Pandas and Vincent","m2fname":"YINGQI","description":"Our project provides a guideline for a health insurance company to decide what are the best counties in U.S. for it to launch or promote certain health insurance plans.

Our research is important in the sense that a health insurance company can know from our result what are the uninsured rates and the poverty rate of all the counties in U.S. We provide them with the first 10% of the counties of a state, which are low in poverty rate and high in insured rate. This means that there are a large section of people that are capable of paying the insurance premium, but don't realise the importance of it yet. We see these counties as the potential market for us.","m1fname":"SHAN","projectname":"Analysis on Health Insurance Coverage in U.S.","m3fname":""},{"m2lname":"Gao","m4lname":"","m3uni":"","m1uni":"zh2255","m4uni":"","pid":"201605-20","m2uni":"hg2412","timestring":"Thu May 12 11:21:54 2016","m4fname":"","language":"Python, javascript","m3lname":"","dataset":"Topic data crawled from Wikipedia portal; Raw channel and video data from Youtube API; Social media data from Twitter Search API","m1lname":"He","industry":"Media","analytics":"LDA topic modeling, Sentiment Analysis, Random Forrest, Cascaded Random Forrest, PCA, K-Means, Item-based recommendation","m2fname":"Haoxiang","description":"In our project, we carry out a comprehensive data driven study of influential factors on YouTube channel popularity. To our knowledge, most of academic studies in such area have been focusing on analytics of each video instead of a channel. If we consider a channel as our target, we can introduce data from other sources such as social media which will be interesting. The unstructured and multi-typed raw data from different sources requires us to implement different data preprocessing approaches. The core of our study will be a proper machine learning model which can predict channel popularity quantified by labels we discovered from clustering analysis, and we will use all the features extracted from raw data. Such model can also indicate the relative importance of different influential factor and video topics that tend to be more popular. We also intend to implement visualization of channel recommender.","m1fname":"Ziyu","projectname":"Data Analytics for Video Popularity","m3fname":""},{"m2lname":"Chang","m4lname":"","m3uni":"","m1uni":"qj2133","m4uni":"","pid":"20160512","m2uni":"jc4267","timestring":"Thu May 12 11:26:47 2016","m4fname":"","language":"Python, R","m3lname":"","dataset":"Dataset is obtained from twitter streaming API and searchAPI.","m1lname":"Qiu","industry":"Media","analytics":"Tweepy API, AlchemyAPI, Shiny","m2fname":"Jonathan","description":"Objective: To predict whether Donald Trump will win the 2016 Presidential Election.

We are curious about if we could extract valuable knowledge from social media to predict who will be the next President of the United States.

We wanted to see how sentiment/cognitive analysis on social media can be used to predict election results, and how accurate the prediction is. Since most voters do not have the opportunity to meet the candidates in person, how a candidate is perceived by the general public through online news information is important.

","m1fname":"Jing","projectname":"Social Media Analysis on 2016 Presidential Election","m3fname":""},{"m2lname":"Zhang","m4lname":"","m3uni":"","m1uni":"zl2438","m4uni":"","pid":"201605-19","m2uni":"zz2361","timestring":"Thu May 12 11:32:24 2016","m4fname":"","language":"Python, Javascript","m3lname":"","dataset":"Twitter streaming API","m1lname":"Liu","industry":"Social Science-Government","analytics":"Markov Chain Inference","m2fname":"Zhili","description":"Predict presidential election 2016","m1fname":"Zhengrong","projectname":"Presidential election prediction based on sentimental analysis of tweets","m3fname":""},{"m2lname":"Si","m4lname":"","m3uni":"","m1uni":"jy2736","m4uni":"","pid":"201605-14","m2uni":"ys2887","timestring":"Thu May 12 11:36:03 2016","m4fname":"","language":"Language: python (sklearn, theano), Platforms: Amazon AWS, jupyter","m3lname":"","dataset":"Dataset: The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified dataset.
https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/3m9u-ws8e

The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified file contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.

","m1lname":"Yuan","industry":"Life Science","analytics":"Algorithms we used in
1. Analytics:
Ridge regression
Random forest regression

2. Prediction:
Ridge regression
Linear regression
Tree
Neural networks (MLP)","m2fname":"Yuqi","description":"1. Discuss the various types of hospitals that comprise the hospital industry: General hospital and Specialty hospital, specially hip and knee replacement surgeries.
2. Figure out correlation between patient attributes and charges for hip and knee replacement.
3. Estimation and prediction of charges with patients information provided.
","m1fname":"Jingyi","projectname":"Can General Hospitals Compete Specialty Hospitals?","m3fname":""},{"m2lname":"Van Loon","m4lname":"","m3uni":"","m1uni":"pm2824","m4uni":"","pid":"201605-41","m2uni":"mhv2109","timestring":"Thu May 12 11:40:01 2016","m4fname":"","language":"Python 2.7, Spark, SQLLite3, Amazon Ec2, Amazon S3, Amazon DynamoDb, Boto3, PRAW, scipy","m3lname":"","dataset":"We have mined Reddit and Twitter for our data. The Twitter dataset serves as a baseline for comparing and contrasting our results. We have extensively mined Reddit(since the stream is public) to populate our dataset with over 6Gb of information.

Our code can support/mine any kind of internet comment thread. ","m1lname":"Matey","industry":"Social Science-Government","analytics":"We gauge the general reading level of the users, get an overall idea of the sentiment of users in a particular sub-reddit. We can extract other popular topics in the same thread. We can get an accurate representation of users and their affinity to certain 'topics', by analyzing the other sub-reddits they post in.

Algorithms :
For sentiment Analysis:
Used a Voted Classifier which works on the principle of collecting a majority vote from the classifiers in the pool(in this case 7). Using the classifier and only classifying sentences with over 80% confidence we get the sentiment score.
For Reading level Score :
This test rates text on a U.S. schoolgrade level. For example, a score of 8.0 means that an eighth grader can understand the document. For most documents, aim for a score of approximately 7.0 to 8.0.
For Common topics in a thread :
The natural language processing toolkit, exploring all the facets including lemming,stemming, stop_words, tokenize, chunking and chinking. Using chunking & chinking coupled with Named Entity Recognition.
SQL queries

System modules: All the algorithms have been made into simple modules, that can be imported into a program and run on the command line.

","m2fname":"Marshall ","description":"Objectives : One of the most important objectives for us was to analyze a big data set, to that effect we have mined for Reddit data and populated datasets with over 6Gb of information. In order to analyze this dataset we have used Spark clusters, learning and understanding SparkSQL. Instead of evaluating the content that people post we wanted to interpret the people who make those comments and analyze common traits.

Innovations : To this effect we have first filtered out users based on a certain popular 'topic', performed sentiment analysis to gauge the general mood of the conversation, expanded that, to get a reading level analysis. This way we can analyze the level of conversation of the users participating in that thread. Further, in the same conversation we extract specific details, such as which organization/person is being spoken about the most(other than the topic itself).
We also have a way to map each user and all the sub-reddits they are active in. In this manner we form a correlation between the user and their affinity to certain 'topics'.

Capabilities : We can gauge the general reading level of the users, get an overall idea of the sentiment of users in a particular sub-reddit. We can extract other popular topics in the same thread. We can get an accurate representation of users and their affinity to certain 'topics', by analyzing the other sub-reddits they post in.

This research can be used to market products to users who follow certain celebrities. For eg, If a person actively participates in the 'Kim Kardashian' subreddit, and active on the 'perfumes' subreddit; we could potentially market a perfume by 'Kim Kardashian'.
We can generate a general user profile. User1 : Active on threads 'Bernie Sanders', 'Donald Trump', 'Manchester United' and 'Mark Ruffalo'. User2 : Active on threads 'Bernie Sanders', 'Manchester United', and 'memes'. Furthermore, we can find the intersection between the two Users and determine if posting on threads related to 'Mark Ruffalo' had a impact on them supporting 'Bernie Sanders'.","m1fname":"Palash Sushil","projectname":"Behavioural Mapping","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rds2174","m4uni":"","pid":"201605-47","m2uni":"","timestring":"Thu May 12 11:41:39 2016","m4fname":"","language":"Python","m3lname":"","dataset":"Yahoo Finance for past year","m1lname":"Shah","industry":"Finance","analytics":"Monte Carlo simulations for optimization","m2fname":"","description":"Optimize Portfolios to generate maximum returns for a given level of risk appetite","m1fname":"Rushabh ","projectname":"Portfolio Optimizationn","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"","m1uni":"sw3013","m4uni":"","pid":"201605-17","m2uni":"yw2786","timestring":"Thu May 12 11:42:10 2016","m4fname":"","language":"Python flask, html, javascript,css","m3lname":"","dataset":"We collected the real time datasets from API including: Real time Uber Surge multiplier using Uber Official API,Real time weather and temperature using Yahoo Weather API, Real time incident and traffic using Maprequest API, Weather and temperature forcast using Weather Underground API and Historical taxi data from NYC Taxi Dataset.
","m1lname":"Wu","industry":"Transportation","analytics":"Lasso regression ,cross validation and dynamic linear model, combination of distance, time and surge multiplier. Javascript Google Heatmap visualization and Javascript Google map.","m2fname":"Yitong","description":"We want to explore Uber’s concept of surge pricing mechanism in response to imbalanced supply and demand.
We want to glean more insights into data and do exploratory analysis on which factors effect surge pricing the most and make surge prediction in 10 and 30 minutes in order to find a better price for customers.
We want to make recommendation to Uber drivers to earn more money.
","m1fname":"Sihan","projectname":"Uber Surge Prediction and Analysis Web Application","m3fname":""},{"m2lname":"Xu","m4lname":"","m3uni":"","m1uni":"hp2414","m4uni":"","pid":"201605-22","m2uni":"cx2179","timestring":"Thu May 12 11:43:35 2016","m4fname":"","language":"Big Query, Python, R, Javascript, HTML","m3lname":"","dataset":"The dataset are from Reddit database.
It's accessible at https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_01","m1lname":"Pan","industry":"Media","analytics":"Logistic Classification, Linear Regression, Quadratic Regression, K-nearest neighbor, Radar Chart, Sankey Diagram, heat map, circular chart","m2fname":"Chen","description":"Good comment can create big discussion in favor of any websites.
Develop methods to classify a comment.
Develop methods to estimate score of a comment.","m1fname":"Haowen","projectname":"Reddit Comment “Catfight” Generator","m3fname":""},{"m2lname":"Hao","m4lname":"","m3uni":"","m1uni":"ll2985","m4uni":"","pid":"201605-39","m2uni":"bh2562","timestring":"Thu May 12 11:44:03 2016","m4fname":"","language":"Meteor.js, Node.js","m3lname":"","dataset":"Self-collected tweets","m1lname":"Liu","industry":"Information","analytics":"LSTM for sentiment analysis (Deep learning),

Real-time visualization using Meteor.js","m2fname":"Boya","description":"Visualize people’s opinions towards candidates on Google Maps
Get to know the candidates’ personality through their tweets
See people’s reaction towards candidates’ speech by topics
","m1fname":"Ling","projectname":"Social Media Based Election Analysis Web App","m3fname":""},{"m2lname":"Jiang","m4lname":"","m3uni":"","m1uni":"yh2791","m4uni":"","pid":"201605-13","m2uni":"gj2258","timestring":"Thu May 12 11:44:25 2016","m4fname":"","language":"language: Swift, Objective-C, C++, platform: OpenCV, Xcode","m3lname":"","dataset":"Stanford dogs dataset:
http://vision.stanford.edu/aditya86/ImageNetDogs/

Our software can support any dog pictures.","m1lname":"Huang","industry":"Information","analytics":"algorithm: LBPH","m2fname":"Guanyu","description":"Objective: develop an iOS app with OpenCV

Innovation: Dogs come in all shapes and sizes, and frequently without pedigrees to describe their heritage. The breeds of dogs with unknown or mixed-breed lineages are frequently guessed based on their physical appearance, but it is not known how accurate these visual breed assessments are.

Capabilities: recognize the breed of a dog given its face picture via face recognition algorithm.

Why important: While many people like to know “What kind of dog is that?” just to satisfy their curiosity, dog breed designations have also been used in an attempt to predict future behavior, match pets to families, find lost dogs, and even to restrict the ownership of certain types of dogs.","m1fname":"Yibei","projectname":"HiSnoopy – dog breeds recognition app","m3fname":""},{"m2lname":"XIAO","m4lname":"","m3uni":"","m1uni":"ts2957","m4uni":"","pid":"201605-5","m2uni":"yx2329","timestring":"Thu May 12 11:45:34 2016","m4fname":"","language":"python, NumPy, scikit-learn","m3lname":"","dataset":"We use the resumes on www.indeed.com whose database has over 60 million resumes. We do web scraping on that website. ","m1lname":"SHEN","industry":"Information","analytics":"PDF parse, TF-IDF, weighting algorithm","m2fname":"YAO","description":"Many job seekers do not know which companies may give an interview opportunity according to their resumes. They are suffering from finding companies that have a relatively bigger chance to get a job offer from. Some of them may send out resume to as many companies as possible, use the ‘shotgun’ method for conducting a job search.
It would be much more efficient if the candidates would be able to know which companies that they have the relatively bigger chance to get the interview from. Our project, therefore, is to find out what aspects a company really care, and then provide job seekers with a ‘sniper’ to target their biggest chance company according to their resumes.
Our Innovation is that we analyze each company’ taste on each category of the resumes, such as ‘education background’, ‘experience’, and etc. We analyze the different tastes of companies on different categories of resumes and calculate the normalized weight of each category for different companies. Then, we optimize the recommendation through combining the category weight and TF-IDF similarity.

","m1fname":"TIANHE","projectname":"Recommendation System ‘Job Hunter’","m3fname":""},{"m2lname":"Gurunath","m4lname":"","m3uni":"","m1uni":"kp2652","m4uni":"","pid":"201605-11","m2uni":"rg2997","timestring":"Thu May 12 11:49:04 2016","m4fname":"","language":"Python, NodeJS, AWS , Keras library","m3lname":"","dataset":"Dataset:
Crimes- 2001 to present from https://data.cityofchicago.org/Public-Safety

How to get it:
Downloaded the dataset from Jan 2008 to March 2016 for these 5 categories.
Parsed the data to retain relevant columns like Date, Crime Type, Location Description, Community Area etc.

What other data?:
Program can run on any similar crime dataset.
","m1lname":"Premkumar","industry":"Social Science-Government","analytics":"Neural Networks using Keras library
Apriori algorithm
Google Heat map visualization using NodeJS","m2fname":"Rohit","description":"Objectives:
Chicago is infamous for being the crime capital due to the prevalence of the Italian Mafia in the city.
Large open dataset on Chicago crimes available.
Meaningful insights on correlation between place, time and type of crime to make people more aware.

Innovations and Capabilites :
Classification of crime types with probability of occurrence
Association analysis to identify correlation
Visualizations through heat maps","m1fname":"Kavya","projectname":"Chicago Crime Analysis","m3fname":""},{"m2lname":"Rao","m4lname":"","m3uni":"","m1uni":"sl4017","m4uni":"","pid":"201605-8","m2uni":"cr2832","timestring":"Thu May 12 11:52:11 2016","m4fname":"","language":"Python, Caffe, HTML, Flask, Jinjia2 Platform: Mac OS ","m3lname":"","dataset":"MNIST
Handwritten digits: a training set of 60,000 examples, and a test set of 10,000 examples

Cifar-10/100
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool [Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012)]
","m1lname":"Lu","industry":"Media","analytics":"Multilayer perception
Convolutional neural networks

LeNet: a layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits. [ LeNet ]

AlexNet: a layered model composed of convolution, subsampling, and further operations followed by a holistic representation and all-in-all a landmark classifier on ILSVRC12. [ AlexNet ]

We have developed a website that let users upload images. The website will display image classification results to users.
","m2fname":"Chenxi","description":"Given a image, we human can recognize what it is immediately. Today’s computer still cannot recognize images with 100% accuracy. The intent of image classification is to categorize a digital image into one or several pre-specified classes or objects. It is a classic problem in computer vision .

From application perspective:

People have many images today. It becomes harder to search a wanted one. People rely on their own eyes to find an image from a large dataset. In order to facilitate the search process, it is useful to design an algorithm that can tag images automatically. In this way, people can find a image based on its content rather than only relying on the visual information.

From research perspective:
Deep learning neural networks have many hyper-parameters that can be tweaked during training: learning rate, weight, bias, etc. Researchers are developing different network structures to solve different problems. Image classification is one of them. It is important to test and fine-tuning networks designed for image classification so that we can get more knowledge in this field.

Our goal is to develop a web service that can store, classify(tag) and search users’ images.

In order to do automatic classifying , we decide to use deep learning neural network(DNN) to do image classification. The network can learn features of images by learning rather than using pre-defined ones. Images contain large number of features which makes it very difficult and time-consuming to find and define them one by one. However, neural networks can be trained to learn those feature.

In order to store and search images, we decide to develop a website that users can upload their images to our server so we can classify them using our algorithms.
","m1fname":"Shanqi","projectname":"Image Classification As A Cloud Service","m3fname":""},{"m2lname":"Chang","m4lname":"","m3uni":"","m1uni":"qj2133","m4uni":"","pid":"201605-12","m2uni":"jc4267","timestring":"Thu May 12 11:53:16 2016","m4fname":"","language":"Python, R","m3lname":"","dataset":"Dataset is obtained from Twitter streamingAPI and searchAPI","m1lname":"Qiu","industry":"Media","analytics":"Tweepy API, Alchemy API, Shiny (viz)","m2fname":"Jonathan","description":"Objective
To predict whether Donald Trump will win the 2016 Presidential Election.

We are curious about if we could extract valuable knowledge from social media to predict who will be the next President of the United States. We wanted to see how sentiment/cognitive analysis on social media can be used to predict election results, and how accurate the prediction is. Since most voters do not have the opportunity to meet the candidates in person, how a candidate is perceived by the general public through online news information is important.

","m1fname":"Jing","projectname":"Social Media Analysis on 2016 Presidential Election","m3fname":""},{"m2lname":"Sihag","m4lname":"","m3uni":"","m1uni":"aab2234","m4uni":"","pid":"201605-34","m2uni":"gs2835","timestring":"Thu May 12 11:54:34 2016","m4fname":"","language":"Python with Jupyter instance running on Dockers, Kaggle instance, git","m3lname":"","dataset":"The data set with its many attributes was provided by Expedia for a professional competition on Kaggle. The training data set is approximately 3.5GB with almost 37 million unique data points. Link: https://www.kaggle.com/c/expedia-hotel-recommendations/data ","m1lname":"Bagri","industry":"Retail","analytics":"Analytics:
1. Room count Analytics
2. Search Span
3. Mobile Analytics
4. International Industry Analytics

Algorithms:
1. PCA – Feature Generation
2. Mini-Batch K-means
3. Support Vector Machine
4. Binary Classifier, RandomForestClassifier
5. Leakage Solution
6. Neural Network Model","m2fname":"Gautam","description":"It is our goal to accurately analyze the large data set provided by Expedia containing attributes relevant to the hotel industry and thus find interesting and illuminating insights which are valuable not only to hotel owners to build strategies to increase revenue or help with marketing campaigns but also to potential customers looking for making a booking. Through various correlations of the attributes, we generate graphs and charts that point to extremely relevant and perhaps unconventional results. We implemented algorithms to predict a returning users hotel cluster selection, and gave the top 5 predictions. We are currently working on Machine Learning algorithms to provide hotel recommendations for the customers. By putting our project to use in the hotel industry, we wish to claim that hotel owners and customers gained a better understanding of the trends and hence both benefitted from the data analytics carried out.","m1fname":"Aditya","projectname":"Expedia Hotel Analytics","m3fname":""},{"m2lname":"Singh","m4lname":"","m3uni":"","m1uni":"dhv2108","m4uni":"","pid":"201605-24","m2uni":"as4916","timestring":"Thu May 12 11:54:39 2016","m4fname":"","language":"Spark, Python, D3.js, EMR","m3lname":"","dataset":"NYC Open Data - Motor Vehicle Collisions - https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95
Weather Data - http://w2.weather.gov/climate/index.php?wfo=okx#show-last-Point","m1lname":"Vanvari","industry":"Information","analytics":"Spark SQL Queries
D3.js, C3.js
Amazon EMR and S3
Scikit, MatplotLib","m2fname":"Anshuman ","description":"Objectives:
Analyzing this accident data can help discover accident hotspots.
These hotspots and the causes of accidents at these hotspots can
then be relayed to the concerned authorities.
This data will enable them to take the necessary action to avert
these situations.
Not only does this benefit the city directly with lesser
accidents but also indirectly with lesser pressure on the already over
loaded infrastructure. (Emergency Rooms, Ambulances,
Police Department)

","m1fname":"Diksha","projectname":"Analysis of NYC Motor Vehicle Collision Data","m3fname":""},{"m2lname":"Purbey","m4lname":"","m3uni":"","m1uni":"ai2336","m4uni":"","pid":"201605-18","m2uni":"np2544","timestring":"Thu May 12 11:54:43 2016","m4fname":"","language":"Python, Flask ","m3lname":"","dataset":"
We used the public Yelp Academic Dataset

business

{
'type': 'business',
'business_id': (encrypted business id),
'name': (business name),
'neighborhoods': [(hood names)],
'full_address': (localized address),
'city': (city),
'state': (state),
'latitude': latitude,
'longitude': longitude,
'stars': (star rating, rounded to half-stars),
'review_count': review count,
'categories': [(localized category names)]
'open': True / False (corresponds to closed, not business hours),
'hours': {
(day_of_week): {
'open': (HH:MM),
'close': (HH:MM)
},
...
},
'attributes': {
(attribute_name): (attribute_value),
...
},
}

review

{
'type': 'review',
'business_id': (encrypted business id),
'user_id': (encrypted user id),
'stars': (star rating, rounded to half-stars),
'text': (review text),
'date': (date, formatted like '2012-03-14'),
'votes': {(vote type): (count)},
}

user

{
'type': 'user',
'user_id': (encrypted user id),
'name': (first name),
'review_count': (review count),
'average_stars': (floating point average, like 4.31),
'votes': {(vote type): (count)},
'friends': [(friend user_ids)],
'elite': [(years_elite)],
'yelping_since': (date, formatted like '2012-03'),
'compliments': {
(compliment_type): (num_compliments_of_this_type),
...
},
'fans': (num_fans),
}

check-in

{
'type': 'checkin',
'business_id': (encrypted business id),
'checkin_info': {
'0-0': (number of checkins from 00:00 to 01:00 on all Sundays),
'1-0': (number of checkins from 01:00 to 02:00 on all Sundays),
...
'14-4': (number of checkins from 14:00 to 15:00 on all Thursdays),
...
'23-6': (number of checkins from 23:00 to 00:00 on all Saturdays)
}, # if there was no checkin for a hour-day block it will not be in the dict
}

tip

{
'type': 'tip',
'text': (tip text),
'business_id': (encrypted business id),
'user_id': (encrypted user id),
'date': (date, formatted like '2012-03-14'),
'likes': (count),
}

photos (from the photos auxiliary file)

This file is formatted as a JSON list of objects.

[
{
\"photo_id\": (encrypted photo id),
\"business_id\" : (encrypted business id),
\"caption\" : (the photo caption, if any),
\"label\" : (the category the photo belongs to, if any)
},
{...}
]","m1lname":"Iyer","industry":"Life Science","analytics":" LDA (Latent Dirichlet Allocation)for training
Flask to make a rudimentary web app to demonstrate functionality
Alchemy api for sentiment analysis
","m2fname":"Niharika","description":"We hope to identify what users care about the most when writing their reviews, and ultimately determine what certain restaurants are doing right and wrong in order to receive these ratings.
This will help restaurant owners make decisions to increase their revenue.
It will also help other users as the tags would provide more context for a restaurant.
","m1fname":"Aishwarya","projectname":"Categorization and Analysis of Yelp reviews","m3fname":""},{"m2lname":"Jain","m4lname":"","m3uni":"","m1uni":"gss2147","m4uni":"","pid":"201605-25","m2uni":"bkj2111","timestring":"Thu May 12 11:54:44 2016","m4fname":"","language":"Python, Jupyter Notebook","m3lname":"","dataset":"Citibike System Trip Data and Station Data
Almanac.com - Daily New York City Weather data","m1lname":"Sadekar","industry":"Transportation","analytics":"Anaylsis:
Dataset querying with Python Pandas, wrote code for creating aggregate stats
Web Scraping for weather data
Used the StreetEasy API for area/neighborhood information

Algorithms:
Regression Models using Statsmodels / Scikit-Learn

Visualization:
Matplotlib - Bar plots
Google Heat Maps (gmaps Ipython widget)","m2fname":"Bahul","description":"CitiBike provides this daily data so that one can analyze how people use CitiBike, where they ride to and from, which areas are most popular biking destinations, and many such questions.
This data is very rich and ripe for analysis, to show location based trends and patterns in NYC. It can give insights into how tourists and casual bike riders like to move about in the city, and can be turned into useful and actionable knowledge to improve intra-city tourism and transportation services.

Analyzed the trips taken by CitiBike users to find the most frequented CitiBike stations in New York City.
Detected the most popular neighborhoods and areas of CitiBike riders.
Correlate actual popularity of areas with Citibike popularity using map visualizations.
Estimating demand and supply of bikes in neighborhoods based on usage statistics and station docking data.
Created a Regression Model to correlate weather (daily average temperature, precipitation, snow) impact on CitiBike activity, to see if there is a direct relation between the two.
","m1fname":"Gaurang","projectname":"Citibike Trip Data Analysis","m3fname":""},{"m2lname":"Haricharan","m4lname":"","m3uni":"","m1uni":"jsc2226","m4uni":"","pid":"201605-28","m2uni":"sh3451","timestring":"Thu May 12 11:55:44 2016","m4fname":"","language":"Python, Javascript, amCharts, Jython, Angular.js, Bootstrap","m3lname":"","dataset":"The dataset used was the Medicare Hospital Comapare dataset. It is public, accessible at https://data.medicare.gov/data/archives/hospital-compare.
Downloaded from the online link.
","m1lname":"Chhatwal","industry":"Life Science","analytics":"Linear Ridge Regression
amCharts USA State Wise visualization
Bar Graphs for disease trends prediction
Google Maps API for nearby recommended hospitals
","m2fname":"Siri","description":"State-wise spending prediction on healthcare
Disease trends prediction
Hospital recommendation based on location, disease and affordability

It makes sense to have an unbiased recommendation system for the best healthcare provider based on a patient’s condition, current location, cost of treatment and reviews by earlier patients about the institution, individual doctors","m1fname":"Jivtesh","projectname":"Analytics on the Medicare Hospital dataset","m3fname":""},{"m2lname":"Mehra","m4lname":"","m3uni":"","m1uni":"zo2131","m4uni":"","pid":"201605-33","m2uni":"mm4694","timestring":"Thu May 12 11:55:49 2016","m4fname":"","language":"Python, Theano, Lasagne, Nolearn, Ubuntu, GPU programming, AWS","m3lname":"","dataset":"1. Facescrub Dataset: http://vintage.winklerbros.net/facescrub.html
2. Kaggle Emotion Detection Dataset: https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge","m1lname":"Onodi-Szucs","industry":"Information","analytics":"1. Concolutional Neural Networks
2. Haar classification based facial detection","m2fname":"Maanit","description":"1. Facial Emotion Detection
2. Celebrity Face-match

Emotion Detection finds application in a variety of marketing opportunities, social interaction and medical research","m1fname":"Zoltan","projectname":"Facial Recognition using Convolutional Neural Networks on Large Scale Image Datasets","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rm3330","m4uni":"","pid":"201605-26","m2uni":"","timestring":"Thu May 12 11:56:59 2016","m4fname":"","language":"Python, Spark, HTML, JavaScript","m3lname":"","dataset":"Yelp Restaurant Reviews Google Places Coordinates and Yelp Ratings using:
San Francisco Public Parks Open Data
Google API
Yelp API
","m1lname":"Mathur","industry":"Information","analytics":"Classification, Data Scraping, K-Means","m2fname":"","description":"San San Uses aims to find the closest restrooms to places and find the places with ambience that mostly closely matches user preferences in the area. It uses multiple sources to establish a better map of service areas that are open to the public and private areas. San Francisco serves as a good model city for NYC as it lacks any sort of map for such areas for people. Also, it can improve upon NYC’s current map by gathering more data
","m1fname":"Rikin","projectname":"San San","m3fname":""},{"m2lname":"Gaba","m4lname":"","m3uni":"","m1uni":"tpt2109","m4uni":"","pid":"201605-1","m2uni":"vhg2105","timestring":"Thu May 12 11:58:26 2016","m4fname":"","language":"Python, Javascript,HTML,CSS","m3lname":"","dataset":"Link to Dataset :

NYC taxi TLC data : http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
Uber Data: https://github.com/fivethirtyeight/uber-tlc-foil-response

","m1lname":"Thaly","industry":"Information","analytics":"Scikit
Matplotlib
ggplot
AmCharts
Google Fusion Tables
C3.js
Google BigQuery
Apache Spark on Amazon Elastic MapReduce Clusters

Algorithms Use:
Decision Trees
Random Forests
Support Vector Machines
k-Nearest Neighbours","m2fname":"Vinay","description":"Our main aim was to figure out any conclusions that we can, given a large archive of data, which was mainly just numerical values spanning over a large range of dates.
The intention was to make sense of the seemingly arbitrary data in a way which can be easily understood by the layman.
Uber and Yellow Cabs are one of the most important travel lifelines of the city. We have simplified the comprehension of the analysis of the data through an aesthetically appealing and an uncomplicated representation of our results on our webpage.
The research and toolkits that we used provided us a lot of information and helped us learn new skills","m1fname":"Tanaya","projectname":"Analysis of NYC Cab Rides","m3fname":""},{"m2lname":"Barona","m4lname":"","m3uni":"","m1uni":"am4417","m4uni":"","pid":"201605-38","m2uni":"jab2397","timestring":"Thu May 12 11:58:55 2016","m4fname":"","language":"python, spark","m3lname":"","dataset":"The data set (~2.6 million records) was obtained from kaggle. It was death records. It originally contained 38 features out of which 13 were selected for our project.","m1lname":"Murthy","industry":"Life Science","analytics":"Random Forest for classification","m2fname":"Jerry","description":"This project aims to describe the death records in the United States. It attempts to predict the manner of death of a person given specific information of the person.","m1fname":"Adarsh ","projectname":"Death Records Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"mo2499","m4uni":"","pid":"201605-27","m2uni":"","timestring":"Thu May 12 12:00:02 2016","m4fname":"","language":"Python, R","m3lname":"","dataset":"The data was collected from NBA.com. Although what is viewable on the website is limited, actual JSON data is much richer than what is displayed. The data consist of x-y coordinates on the court for made and missed field goal attempts from every match (1230 games) in the 2014-2015 NBA regular season.
I also collected data that include the closest defender at the time of shooting for each shot and the distance between the defender and the shooter. I also connect this data set with metadata which include some useful information, such as the number of dribbles the shooter took before shooting, the duration of time that the shooter touched the ball before a shot, shot clock, game clock, shot types, etc.","m1lname":"Oh","industry":"Information","analytics":"Mixed-effects modelling, EM, Rejection Sampling","m2fname":"","description":"In basketball, all players compete on both offense and defense, and the core strategies of basketball revolve around scoring points on offense and preventing points on defense. Every shot event in a basketball match is the result of a shooter's action under the influence of defense of the opponent team -- whether it is defense by a single opponent player or multiple opponent players (or none when a shooter is wide open). While it may not seem difficult to empirically characterize shooting abilities of shooters by simply observing field goal percentages at different locations on the court, it is much more challenging to account for how defenders affect shooting. Furthermore, different players clearly possess distinct shooting propensity.
I construct a mixed-effects model that incorporate both fixed-effects parameters, which apply to the entire population of players, and random effects, which apply to specific shooters and defenders. Given a shot attempt, I model the probability that the shot is made as a function of fixed effects of match situations and random effect of the offensive player's shooting skills, the defender at the time of the shot, the distance of that defender to the shooter, and where the shot was taken.
","m1fname":"Min-hwan","projectname":"Mixed Effect Model for Learning Latent Shooting Ability and Defensive Ability in Basketball","m3fname":""},{"m2lname":"Nanda","m4lname":"","m3uni":"","m1uni":"pkd2108","m4uni":"","pid":"201605-42","m2uni":"an2706","timestring":"Thu May 12 12:01:07 2016","m4fname":"","language":"R, Spark","m3lname":"","dataset":"We used a dataset available on Mashable and Twitter tweets were collected. ","m1lname":"Dutta","industry":"Media","analytics":"SVM, KNN","m2fname":"Ashish","description":"Viral news is a large business now. Individuals are spending large amounts of money to get their product endorsed by viral means.

As such, a business opportunity has opened. We provide a dash board of various analytic techniques to predict how viral a given piece of news will be and to understand the public's opinions. ","m1fname":"Preetam","projectname":"Demystifying Viral News: Exploring Information Flow Network, Content Based Classification, Feature Selection","m3fname":""},{"m2lname":"Suryanarayanan","m4lname":"","m3uni":"","m1uni":"kk3098","m4uni":"","pid":"201605-31","m2uni":"ss4951","timestring":"Thu May 12 12:01:58 2016","m4fname":"","language":"Python, theano, AWS GPU instance","m3lname":"","dataset":"The dataset was obtained from Kaggle. (https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data)

The software supports any image of dimension 48 * 48","m1lname":"Kesarla Shantharam","industry":"Information","analytics":"Convolutional neural network was implemented to predict the mood of the person in the image

Angular js was used for front end development

Python Flask","m2fname":"Sharan","description":"Determine the mood of a person from the selfie.
Recommend songs based on the mood predicted

Implementation

A three layer convolutional neural network was trained on a set of 35000 pre classified training images.

The training images used were pre classified into one of - Angry, Disgust, Fear, Happy, Sad, Surprise, or Neutral.

The testing and the validation set each consisted of 3500 images.

The convolutional neural network was implemented in theano. (Theano provides a tight integration with GPU, It can perform data-intensive calculations up to 140x faster than with CPU.).

The neural network was trained on a GPU instance on AWS.

The network was trained for approximately 30 minutes and a test accuracy of close to 55% was obtained on the test dataset.

","m1fname":"Kushwanth Ram","projectname":"Selfie Based Song Recommadations","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jc4267","m4uni":"","pid":"201605-36","m2uni":"","timestring":"Thu May 12 12:13:47 2016","m4fname":"","language":"Python, R, Shiny","m3lname":"","dataset":"50,000 tweets from Twitter’s streaming API related to any of the running candidates in the primaries are stored in a table. For each state, delegate percentages for each party for each candidate are recorded. The primary election results stored in a python dictionary for use in the prediction algorithm.","m1lname":"Chang","industry":"Media","analytics":"A custom algorithm based on percentage differenced of tweet mentions and primary delegate results was created. This results were visualized using R Shiny on map showing if the state is predicted to vote R or D. ","m2fname":"","description":"The primary aim of this project is to predict which presidential candidate will win the election. Primary election data, combined with Twitter data and sentiment analysis are considered in this report. An algorithm to predict election winners was developed based on past primary election data.","m1fname":"Jonathan","projectname":"Election Prediction Model based on Twitter Mention Count and Sentiment Analysis correlated with Primary Election Results","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"cy2403","m4uni":"","pid":"201605-40","m2uni":"","timestring":"Thu May 12 12:14:37 2016","m4fname":"","language":"Python, AWS","m3lname":"","dataset":"MTA API will be used in this project to fetch real time traffic data. Programming language is Python. http://datamine.mta.info/","m1lname":"Yuan","industry":"Transportation","analytics":"Classification","m2fname":"","description":"This project aims to provide a solution to find the optimal route between two locations, for example to find out whether switching to a express train will be faster. Then build machine learning model from collected data for real time analysis and suggestion.

Currently most people rely on apps such as Google Maps and NYC Subway to find the best routes for them. The problem with those apps is it is usually rough estimation, and no real time traffic data is taken into consideration.","m1fname":"Chenli","projectname":"NYC subway analysis","m3fname":""},{"m2lname":"Lin","m4lname":"","m3uni":"","m1uni":"kl2855","m4uni":"","pid":"201605-37","m2uni":"tl2699","timestring":"Thu May 12 13:23:30 2016","m4fname":"","language":"Scala","m3lname":"","dataset":"Yahoo finance. ","m1lname":"Li","industry":"Finance","analytics":"Monte Carlo Simulation. Multi-variant normal distribution. ","m2fname":"Tao","description":"Compute VaR using Spark, exploit the power of spark in doing parallel monte carlo simulation. ","m1fname":"Keying ","projectname":"Estimating VaR using Apache Spark","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ly2352","m4uni":"","pid":"201605-10","m2uni":"","timestring":"Thu May 12 14:31:47 2016","m4fname":"","language":"Python, R","m3lname":"","dataset":"Nasdaq.com real time data
Bloomberg historical data","m1lname":"Yang","industry":"Finance","analytics":"AWS, Kafka, MariaDB, SparkStreaming","m2fname":"","description":"Numerous stocks and high frequency of price changing, hard to monitor them all together
Use backend service continuously monitoring the live price stream
User defined indicator of price change and situation
Recommended strategies based on different stocks and its price movement, and estimate risks and gain of each strategies.
","m1fname":"Lu","projectname":"Real-time stock tick data analysis","m3fname":""},{"m2lname":"Shah","m4lname":"","m3uni":"","m1uni":"dp2781","m4uni":"","pid":"201605-12","m2uni":"ss4936","timestring":"Thu May 12 15:08:46 2016","m4fname":"","language":"python","m3lname":"","dataset":"The dataset was obtained from Julian McAuley at UCSD. It is not public","m1lname":"Parekh","industry":"Retail","analytics":"Naive Bayes
Logistic Regression
Decision Tree Classifier
Gradient Boosting","m2fname":"Sahil","description":"This project aims to predict the helpfulness of Amazon product reviews, specifically in the category of Grocery and Gourmet food. We train various classifier models using a training set of preprocessed reviews. The classifier predicts if the review is helpful or not. The performance of the classifiers is tested on a test set which has about 250,000 reviews. Judicious feature selection from lexical and metadata along with the Gradient boosting classifier can achieve an accuracy rate of about 78%.

On purchase of a product, Amazon invites the buyer to rate the product and write a review. It also asks users to vote if a review is helpful or not.Amazon uses the number of helpful votes while deciding the order of the top reviews. Herein lies the drawback of the review helpfulness system. Older product reviews would have been read by more people over time and probably would receive more helpful votes than a newer review which might be equally helpful

In this project we would like to automatically classify a review as helpful or not without needing the user to vote on it. This enables the system to present the most recent helpful review to the user.","m1fname":"Dhrumil","projectname":"Predicting Amazon Product Review Helpfulness","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ka2513","m4uni":"","pid":"201605-46","m2uni":"","timestring":"Thu May 12 15:36:45 2016","m4fname":"","language":"Ubuntu Server 14.04 LTS on Amazon EC2, Python/pySpark, Linux Shell Scripting, SQLite.","m3lname":"","dataset":"Stock price data obtained from Netfonds Bank and Google Finance websites. ","m1lname":"Alshewski","industry":"Finance","analytics":"1) Modified version of Dynamic Time Warping algorithm with Manhattan and Euclidean distances was used for identification of patterns in stock price movements.
2) Random Forest regression was used for predicting the future price of the stock.
3) Charts visualization was implemented using Python's matplotlib package.","m2fname":"","description":"The methods used to analyze securities and make investment decisions fall into two very broad categories: fundamental analysis and technical analysis. While fundamental analysis involves analyzing the characteristics of a company in order to estimate its value, technical analysis takes a completely different approach. Technical analysis is only concerned with the price movements in the market. Technical analysis combines the laws of supply and demand with the “psychological” part of trading. The main focus of technical analysis is to identify patterns of stock movements. These patterns, in turn, will give trader a hint of what the future direction of price movement will be. One way to identify stock price patterns is to visually inspect the price chart. However, there are major drawbacks with this approach – there are thousands of stocks available and there are various chart time frequencies in which a certain pattern may form. Therefore, it is very cumbersome and time consuming process to visually identify a stock from a large pool of stocks. Moreover, especially when searching for patterns on small frequency chart, by the time the pattern is identified it may be too late to execute a trade since the expected stock movement already happened.
The purpose of this project is to programmatically identify stock patterns of interest by scanning through the large pool of stocks. Once the stock(s) are identified the program is going to run regression analysis to predict future price of selected stock(s) in order to confirm the direction of the price. The idea is to have two factors affecting the decision whether to execute a trade on a particular stock. For example, a certain pattern may indicate that the stock price will go up, but regression may produce the opposite result. In this case it may be better to put this stock on the watchlist instead of making a trade. In case, when the pattern and regression indicate the price movement in the same direction it is safer to make a trade.
","m1fname":"Kirill","projectname":"Identifying Patterns in Stock Price Movements and Predicting Future Price","m3fname":""},{"m2lname":"Purbey","m4lname":"","m3uni":"","m1uni":"ai2336","m4uni":"","pid":"201605-18","m2uni":"np2544","timestring":"Thu May 12 16:10:00 2016","m4fname":"","language":"--","m3lname":"","dataset":"--","m1lname":"Iyer","industry":"Life Science","analytics":"--","m2fname":"Niharika","description":"--","m1fname":"Aishwarya","projectname":"Categorization and Analysis of Yelp reviews","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rtm2129","m4uni":"","pid":"201605-32","m2uni":"","timestring":"Thu May 12 17:35:51 2016","m4fname":"","language":"Python, Apache Spark, Selenium, BeautifulSoup, Pandas, TextBlob, SciKit-Learn, R","m3lname":"","dataset":"Gathered data using a combination of wget, API calls, and web scraping.

- Wikipedia Page Views
- Open Movie Database (OMDb) API
- IMDB User Reviews ","m1lname":"Munoz","industry":"Media","analytics":" - Naive Bayes Classifier from TextBlob for sentiment analysis
- Binary classifiers:
1. -regularized logistic regression (“logit”),
2. Support vector machines (SVMs) using a radial basis function (RBF) kernel (“svm-rbf”),
3. SVMs using a linear kernel (“svm-linear”),
4. SVMs using a polynomial kernel (“svm-poly”), and
5. Boosted decision trees using the Adaboost algorithm (“adaboost\").","m2fname":"","description":"For my final report, I investigated the ability to predict the winner of the Best Picture Oscar Award from the nominees for ceremonies celebrating years 2008 through 2015. Thus, the goal of this project was to use the following in order to create and evaluate binary classifiers for predicting the winner of the Best Picture Award at the Oscars:

1. Wikipedia page views for a particular movie,
2. Critic ratings,
3. Audience ratings, and
4. Sentiment analysis of long-form audience reviews.

This project is an extension of a project in Haughton et al. (2015) where the authors show how to analyze sentiment from Internet Movie Database (IMDb) movie reviews and Twitter data for nine movies [1]. They suggest one future direction to be investigating the prediction ability for a larger set of movies.","m1fname":"Richard","projectname":"Oscar Winner Prediction based on User Reviews and Wikipedia Activity","m3fname":""},{"m2lname":"Suryanarayanan","m4lname":"","m3uni":"","m1uni":"kk3098","m4uni":"","pid":"201605-31","m2uni":"ss4951","timestring":"Thu May 12 17:54:01 2016","m4fname":"","language":"Python, Theano, Angular Js, Dynamo DB, HTML and CSS","m3lname":"","dataset":"The dataset used for training, testing and validation is all public and was obtained from Kaggle
link for the dataset - https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data","m1lname":"Kesarla Shantharam","industry":"Information","analytics":"A 3 layer convolutional neural network was implemented using theano (theano was used as it is capable of using the parallel processing capabilities of the GPU).

filters of size 16, 512, 20 were used in the first second and third convolutional layers respectively

Theano was used to take advantage of the parallel processing capabilities of GPU

Angular js functionalities was used to display the image snapped and to play the youtube songs recommended","m2fname":"Sharan","description":"Motivation
Music fuels the mind and thus fuels creativity. A Creative mind has the ability to make discoveries and create innovations.
Study shows that based on state of the mind, listening to a particular kind music can have significant impacts on the behaviour and productivity of a person
People are spoilt for choices as there are tons of songs available and they don't know listening to which one of them makes their present situation better.
Selfie Based Song Recommendation is the solution for the above problem

Product Description
Selfie of a person is taken real time and his mood is determined by passing the image captured through a convolutional neural network
Based on the mood predicted, appropriate song recommendations are made

Implementation
A three layer convolutional neural network was trained on a set of 35000 pre classified training images
The training images used were pre classified into one of - Angry, Disgust, Fear, Happy, Sad, Surprise, or Neutral
The testing and the validation set each consisted of 3500 images

The convolutional neural network was implemented in theano. (Theano provides a tight integration with GPU, It can perform data-intensive calculations up to 140x faster than with CPU.)
The neural network was trained on a GPU instance on AWS
The network was trained for approximately 30 minutes and a test accuracy of close to 55% was obtained on the test dataset

","m1fname":"Kushwanth Ram","projectname":"Selfie Based Song Recommendation","m3fname":""},{"m2lname":"Suryanarayanan","m4lname":"","m3uni":"","m1uni":"kk3098","m4uni":"","pid":"201605-31","m2uni":"ss4951","timestring":"Thu May 12 17:57:54 2016","m4fname":"","language":"Python, Theano, Angular Js, Dynamo DB, HTML and CSS","m3lname":"","dataset":"The dataset used for training, testing and validation is all public and was obtained from Kaggle
link for the dataset - https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data","m1lname":"Kesarla Shantharam","industry":"Information","analytics":"A 3 layer convolutional neural network was implemented using theano (theano was used as it is capable of using the parallel processing capabilities of the GPU).

filters of size 16, 512, 20 were used in the first second and third convolutional layers respectively

Theano was used to take advantage of the parallel processing capabilities of GPU

Angular js functionalities was used to display the image snapped and to play the youtube songs recommended","m2fname":"Sharan","description":"Motivation
Music fuels the mind and thus fuels creativity. A Creative mind has the ability to make discoveries and create innovations.
Study shows that based on state of the mind, listening to a particular kind music can have significant impacts on the behaviour and productivity of a person
People are spoilt for choices as there are tons of songs available and they don't know listening to which one of them makes their present situation better.
Selfie Based Song Recommendation is the solution for the above problem

Product Description
Selfie of a person is taken real time and his mood is determined by passing the image captured through a convolutional neural network
Based on the mood predicted, appropriate song recommendations are made

Implementation
A three layer convolutional neural network was trained on a set of 35000 pre classified training images
The training images used were pre classified into one of - Angry, Disgust, Fear, Happy, Sad, Surprise, or Neutral
The testing and the validation set each consisted of 3500 images

The convolutional neural network was implemented in theano. (Theano provides a tight integration with GPU, It can perform data-intensive calculations up to 140x faster than with CPU.)
The neural network was trained on a GPU instance on AWS
The network was trained for approximately 30 minutes and a test accuracy of close to 55% was obtained on the test dataset

","m1fname":"Kushwanth Ram","projectname":"Selfie Based Song Recommendation","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rds2174","m4uni":"","pid":"201605-48","m2uni":"","timestring":"Thu May 12 20:02:27 2016","m4fname":"","language":"Python, Spark, Pandas, Statsmodel","m3lname":"","dataset":"Yahoo finance ","m1lname":"Shah","industry":"Finance","analytics":"Use Modern Portfolio Theory. Monty Carlo SImulations for optmization","m2fname":"","description":"To maximize expected returns from a portfolio for a given risk appetite","m1fname":"Rushabh","projectname":"Portfolio Optimizationn","m3fname":""},{"m2lname":"King","m4lname":"","m3uni":"","m1uni":"sjb2198","m4uni":"","pid":"201605-35","m2uni":"PJK2129","timestring":"Thu May 12 20:02:44 2016","m4fname":"","language":"Python, and AWS ubuntu instance with Nvidia Tesla GPUs","m3lname":"","dataset":"We used the adult dataset found here:

http://www.cs.toronto.edu/~delve/data/adult/desc.html","m1lname":"Brown","industry":"Social Science-Government","analytics":"CuDNN, Chainer, and PyCuda were all used.","m2fname":"Patrick","description":"Classify individuals into high income earners and low income earners based on census data.","m1fname":"Samuel","projectname":"Neural Network for Income classification","m3fname":""},{"m2lname":"Qiu","m4lname":"","m3uni":"","m1uni":"jc4267","m4uni":"","pid":"201605-36","m2uni":"qj2133","timestring":"Thu May 12 20:06:53 2016","m4fname":"","language":"Python, R, Shiny ","m3lname":"","dataset":"50,000 tweets from Twitter’s streaming API related to any of the running candidates in the primaries are stored in a table. For each state, delegate percentages for each party for each candidate are recorded. The primary election results stored in a python dictionary for use in the prediction algorithm. ","m1lname":"Chang","industry":"Media","analytics":"A custom algorithm based on percentage differenced of tweet mentions and primary delegate results was created. This results were visualized using R Shiny on map showing if the state is predicted to vote R or D.","m2fname":"Jing","description":"The primary aim of this project is to predict which presidential candidate will win the election. Primary election data, combined with Twitter data and sentiment analysis are considered in this report. An algorithm to predict election winners was developed based on past primary election data.","m1fname":"Jonathan","projectname":"Election Prediction Model based on Twitter Mention Count and Sentiment Analysis correlated with Primary Election Results","m3fname":""},{"m2lname":"Van Loon","m4lname":"","m3uni":"","m1uni":"pm2824","m4uni":"","pid":"201605-41","m2uni":"mhv2109","timestring":"Thu May 12 20:17:51 2016","m4fname":"","language":"Python 2.7","m3lname":"","dataset":"The Dataset that we have collected is over 6GB in size over two databases(NoSQL-DynamoDb and SQLlite3) hosted on Amazon S3 bucket.
The dataset is essentially :
Number of Authors : 594,866
Number of Threads : 23,990
Number of Comments : 2,778,314
Number of Subreddits : 2005
The dataset is comprehensive and is available at the following link:
https://s3.amazonaws.com/eecse6895-1-g37/Reddit2.db

Our code can support/mine any kind of internet comment thread.
","m1lname":"Matey","industry":"Social Science-Government","analytics":"For sentiment Analysis:
Used a Voted Classifier which works on the principle of collecting a majority vote from the classifiers in the pool(in this case 7). Using the classifier and only classifying sentences with over 80% confidence we get the sentiment score.
For Reading level Score :
This test rates text on a U.S. schoolgrade level. For example, a score of 8.0 means that an eighth grader can understand the document. For most documents, aim for a score of approximately 7.0 to 8.0.
For Common topics in a thread :
The natural language processing toolkit, exploring all the facets including lemming,stemming, stop_words, tokenize, chunking and chinking. Using chunking & chinking coupled with Named Entity Recognition.
Effective SQL queries ","m2fname":"Marshall ","description":"Objectives : One of the important objectives for us was to analyze a big data set, to that effect we have mined for Reddit data and populated datasets with over 6Gb of information. In order to analyze this dataset we have used Spark clusters, learning and understanding SparkSQL. Instead of evaluating the content that people post we wanted to interpret the people who make those comments and analyze common traits.

Innovations : To this effect we have first filtered out people based on a certain popular 'topic', performed sentiment analysis to gauge the general mood of the conversation, expanded that, to get a reading level analysis. This way we can analyze the level of conversation of the people participating in that thread. Further, in the same conversation we extract specific details, such as which organization/person is being spoken about the most(other than the topic itself).
We also have a way to map each user and all the sub-reddits they are active in. In this manner we form a correlation between the user and their affinity to certain 'topics'.

Capabilities : We can gauge the general reading level of the users, get an overall idea of the sentiment of users in a particular sub-reddit. We can extract other popular topics in the same thread. We can get an accurate representation of users and their affinity to certain 'topics', by analyzing the other sub-reddits they post in.

This research can be used to market products to users who follow certain celebrities. For eg, If a person actively participates in the Kim Kardashian subreddit, and active on the perfumes subreddit; we could potential market a perfume by Kim Kardashian.

We can generate a general user profile. User1 : Active on threads 'Bernie Sanders', 'Donald Trump', 'Manchester United' and 'Mark Ruffalo'. User2 : Active on threads 'Bernie Sanders', 'Manchester United', and 'memes'. Furthermore, we can find the intersection between the two Users and determine if posting on threads related to 'Mark Ruffalo' had a impact on them supporting 'Bernie Sanders'.
","m1fname":"Palash Sushil","projectname":"Behavioral Mapping","m3fname":""},{"m2lname":"Van Loon","m4lname":"","m3uni":"","m1uni":"pm2824","m4uni":"","pid":"201605-41","m2uni":"mhv2109","timestring":"Thu May 12 20:35:36 2016","m4fname":"","language":"Python 2.7","m3lname":"","dataset":"The Dataset that we have collected is over 6GB in size over two databases(NoSQL-DynamoDb and SQLite3) hosted on Amazon S3 bucket.
The dataset is essentially :
Number of Authors : 594,866
Number of Threads : 23,990
Number of Comments : 2,778,314
Number of Subreddits : 2005
The dataset is comprehensive and is available at the following link:
https://s3.amazonaws.com/eecse6895-1-g37/Reddit2.db
","m1lname":"Matey","industry":"Social Science-Government","analytics":"Algorithms :
For sentiment Analysis:
Used a Voted Classifier which works on the principle of collecting a majority vote from the classifiers in the pool(in this case 7). Using the classifier and only classifying sentences with over 80% confidence we get the sentiment score.
For Reading level Score :
This test rates text on a U.S. schoolgrade level. For example, a score of 8.0 means that an eighth grader can understand the document. For most documents, aim for a score of approximately 7.0 to 8.0.
For Common topics in a thread :
The natural language processing toolkit, exploring all the facets including lemming,stemming, stop_words, tokenize, chunking and chinking. Using chunking & chinking coupled with Named Entity Recognition.
SQL queries

System modules: All the algorithms have been made into simple modules, that can be imported into a program and run on the command line. ","m2fname":"Marshall ","description":"Objectives : One of the important objectives for us was to analyze a big data set, to that effect we have mined for Reddit data and populated datasets with over 6Gb of information. In order to analyze this dataset we have used Spark clusters, learning and understanding SparkSQL. Instead of evaluating the content that people post we wanted to interpret the people who make those comments and analyze common traits.

Innovations : To this effect we have first filtered out people based on a certain popular 'topic', performed sentiment analysis to gauge the general mood of the conversation, expanded that, to get a reading level analysis. This way we can analyze the level of conversation of the people participating in that thread. Further, in the same conversation we extract specific details, such as which organization/person is being spoken about the most(other than the topic itself).
We also have a way to map each user and all the sub-reddits they are active in. In this manner we form a correlation between the user and their affinity to certain 'topics'.

Capabilities : We can gauge the general reading level of the users, get an overall idea of the sentiment of users in a particular sub-reddit. We can extract other popular topics in the same thread. We can get an accurate representation of users and their affinity to certain 'topics', by analyzing the other sub-reddits they post in.

This research can be used to market products to users who follow certain celebrities. For eg, If a person actively participates in the Kim Kardashian subreddit, and active on the perfumes subreddit; we could potential market a perfume by Kim Kardashian.

We can generate a general user profile. User1 : Active on threads 'Bernie Sanders', 'Donald Trump', 'Manchester United' and 'Mark Ruffalo'. User2 : Active on threads 'Bernie Sanders', 'Manchester United', and 'memes'. Furthermore, we can find the intersection between the two Users and determine if posting on threads related to 'Mark Ruffalo' had a impact on them supporting 'Bernie Sanders'.","m1fname":"Palash Sushil","projectname":"Behavioural Mapping","m3fname":""},{"m2lname":"Hao","m4lname":"","m3uni":"","m1uni":"ll2985","m4uni":"","pid":"201605-39","m2uni":"bh2562","timestring":"Thu May 12 20:43:12 2016","m4fname":"","language":"Meteor.js, Node.js, Python","m3lname":"","dataset":"Self collected and labeled tweets.
Twitter streaming API, RESTful API","m1lname":"Liu","industry":"Media","analytics":"Sentiment classification using Naivebayes and LSTM (deep learning), Apache Spark, Meteor.js, Bootstrap.js","m2fname":"Boya","description":"Visualize people’s opinions towards candidates on Google Maps
Get to know the candidates’ personality through their tweets
See people’s reaction towards candidates’ speech by topics ","m1fname":"Ling","projectname":"Social Media Based Election Analysis Web App","m3fname":""},{"m2lname":"Patrick","m4lname":"","m3uni":"","m1uni":"Brown","m4uni":"","pid":"","m2uni":"King","timestring":"201605-35","m4fname":"","language":"Predict if an individual's annual income exceeds $50,000 based on census data. From the UCI repository of machine learning databases. http://www.cs.toronto.edu/~delve/data/datasets.html","m3lname":"","dataset":"In this paper we demonstrate the use of “big data” tools to train a neural network classification engine. Our classification engine takes as input 42,225 individuals collected from 1996 U.S. census records and trains the network to determine whether an individual’s income is above a given threshold. The input data contains 15 attributes for each of the 42,225 individuals – one of which is income. Through our efforts, we attempt to train the neural network to determine which combination of the remaining 14 attributes will determine whether an individual earns above the threshold. In this case, we are interested in training the neural network to determine whether an individual’s income is above $50,000 based solely upon which of the 14 attributes are presented. ","m1lname":"Samuel","industry":"Python, Cuda, six, chainer, Adam","analytics":"PyCuda","m2fname":"SJB2198","description":"A Neural Network Classification Engine","m1fname":"Thu May 12 20:53:02 2016","projectname":"","m3fname":"pjk2129"},{"m2lname":"Nanda","m4lname":"","m3uni":"","m1uni":"pkd2108","m4uni":"","pid":"201605-42","m2uni":"an2706","timestring":"Thu May 12 21:25:42 2016","m4fname":"","language":"R, Spark, Python","m3lname":"","dataset":"We used a dataset available on Mashable and Twitter tweets were collected.
","m1lname":"Dutta","industry":"Media","analytics":"SVM, KNN, Naive Bayes, RF","m2fname":"Ashish","description":"Viral news is a large business now. Individuals are spending large amounts of money to get their product endorsed by viral means.

As such, a business opportunity has opened. We provide a dash board of various analytic techniques to predict how viral a given piece of news will be and to understand the public's opinions. ","m1fname":"Preetam","projectname":"Demystifying Viral News: Exploring Information Flow Network, Content Based Classification, Feature Selection","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sl3953","m4uni":"","pid":"201605-44","m2uni":"","timestring":"Thu May 12 21:58:20 2016","m4fname":"","language":"Python","m3lname":"","dataset":"Tweets collected via Twitter API","m1lname":"Liu","industry":"Information","analytics":"Algorithm:
Text mining, a subfield of cluster analysis, is the analysis of large collections of text to find patterns between documents.
Consensus clustering, In the case of text mining, the consensus matrix is used in place of the term document matrix when clustering again (apply for time constraints).
","m2fname":"","description":"General Purpose
Track down twitters with keywords about 2016 elections to analyze which candidate is
more popular among the web. Conduct a brief prediction of 2016 Election with one-month
data tracked from Twitter.com

Why is this important
2016 election might be the most important election of our lifetimes because Republican
candidates and Democrats’ candidates share different opinions on policies regarding to
climate changes, immigrants` policies, which side will win more spots in Supreme Courts,
and many more things which make this year`s election important.
","m1fname":"Shuhang","projectname":"Twitter Based 2016 Election Predictions and Candidates` Popularity Check","m3fname":""},{"m2lname":"Phadke","m4lname":"","m3uni":"","m1uni":"pat70","m4uni":"","pid":"201605-45","m2uni":"mp3212","timestring":"Fri May 13 08:52:52 2016","m4fname":"","language":"GPU/CUDA/OpenCL1.2 (pyCUDA), Python, Alchemy API, HighCharts, Java/Scala/JavaScript/Vertx/Spark/Kafka/AngularJS ","m3lname":"","dataset":"• Twitter Streaming
• Yahoo Finance
• Google Finance
","m1lname":"Thatte","industry":"Finance","analytics":"Stats Analysis (StdDev) run on GPU/OpenCL1.2 using pyOpenCL

Alchemy Sentiment Analysis

HighCharts (Angular JS)

Twitter/Kafka streamed to RDDs:
• windowed reduction for price updates
• keyed joins for sentiment weighting","m2fname":"Manjiri","description":"Portfolio Management tools are hand-rolled by individual teams for their calculations and platforms. We previously built a single calculation, scalable platform that gave fund managers a ticking view of their holdings (we had implemented a basic version of this in the lower level course (E6893) in Fall 2015).

In this project we have reiterated on the same idea , and enhanced it using the tools and technologies used in this course.
The calculators perform meaningful operations on real data as we learnt in this course. All operations are performed live and results are presented in real time –
• Alchemy scoring of Twitter streaming data.
• Actual prices from Google and Yahoo Finance.
• Prices predictions run on GPU Grid using pyOpenCL.

Visualization techniques address information overload and provide meaningful reporting rather than dumping data to user.
","m1fname":"Paresh","projectname":"Stateful algorithms/UI to process streaming stock activity and news.","m3fname":""},{"m2lname":"Iyer","m4lname":"","m3uni":"","m1uni":"np2544","m4uni":"","pid":"201605-18","m2uni":"ai2366","timestring":"Fri May 13 09:30:01 2016","m4fname":"","language":"Python, Flask","m3lname":"","dataset":"Yelp Academic Dataset. Data was in json format","m1lname":"Purbey","industry":"Life Science","analytics":"Algorithm used: Latent Dirichlet Allocation
We used python/flask for visualization.","m2fname":"Aishwarya","description":"We hope to identify what users care about the most when writing their reviews and ultimately determine what certain restaurants are doing right and wrong in order to receive these ratings.
This will help restaurant owners make decisions to increase their revenue.
It will also help other users make decisions about a restaurant as the tags would provide more context for a restaurant.
","m1fname":"Niharika","projectname":"Categorization and Analysis of Yelp restaurant reviews","m3fname":""},{"m2lname":"Kulkarni","m4lname":"","m3uni":"","m1uni":"prp2121","m4uni":"","pid":"201605-2","m2uni":"rk2845","timestring":"Tue May 17 20:27:46 2016","m4fname":"","language":"Python, Java, HTML, CSS, Javascript, Ubuntu, IOS","m3lname":"","dataset":"Songs compiled from websites. Lyrics using PyLyrics Python API","m1lname":"Parekh","industry":"Information","analytics":"LDA Algorithm, Mateiralize and Highcharts for visualization, Indico API ","m2fname":"Rohan Dattatraya","description":"Clusetering and recommendation of songs based on lyrics. Implemented using Latent Dirichlet Allocation Algorithm.","m1fname":"Parth Rajesh","projectname":"Myusic - Clustering and Recommendation of Songs based on Lyrics","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ka2513","m4uni":"","pid":"201605-46","m2uni":"","timestring":"Wed May 18 00:03:09 2016","m4fname":"","language":"Ubuntu Server 14.04 LTS on Amazon EC2, Python/pySpark, Linux Shell Scripting, SQLite.","m3lname":"","dataset":"Stock price data obtained from Google Finance website. ","m1lname":"Alshewski","industry":"Finance","analytics":"1) Modified version of Dynamic Time Warping algorithm with Euclidean distance was used for identification of patterns in stock price movements.
2) Random Forest regression was used for predicting the future price of the stock.
3) Charts visualization was implemented using Python's matplotlib package.","m2fname":"","description":"The methods used to analyze securities and make investment decisions fall into two very broad categories: fundamental analysis and technical analysis. While fundamental analysis involves analyzing the characteristics of a company in order to estimate its value, technical analysis takes a completely different approach. Technical analysis is only concerned with the price movements in the market. Technical analysis combines the laws of supply and demand with the “psychological” part of trading. The main focus of technical analysis is to identify patterns of stock movements. These patterns, in turn, will give trader a hint of what the future direction of price movement will be. One way to identify stock price patterns is to visually inspect the price chart. However, there are major drawbacks with this approach – there are thousands of stocks available and there are various chart time frequencies in which a certain pattern may form. Therefore, it is very cumbersome and time consuming process to visually identify a stock from a large pool of stocks. Moreover, especially when searching for patterns on small frequency chart, by the time the pattern is identified it may be too late to execute a trade since the expected stock movement already happened.
The purpose of this project is to programmatically identify stock patterns of interest by scanning through the large pool of stocks. Once the stock(s) are identified the program is going to run regression analysis to predict future price of selected stock(s) in order to confirm the direction of the price. The idea is to have two factors affecting the decision whether to execute a trade on a particular stock. For example, a certain pattern may indicate that the stock price will go up, but regression may produce the opposite result. In this case it may be better to put this stock on the watchlist instead of making a trade. In case, when the pattern and regression indicate the price movement in the same direction it is safer to make a trade. ","m1fname":"Kirill","projectname":"Identifying Patterns in Stock Price Movements and Predicting Future Price","m3fname":""},{"m2lname":"Lin","m4lname":"","m3uni":"","m1uni":"sg3303","m4uni":"","pid":"201612-36","m2uni":"ll2948","timestring":"Mon Dec 12 20:39:11 2016","m4fname":"","language":"Mahout, Hadoop, Java ,Angular JS, ElasticSearch","m3lname":"","dataset":"We collected 10000+ free text travel blogs from websites travelblog.org. The data was collected by scraping through content using Python beautiful soup and Java

","m1lname":"Goyal","industry":"Media","analytics":"Clustering , Topic modelling","m2fname":"Ling ","description":"People usually spend hours researching about places through articles and blogs all over internet. In this project , we are providing one platform where people can find useful information from travel blogs .","m1fname":"Sugandha","projectname":"Holiday Planner","m3fname":""},{"m2lname":"Bhatt","m4lname":"","m3uni":"","m1uni":"jjg2188","m4uni":"","pid":"201612-92","m2uni":"tb2658","timestring":"Tue Dec 13 14:49:21 2016","m4fname":"","language":"Ubuntu 14.04, Python, hmmlearn, Pyspark, Scipy, sklearn, numpy","m3lname":"","dataset":"We would be working with a dataset provided by Kaggle from the ‘American Epilepsy Society Seizure Prediction Challenge’. In this dataset, Intracranial EEG was recorded from dogs (~40GB) with naturally occurring epilepsy using an ambulatory monitoring system.In addition, datasets from patients (~60GB) with epilepsy undergoing intracranial EEG monitoring to identify a region of brain that can be resected to prevent future seizures are included.","m1lname":"Guerra Marin","industry":"Life Science","analytics":"In order to properly classify and forecast this kind of dataset, an algorithm that takes into consideration the temporal information is necessary. EEG data has been analyzed in many projects and we are currently considering either applying Hidden Markov Models.","m2fname":"Tulika","description":"For this project, we aimed to fulfill all the requirements for a ‘Big Data’ dataset. These requirements entailed high-volume, high-velocity and high-variety information dataset. In addition, we are interested in the development of machine learning algorithms with an emphasis on applications of data acquisition, processing, understanding, and learning. More specifically, we would like to work on the development of systems that incorporate sequential decision-making, reasoning, and inference, in an efficient and secure manner to improve healthcare applications.

We, therefore, decided to concentrate on the area of seizure forecasting by using EGG-based information. The seizure forecasting results would be used for the prevention of Epilepsy. Epilepsy affects nearly 1% of the world’s population, and it happens as of a result of spontaneous and sequential seizures. Machine learning algorithms can be used to forecast the presence of seizures and, as of result, avoid epilepsy by taking the right measurements.

Based on clinical data, the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). Our project entails understanding all the different pattern of the 4 states of brain activity and identify a preictal state that can be differentiated from the interictal, ictal, and postictal state. ","m1fname":"Jorge","projectname":"Seizure forecasting analysis of EEG data","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201612-95","m2uni":"","timestring":"Tue Dec 13 22:00:44 2016","m4fname":"","language":"IBM System G Graph Tools, Amazon AWS EC2, gShell, R, Python, Bash, Ubuntu 14.04, Mac OSX El Capitan, Windows 7","m3lname":"","dataset":"Yelp Academic Dataset Challenge Round 8:
- 2.7 Million Reviews
- 648k Tips
- 86k Businesses
- 10 Cities
- 687k Users
- 4.2 M Social Edges
","m1lname":"Hasbamrer","industry":"Information","analytics":"Created:
- Graph visualization of yelp businesses and users
- Edges show connection between users and businesses
- Showed relative popularity of businesses with node size based on PageRank
","m2fname":"","description":"Previous work on Yelp Academic Dataset:
- Review text sentiment analysis.
- Restaurant recommendations with ML.
- Circle graph and heat map visualization.
- However, no visualization of business and user connection.

Current work: Graph visualization
- Show how businesses and users are connected.
- Help businesses identify consumer influence.
- Identify business competition.
- Help consumers quickly identify business popularity.
- Easier to look at a graph than read a data table.
","m1fname":"Nond","projectname":"Graph Visualization of the Yelp Academic Dataset (CVN presentation, video link in description)","m3fname":""},{"m2lname":"Shi","m4lname":"","m3uni":"","m1uni":"rs3569","m4uni":"","pid":"201612-96","m2uni":"ys2901","timestring":"Wed Dec 14 00:00:38 2016","m4fname":"","language":"Python, theano, lasagne, tensorflow, spark","m3lname":"","dataset":"Thanks Erik Bernhardsson for collecting this amazing dataset!
You can download data by typing:
wget https://s3.amazonaws.com/erikbern/fonts.hdf5
in your terminal,
warning: the file is about 14GB","m1lname":"Shi","industry":"Media","analytics":"We trained a generative network recognizing 56k different fonts styles by learning a lower dimensional embedding space
Visualizing fonts embedding in 3 dimensions (using T-SNE) from Google embedding projector
Write a script to read hdf5 file into spark

System specification:
We used service from Amazon AWS
Instance: P2.xlarge
Ubuntu 14.04 system
High Frequency Intel Xeon E5-2686v4 (Broadwell) Processors and 61 GiB memory
High-performance NVIDIA Tesla K80 GPUs
2,496 parallel processing cores
12GiB of GPU memory
","m2fname":"Yuchen","description":"Nowadays, we have seen that artificial intelligence technology (especially deep learning) can achieve human level performance or even beat human in many complicated tasks such as image recognition, playing Go games.

At the meantime, we found out that it is still very challenging to teach machine in art creation, composing music or write an article.

We challange ourselves in this project teaching machine to write in its own style

We purpose to create a “font vector” that is a vector in latent space that “defines” a certain font. That way we embed all fonts in a space where similar fonts have similar vectors.

By leveraging the font dataset online we plan to train a generative neural network that can map from our embedding space to different font styles.

Going beyond this we can ask computer to write in a random style or mixing styles from different fonts to generate its own ones.

Potential marketplace:
Help senior (disabled) people write in their style
Calligraphy education

","m1fname":"Ruixiong","projectname":"Analyzing 56k fonts using deep neural networks","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"cs3488","m1uni":"sc4097","m4uni":"","pid":"201612-17","m2uni":"ll3057","timestring":"Wed Dec 14 12:34:59 2016","m4fname":"","language":"Spark, Python, Javascript, MySQL","m3lname":"Shan","dataset":"lastfm 360k, public dataset","m1lname":"Chi","industry":"Media","analytics":"ALS recommendation algorithm","m2fname":"Li","description":"Recommend music for users using ALS algorithm","m1fname":"Shijian","projectname":"Smart Music Recommendation","m3fname":"Chuanjun"},{"m2lname":"Alvarado","m4lname":"","m3uni":"jj2807","m1uni":"mz2517","m4uni":"","pid":"201612-14","m2uni":"jaa2220","timestring":"Wed Dec 14 15:56:01 2016","m4fname":"","language":"PHP, HTML, Java, Mahout, IBM System G, Python, Plotly","m3lname":"Jacobson","dataset":"We used the UMLS (Unified Medical Language System) dataset from the US National Library of Medicine.","m1lname":"Zaryab","industry":"Life Science","analytics":"To make this possible, we relied on Mahout classification (naive bayes - written in Java) to determine what the user is looking for. For example, the word \"sinus\" is classified as a \"sign or symptom\" where as \"sinus advil\" is classified as a clinical drug. We wish to show the user relevant information based on the class. This is crucial when the terms are not found in the original dataset and we need to find the best possible match. A word such as \"Cold\" can return 3000+ matches when ideally we want to depict only what is important.

Search queries are performed using MySQL. Information retrieved is sent to IBM System G (which contains the entire graph of all classes and concepts in the dataset). We slim it down to a egonet of connections to the searched query. This information is visualized with Plotly.

All integration is done through PHP and HTML.","m2fname":"Jose","description":"The core objective was to take a dense dataset of medical information and provide it in a meaningful way. In this case, we would like to make it accessible to people through an online search engine (currently hosted using Amazon Web Services).

The user enters their query, for example \"asthma.\" In return, they see a classification of the term, a definition, and relationships to other concepts - both text and visual representations. To slim down exactly what the user wants, they are welcome to choose from a drop down menu.

The idea was proposed for its usefulness to the New York Presbyterian Hospital.

","m1fname":"Mohammad","projectname":"Biomedical Search Engine","m3fname":"Josh"},{"m2lname":"Smajlaj","m4lname":"","m3uni":"nn2250","m1uni":"bjz2107","m4uni":"","pid":"201612-25","m2uni":"ss3912","timestring":"Wed Dec 14 18:23:30 2016","m4fname":"","language":"Python, Pandas, Hadoop, Pyspark, Scikitlearn, Java, Flask","m3lname":"Zhao","dataset":"Twitter Dumps for a few months. Public data set, online
https://archive.org/details/twitterstream

Quandl Stock prices, self parsed for information","m1lname":"Zhu","industry":"Finance","analytics":"Logistic Regression, Classification, visualization in flask of stock market price movement, predictive analysis of behavior of EoD price deltas based of \"bag of words\" ML model","m2fname":"Sabina","description":"EoD Price for a company represents millions of microscopic changes in the entire market into a single number
> 500 million tweets a day; each a microcosm of public sentiment for a company
Aggregating social media messages should reliably predict trend for company market behavior
Leverage interests in finance
Twitter is one of the most diverse, public data sets
Experiment with Hadoop, Spark, utilize our skills in machine learning and data analysis
Predict EoD Price behavior real-time
","m1fname":"Ben ","projectname":"Tweet Based EoD Stock Market Price Predictor","m3fname":"Nan"},{"m2lname":"Chen","m4lname":"","m3uni":"zw2364","m1uni":"jz2748","m4uni":"","pid":"53","m2uni":"jc4648","timestring":"Wed Dec 14 18:24:28 2016","m4fname":"","language":"Python, Spark","m3lname":"Wang","dataset":"We used Python script to scrape the data from \"http://www.nba-reference.com\".","m1lname":"Zhong","industry":"Media","analytics":"Algorithm: Multilayer Perceptron Classifier
","m2fname":"Junbo","description":"Objectives: Based on the relation between NBA players' performance in NCAA and NBA, use rookies' performance in NCAA to predict their performance in future NBA career.

The selection of rookies in NBA has great impacts on the whole team, so it is important for team managers to evaluate rookies' future performance so that they can make correct choices.

Expected Outcome: Rookies' future performance in their NBA career.","m1fname":"Jing","projectname":"NBA Rookies' Performance Prediction","m3fname":"Zebin"},{"m2lname":"Chen","m4lname":"","m3uni":"zw2364","m1uni":"jz2748","m4uni":"","pid":"53","m2uni":"jc4648","timestring":"Wed Dec 14 18:24:34 2016","m4fname":"","language":"Python, Spark","m3lname":"Wang","dataset":"We used Python script to scrape the data from \"http://www.nba-reference.com\".","m1lname":"Zhong","industry":"Media","analytics":"Algorithm: Multilayer Perceptron Classifier
","m2fname":"Junbo","description":"Objectives: Based on the relation between NBA players' performance in NCAA and NBA, use rookies' performance in NCAA to predict their performance in future NBA career.

The selection of rookies in NBA has great impacts on the whole team, so it is important for team managers to evaluate rookies' future performance so that they can make correct choices.

Expected Outcome: Rookies' future performance in their NBA career.","m1fname":"Jing","projectname":"NBA Rookies' Performance Prediction","m3fname":"Zebin"},{"m2lname":"Piao","m4lname":"","m3uni":"","m1uni":"jf3030","m4uni":"","pid":"201612-37","m2uni":"yp2419","timestring":"Wed Dec 14 18:24:48 2016","m4fname":"","language":"Spark, Python","m3lname":"","dataset":"http://jmcauley.ucsd.edu/data/amazon/
I contacted the author asking for the full data.
Any book dataset that includes user(costumer) ids, book ids, book ratings and book titles.","m1lname":"Fu","industry":"Information","analytics":"ALS algorithm-alternating least sqaures","m2fname":"Yanglu","description":"There are tons of books are published every year worldwide. However, previews of books are not as self-explanatory as trailers of movies, ingredients of foods and appearances of outfits in many circumstances.
There are places that make their recommendations based on the “also bought with” information. A better idea to make the recommendation is using the ratings given by costumers.
","m1fname":"Jiayi","projectname":"Book Recommender","m3fname":""},{"m2lname":"Zhou","m4lname":"","m3uni":"ty2345","m1uni":"kc3031","m4uni":"","pid":"201612-48","m2uni":"rz2361","timestring":"Wed Dec 14 18:56:09 2016","m4fname":"","language":"Spark, Python, R, javascript, jQuery, Bootstrap","m3lname":"Yang","dataset":"In order to achieve our ultimate goal of disease prediction, we leveraged the hospital inpatient discharge 2013 dataset which can be downloaded from the New York state ny.gov website. The dataset contains about 1.8 million records of hospital stays in New York state and non-clinical data elements such as patient demographics, total charges and length of stay for each visit. We also utilized hospital inpatient discharge 2010 and 2014 as a compared dataset.","m1lname":"Chen","industry":"Life Science","analytics":"We used K-means clustering in PySpark to train the clusters, and construct random forest, gradient boosting, and decision tree regression models for newborn’s weight prediction. In addition, we used R and Javascript for disease and newborn data analysis and visualization.","m2fname":"Rong","description":"Our study seeks to identify the likelihood of having a disease and thus improve the public health through disease risk management.
By incorporating machine learning techniques, we are able to predict most possible diseases for a given individual. The prediction can benefit many diverse users, including individual patients, health care organizations, insurance companies and policymakers.
Ultimately, we hope disease prediction can be widely used to address issues of public health and safety as well as create a better and healthier community.","m1fname":"Kaili","projectname":"Disease Risk Prediction in New York State","m3fname":"Tianyu"},{"m2lname":"Wang","m4lname":"","m3uni":"xc2358","m1uni":"zw2389","m4uni":"","pid":"201612-19","m2uni":"zw2376","timestring":"Wed Dec 14 18:57:03 2016","m4fname":"","language":"Python, R , Spark","m3lname":"Chen","dataset":"Yelp Data: https://www.yelp.com/dataset_challenge.
Description: original data is about 8.32G. We extracted about 2.4G raw data
containing business, user and review information for our project.","m1lname":"Wang","industry":"Information","analytics":"modified word count algorithm, random forest, linear regression","m2fname":"Zhenyu","description":"Analyze Yelp review data. Yelp provides a good connection between customers and business owners by
allowing users to rate and write comments for the business. Review data is a powerful source for business owners to identify their business model and make improvements based on the analysis of the data.
","m1fname":"Zhirui","projectname":"Exploring the Power of Yelp Review for Business Popularity","m3fname":"Xikai"},{"m2lname":"Cheng","m4lname":"","m3uni":"qh2174","m1uni":"ds3516","m4uni":"","pid":"201605-82","m2uni":"pc2756","timestring":"Wed Dec 14 19:03:03 2016","m4fname":"","language":"Python, R, Spark, System G","m3lname":"Hu","dataset":"Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j.

The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes).","m1lname":"Sui","industry":"Retail","analytics":"We used User-based Collaborative Filtering Recommendation, based on Alternating Least Square algorithm in pyspark. Also we implemented the Item to Item neighbor recommendation in System G (gShell) by Colfilter Query.","m2fname":"Panpan","description":"Combination of User-based Collaborative Filtering and Item to Item Recommendation Process
Pursuing a personalized and abundant recommendation set to open the market
Develop the recommendation chain for individuals","m1fname":"Danning","projectname":"E-commercial Product Recommendations","m3fname":"Qiong"},{"m2lname":"Li","m4lname":"","m3uni":"qc2217","m1uni":"km3194","m4uni":"","pid":"201612-15","m2uni":"xl2601","timestring":"Wed Dec 14 19:32:01 2016","m4fname":"","language":"Python, Spark, Gephi","m3lname":"Chen","dataset":"Amazon product co-purchasing network metadata from summer 2006
The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs, and VHS video tapes).","m1lname":"Ma","industry":"Information","analytics":"Classification: Naive Bayes & Decision Tree
Clustering","m2fname":"Xinwei","description":"Predicting co-purchasing products based on previous order history of customers can help online retailers recommend proper products to costumers. Successful prediction of co-purchasing products can help the online retailers increase revenue.

Title, rate, and category similarities of items can be critical factors that could influence costumers when they have the co-purchasing options.
Naïve Bayes model and Decision Tree model can be employed to generate perform appropriate classification and predictions.

We employ different sizes of data to find out the if the accuracy of our model prediction is dependent on the size of data.

We exclude different factors (e.g. title-similarity) respectively from our analysis to determine the most important one among possible factors.
","m1fname":"Ke","projectname":"Amazon Co-Purchasing Network Analysis and Prediction","m3fname":"Qi"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201612-95","m2uni":"","timestring":"Wed Dec 14 19:36:49 2016","m4fname":"","language":"IBM System G Graph Tools, Amazon AWS EC2, gShell, R, Python, Bash, Ubuntu 14.04, Mac OSX El Capitan, Windows 7 ","m3lname":"","dataset":"Yelp Academic Dataset Challenge Round 8:
- 2.7 Million Reviews
- 648k Tips
- 86k Businesses
- 10 Cities
- 687k Users
- 4.2 M Social Edges","m1lname":"Hasbamrer","industry":"Information","analytics":"Created:
- Graph visualization of yelp businesses and users
- Edges show connection between users and businesses
- Showed relative popularity of businesses with node size based on PageRank","m2fname":"","description":"Previous work on Yelp Academic Dataset:
- Review text sentiment analysis.
- Restaurant recommendations with ML.
- Circle graph and heat map visualization.
- However, no visualization of business and user connection.

Current work: Graph visualization
- Show how businesses and users are connected.
- Help businesses identify consumer influence.
- Identify business competition.
- Help consumers quickly identify business popularity.
- Easier to look at a graph than read a data table.

Video link:
https://youtu.be/dez0VBqrNWc","m1fname":"Nond","projectname":"Graph Visualization of the Yelp Academic Dataset (CVN presentation, 5min video link https://youtu.be/dez0VBqrNWc)","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201612-95","m2uni":"","timestring":"Wed Dec 14 19:43:45 2016","m4fname":"","language":"IBM System G Graph Tools, Amazon AWS EC2, gShell, R, Python, Bash, Ubuntu 14.04, Mac OSX El Capitan, Windows 7","m3lname":"","dataset":"Yelp Academic Dataset Challenge Round 8:
- 2.7 Million Reviews
- 648k Tips
- 86k Businesses
- 10 Cities
- 687k Users
- 4.2 M Social Edges ","m1lname":"Hasbamrer","industry":"Information","analytics":"Created:
- Graph visualization of yelp businesses and users
- Edges show connection between users and businesses
- Showed relative popularity of businesses with node size based on PageRank","m2fname":"","description":"Video link:
https://youtu.be/dez0VBqrNWc

Previous work on Yelp Academic Dataset:
- Review text sentiment analysis.
- Restaurant recommendations with ML.
- Circle graph and heat map visualization.
- However, no visualization of business and user connection.

Current work: Graph visualization
- Show how businesses and users are connected.
- Help businesses identify consumer influence.
- Identify business competition.
- Help consumers quickly identify business popularity.
- Easier to look at a graph than read a data table. ","m1fname":"Nond","projectname":"https://youtu.be/dez0VBqrNWc - Graph Visualization of the Yelp Academic Dataset (CVN, 5min video)","m3fname":""},{"m2lname":"Ji","m4lname":"","m3uni":"ll3078","m1uni":"wl2573","m4uni":"","pid":"201612-27","m2uni":"hj2436","timestring":"Wed Dec 14 19:54:13 2016","m4fname":"","language":"Python; Java; Javascript","m3lname":"Long","dataset":"We included two separate datasets in our project. One is crime/incident reports data from some selected major cities and the other one contains tweets collected using twitter APIs. The former dataset is open data of those cities that is usually collected and maintained by police departments and/or government, and can be downloaded through according cities’ open data websites under “public safety“ category. As for tweets, we combined twitter APIs and tweepy, a python library, to fetch streaming twitter posts with customized options. The technology is target specifically to write or read twitter data.","m1lname":"Li","industry":"Media","analytics":"We use unigram for feature extraction. First we apply extraction algorithm on our training data set which is consists of sentences and their corresponding labels (positive/ negative). And then we apply spark training method, decision trees on these training datasets. Then we use the generate feature function for any new upcoming tweets and transform them in to executable vectors and use decision trees to predict the result. At last we transform our predicting results in to graphs.","m2fname":"Heng","description":"Tweets has become a very efficient method to reflect people’s moods. We will extract features from tweets and classify each tweets into positive and negative groups.
Use collected data to reflect the order of a specific area. Find the correlation between tweets and local safety. Accordingly, predict local safety factor and recommend police force.
","m1fname":"Wanheng","projectname":"Tweets Analysis and Area Safety Prediction","m3fname":"Long"},{"m2lname":"Dong","m4lname":"","m3uni":"zj2213","m1uni":"yw2812","m4uni":"","pid":"201612-4","m2uni":"cd2847","timestring":"Wed Dec 14 20:02:51 2016","m4fname":"","language":"R, Python, Spark","m3lname":"Jia","dataset":"The dataset is from New York Times Article Search API through nytimesarticle package in Python. We have 798 csv files in total and the size is 2.92GB.","m1lname":"Wang","industry":"Finance","analytics":"1. Unigram TF-IDF is used to generate features.
2. Try some machine learning algorithms to build a predictive model.
3. Employ sparsity algorithms thus as LASSO and Elastic Net Regularization for the exploratory model.","m2fname":"Chunfeng","description":"Nowadays, there is an interesting phenomenon emerging on the stock market: with more economic scholars start to paying attention to the use of social media data, many hedge funds companies abandoned their traditional forecasting techniques and applied the new ways to make more profit by predicting stock markets through data from social media platforms. Triggered by this idea, we decide to examine how well could we forecast S&P 500 based on NY Times data in specific. Multiple related research papers have illustrated how to forecast stock price through tweets as reflections of people’s mood, but seldom does a paper collect text data from mainstream media, and this is the reason that we decide to explore text data from NY Times.

Different from point forecasting, we are going to predict the price change in S&P 500 by making directional forecasting, which in specific, corresponding to if the predicted stock price going up or down, we mark “+1” and “-1” as the result respectively. By comparing with the actual price increase or decrease in S&P 500, we could examine the correct rate of prediction and conclude if the forecasting technique is precise enough.","m1fname":"Yusen","projectname":"Forecasting S&P 500 Index Using NY Times Data","m3fname":"Ze"},{"m2lname":"Chen","m4lname":"","m3uni":"zw2364","m1uni":"jz2748","m4uni":"","pid":"53","m2uni":"jc4648","timestring":"Wed Dec 14 20:16:14 2016","m4fname":"","language":"Python, Spark","m3lname":"Wang","dataset":"We used NBA players' performance data in both their NCAA career and NBA career as training data, and use the model to predict current NCAA player's future performance in NBA using their performance data in NCAA career.
We wrote some Python code to scrape the data from \"http://www.nba-reference.com\" . The data should have \"libsvm\" format.","m1lname":"Zhong","industry":"Media","analytics":"Algorithm: Multilayer perceptron classifier.
Analytics: Classification","m2fname":"Junbo","description":"The selection of rookies in NBA has great impacts on the whole team, so it is important for team managers to evaluate rookies' future performance so that they can make correct choices.

Objectives: Based on the relation between NBA players' performance in NCAA and NBA, use rookies' performance in NCAA to predict their performance in future NBA career.

Expected Outcome: Rookies' future performance in their NBA career.","m1fname":"Jing","projectname":"NBA Rookies' Performance Prediction","m3fname":"Zebin"},{"m2lname":"Lin","m4lname":"","m3uni":"","m1uni":"sg3303","m4uni":"","pid":"201612-36","m2uni":"ll2948","timestring":"Wed Dec 14 20:19:51 2016","m4fname":"","language":"Mahout, Hadoop, Java ,Angular JS, ElasticSearch","m3lname":"","dataset":"We collected 10000+ free text travel blogs from websites travelblog.org. The data was collected by scraping through content using Python beautiful soup and Java","m1lname":"Goyal","industry":"Media","analytics":" Clustering , Topic modelling","m2fname":"Ling ","description":"People usually spend hours researching about places through articles and blogs all over internet. In this project , we are providing one platform where people can find useful information from travel blogs .","m1fname":"Sugandha","projectname":"Holiday Planner","m3fname":""},{"m2lname":"Chen","m4lname":"","m3uni":"zw2364","m1uni":"jz2748","m4uni":"","pid":"201612-53","m2uni":"jc4648","timestring":"Wed Dec 14 20:37:53 2016","m4fname":"","language":"Python, Spark, MySQL","m3lname":"Wang","dataset":"We used NBA players' performance data in both their NCAA career and NBA career as training data, and use the model to predict current NCAA player's future performance in NBA using their performance data in NCAA career.
We wrote some Python code to scrape the data from \"http://www.nba-reference.com\" . The data should have \"libsvm\" format.","m1lname":"Zhong","industry":"Media","analytics":"Algorithm: Multilayer perceptron classifier.
Analytics: Classification","m2fname":"Junbo","description":"The selection of rookies in NBA has great impacts on the whole team, so it is important for team managers to evaluate rookies' future performance so that they can make correct choices.

Objectives: Based on the relation between NBA players' performance in NCAA and NBA, use rookies' performance in NCAA to predict their performance in future NBA career.

Expected Outcome: Rookies' future performance in their NBA career.","m1fname":"Jing","projectname":"NBA Rookies' Performance Prediction","m3fname":"Zebin"},{"m2lname":"Qiu","m4lname":"","m3uni":"","m1uni":"yt2558","m4uni":"","pid":"201612-11","m2uni":"cq2192","timestring":"Wed Dec 14 20:40:19 2016","m4fname":"","language":"Python, R, Scala, Spark, Hadoop, PostgreSQL, PostGIS, Google Cloud Platform","m3lname":"","dataset":"NYC TLC Taxi Trip Dataset","m1lname":"Tanaka","industry":"Transportation","analytics":"Random Forest
Gaussian Mixture Model","m2fname":"Congying","description":"Objectives:
1. Tip prediction for potential business application
2. GPS noise modeling for highly-accurate GPS data

Innovations:
1. Novel objective variable (whether hourly tip is more than $12 or not)
2. Proposed a way to estimate true avenue membership for each data point","m1fname":"Yasutaka","projectname":"Tip Prediction and GPS Noise Modeling on NYC Taxi Dataset","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"yw2920","m1uni":"jw3447","m4uni":"","pid":"201605-40","m2uni":"dw2726","timestring":"Wed Dec 14 20:44:00 2016","m4fname":"","language":" Spark; Shiny; R; Python; IBM System G","m3lname":"Wu","dataset":"Dataset: The data are obtained from Cagle Website:
https://www.kaggle.com/devinanzelmo/dota-2-matches

This dataset contains more than 900k Dota 2 matches and relevant players information.

And it also has a test data set to evaluate our prediction accuracy.
","m1lname":"Wei","industry":"Information","analytics":"Algorithms:
Basic statistical methods
Classification: Logistics Regression
Visualization & Graph Analysis","m2fname":"Dejian","description":"Dota 2 is a hot Multiplayer Online Battle Arena (MOBA) and can take up thousands of hours of our life. It has a huge number of followers around the world.

In this project, we have achieved the following:

1.Exploratory Data Analysis(Create Shiny App for users)
1.1:Player Performance Analysis:
Radar Chart
Win Rate
1.2:Recommendation for Hero Selection

2.Match Outcome Prediction:
Make prediction by Logistic Regression, using player and hero information as features.

3.Graph Analysis Between Heroes:
3.1:Analyze the relationship between heroes: synergy & suppression
3.2:Identify the most significant heroes and the combination of heroes
","m1fname":"Jihan","projectname":"Dota 2: Exploratory Data Analysis & Prediction","m3fname":"Yinxiang"},{"m2lname":"Wang","m4lname":"","m3uni":"yw2920","m1uni":"jw3447","m4uni":"","pid":"201612-40","m2uni":"dw2726","timestring":"Wed Dec 14 20:45:06 2016","m4fname":"","language":" Spark; Shiny; R; Python; IBM System G","m3lname":"Wu","dataset":"Dataset: The data are obtained from Cagle Website:
https://www.kaggle.com/devinanzelmo/dota-2-matches

This dataset contains more than 900k Dota 2 matches and relevant players information.

And it also has a test data set to evaluate our prediction accuracy.
","m1lname":"Wei","industry":"Information","analytics":"Algorithms:
Basic statistical methods
Classification: Logistics Regression
Visualization & Graph Analysis","m2fname":"Dejian","description":"Dota 2 is a hot Multiplayer Online Battle Arena (MOBA) and can take up thousands of hours of our life. It has a huge number of followers around the world.

In this project, we have achieved the following:

1.Exploratory Data Analysis(Create Shiny App for users)
1.1:Player Performance Analysis:
Radar Chart
Win Rate
1.2:Recommendation for Hero Selection

2.Match Outcome Prediction:
Make prediction by Logistic Regression, using player and hero information as features.

3.Graph Analysis Between Heroes:
3.1:Analyze the relationship between heroes: synergy & suppression
3.2:Identify the most significant heroes and the combination of heroes
","m1fname":"Jihan","projectname":"Dota 2: Exploratory Data Analysis & Prediction","m3fname":"Yinxiang"},{"m2lname":"Han","m4lname":"","m3uni":"","m1uni":"hy2457","m4uni":"","pid":"201612-105","m2uni":"xh2257","timestring":"Wed Dec 14 20:46:40 2016","m4fname":"","language":"Java/Python/Hadoop/Spark","m3lname":"","dataset":"http://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php
http://homepages.lboro.ac.uk/~cogs/datasets/ucid/ucid.html","m1lname":"Yan","industry":"Media","analytics":"KMeans","m2fname":"Xi","description":"Use Spark and Hadoop to set up a large-scale online image searching engine.
Expected outcome: when you upload an image, the system will return similar images in the database back to you. And we hope the latency is below 1 second.
Importance: If you want to buy something online, you might not know the merchandise name, then you can search by image.","m1fname":"Hang","projectname":"Large scale real-time similar image search","m3fname":""},{"m2lname":"Suzuki","m4lname":"","m3uni":"","m1uni":"tt2573","m4uni":"","pid":"201612-2","m2uni":"hs2865","timestring":"Wed Dec 14 20:47:03 2016","m4fname":"","language":"Python, HTML, jQuery,Bootstrap, MySQL, IBM System G","m3lname":"","dataset":"We used the Yelp Data set from the Yelp Data Challenge. The total size of the data was 3GB.
","m1lname":"Tan","industry":"Media","analytics":"We used IBM System G for preliminary visualizations, k-nearest neighbors for returning the most popular restaurants, topic modeling and sentiment analysis for keyword search, and time series methods for rate prediction.

We used the following packages: scikit-learn, nltk, gensim, statsmodels.tssa","m2fname":"Hiroaki","description":"Recently, recommendation sites have become more popular with users, mostly because recommenders simplify a user’s search process by suggesting locations, people and organizations, such as restaurants, hotels, jobs, even dating partners. But current systems focus their recommendations on the features available at the time of search, and not at the time of the expected actual usage or engagement, such as future events for birthday celebrations, and hotel rooms for future travel. It is well documented that proximity and ratings are key determinants for users in making their selection, and that ratings are frequently updated by new user reviews.

Yet, how can the user be sure that the recommendations given in the day of search will be of the same predictive quality at the time the user actually experiences the recommended site or business?

We propose a recommendation system that includes the recommended businesses as well as their predicted future ratings. Including predicted ratings allow users an extra layer of information about the businesses with which to make an informed decision about their selection. Users may search for a restaurant and find out that the ratings are predicted to be low on the particular future day they plan on going, thus allowing them to decide to go to another restaurant where predicted ratings are expected to be higher on that day. Our recommendation design is divided into three parts: Showing popular restaurants, Keyword search, and User rate prediction. ","m1fname":"Tian","projectname":"Recommendations With Predicted Future Ratings","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"yw2920","m1uni":"jw3447","m4uni":"","pid":"201612-40","m2uni":"dw2726","timestring":"Wed Dec 14 20:48:04 2016","m4fname":"","language":" Spark; Shiny; R; Python; IBM System G","m3lname":"Wu","dataset":"Dataset: The data are obtained from Cagle Website:
https://www.kaggle.com/devinanzelmo/dota-2-matches

This dataset contains more than 900k Dota 2 matches and relevant players information.

And it also has a test data set to evaluate our prediction accuracy.
","m1lname":"Wei","industry":"Information","analytics":"Algorithms:
Basic statistical methods
Classification: Logistics Regression
Visualization & Graph Analysis","m2fname":"Dejian","description":"Dota 2 is a hot Multiplayer Online Battle Arena (MOBA) and can take up thousands of hours of our life. It has a huge number of followers around the world.

In this project, we have achieved the following:

1.Exploratory Data Analysis(Create Shiny App for users)
1.1:Player Performance Analysis:
Radar Chart
Win Rate
1.2:Recommendation for Hero Selection

2.Match Outcome Prediction:
Make prediction by Logistic Regression, using player and hero information as features.

3.Graph Analysis Between Heroes:
3.1:Analyze the relationship between heroes: synergy & suppression
3.2:Identify the most significant heroes and the combination of heroes
","m1fname":"Jihan","projectname":"Dota 2: Exploratory Data Analysis & Prediction","m3fname":"Yinxiang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ra2659","m4uni":"","pid":"201612-44","m2uni":"","timestring":"Wed Dec 14 20:48:10 2016","m4fname":"","language":"Python and Spark and Apache Website","m3lname":"","dataset":"Two datasets were used containing hundreds of files of annotated songs. Each file (song) has three columns with the begin time frame, end time frame and human annotated chord for that duration.
Each file used also needed the matching audio file which was taken from youtube. The youtube music videos were converted to mp3 format using: http://convert2mp3.net/en/
1. Isophonics Dataset from the Centre for Digital Music http://www.isophonics.net/datasets
2. Billboard Hot 100 (years) Dataset from the McGill Billboard Project http://ddmal.music.mcgill.ca/research/billboard","m1lname":"Abramson","industry":"Media","analytics":"Python library LIBROSA used to create chromagrams (in graphs and in number form) from the pitches extracted from the audio files. The numbers from the chromagrams became the feature vectors for the annotated chords (labels).
Spark over Hadoop for Python (library pyspark.mllib) used to create a model to classify audio time windows into one of 35 basic chords
Tried many classification algorithms from pyspark but chose the Random Forest Classifier for implementation
Created Python-powered website which can classify (annotate) new songs using the saved model","m2fname":"","description":"Having an ear for music is a genetic gift. For the rest of us, playing songs we like on our instruments can be frustrating. The chords we use are often incorrect.
Through machine learning, a program or an app can learn to recognize chords using training examples
manually annotated with chords by people with an ear for music. Then, given a new song, the program can give the player the correct chords to play!
This is also very useful for tuning instruments and for Music Information Retrieval (MIR) which is the study of music.","m1fname":"Reva","projectname":"Machine Learning for Chord Recognition","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"yw2920","m1uni":"jw3447","m4uni":"","pid":"201612-40","m2uni":"dw2726","timestring":"Wed Dec 14 20:48:24 2016","m4fname":"","language":" Spark; Shiny; R; Python; IBM System G","m3lname":"Wu","dataset":"Dataset: The data are obtained from Cagle Website:
https://www.kaggle.com/devinanzelmo/dota-2-matches

This dataset contains more than 900k Dota 2 matches and relevant players information.

And it also has a test data set to evaluate our prediction accuracy.
","m1lname":"Wei","industry":"Information","analytics":"Algorithms:
Basic statistical methods
Classification: Logistics Regression
Visualization & Graph Analysis","m2fname":"Dejian","description":"Dota 2 is a hot Multiplayer Online Battle Arena (MOBA) and can take up thousands of hours of our life. It has a huge number of followers around the world.

In this project, we have achieved the following:

1.Exploratory Data Analysis(Create Shiny App for users)
1.1:Player Performance Analysis:
Radar Chart
Win Rate
1.2:Recommendation for Hero Selection

2.Match Outcome Prediction:
Make prediction by Logistic Regression, using player and hero information as features.

3.Graph Analysis Between Heroes:
3.1:Analyze the relationship between heroes: synergy & suppression
3.2:Identify the most significant heroes and the combination of heroes
","m1fname":"Jihan","projectname":"Dota 2: Exploratory Data Analysis & Prediction","m3fname":"Yinxiang"},{"m2lname":"Lin","m4lname":"","m3uni":"","m1uni":"sg3303","m4uni":"","pid":"201612-36","m2uni":"ll2948","timestring":"Wed Dec 14 20:48:44 2016","m4fname":"","language":"Mahout, Hadoop, Java ,Angular JS, ElasticSearch","m3lname":"","dataset":"We collected 10000+ free text travel blogs from websites travelblog.org. The data was collected by scraping through content using Python beautiful soup and Java","m1lname":"Goyal","industry":"Media","analytics":"Clustering , Topic modelling","m2fname":"Ling ","description":"People usually spend hours researching about places through articles and blogs all over internet. In this project , we are providing one platform where people can find useful information from travel blogs .","m1fname":"Sugandha","projectname":"Holiday Planner","m3fname":""},{"m2lname":"Huang","m4lname":"","m3uni":"","m1uni":"xw2401","m4uni":"","pid":"201612-16","m2uni":"mh3560","timestring":"Wed Dec 14 20:59:11 2016","m4fname":"","language":"python, android, spark, MySQL","m3lname":"","dataset":"The dataset is public and the link is http://web.mta.info/developers/download.html
","m1lname":"Wang","industry":"Transportation","analytics":"In this project, we designed an android app in which the user can enter the starting stop ID / destination stop ID.
Then as response, the app will show the user a route on the google map with an estimated time to arrive at the corresponding destination.

Technology used:
Spark, Flask server, MySQL, ngrok
Android, MTA bus time API , Google Map API, Google Location service, Google Direction service

","m2fname":"Meng","description":"MTA bus service is an important part of the New York City public transportation service. Many people rely on it for work, school and other purposes.

Unlike subway and railway transportation, the arrival time of buses can be off-schedule a lot due to weather, emergency accidents. Also, bus service can take much longer in peak hours than normal time.

So we want to gather the MTA historical data of buses arrival time, and build a prediction model based on that. So when user inputs a specific time, we can get an estimate of how long it would take for the bus to arrive.
","m1fname":"Xucan","projectname":"MTA Bus Time Prediction","m3fname":""},{"m2lname":"Lin","m4lname":"","m3uni":"","m1uni":"sg3303","m4uni":"","pid":"201612-36","m2uni":"ll2948","timestring":"Wed Dec 14 21:00:08 2016","m4fname":"","language":"Mahout, Hadoop, Java ,Angular JS, ElasticSearch","m3lname":"","dataset":"We collected 10000+ free text travel blogs from websites travelblog.org. The data was collected by scraping through content using Python beautiful soup and Java","m1lname":"Goyal","industry":"Media","analytics":"Clustering , Topic modelling","m2fname":"Ling ","description":"People usually spend hours researching about places through articles and blogs all over internet. In this project , we are providing one platform where people can find useful information from travel blogs .","m1fname":"Sugandha","projectname":"Holiday Planner","m3fname":""},{"m2lname":"Smajlaj","m4lname":"","m3uni":"nz2250","m1uni":"bjz2107","m4uni":"","pid":"201612-25","m2uni":"ss3912","timestring":"Wed Dec 14 21:13:38 2016","m4fname":"","language":"Python, Pandas, Hadoop, Pyspark, Scikitlearn, Java, Flask","m3lname":"Zhao","dataset":"Twitter Dumps for a few months. Public data set, online
https://archive.org/details/twitterstream

Quandl Stock prices, self parsed for information ","m1lname":"Zhu","industry":"Finance","analytics":" Logistic Regression, Classification, visualization in flask of stock market price movement, predictive analysis of behavior of EoD price deltas, TD-IDF bag of words ML ","m2fname":"Sabina","description":"EoD Price for a company represents millions of microscopic changes in the entire market into a single number
> 500 million tweets a day; each a microcosm of public sentiment for a company
Aggregating social media messages should reliably predict trend for company market behavior
Leverage interests in finance
Twitter is one of the most diverse, public data sets
Experiment with Hadoop, Spark, utilize our skills in machine learning and data analysis
Predict EoD Price behavior real-time ","m1fname":"Ben","projectname":"Tweet Based EoD Stock Market Price Predictor","m3fname":"Nan"},{"m2lname":"Lu","m4lname":"","m3uni":"xl2600","m1uni":"yc3171","m4uni":"","pid":"201612-71","m2uni":"hl2967","timestring":"Wed Dec 14 21:17:00 2016","m4fname":"","language":"Python, SQL, JavaScript","m3lname":"Li","dataset":"1. 3 months data of the yellow (~1.6G each) and green (~230M each) taxi trip records in 2016 on The New York City Taxi & Limousine Commission:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

2. National Climatic Data:
https://www.ncdc.noaa.gov/cdo-web/","m1lname":"Cen","industry":"Transportation","analytics":"Big Data Tools: Hive, Hadoop

Machine Learning: scikit-learn, XGBoost","m2fname":"Houliang","description":"In the New York city, people use taxi in a frequency much higher than any other cities. There are millions of taxi rides in New York City each year. Instead of booking a taxi by phone one day ahead of time, New York taxi drivers pick up passengers on street. So it could be hard to take a taxi during rush hours. So, It is necessary to exploit an understanding of taxi supply and demand.","m1fname":"Yue","projectname":"Analysis of NYC Taxi Data","m3fname":"Xinyi"},{"m2lname":"Shen","m4lname":"","m3uni":"wz2363","m1uni":"ck2749","m4uni":"","pid":"201612-78","m2uni":"ys2840","timestring":"Wed Dec 14 21:22:20 2016","m4fname":"","language":"Spark/Scala/Python/Matlab/Excel","m3lname":"Zhang","dataset":"Climatology Data from NOAA
https://www.nodc.noaa.gov/access/index.html
Global Surface Temperature Data from NASA
http://data.giss.nasa.gov/gistemp/
Sea Level Trends Data from NOAA
http://tidesandcurrents.noaa.gov/sltrends/sltrends.html
Sea Level Trends Data From PSMSL
http://www.psmsl.org/data/obtaining/
","m1lname":"Kanungo","industry":"Social Science-Government","analytics":"Data Filtering using Python and MATLAB
ML-LIB
Regression with Stochastic Gradient Descent
Exponential Least Square Fitting
MATLAB 2-D/3-D Visualization
NCBrowser Visualization
","m2fname":"Yizhou","description":"The issue of Global Warming has been a controversial topic over the past decades. President Trump even claimed it is a made-in-China topic. The impact of global warming can truly be devastating to our planet. For example, 40% of the population in Netherlands are exposed to the risk of drowning. Growing sea level resulted from global warming can lead to submerging city’s land like Manhattan. We used evidence from big geographical data and evaluated the impact of global warming. We predicted the global temperature and the resulting sea level trend around the whole world.\u000b
","m1fname":"Chandan","projectname":"The Impact of Global Warming from Big Geographical Data","m3fname":"Wei"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ik2338","m4uni":"","pid":"201612-38","m2uni":"","timestring":"Wed Dec 14 21:23:02 2016","m4fname":"","language":"Python ","m3lname":"","dataset":"Million Song Database -Free collection of audio features and metadata for a million contemporary popular music tracks
Developed by Columbia University’s LabRosa and The Echo Nest
Data is accessible using both SQLite databases or HDF5 files
Metadata includes the song’s name, artist’s name, similar artists, artist tags
","m1lname":"Keren","industry":"Media","analytics":"System G","m2fname":"","description":"We have developed a locally hosted system that provides recommended playlists off of user inputs or user activity. This system will be open-source, lightweight, and optimized for large datasets. We will then use System G as a visualization tool to present the music library along with data from our recommendation system. This tool would display artist \"bubbles\" with colors and XY coordinates based off of data from our algorithm.","m1fname":"Itay","projectname":"RecoSonic: Self-Hosted Music Recommendation System","m3fname":""},{"m2lname":"Bhatt","m4lname":"","m3uni":"","m1uni":"jjg2188","m4uni":"","pid":"201612-92","m2uni":"tb2658","timestring":"Wed Dec 14 21:31:40 2016","m4fname":"","language":"Ubuntu 14.04, Python, hmmlearn, Pyspark, Scipy, sklearn, numpy","m3lname":"","dataset":"We would be working with a dataset provided by Kaggle from the ‘American Epilepsy Society Seizure Prediction Challenge’. In this dataset, Intracranial EEG was recorded from dogs (~40GB) with naturally occurring epilepsy using an ambulatory monitoring system to identify a region of brain that can be resected to prevent future seizures are included.","m1lname":"Guerra Marin","industry":"Life Science","analytics":"In order to properly classify and forecast this kind of dataset, an algorithm that takes into consideration the temporal information is necessary. EEG data has been analyzed in many projects and we are currently considering either applying Hidden Markov Models.","m2fname":"Tulika","description":"For this project, we aimed to fulfill all the requirements for a ‘Big Data’ dataset. These requirements entailed high-volume, high-velocity and high-variety information dataset. In addition, we are interested in the development of machine learning algorithms with an emphasis on applications of data acquisition, processing, understanding, and learning. More specifically, we would like to work on the development of systems that incorporate sequential decision-making, reasoning, and inference, in an efficient and secure manner to improve healthcare applications.

We, therefore, decided to concentrate on the area of seizure forecasting by using EGG-based information. The seizure forecasting results would be used for the prevention of Epilepsy. Epilepsy affects nearly 1% of the world’s population, and it happens as of a result of spontaneous and sequential seizures. Machine learning algorithms can be used to forecast the presence of seizures and, as of result, avoid epilepsy by taking the right measurements.

Based on clinical data, the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). Our project entails understanding all the different pattern of the 4 states of brain activity and identify a preictal state that can be differentiated from the interictal, ictal, and postictal state. ","m1fname":"Jorge","projectname":"Seizure forecasting analysis of EEG data","m3fname":""},{"m2lname":"Yang","m4lname":"","m3uni":"sl4039","m1uni":"sf2794","m4uni":"","pid":"201612-84","m2uni":"gy2237","timestring":"Wed Dec 14 21:32:43 2016","m4fname":"","language":"python","m3lname":"Liu","dataset":"Data on past NBA game (each game, each team, each player)","m1lname":"Fang","industry":"Information","analytics":"naive Bayes","m2fname":"Guang","description":"Prediction for NBA game result and final table. Based on the performance of each player in the past games.","m1fname":"Shu","projectname":"Data Analysis For Basketball Matches Outcomes Prediction ","m3fname":"Shutian"},{"m2lname":"Bhatt","m4lname":"","m3uni":"","m1uni":"jjg2188","m4uni":"","pid":"201612-92","m2uni":"tb2658","timestring":"Wed Dec 14 21:36:44 2016","m4fname":"","language":"Ubuntu 14.04, Python, hmmlearn, Pyspark, Scipy, sklearn, numpy","m3lname":"","dataset":"We would be working with a dataset provided by Kaggle from the ‘American Epilepsy Society Seizure Prediction Challenge’. In this dataset, Intracranial EEG was recorded from dogs (~40GB) with naturally occurring epilepsy using an ambulatory monitoring system to identify a region of brain that can be resected to prevent future seizures are included.","m1lname":"Guerra Marin","industry":"Life Science","analytics":"In order to properly classify and forecast this kind of dataset, an algorithm that takes into consideration the temporal information is necessary. EEG data has been analyzed in many projects and we are currently considering either applying Hidden Markov Models.","m2fname":"Tulika","description":"For this project, we aimed to fulfill all the requirements for a ‘Big Data’ dataset. These requirements entailed high-volume, high-velocity and high-variety information dataset. In addition, we are interested in the development of machine learning algorithms with an emphasis on applications of data acquisition, processing, understanding, and learning. More specifically, we would like to work on the development of systems that incorporate sequential decision-making, reasoning, and inference, in an efficient and secure manner to improve healthcare applications.

We, therefore, decided to concentrate on the area of seizure forecasting by using EGG-based information. The seizure forecasting results would be used for the prevention of Epilepsy. Epilepsy affects nearly 1% of the world’s population, and it happens as of a result of spontaneous and sequential seizures. Machine learning algorithms can be used to forecast the presence of seizures and, as of result, avoid epilepsy by taking the right measurements.

Based on clinical data, the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). Our project entails understanding all the different pattern of the 4 states of brain activity and identify a preictal state that can be differentiated from the interictal, ictal, and postictal state. ","m1fname":"Jorge","projectname":"Seizure Forecasting Analysis of EEG Data","m3fname":""},{"m2lname":"Bhatt","m4lname":"","m3uni":"","m1uni":"jjg2188","m4uni":"","pid":"201612-92","m2uni":"tb2658","timestring":"Wed Dec 14 21:38:12 2016","m4fname":"","language":"Ubuntu 14.04, Python, hmmlearn, Pyspark, Scipy, sklearn, numpy","m3lname":"","dataset":"We would be working with a dataset provided by Kaggle from the ‘American Epilepsy Society Seizure Prediction Challenge’. In this dataset, Intracranial EEG was recorded from dogs (~40GB) with naturally occurring epilepsy using an ambulatory monitoring system to identify a region of brain that can be resected to prevent future seizures are included.","m1lname":"Guerra Marin","industry":"Life Science","analytics":"In order to properly classify and forecast this kind of dataset, an algorithm that takes into consideration the temporal information is necessary. EEG data has been analyzed in many projects and we are currently considering either applying Hidden Markov Models.","m2fname":"Tulika","description":"For this project, we aimed to fulfill all the requirements for a ‘Big Data’ dataset. These requirements entailed high-volume, high-velocity and high-variety information dataset. In addition, we are interested in the development of machine learning algorithms with an emphasis on applications of data acquisition, processing, understanding, and learning. More specifically, we would like to work on the development of systems that incorporate sequential decision-making, reasoning, and inference, in an efficient and secure manner to improve healthcare applications.

We, therefore, decided to concentrate on the area of seizure forecasting by using EGG-based information. The seizure forecasting results would be used for the prevention of Epilepsy. Epilepsy affects nearly 1% of the world’s population, and it happens as of a result of spontaneous and sequential seizures. Machine learning algorithms can be used to forecast the presence of seizures and, as of result, avoid epilepsy by taking the right measurements.

Based on clinical data, the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). Our project entails understanding all the different pattern of the 4 states of brain activity and identify a preictal state that can be differentiated from the interictal, ictal, and postictal state. ","m1fname":"Jorge","projectname":"Seizure Forecasting Analysis of EEG Data","m3fname":""},{"m2lname":"Bhatt","m4lname":"","m3uni":"","m1uni":"jjg2188","m4uni":"","pid":"201612-92","m2uni":"tb2658","timestring":"Wed Dec 14 21:38:33 2016","m4fname":"","language":"Ubuntu 14.04, Python, hmmlearn, Pyspark, Scipy, Sklearn, Numpy","m3lname":"","dataset":"We would be working with a dataset provided by Kaggle from the ‘American Epilepsy Society Seizure Prediction Challenge’. In this dataset, Intracranial EEG was recorded from dogs (~40GB) with naturally occurring epilepsy using an ambulatory monitoring system to identify a region of brain that can be resected to prevent future seizures are included.","m1lname":"Guerra Marin","industry":"Life Science","analytics":"In order to properly classify and forecast this kind of dataset, an algorithm that takes into consideration the temporal information is necessary. EEG data has been analyzed in many projects and we are currently considering either applying Hidden Markov Models.","m2fname":"Tulika","description":"For this project, we aimed to fulfill all the requirements for a ‘Big Data’ dataset. These requirements entailed high-volume, high-velocity and high-variety information dataset. In addition, we are interested in the development of machine learning algorithms with an emphasis on applications of data acquisition, processing, understanding, and learning. More specifically, we would like to work on the development of systems that incorporate sequential decision-making, reasoning, and inference, in an efficient and secure manner to improve healthcare applications.

We, therefore, decided to concentrate on the area of seizure forecasting by using EGG-based information. The seizure forecasting results would be used for the prevention of Epilepsy. Epilepsy affects nearly 1% of the world’s population, and it happens as of a result of spontaneous and sequential seizures. Machine learning algorithms can be used to forecast the presence of seizures and, as of result, avoid epilepsy by taking the right measurements.

Based on clinical data, the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). Our project entails understanding all the different pattern of the 4 states of brain activity and identify a preictal state that can be differentiated from the interictal, ictal, and postictal state. ","m1fname":"Jorge","projectname":"Seizure Forecasting Analysis of EEG Data","m3fname":""},{"m2lname":"Zhang","m4lname":"","m3uni":"jy2803","m1uni":"cr2826","m4uni":"","pid":"201612-62","m2uni":"lz2467","timestring":"Wed Dec 14 21:41:45 2016","m4fname":"","language":"Python, Django, Mahout, HTML, CSS, Node.Js, Bootstrap, Git I/O, AWS RDS","m3lname":"Yu","dataset":"1. OKCupid Profile Data - We got this dataset by crawling through the OKCupid API. This dataset has 50k user profile data","m1lname":"Ren","industry":"Information","analytics":"Kmeans - recommendation
Microsoft Face API - Recognize face in each photo and find top k similar faces to user’s
photo","m2fname":"Lyujia","description":"The face is the index of the mind --- Chinese proverb
People are constantly, both consciously and unconsciously, looking for their future partners. This application will expedite this process by analyzing and matching people's face to those in the database, and recommending the \"correct\" ones to the clients. By analyzing clients' photos (aka their faces), we are able to find the right one(s) for them to date. The innovation of this project is the potential of recommending people by their facial characteristics.","m1fname":"Chuqiao","projectname":"MateFinder: The Next Generation of Dating Recommender System","m3fname":"Jinyang"},{"m2lname":"Zhang","m4lname":"","m3uni":"jy2803","m1uni":"cr2826","m4uni":"","pid":"201612-62","m2uni":"lz2467","timestring":"Wed Dec 14 21:43:53 2016","m4fname":"","language":"Python, Django, Mahout, HTML, CSS, Node.Js, Bootstrap, Git I/O, AWS RDS ","m3lname":"Yu","dataset":"OKCupid Profile Data - We got this dataset by crawling through the OKCupid API. This dataset has 50k user profile data ","m1lname":"Ren","industry":"Social Science-Government","analytics":"Kmeans - recommendation
Microsoft Face API - Recognize face in each photo and find top k similar faces to user’s
photo","m2fname":"Lyujia","description":"The face is the index of the mind --- Chinese proverb
People are constantly, both consciously and unconsciously, looking for their future partners. This application will expedite this process by analyzing and matching people's face to those in the database, and recommending the \"correct\" ones to the clients. By analyzing clients' photos (aka their faces), we are able to find the right one(s) for them to date. The innovation of this project is the potential of recommending people by their facial characteristics.","m1fname":"Chuqiao","projectname":"MateFinder: The Next Generation of Dating Recommender System","m3fname":"Jinyang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"ajc2243","m4uni":"","pid":"201612-104","m2uni":"","timestring":"Wed Dec 14 21:43:55 2016","m4fname":"","language":"Languages: Matlab (for image proccessing toolbox), C. Matlab was chosen because it had the capability of working with images, and it was used in combination with the hadoop/spark toolbox to handle large scale data:https://www.mathworks.com/discovery/matlab-hadoop-and-spark.html","m3lname":"","dataset":"
https://www.transportation.gov/data (government data warehouse for traffic information, some of which is used for similar applications already. It is my hope that incorporating other types of data will increase the value of the supplied data.)

https://developers.google.com/maps/(APIs for Google Maps. I can preform computationally intensive image processing on critical faults, and traffic analysis on greater areas.)

Draper Laboratory Image Bank- I have access to daily images of parts of the US, i have played with these images in the past and can import them easily, but i cannot release these images
","m1lname":"Cunningham","industry":"Transportation","analytics":"This system currently uses a combination of algorithms and big data structures to store images onto a local Ubuntu machine where they are run through a matlab script, reduced to raw numbers, and stored based on their metrics alongside (longitude, latitude) into a database witch is visualized in various ways.

","m2fname":"","description":"The US spends over 400 billion USD per year on maintaining its roads and infrastructure, including the oldest large-scale highway system in the world. Even with that investment, the road infrastructure in the United States has a D+ rating from the society of civil engineers, and is struggling to keep up with the increased traffic of a growing economy.

I plan to develop a system to help identify issues with the roads, allowing government entities to identify where these road faults exist. Rather then rely on manual surveys, an automated system could collect data on faults that appear in satellite imagery, common traffic jams on google maps, cars collecting at intersections, and any other existing available data.

With this large scale dataset, I can extract meaningful actionable data including:
-Identify locations with issues/faults
-Ranking of issues for prioritization
-Socio/economic data on locations/communities with failing roads and infrastructure

Dataset: I have several sources of data for this, including:

I plan to use to display data in a means that conveys quantitative road condition information to the government. To achieve this with the available data I will use a combination of the below methods:

- trending or clustering of traffic information. Try to graphically visualize locations of common traffic jams.

- image processing of known jam locations after they are identified above. Look for pot holes, debris, broken pavement, etc. This will separate the road design issues from the issues that simply require repair. I have skills in image processing, and will only need to preform this computationally expensive algorithm on a small data-set.

Ultimately i would like to display a map of known issues. ","m1fname":"Andrew","projectname":"Finding Faults in Road Systems","m3fname":""},{"m2lname":"Zheng","m4lname":"","m3uni":"mz2591","m1uni":"tw2568","m4uni":"","pid":"201612-24","m2uni":"zz2406","timestring":"Wed Dec 14 21:44:23 2016","m4fname":"","language":"Python, Javascript; Spark","m3lname":"Zhou","dataset":"Yelp dataset are provided from Yelp Data Challenge. The URL of the source is: https://www.yelp.com/dataset_challenge","m1lname":"Wu","industry":"Information","analytics":"Analytics: City business trend, meals ordered most frequently, most crowded hours of the restaurants, etc.
Algorithm: Clustering, Support Vector Machine, Bayes Classifier, Random Forest, CNN
Visualization: Review Stars distribution, Review Texts Word-cloud, Review text sentiment analysis","m2fname":"Zhi ","description":"The overall objective is to explore and visualize the yelp dataset, predict yelp review Star categories from yelp reviews as well as visualize the sentiment analysis of the dataset. ","m1fname":"Tongyun","projectname":"Yelp Reviews Exploration & Visualization","m3fname":"Ming"},{"m2lname":"Ren","m4lname":"","m3uni":"jy2803","m1uni":"lz2467","m4uni":"","pid":"201612-62","m2uni":"cr2826","timestring":"Wed Dec 14 21:46:05 2016","m4fname":"","language":"Python, Django, Mahout, HTML, CSS, Node.Js, Bootstrap, Git I/O, AWS RDS ","m3lname":"Yu","dataset":" OKCupid Profile Data - We got this dataset by crawling through the OKCupid API. This dataset has 50k user profile data
","m1lname":"Zhang","industry":"Social Science-Government","analytics":"Kmeans - recommendation
Microsoft Face API - Recognize face in each photo and find top k similar faces to user’s
photo","m2fname":"Chuqiao","description":"The face is the index of the mind --- Chinese proverb
People are constantly, both consciously and unconsciously, looking for their future partners. This application will expedite this process by analyzing and matching people's face to those in the database, and recommending the \"correct\" ones to the clients. By analyzing clients' photos (aka their faces), we are able to find the right one(s) for them to date. The innovation of this project is the potential of recommending people by their facial characteristics.","m1fname":"Lyujia","projectname":"MateFinder: The Next Generation of Dating Recommender System","m3fname":"Jinyang"},{"m2lname":"Shi","m4lname":"","m3uni":"","m1uni":"rs3569","m4uni":"","pid":"201612-96","m2uni":"ys2901","timestring":"Wed Dec 14 21:47:14 2016","m4fname":"","language":"Python, theano, lasagne, tensorflow, spark ","m3lname":"","dataset":"Thanks Erik Bernhardsson for collecting this amazing dataset!
You can download data by typing:
wget https://s3.amazonaws.com/erikbern/fonts.hdf5
in your terminal,
warning: the file is about 14GB","m1lname":"Shi","industry":"Media","analytics":"We train a generative network recognizing 56k fonts by learning a lower dimensional embedding space
Visualize fonts embedding in 3 dimensions (using T-SNE) by Google embedding projector
Write a script to read hdf5 file into spark

System specification:
We used service from Amazon AWS
Instance: P2.xlarge
Ubuntu 14.04 system
High Frequency Intel Xeon E5-2686v4 (Broadwell) Processors and 61 GiB memory
High-performance NVIDIA Tesla K80 GPUs
2,496 parallel processing cores
12GiB of GPU memory ","m2fname":"Yuchen","description":"Nowadays, we have seen that artificial intelligence technology (especially deep learning) can achieve human level performance or even beat human in many complicated tasks such as image recognition, playing Go games.

At the meantime, we find out that it is still very challenging to teach machine in art creation, composing music or writing an article.

We challange ourselves in this project teaching machine to write in its own style

In order to do that:

1,We try to create a “font vector” in latent space that “defines” a certain font. Then we can embed all fonts in one space where similar fonts have similar vectors.
2,Leveraging the font dataset online we plan to train a generative neural network that can map from our embedding space to different font styles.
3,Going beyond this we can ask computer to write in a random style or mixing styles from several fonts to generate its own ones

Potential marketplace:
1,Help senior (disabled) people write in their style
2,Calligraphy education ","m1fname":"Ruixiong","projectname":"analyzing 56k fonts using deep neural networks","m3fname":""},{"m2lname":"Suzuki","m4lname":"","m3uni":"ssl2153","m1uni":"tt2573","m4uni":"","pid":"201612-2","m2uni":"hs2865","timestring":"Wed Dec 14 21:48:51 2016","m4fname":"","language":"Python, HTML, jQuery,Bootstrap, MySQL, IBM System G","m3lname":"Lim","dataset":"We used the Yelp Data set from the Yelp Data Challenge. The total size of the data was 3GB.","m1lname":"Tan","industry":"Media","analytics":"We used IBM System G for preliminary visualizations, k-nearest neighbors for returning the most popular restaurants, topic modeling and sentiment analysis for keyword search, and time series methods for rate prediction.

We used the following packages: scikit-learn, nltk, gensim, statsmodels.tssa","m2fname":"Hiroaki","description":"Recently, recommendation sites have become more popular with users, mostly because recommenders simplify a user’s search process by suggesting locations, people and organizations, such as restaurants, hotels, jobs, even dating partners. But current systems focus their recommendations on the features available at the time of search, and not at the time of the expected actual usage or engagement, such as future events for birthday celebrations, and hotel rooms for future travel. It is well documented that proximity and ratings are key determinants for users in making their selection, and that ratings are frequently updated by new user reviews.

Yet, how can the user be sure that the recommendations given in the day of search will be of the same predictive quality at the time the user actually experiences the recommended site or business?

We propose a recommendation system that includes the recommended businesses as well as their predicted future ratings. Including predicted ratings allow users an extra layer of information about the businesses with which to make an informed decision about their selection. Users may search for a restaurant and find out that the ratings are predicted to be low on the particular future day they plan on going, thus allowing them to decide to go to another restaurant where predicted ratings are expected to be higher on that day. Our recommendation design is divided into three parts: Showing popular restaurants, Keyword search, and User rate prediction.","m1fname":"Tian ","projectname":"Recommendations With Predicted Future Ratings","m3fname":"Shiemi"},{"m2lname":"Shi","m4lname":"","m3uni":"","m1uni":"rs3569","m4uni":"","pid":"201612-96","m2uni":"ys2901","timestring":"Wed Dec 14 21:51:09 2016","m4fname":"","language":"Python, theano, lasagne, tensorflow, spark ","m3lname":"","dataset":"Thanks Erik Bernhardsson for collecting this amazing dataset!
You can download data by typing:
wget https://s3.amazonaws.com/erikbern/fonts.hdf5
in your terminal,
warning: the file is about 14GB","m1lname":"Shi","industry":"Media","analytics":"We train a generative network recognizing 56k fonts by learning a lower dimensional embedding space
Visualize fonts embedding in 3 dimensions (using T-SNE) by Google embedding projector
Write a script to read hdf5 file into spark

System specification:
We used service from Amazon AWS
Instance: P2.xlarge
Ubuntu 14.04 system
High Frequency Intel Xeon E5-2686v4 (Broadwell) Processors and 61 GiB memory
High-performance NVIDIA Tesla K80 GPUs
2,496 parallel processing cores
12GiB of GPU memory ","m2fname":"Yuchen","description":"Nowadays, we have seen that artificial intelligence technology (especially deep learning) can achieve human level performance or even beat human in many complicated tasks such as image recognition, playing Go games.

At the meantime, we find out that it is still very challenging to teach machine in art creation, composing music or writing an article.

We challange ourselves in this project teaching machine to write in its own style

In order to do that:

1,We try to create a “font vector” in latent space that “defines” a certain font. Then we can embed all fonts in one space where similar fonts have similar vectors.
2,Leveraging the font dataset online we plan to train a generative neural network that can map from our embedding space to different font styles.
3,Going beyond this we can ask computer to write in a random style or mixing styles from several fonts to generate its own ones

Potential marketplace:
1,Help senior (disabled) people write in their style
2,Calligraphy education ","m1fname":"Ruixiong","projectname":"analyzing 56k fonts using deep neural networks","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"kz2246","m1uni":"yt2495","m4uni":"","pid":"201612-13","m2uni":"pw2406","timestring":"Wed Dec 14 21:56:48 2016","m4fname":"","language":"Python, Pyspark, Opencv","m3lname":"Zhang","dataset":"Eight target categories are available in this dataset: Albacore tuna, Bigeye tuna, Yellowfin tuna, Mahi Mahi, Opah, Sharks, Other (meaning that there are fish present but not in the above categories), and No Fish (meaning that no fish is in the picture).
Each image has only one fish category, except that there are sometimes very small fish in the pictures that are used as bait. ","m1lname":"Teng","industry":"Information","analytics":"Convolutional Neural Network, Random Forest, Classifier Chains, GBDT
SIFT features, Bag of Words method","m2fname":"Pengfei","description":"The Conservancy is looking to the future by using cameras to dramatically scale the monitoring of fishing activities to fill critical science and compliance monitoring data gaps. And what we do is to develop algorithms to automatically detect and classify species of tunas, sharks and more that fishing boats catch, which will accelerate the video review process.","m1fname":"Yueying ","projectname":"Multiclass Image Classification with Fisheries Monitoring Data","m3fname":"Kewen"},{"m2lname":"Zhang","m4lname":"","m3uni":"cr2826","m1uni":"jy2803","m4uni":"","pid":"201612-62","m2uni":"lz2467","timestring":"Wed Dec 14 22:08:08 2016","m4fname":"","language":"pyt","m3lname":"Ren","dataset":"hehehe","m1lname":"Yu","industry":"Social Science-Government","analytics":"wet","m2fname":"Lyujia","description":"lalala","m1fname":"Jinyang","projectname":"MateFinder","m3fname":"Chuqiao "},{"m2lname":"Zhou","m4lname":"","m3uni":"hl2915","m1uni":"yl3407","m4uni":"","pid":"201612-10","m2uni":"lz2484","timestring":"Wed Dec 14 22:09:04 2016","m4fname":"","language":"Python, jupyter, pyspark","m3lname":"Li","dataset":"We used Raw NYC Taxi Trip Data that has been available to public
We processed around 4.1 million records from March 2015 to May 2015, and May 2016
","m1lname":"Li","industry":"Transportation","analytics":"Random Forest Algorithm,
Linear Regression,
Carto db
","m2fname":"Liutong","description":"Save $$ money and time on travel
Recommend optimal traveling plans for users
Predict traveling costs based on model","m1fname":"Yiting","projectname":"Travel Planing System","m3fname":"Hongyi"},{"m2lname":"Jiang","m4lname":"","m3uni":"nh2503","m1uni":"ys2867","m4uni":"","pid":"201612-66","m2uni":"mj2716","timestring":"Wed Dec 14 22:11:30 2016","m4fname":"","language":"Hadoop, Hive, Python, Java","m3lname":"Hao","dataset":"Dataset we used is Yelp Dataset.
The files containing data are ‘business.csv’ and ‘reviews.csv’, which are 2GB in total.

There are following characteristics in the two files:
1. 2.7M reviews and 649K tips by 687K users for 86K businesses
2. 566K business attributes, e.g., hours, parking availability, ambience.
3. Social network of 687K users for a total of 4.2M social edges.
4. Aggregated check-ins over time for each of the 86K businesses

http://www.ee.columbia.edu/~cylin/course/bigdata/submitprojectinfo.html
","m1lname":"Song","industry":"Information","analytics":"1. Classification: classify the restaurants from yelp dataset into different categories and perform the visualization of the restaurants.

2. Recommendation: based on the daily dinning behavior of users, recommend the same categorical restaurants that the users may fond of.

Analytics: Recommendation:
Item based recommendation algorithm;
Collaborative filtering;

Clustering:
TFIDF, Kmeans.","m2fname":"Meng","description":"Objectives:
Current Yelp recommendation results are not good and we want to improve the recommendation system of yelp to customize the result for each user.

Innovations:
Using users reviews to do clustering of the restaurants. Make a simple website to show the recommendations based on users' favorites.

Capabilities:
Make use of Hadoop, Hive and Python.

Why:
It can recommend similar restaurants and cluster different restaurants based on user's favorites. This can give out some customized results rather than same results provided by current Yelp.","m1fname":"Yao","projectname":"Restaurants Classification and Recommendation System Based on Yelp Dataset and Yelp API","m3fname":"Nongxiao"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nj2315","m4uni":"","pid":"201612-8","m2uni":"","timestring":"Wed Dec 14 22:18:06 2016","m4fname":"","language":"Python with for web scraping and toolkit implementation, PySpark for machine learning algorithms","m3lname":"","dataset":"Dataset: 700MB-1GB of Craigslist listings collected via web-scraping between 12/1/2016 and 12/22/2016

","m1lname":"Jones","industry":"Social Science-Government","analytics":"Spark - MapReduce, K-Means Clustering, and Random Forests","m2fname":"","description":"Objectives: Create a toolkit for detecting irregularities in online apartment listings using machine-learning-based anomaly detection, determine which algorithms are most effective for this purpose.

Innovations: Use of anomaly detection for the purpose of real estate fraud detection, examining which techniques are most effective for analyzing real estate data and why.

Capabilities: This toolkit provides an interface for outlier detection using univariate analysis, clustering, and random forests in the context of apartment listings.

Importance: This toolkit could be used by apartment listing websites or by law enforcement agencies to detect and prevent potential real estate fraud. It could be especially useful for apartment listing sites, as preventing fraud would protect the site's reputation for reliability and accuracy.
","m1fname":"Nathaniel","projectname":"Real Estate Fraud Detection","m3fname":""},{"m2lname":"Dong","m4lname":"","m3uni":"zj2213","m1uni":"yw2812","m4uni":"","pid":"201612-4","m2uni":"cd2847","timestring":"Wed Dec 14 22:26:01 2016","m4fname":"","language":"R, Python, Spark","m3lname":"Jia","dataset":"The dataset is from New York Times Article Search API through nytimesarticle package in Python. We have 798 csv files in total and the size is 2.92GB.","m1lname":"Wang","industry":"Finance","analytics":"1. Unigram TF-IDF is used to generate features.
2. Try some machine learning algorithms to build a predictive model.
3. Employ sparsity algorithms thus as LASSO and Elastic Net Regularization for the exploratory model.","m2fname":"Chunfeng","description":"Nowadays, there is an interesting phenomenon emerging on the stock market: with more economic scholars start to paying attention to the use of social media data, many hedge funds companies abandoned their traditional forecasting techniques and applied the new ways to make more profit by predicting stock markets through data from social media platforms. Triggered by this idea, we decide to examine how well could we forecast S&P 500 based on NY Times data in specific. Multiple related research papers have illustrated how to forecast stock price through tweets as reflections of people’s mood, but seldom does a paper collect text data from mainstream media, and this is the reason that we decide to explore text data from NY Times.

Different from point forecasting, we are going to predict the price change in S&P 500 by making directional forecasting, which in specific, corresponding to if the predicted stock price going up or down, we mark “+1” and “-1” as the result respectively. By comparing with the actual price increase or decrease in S&P 500, we could examine the correct rate of prediction and conclude if the forecasting technique is precise enough.","m1fname":"Yusen","projectname":"Forecasting S&P 500 Index Using NY Times Data","m3fname":"Ze"},{"m2lname":"Xu","m4lname":"","m3uni":"yz2837","m1uni":"xj2191","m4uni":"","pid":"201612-26","m2uni":"qx2154","timestring":"Wed Dec 14 22:27:42 2016","m4fname":"","language":"R, Python, databricks","m3lname":"Zhong","dataset":"The dataset that used in our project is provided by www.quandl.com. It includes the date, time, high price, low price, open price, close price and trading volume of Chinese company stocks from 2006 to 2016.","m1lname":"Jia","industry":"Finance","analytics":" Random Forest, Gradient-boosted tree, Logistic Regression","m2fname":"Qing","description":"Predicting price trends of particular stocks based on historical movement.
Comparing the prediction result different classification algorithms including Random Forest, Gradient-boosted tree, Logistic Regression, etc.

Prediction of stock price is a hard mission because it's related to many factors, such as company news and performance, interest rates, economic outlook. Also it could provide valuable information to investigators or peer companies. Our project is simply based on the Chinese company stock prices from 2006 to 2016. ","m1fname":"Xinli","projectname":"Predicting stock price movement using machine learning algorithms","m3fname":"Yitong"},{"m2lname":"Ma","m4lname":"","m3uni":"sy2615","m1uni":"tl2710","m4uni":"","pid":"201612-75","m2uni":"qm2124","timestring":"Wed Dec 14 22:28:08 2016","m4fname":"","language":"Python, R; Spark, Jupyter, SystemG","m3lname":"Yin","dataset":"We used 2 main datasets.

First is a 200K records of pokemons that appeared in the game, together with the location, time, weather and various features. It's available from Kaggle
https://www.kaggle.com/semioniy/predictemall

Second is open dataset detailing the 311 complains received in NYC. It has all the records from 2010 to 2016.
https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/data

Both datasets are available online to public.","m1lname":"Lu","industry":"Information","analytics":"For data visualization, we performed
python to build network graph data; SystemG query for visualization; Python for map plotting; R(leaflet) for map plotting

For classification algorithms, we performed:
Random forest, K Nearest Neighbor, Logistic Regression, Support Vector Machine

We also did various descriptive analytics using Python and R, for example we summarized the hourly frequency of different pokemons in terms of their rarity","m2fname":"Qing","description":"PokemonGo is a mobile augmented reality game developed by Niantic inc. for iOS, Android, and Apple Watch devices. It was initially released in selected countries in July 2016. In the game, players use a mobile device's GPS capability to locate, capture, battle, and train virtual creatures, called Pokémon, who appear on the screen as if they were in the same real-world location as the player.

One of the challenges faced by players is to capture a specific type of pokemon. There are 151 types in total and the rare ones have low chance of occurrences (~0.01% chance on average). We want to use historical data, recording the occurrences of each pokemon, together with features (such as time, weather, wind speed, etc) to predict where and when to catch specific pokemons. We will use various machine learning algorithms to tackle this problem.
","m1fname":"Tianhao","projectname":"PokemonGo Predict Them All!","m3fname":"Shengzhong"},{"m2lname":"Liu","m4lname":"","m3uni":"sf2794","m1uni":"gy2237","m4uni":"","pid":"201612-84","m2uni":"sl4039","timestring":"Wed Dec 14 22:29:19 2016","m4fname":"","language":"Python, spark","m3lname":"Fang","dataset":"We use python to write web crawler
URL:http://www.basketball-reference.com/
• Data for each Game
• Points of each quarter (over time if
available)
• Total score
• Win/Loss
• Data for each team (per season)
• Win/Loss rate
• Field Goals (& attempts, & percentage)
• 3-point Field Goals (& attempts, &
percentage)
• 2-point Field Goals (& attempts, & percentage)
• Free Throws (& attempts, & percentage)
• Rebounds (Offensive & Defensive)
• Assists, Steals, Blocks, Turnover, Fouls
• Total points
• Data for each player
• (Clauses same with each team)","m1lname":"Yang","industry":"Information","analytics":"
Our prediction use naive bayesian classified mode.
We extract some useful vectors for naive bayesian (Rebounds, Assist, Steals, Blocks, ... ), and modify the vector based on history performance of team and the newest game report.
The classification label is the result of each game.
We use python on Spark to realize the classification.","m2fname":"Shutian","description":"Sport result prediction is nowadays very popular which makes the problem of predicting the results of sporting events, a new and interesting challenge.
-Help sports commentator's work
-Help the team assess whether they have reached their expected performance
-Help advertisers make decisions
What previous papers concentrated are the classification results of the original game data, or say data that just modified using simple feature extraction methods.
We want to do more:
We want to establish a dynamic model based on Naive Bayesian model. And We think that we should modified the feature vector based on daily match result. And also, we want to consider the effect of team member changing.","m1fname":"Guang","projectname":"Data Analysis for Basketball Matches Outcomes Prediction ","m3fname":"Shu"},{"m2lname":"Dong","m4lname":"","m3uni":"zj2213","m1uni":"yw2812","m4uni":"","pid":"201612-4","m2uni":"cd2847","timestring":"Wed Dec 14 22:30:36 2016","m4fname":"","language":"R, Python, Spark ","m3lname":"Jia","dataset":"The dataset is from New York Times Article Search API through nytimesarticle package in Python. We have 798 csv files in total and the size is 2.92GB.
","m1lname":"Wang","industry":"Finance","analytics":"1. Unigram TF-IDF is used to generate features.
2. Try some machine learning algorithms to build a predictive model.
3. Employ sparsity algorithms thus as LASSO and Elastic Net Regularization for the exploratory model.","m2fname":"Chunfeng","description":"Nowadays, there is an interesting phenomenon emerging on the stock market: with more economic scholars start to paying attention to the use of social media data, many hedge funds companies abandoned their traditional forecasting techniques and applied the new ways to make more profit by predicting stock markets through data from social media platforms. Triggered by this idea, we decide to examine how well could we forecast S&P 500 based on NY Times data in specific. Multiple related research papers have illustrated how to forecast stock price through tweets as reflections of people’s mood, but seldom does a paper collect text data from mainstream media, and this is the reason that we decide to explore text data from NY Times.

Different from point forecasting, we are going to predict the price change in S&P 500 by making directional forecasting, which in specific, corresponding to if the predicted stock price going up or down, we mark “+1” and “-1” as the result respectively. By comparing with the actual price increase or decrease in S&P 500, we could examine the correct rate of prediction and conclude if the forecasting technique is precise enough.","m1fname":"Yusen","projectname":"Forecasting S&P 500 Index Using NY Times Data","m3fname":"Ze"},{"m2lname":"Liu","m4lname":"","m3uni":"jl4564","m1uni":"yx2316","m4uni":"","pid":"201612-42","m2uni":"yl3380","timestring":"Wed Dec 14 22:33:12 2016","m4fname":"","language":"R, Python, Spark","m3lname":"Liu","dataset":"The dataset will be collected from a Kaggle competition: State Farm Distracted Driver Detection. It includes two files: imgs.zip, a zipped folder of all (train/test) images, and driver_imgs_list.csv, a list of training images, their subject (driver) id, and class/label id.","m1lname":"Xie","industry":"Social Science-Government","analytics":"Face Recognition: OpenCV Haar Cascade Classifier

Feature extraction: Caffe deep features + OpenCV SIFT features

Feature decomposition: PCA

Classification model: Neural Network, KNN, Random Forest
(With Cross validation + Fine-tuning)

Interface: R Shiny

","m2fname":"Yuhang","description":"Objectives: Enable dashboard cameras to automatically detect drivers engaging in distracted behaviors to avoid traffic accidents.

Innovations: The system automatically implements face recognition to avoid unnecessary feature extraction and classification; once the face of driver is recognized, the system will extract deep features of the photo using neural networks. The model trained previously will then be applied to classify the driver's behavior. If a distraction is captured, the driver will be warned by the system.

Capabilities: The model can function as long as a photo of driver is provided. The system is implemented on iOS platform and several tools/packages are required: R, Python, Spark, Caffe, OpenCV.

Why are these research/toolkits important: According to the CDC motor vehicle safety division, one in five car accidents is caused by a distracted driver. This translates to 425,000 people injured and 3,000 people killed by distracted driving every year. This tool can help us detected the distracted driving behavior in time and potentially prevent the accidents from happening.
","m1fname":"Yaqing","projectname":"Drive and Arrive - Distracted Driver Detection","m3fname":"Jiaming"},{"m2lname":"Owens","m4lname":"","m3uni":"ka2604","m1uni":"kcl2143","m4uni":"","pid":"201612-3","m2uni":"ao2595","timestring":"Wed Dec 14 22:33:57 2016","m4fname":"","language":"Python/Spark, SparkSQL","m3lname":"Andoni","dataset":"We leveraged existing Census Bureau data across multiple data sets and connected them with driving distances from Web APIs (Bing Maps). Educational accounts are limited 125,000 calls per year and the final data set contains over 111,000 routes. The routes contain driving distances, traffic, and driving times (as well as full directions). The original 110,000,000 combinations we narrowed down by eliminating small metro areas and pre-filtering based on the minimum straight line distance.

We compared the new dataset to the 2009 National Household Travel Survey which contains a lot of information on travel and fuel consumption to verify our dataset.","m1lname":"Lauritzen","industry":"Transportation","analytics":"Developed custom code utilizing SparkSQL to conduct the analytics. Visualizations were done in Plotly. The data was put into System G, but it is not well represented by a graph.","m2fname":"Adam","description":"In our project, we looked at the geographic and financial impact of a carbon tax across the nation. Specifically, we focused on commuting and the large geographic spread of people from city centers within the United States. Unlike other sources of carbon utilization, commuting distances are much harder to change (most people cannot move their house) and will rely on new technology or dramatic societal changes to overcome. Moreover, in many cases the extended suburbs (exurbs) are populated by people with lower incomes than those closer to the city center.

We created a new dataset that combines data from the Census Bureau with driving distances between metro areas and the surrounding areas. We gathered the driving distances through a Web API (Bing Maps). We then built a model of commuting leverages driving distance and estimates carbon usage and carbon tax cost (set at $140). The carbon tax was looked at in absolute terms and relative to the median income. The impact varies across geography, with carbon taxes exceeding over 1% of median income in over half the county (just for commuting).","m1fname":"Keir","projectname":"Geographic and Financial Impact of Carbon Tax in the United States","m3fname":"Kosta"},{"m2lname":"Zheng","m4lname":"","m3uni":"jd3304","m1uni":"sf2785","m4uni":"","pid":"201612-59","m2uni":"sz2607","timestring":"Wed Dec 14 22:35:58 2016","m4fname":"","language":"mainly python, pycuda, pyspark","m3lname":"Du","dataset":"Data sets are public, including pollutant data and economic data","m1lname":"Feng","industry":"Life Science","analytics":"Multinominal naive bayes, Adaboost, Tableau","m2fname":"Shuyan","description":"Economic data are some time sensitive, so we could use pollutant data to predict economic data","m1fname":"Shuwei","projectname":"Relationship between Environment and Economic Development","m3fname":"Jinze"},{"m2lname":"Wang","m4lname":"","m3uni":"xy2306","m1uni":"jg3752","m4uni":"","pid":"201612-20","m2uni":"yw2875","timestring":"Wed Dec 14 22:36:00 2016","m4fname":"","language":"Java, SystemG, Eclipse","m3lname":"Yuan","dataset":"Yahoo! Movies Ratings and Information, v1.0a
Amazon Movie Data Reviews","m1lname":"Gao","industry":"Information","analytics":"Hybrid recommender systems
Users-based Model
Content-based Model","m2fname":"Yizhou","description":"Build more effective movie recommendation system.
Highly used in both movie theaters and movie production companies.","m1fname":"Jiechao","projectname":"Hybrid Movie Recommender","m3fname":"Xing"},{"m2lname":"Shi","m4lname":"","m3uni":"","m1uni":"ya2366","m4uni":"","pid":"201612-5","m2uni":"hs2917","timestring":"Wed Dec 14 22:39:45 2016","m4fname":"","language":"Python, Spark","m3lname":"","dataset":"It is the tick by tick market price data, of four types of futures: IFB1: CSI 300 XU1: FTSE China A50 HI1:Hang Seng HC1: H-shares. The data is from Wind financial Terminal, and it is public.","m1lname":"An","industry":"Finance","analytics":"Using logistic regression, decision tree and Neural Network for classification.","m2fname":"Huafeng","description":"Find possible trading strategies to make profit in Chinese Futures market. Our project will try to predict the price movement, and given the prediction result to make \"buy\" or \"sell\" decisions. Machine learning algorithm may include information that traditional financial models cannot include.","m1fname":"Yuting","projectname":"Trading Strategies in Chinese Future Market","m3fname":""},{"m2lname":"Halpert","m4lname":"","m3uni":"vss2113","m1uni":"yl3394","m4uni":"","pid":"201612-45","m2uni":"cph2133","timestring":"Wed Dec 14 22:40:03 2016","m4fname":"","language":"Hadoop, Spark, Python, R, CSS/HTML, AWS","m3lname":"Scherbich","dataset":"Yelp Dataset:
Round 8 (Nov - Dec 2016) of the Yelp Dataset Challenge:
https://www.yelp.com/dataset_challenge

1. 2.7M reviews and 649K tips by 687K users for 86K businesses
2. 566K business attributes, e.g., hours, parking availability, ambience.
3. Social network of 687K users for a total of 4.2M social edges.
4. Aggregated check-ins over time for each of the 86K businesses
5. 200,000 pictures from the included businesses ","m1lname":"Li","industry":"Retail","analytics":"1. Image Captioning: NeuralTalk2
2. Text-mining on Reviews: N-gram Document Term Matrix, topic modeling
3. Prediction Models on stars
4. Search Engine ","m2fname":"Christopher Paul","description":"Identify what factors lead to success and failure restaurants over time, and what are the main factors that influence consumers’ reviews. The final output will be a data product based on our performance prediction analysis.
","m1fname":"Yanjin","projectname":"Personalized Predictions Yelp Business ","m3fname":"Vladislav Sergeevich "},{"m2lname":"Zhang","m4lname":"","m3uni":"jy2803","m1uni":"cr2826","m4uni":"","pid":"201612-62","m2uni":"lz2467","timestring":"Wed Dec 14 22:41:15 2016","m4fname":"","language":"Mahout, Spark, Jyson, Python, Django, Bootstrap, HTML, CSS, Node.js, Git I/O, AWS RDS","m3lname":"Yu","dataset":"OKCupid Profile Data: 50k user profile data","m1lname":"Ren","industry":"Social Science-Government","analytics":"k-means: recommendation phase
Microsoft face API: Recognize face and find top-k similar faces based on user's photo","m2fname":"Lyujia","description":"The face is the index of the mind --- Chinese proverb
People are constantly, both consciously and unconsciously, looking for their future partners. This application will expedite this process by analyzing and matching people's face to those in the database, and recommending the \"correct\" ones to the clients. By analyzing clients' photos (aka their faces), we are able to find the right one(s) for them to date. The innovation in this project is to integrate the facial recognition to the recommendation system.","m1fname":"Chuqiao","projectname":"MateFinder: The Next Generation of Dating Recommender System. ","m3fname":"Jinyang"},{"m2lname":"Zhao","m4lname":"","m3uni":"rz2357","m1uni":"nh2531","m4uni":"","pid":"69","m2uni":"yz2996","timestring":"Wed Dec 14 22:45:50 2016","m4fname":"","language":"Python & Flask & javascript","m3lname":"Zhang","dataset":"NYPD Complaint Data Historic and New York Historic Temperature
includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2015).

NYC open data
Historical temperature records for the NYC were augmented to the dataset. The temperature records were aggregated up to the week level. The temperature data for New York was obtained from Prof. John Kissock’s website at the University of Dayton.
","m1lname":"Huang","industry":"Social Science-Government","analytics":"RandomForest
Pyspark
MLlib
Google Api
AWS
","m2fname":"Yi Han","description":" As we live in an international city,we all care about the crime rate because our safeties are very important.So we need to know some kind of facts like weather that maybe effect the crime rate.It will also provide a reference for people.
From New York Times reports, weather may have influence on violence.
In a paper published in the journal Science, we assembled 60 of the best studies on this topic from fields as diverse as archaeology, criminology, economics, geography, history, political science and psychology. Typically, these were studies that compared, in a given population, levels of violence during periods of normal climate with levels of violence during periods of extreme climate.
We found that higher temperatures and extreme rainfall led to large increases in conflict: for each one standard deviation change in climate toward warmer temperatures or more extreme rainfall, the median effect was a 14 percent increase in conflict between groups, and a 4 percent increase in conflict between individuals.

","m1fname":"Neng","projectname":"Criminal Almanac","m3fname":"Ruo Meng"},{"m2lname":"Dong","m4lname":"","m3uni":"zj2213","m1uni":"yw2812","m4uni":"","pid":"201612-4","m2uni":"cd2847","timestring":"Wed Dec 14 22:46:20 2016","m4fname":"","language":"R, Python, Spark","m3lname":"Jia","dataset":"The dataset is from New York Times Article Search API through nytimesarticle package in Python. We have 798 csv files in total and the size is 2.92GB. ","m1lname":"Wang","industry":"Finance","analytics":"1. Unigram TF-IDF is used to generate features.
2. Try some machine learning algorithms to build a predictive model.
3. Employ sparsity algorithms thus as LASSO and Elastic Net Regularization for the exploratory model.","m2fname":"Chunfeng","description":"Nowadays, there is an interesting phenomenon emerging on the stock market: with more economic scholars start to paying attention to the use of social media data, many hedge funds companies abandoned their traditional forecasting techniques and applied the new ways to make more profit by predicting stock markets through data from social media platforms. Triggered by this idea, we decide to examine how well could we forecast S&P 500 based on NY Times data in specific. Multiple related research papers have illustrated how to forecast stock price through tweets as reflections of people’s mood, but seldom does a paper collect text data from mainstream media, and this is the reason that we decide to explore text data from NY Times.

Different from point forecasting, we are going to predict the price change in S&P 500 by making directional forecasting, which in specific, corresponding to if the predicted stock price going up or down, we mark “+1” and “-1” as the result respectively. By comparing with the actual price increase or decrease in S&P 500, we could examine the correct rate of prediction and conclude if the forecasting technique is precise enough.","m1fname":"Yusen","projectname":"Forecasting S&P 500 Index Using NY Times Data","m3fname":"Ze"},{"m2lname":"Zhang","m4lname":"","m3uni":"","m1uni":"ls3301","m4uni":"","pid":"201612-1","m2uni":"qz2246","timestring":"Wed Dec 14 22:49:03 2016","m4fname":"","language":"Python, Spark, Tableau","m3lname":"","dataset":"Million song dataset","m1lname":"SHI","industry":"Information","analytics":"hierarchical clustering , PCA, feature selection","m2fname":"Qianyun","description":"Music clustering and music recommendation system","m1fname":"LUYUAN","projectname":"Music Time","m3fname":""},{"m2lname":"Halpert","m4lname":"","m3uni":"vss2113","m1uni":"yl3394","m4uni":"","pid":"201612-45","m2uni":"cph2133","timestring":"Wed Dec 14 22:50:06 2016","m4fname":"","language":"Hadoop, Spark, Python, R, CSS/HTML, AWS ","m3lname":"Scherbich","dataset":"Yelp Dataset:
Round 8 (Nov - Dec 2016) of the Yelp Dataset Challenge:
https://www.yelp.com/dataset_challenge

1. 2.7M reviews and 649K tips by 687K users for 86K businesses
2. 566K business attributes, e.g., hours, parking availability, ambience.
3. Social network of 687K users for a total of 4.2M social edges.
4. Aggregated check-ins over time for each of the 86K businesses
5. 200,000 pictures from the included businesses ","m1lname":"Li","industry":"Retail","analytics":"1. Image Captioning: NeuralTalk2
2. Text-mining on Reviews: N-gram Document Term Matrix, topic modeling
3. Prediction Models on stars
4. Search Engine","m2fname":"Christopher Paul","description":"Identify what factors lead to success and failure restaurants over time, and what are the main factors that influence consumers’ reviews. The final output will be a data product based on our performance prediction analysis.
","m1fname":"Yanjin","projectname":"Personalized Predictions Yelp Business ","m3fname":"Vladislav Sergeevich "},{"m2lname":"Li","m4lname":"","m3uni":"xd2163","m1uni":"xc2364","m4uni":"","pid":"201612-94","m2uni":"sl4063","timestring":"Wed Dec 14 22:51:32 2016","m4fname":"","language":"Hadoop, Spark, Python, Matlab, Beautiful Soup","m3lname":"Dong","dataset":"We wrote a web crawler using Beautiful Soup Python library to gather online news from The New York Times, Bloomberg and Business Insider.","m1lname":"Chen","industry":"Information","analytics":"We use Nature Language Processing technics for extracting information from online news dataset.
and we also use Support Vector Machine Regression for modeling and prediction.
We use Polynomial Regression for model evaluation.","m2fname":"Sijia","description":"Novelty
• Traditional economic and business forecasting has relied on statistics gathered by government agencies, annual reports, and financial statements, with little or no regard to the impact of human behavior.
• Sentiment analysis of online news and search engine query data has made it possible to monitor and quantify human behaviors in a systematic way.

Business Value
• In this project, we’d like to examine how big data concerning the above two two factors can be used to predict the real estate market trend.","m1fname":"Xiaofei","projectname":"Real estate market forecasting model based on search engine query data and NLP sentiment analysis","m3fname":"Xu"},{"m2lname":"Song","m4lname":"","m3uni":"yt2549","m1uni":"hx2208","m4uni":"","pid":"201612-56","m2uni":"zs2324","timestring":"Wed Dec 14 22:55:48 2016","m4fname":"","language":"Spark, python","m3lname":"Tie","dataset":"Yelp data.
Firstly we use Yelp Api to get different Business id, then we write a scarpy to get the data we want by making request which combined with business id directly.
After our IP was banned, we combined the data we get by using scarpy with other yelp dataset we downloaded from yelp dataset challenge round 8. ","m1lname":"Xu","industry":"Information","analytics":"Scrapy to get data on Yelp website.
Collaborative filtering
K-NN
linear regression","m2fname":"Zehao","description":"Yelp is an excellent application to help people find restaurants. People sharing their experiences during the meal on Yelp and Yelp makes recommendations for good restaurants based on people’s rating history. However, Yelp cannot recommend people have the same eating habit to have meal together currently.
This is a new functionality that Yelp does not support, if implemented well, there would be a great business potentials and could attract more users.","m1fname":"Hao","projectname":"Eating Mate Recommendation","m3fname":"Yutong"},{"m2lname":"Yang","m4lname":"","m3uni":"sl4039","m1uni":"sf2794","m4uni":"","pid":"201612-84","m2uni":"gy2237","timestring":"Wed Dec 14 22:56:10 2016","m4fname":"","language":"python on spark","m3lname":"Liu","dataset":"Data on past NBA game (each game, each team, each player)","m1lname":"Fang","industry":"Information","analytics":"naive bayes","m2fname":"Guang","description":"Prediction for NBA game result and final table. Based on the performance of each player in the past games.","m1fname":"Shu","projectname":"Data Analysis For Basketball Matches Outcomes Prediction ","m3fname":"Shutian"},{"m2lname":"Zhao","m4lname":"","m3uni":"rz2357","m1uni":"nh2531","m4uni":"","pid":"201612-69","m2uni":"yz2996","timestring":"Wed Dec 14 22:58:20 2016","m4fname":"","language":"Python & Flask & javascript","m3lname":"Zhang","dataset":"NYPD Complaint Data Historic(1.4G): includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2015).

New York Historic Temperature: Historical temperature records for the NYC were augmented to the dataset. The temperature records were aggregated up to the week level. The temperature data for New York was obtained from Prof. John Kissock’s website at the University of Dayton.
","m1lname":"Huang","industry":"Social Science-Government","analytics":"Random Forest
Pyspark
MLlib
AWS
Flask
HTML
Javascript
sparksql","m2fname":"Yi Han","description":" As we live in an international city,we all care about the crime rate because our safeties are very important.So we need to know some kind of facts like weather that maybe effect the crime rate.It will also provide a reference for people.
From New York Times reports, weather may have influence on violence.
In a paper published in the journal Science, we assembled 60 of the best studies on this topic from fields as diverse as archaeology, criminology, economics, geography, history, political science and psychology. Typically, these were studies that compared, in a given population, levels of violence during periods of normal climate with levels of violence during periods of extreme climate.
We found that higher temperatures and extreme rainfall led to large increases in conflict: for each one standard deviation change in climate toward warmer temperatures or more extreme rainfall, the median effect was a 14 percent increase in conflict between groups, and a 4 percent increase in conflict between individuals.
","m1fname":"Neng","projectname":"Criminal Almanac","m3fname":"Ruo Meng"},{"m2lname":"Zhao","m4lname":"","m3uni":"jc4609","m1uni":"yz2978","m4uni":"","pid":"201612-30","m2uni":"yz3007","timestring":"Wed Dec 14 23:01:09 2016","m4fname":"","language":"pySpark","m3lname":"Chen","dataset":"We mainly utilize the dataset from Kaggle (with the size of 4.1GB) for our recommendation system. It is used for training and testing the recommendation model.","m1lname":"Zheng","industry":"Information","analytics":"The functionality of algorithm is recommending five most suitable cluster for the user based on the information provided. Intuitionally, we find that some factors play a more significant role in modeling a user’s taste than rest of variables. Therefore, we prioritize different features and set up five steps based on accuracy(you can see in detail in our slides).
","m2fname":"Yufei","description":"Objective:
When you are planning your trip, choosing hotels can be an overwhelming affair. Thus we want to design a recommendation system to help the users find the suitable hotels.

Innovation:
We incorporate hotel markets, user’s current location, and destination into a recommending model on Spark platform. By performing our elaborated algorithm, we are able to recommend top 5 hotel types/clusters to the user in terms of their distinctive demands.
","m1fname":"Yu","projectname":"Expedia Hotel Exploration","m3fname":"Jingshi"},{"m2lname":"Yu","m4lname":"","m3uni":"","m1uni":"ql2260","m4uni":"","pid":"201612-12","m2uni":"jy2731","timestring":"Wed Dec 14 23:01:30 2016","m4fname":"","language":"Python: Nltk enchant sklearn, etc packages javascript, AWS Web Server Hadoop, mahout","m3lname":"","dataset":"The data for classfier training process:http://help.sentiment140.com/for-students/
It's a raw twitter text dataset with sentiment label, 128M,povidede by sentiment 140 (a online tool targeting twitter sentiment analysis)
test data: Twitter streaming API ~1G
","m1lname":"Li","industry":"Information","analytics":"Python: sklearn package
Mahout: Built-in function for training classifier
Logistic Regression
Logistic Regression is a type of probabilistic statistical classification model.
It is also used to predict a binary response from a binary predictor, used for
predicting the outcome of a categorical dependent variable (i.e., a class label)
based on one or more predictor variables (features).
Naive Bayes
naive Bayes classifiers are a family of simple probabilistic classifiers based
on applying Bayes' theorem with strong (naive) independence assumptions
between the features.
Linear SVM
Similar to SVC with parameter kernel=’linear’, but implemented in terms of
liblinear rather than libsvm, so it has more flexibility in the choice of penalties
and loss functions and should scale better to large numbers of samples.
","m2fname":"JiaYing","description":"music is such an important part of our daily life, besides its function to help us release stress, it's also a great topic for daily conversation. We spend lots of time streaming through music to stay in trend, but it's hard to detect the hottest song or artist quickly before it's too late. Although we have access to all kinds of music charts from main stream meida, but their opinion may differ from the public's view. Today we'll build a music recommender website using twitter streaming data to deliver the newest music trend, you'll be able to know the most-discussed music and people's real attitude about the music.","m1fname":"Qian","projectname":"Music recommender based on Twitter Stream","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"yw2864","m4uni":"","pid":"201612-93","m2uni":"","timestring":"Wed Dec 14 23:02:50 2016","m4fname":"","language":"java hadoop","m3lname":"","dataset":"Using all the novels of Jinyong and the character name in these novels.
1. http://www.txthj.com/post/195
2. http://wenku.baidu.com/view/ea10a8c34028915f804dc22e.html?re=view (word)
3. http://www.jinyongwang.com/data/renwu/
typically this program can support all Chinese novel bay changing the name list and the text.","m1lname":"wang","industry":"Media","analytics":"Word co-occurrence; PageRank Algorithm; Label propagation algorithm","m2fname":"","description":" If there has too much character in a series of novels, then finding the relationship between each characters and different power is difficult by human power.
To show this relationship easily, we developed this program.","m1fname":"yangming","projectname":"THE CHARACTER RELATIONSHIP IN SERIES CHINESE NOVEL","m3fname":""},{"m2lname":"Jain","m4lname":"","m3uni":"ha2434","m1uni":"av2674","m4uni":"","pid":"201612-49","m2uni":"bj2346","timestring":"Wed Dec 14 23:03:24 2016","m4fname":"","language":"Python, Spark, Tableau","m3lname":"Arora","dataset":"The Yelp Dataset provided by the Yelp Dataset challenge was used","m1lname":"Vasudevan","industry":"Information","analytics":"Topic Modeling - LDA
Word2vec - Word Vectorization
Textblob - Sentiment Analysis
Tableau - Visualization","m2fname":"Bhawika","description":"The objective of the project was to use Text Analytics to understand the topics people talk about and the sentiments people express in the Yelp Dataset. We hope that these techniques can be implemented so that businesses can better understand the customers' needs and the customer can better understand the businesses' strengths and weaknesses. ","m1fname":"Avinesh","projectname":"Restaurant Performance Quantification","m3fname":"Himani"},{"m2lname":"Zhu","m4lname":"","m3uni":"zz2371","m1uni":"bt2414","m4uni":"","pid":"201612-22","m2uni":"yz2868","timestring":"Wed Dec 14 23:13:39 2016","m4fname":"","language":"PySpark, Flask, CherryPy and wordle ","m3lname":"Zhang","dataset":"1. Description
The Data is from GroupLens's MovieLen dataset. There are four csv files we mainly used in building up our system:
(1) movies.csv(all movies list; recommendation and clustering features)
(2) ratings.csv(all users' ratings to movies; recommendation and clustering features)
(3) genome-tags.csv(all tags list; visualization)
(3) genome-score.csv(all movies' related tags with calculated relevance; visualization)

2. DataSets Access
complete_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest.zip'
small_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
The complete dataset is 1G and the small dataset is 3.1MB(Used for testing).

3. Records
There are totally 40,110 movies and 24,404,096 ratings recorded in the dataset.
There are totally 1,130 tags.
Each movie could have a lot of tags. And the attachment number for all the movies is 12,040,273.
","m1lname":"Tong","industry":"Information","analytics":"Algorithm: ALS Recommendation Model, KMeans Recommendation Model
Visualization: Word Cloud For Movies","m2fname":"Yunxuan","description":"There are a lot of projects in the previous were building up movie recommendation systems. And they all used Hadoop MapReduce programming model combined with Apache Mahout to build up their engines. However, Hadoop MapReduce is not an efficient model to build recommendation systems. The reason is because there are a lot of iterative machine learning algorithms are using when we built up the recommendation engines. MapReduce may do a lot of duplicated jobs when running those kind of algorithms. Thus, we are trying to use Apache Spark’s in-memory processing and pipeline features to build up a more sufficient recommendation system.
Moreover, traditional movie recommendation systems just return their predicted high-rated movies(ML algorithms) to their customers. This may not intuitive to customers to deeply understand the movies recommended. Thus, we want to visualized all the previous customers’ comments to movies to improve the customer experience.
","m1fname":"Benjie","projectname":"Visualized Movie Recommendation System","m3fname":"Zhiwang"},{"m2lname":"An","m4lname":"","m3uni":"hs2917","m1uni":"sz2629","m4uni":"","pid":"201612-5","m2uni":"ya2366","timestring":"Wed Dec 14 23:16:25 2016","m4fname":"","language":"Python, Spark ","m3lname":"Shi","dataset":" It is the tick by tick market price data, of four types of futures: IFB1: CSI 300 XU1: FTSE China A50 HI1:Hang Seng HC1: H-shares. The data is from Wind financial Terminal, and it is public. ","m1lname":"ZHOu","industry":"Finance","analytics":"Using logistic regression, decision tree and Neural Network for classification.","m2fname":"Yuting","description":"Find possible trading strategies to make profit in Chinese Futures market. Our project will try to predict the price movement, and given the prediction result to make \"buy\" or \"sell\" decisions. Machine learning algorithm may include information that traditional financial models cannot include.
","m1fname":"Sitong","projectname":"Trading Strategies in Chinese Future Market","m3fname":"Huafeng"},{"m2lname":"Zhang","m4lname":"","m3uni":"xs2278","m1uni":"yc3228","m4uni":"","pid":"201612-79","m2uni":"wz2348","timestring":"Wed Dec 14 23:17:35 2016","m4fname":"","language":"Eclipse(Java), Spark(Python), Hadoop","m3lname":"Sun","dataset":"MovieLens http://grouplens.org/datasets/movielens/latest","m1lname":"Chai","industry":"Information","analytics":"1. Collaborative filtering
Collaborative filtering is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.
2. Content-based filtering
Content-based filtering tries to recommend items that are similar to those that a user liked in the past (or is examining in the present).
3. Hybrid recommender systems
Hybrid recommender systems combines collaborative filtering and content- based filtering
Collaborative filtering","m2fname":"Wanjia","description":"Nowadays, there are more and more new movies are made every year, and they involve various categories. Many websites will let people post their reviews and ratings about the movie, and tag these movies. System can recommend people the movies they are interested in according to these reviews and ratings.

Expected Outcome: Recommend movies people may be interested in based on their ratings of other movies.

This system can help people find movies they want easily and save their time.","m1fname":"Yunzi","projectname":"Movie Recommendation","m3fname":"Xuecheng"},{"m2lname":"Zhao","m4lname":"","m3uni":"rz2357","m1uni":"nh2531","m4uni":"","pid":"201612-69","m2uni":"yz2996","timestring":"Wed Dec 14 23:19:31 2016","m4fname":"","language":"Python & Flask & javascript","m3lname":"Zhang","dataset":"NYPD Complaint Data Historic(1.4G): includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) from 2006 to the end of last year (2015).

New York Historic Temperature: Historical temperature records for the NYC were augmented to the dataset. The temperature records were aggregated up to the week level. The temperature data for New York was obtained from Prof. John Kissock’s website at the University of Dayton.
","m1lname":"Huang","industry":"Social Science-Government","analytics":"Random Forest
Pyspark
MLlib
AWS
Flask
HTML
Javascript
sparksql","m2fname":"Yi Han","description":" As we live in an international city,we all care about the crime rate because our safeties are very important.So we need to know some kind of facts like weather that maybe effect the crime rate.It will also provide a reference for people.
From New York Times reports, weather may have influence on violence.
In a paper published in the journal Science, we assembled 60 of the best studies on this topic from fields as diverse as archaeology, criminology, economics, geography, history, political science and psychology. Typically, these were studies that compared, in a given population, levels of violence during periods of normal climate with levels of violence during periods of extreme climate.
We found that higher temperatures and extreme rainfall led to large increases in conflict: for each one standard deviation change in climate toward warmer temperatures or more extreme rainfall, the median effect was a 14 percent increase in conflict between groups, and a 4 percent increase in conflict between individuals.
","m1fname":"Neng","projectname":"Criminal Almanac","m3fname":"Ruo Meng"},{"m2lname":"Cheng","m4lname":"","m3uni":"qh2174","m1uni":"ds3516","m4uni":"","pid":"201612-82","m2uni":"pc2756","timestring":"Wed Dec 14 23:19:56 2016","m4fname":"","language":"Python, R, Spark, System G","m3lname":"Hu","dataset":"The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes).","m1lname":"Sui","industry":"Retail","analytics":"User-based Collaborative Filtering
Alternating Least Square Error
Item to Item similarity
Breath First Search","m2fname":"Panpan","description":"Combination of User-based Collaborative Filtering and Item to Item Recommendation Process
Pursuing a personalized and abundant recommendation chain to open the market","m1fname":"Danning","projectname":"E-commercial Product Recommendations","m3fname":"Qiong"},{"m2lname":"Zhang","m4lname":"","m3uni":"yw2910","m1uni":"bg2567","m4uni":"","pid":"201612-35","m2uni":"jz2793","timestring":"Wed Dec 14 23:21:05 2016","m4fname":"","language":" Python, SQL/Hadoop, Spark, Hive","m3lname":"Wang","dataset":"Motor Vehicle Collisions datasets from public or government websites.

https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data
https://data.ny.gov/Transportation/Motor-Vehicle-Crashes-Case-Information-Three-Year-/e8ky-4vqe
https://data.cityofnewyork.us/Transportation/2015-Green-Taxi-Trip-Data/gi8d-wdg5

","m1lname":"Gao","industry":"Transportation","analytics":"K-means clustering to find out the most highly possible locations of collisions.
Visualization algorithm for statistics result of datasets.","m2fname":"Jiajun","description":"Our goal is to get some conclusions on collisions through the databases.
We would like to get:
1. the district and street with the highest possibility to have a collision
2. the time when it is easy to have a collision
3. the type of the motor vehicle which is easy to cause a collision
4. the most common reason for the collision
5. the outcome of the collision(how many people were injured or killed)

These conclusions can help people take care themselves at some place for some time and also help police arrange their policemen well, which would help reduce the number of motor vehicle collisions. ","m1fname":"Borui","projectname":"Research on Motor Vehicle Collisions in NYC","m3fname":"Yufu"},{"m2lname":"Zhang","m4lname":"","m3uni":"hl2919","m1uni":"zy2233","m4uni":"","pid":"201612-61","m2uni":"yz2866","timestring":"Wed Dec 14 23:21:42 2016","m4fname":"","language":"Python, Spark, Hadoop, System G","m3lname":"Li","dataset":" http://jmcauley.ucsd.edu/data/amazon/
This dataset contains product reviews and metadata from Amazon. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).","m1lname":"Yang","industry":"Media","analytics":"Recommendation (Mahout)
Natural Language Processing (Spark)
Classification (Spark)
Visualization (System G and Flask)
","m2fname":"Yunfan","description":"A web application realizing three kinds of recommendation:
1. Recommend unread books to users.
2. Recommend users with similar taste to users.
3. Recommend potential helpful reviews to users.

","m1fname":"Zhuangfei","projectname":"Book Recommendation Engine","m3fname":"Huashu"},{"m2lname":"Qiu","m4lname":"","m3uni":"","m1uni":"rz2364","m4uni":"","pid":"201612-60","m2uni":"qy2207","timestring":"Wed Dec 14 23:21:57 2016","m4fname":"","language":"python Spark","m3lname":"","dataset":"We get the dataset from a related research
and here are the datasets.
1. The Harvard-Haskins Database of Regularly-Timed Speech http://www.nsi.edu/~ani/download.html
2.VoxForge Speech Corpus, Home http://www.repository.voxforge1.org/downloads/SpeechCorpus/Trunk/Audio/Main/8kHz_16bit/
3. Festvox CMU_ARCTIC Speech Database at Carnegie Mellon University http://festvox.org/cmu_arctic/","m1lname":"Zhang","industry":"Information","analytics":"Naive Bayes, Random Forest, Gradient-Boosted Tree","m2fname":"Yunlei","description":"Objectives
Use acoustic characteristics (mean frequency & skew etc) to determine the gender of a person & Compare different Classification ( Random Forest, Naive Bayes, Logistic Regression)

Reason
1.Gender Recognition is a key part of identification
2.The result of Gender Recognition can be used in many other fields ( Gender Studies, etc )
3.Knowledge about acoustic characteristics is very helpful to process and restore data information

","m1fname":"Ruixuan","projectname":"Gender Recognition by Voice","m3fname":""},{"m2lname":"Chen","m4lname":"","m3uni":"","m1uni":"pw2435","m4uni":"","pid":"201612-9","m2uni":"xc2363","timestring":"Wed Dec 14 23:22:40 2016","m4fname":"","language":"R, Python, Spark","m3lname":"","dataset":"Yahoo Finance& Stocktwists

Yahoo Finance Data is downloaded with R quantmod
Sticktwists Data is downloaded with nodejs

","m1lname":"Wang","industry":"Media","analytics":"1. Classification Method in Yahoo Finance
2. Sentiment Analysis on Stocktwists Review Data
3. Compare Two Results
4. time series models","m2fname":"Xuanyu","description":"Stock reviewing websites are surrounded by masses of unstructured data.(social media, blogs,etc)
Big data analysis is increasingly being used to provide deep insight and predictive analysis into stock market movements and individual investment behaviors. Those that are able to make use and harness the power of this disruptive force in markets will benefit by being smarter, faster and more efficient.
","m1fname":"Pingyuan","projectname":"Social Media and Investment","m3fname":""},{"m2lname":"Ji","m4lname":"","m3uni":"","m1uni":"xy2282","m4uni":"","pid":"201612-98","m2uni":"xj2178","timestring":"Wed Dec 14 23:23:19 2016","m4fname":"","language":"Hadoop, PIG, Spark, Python, D3.js, AngularJs, Google map API","m3lname":"","dataset":"Dataset:
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
","m1lname":"Yu","industry":"Transportation","analytics":"ALGORITHMS
Clustering (GMM)
Random Forest
Cross Validation

VISUALIZATION
D3.js
AngularJs
Google map API","m2fname":"Xiangbing","description":"Traditionally, Taxi drivers pick up customers randomly. Our project aims to build a web application telling the drivers real-time demands of Taxi cars in a specific location.

The results can help the government better allocate the NYC Taxi car flow, help the drivers increase their revenue and shrink the customer's average waiting time.

Main Functionalities:
1. Calculate the importance of each factor which may influence Taxi demands.
2. Plot real time density map to show the demand of Taxis.
3. Use Gaussian mixture model to calculate several high demanding clusters for Taxis on Google map API.
","m1fname":"Xinzhe","projectname":"NYC Taxi Data Analysis","m3fname":""},{"m2lname":"Mudgal","m4lname":"","m3uni":"ss5136","m1uni":"vm2486","m4uni":"","pid":"201612-83","m2uni":"am4590","timestring":"Wed Dec 14 23:24:42 2016","m4fname":"","language":" Python(numpy, pandas, scikit-learn), Spark, Databricks, SystemG","m3lname":"Singh","dataset":"The data represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, files are separated by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was ta
","m1lname":"Mahajan","industry":"Social Science-Government","analytics":" Naive bayes, random forest, Gradient boosting, Multi-layer Perceptron","m2fname":"Aayush","description":"To bring down the cost of manufacturing it is imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. We are faced with the task of predicting the internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable the production company to bring quality products at lower costs to the end user.
The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).","m1fname":"Vibhuti","projectname":"Reducing Manufacturing Failures ","m3fname":"Sheallika"},{"m2lname":"Wang","m4lname":"","m3uni":"jz2784","m1uni":"jz2776","m4uni":"","pid":"201612-70","m2uni":"aw3001","timestring":"Wed Dec 14 23:25:21 2016","m4fname":"","language":"Python, Spark","m3lname":"Zhang","dataset":"Dataset: The restaurant review text from the 8th Round Yelp Dataset Challenge

2.7M reviews and 649K tips by 687K users for 86K businesses
566K business attributes, e.g., hours, parking availability, ambiance
Social network of 687K users for a total of 4.2M social edges","m1lname":"Zhang","industry":"Information","analytics":"Algorithm:
1. Developed Multi-aspect Sentiment Analysis
2. Developed Word Embedding Collaborative Filtering
3. Developed Distributed Stochastic Gradient Descent

Tools: Gensim (Word2Vec), TextBlob, Spark","m2fname":"An","description":"Traditional recommender system looks only numeric ratings while discards users' review text.

However, the review text justifies the ratings and therefore contains more information about users’ preferences as well as the properties of items.

We want to build a more accurate recommender system by interpreting the information from the review text using the multi-aspect sentiment analysis.
","m1fname":"Jinyi","projectname":"Word Embedding Collaborative Filtering Model","m3fname":"Jia"},{"m2lname":"Zhang","m4lname":"","m3uni":"yw2910","m1uni":"bg2567","m4uni":"","pid":"201612-35","m2uni":"jz2793","timestring":"Wed Dec 14 23:26:06 2016","m4fname":"","language":" Python, SQL/Hadoop, Spark, Hive","m3lname":"Wang","dataset":"Our data of motor vehicle collisions is from https://data.ny.gov/ & https://nycopendata.socrata.com/.
The data contains information as: date, time, location(latitude, longitude), borough, kinds of conditions.

","m1lname":"Gao","industry":"Transportation","analytics":"K-means clustering to find out the most highly possible locations of collisions.
Visualization algorithm for statistics result of datasets.","m2fname":"Jiajun","description":"Our goal is to get some conclusions on collisions through the databases.
We would like to get conclusions such as:
1. the district and street with the highest possibility to have a collision
2. the time when it is easy to have a collision
3. the most common reason for the collision

These conclusions can help people take care themselves at some place for some time and also help police arrange their policemen well, which would help reduce the number of motor vehicle collisions. ","m1fname":"Borui","projectname":"Research on Motor Vehicle Collisions in NYC","m3fname":"Yufu"},{"m2lname":"Bao","m4lname":"","m3uni":"yz2831","m1uni":"ha2399","m4uni":"","pid":"201612-21","m2uni":"wb2304","timestring":"Wed Dec 14 23:26:51 2016","m4fname":"","language":"Spark, Python, R","m3lname":"Zhang","dataset":"We downloaded our data from the following website and the uncompressed data size is over 2.3GB. We downloaded it from https://www.kaggle.com/c/santander-product-recommendation/data.
Our software could support both categorical and numeric data. ","m1lname":"An","industry":"Finance","analytics":"Collaborative Filtering
Weighted RFM
Hybrid Filtering
K-means
Association rules","m2fname":"Wenhang","description":"Our project is trying to explore an effective recommendation algorithm to predict which bank product a consumer will be most likely to purchase in the following month based on their past behavior and that of similar customers.

We used advanced model to calculate the similarities between users, which significantly changed the way to customize the recommendation system. ","m1fname":"Huilong","projectname":"Santander Products Recommendation ","m3fname":"Yifan"},{"m2lname":"Mohsin","m4lname":"","m3uni":"","m1uni":"nt2320","m4uni":"","pid":"201612-76","m2uni":"sbm2164","timestring":"Wed Dec 14 23:27:07 2016","m4fname":"","language":"Scala, Python, IBM BlueMix, Spark, IBM Watson Tone Analyzer, iPython ","m3lname":"","dataset":"The dataset tested was a streamed set of tweets (for 24 hours). This was acquired using spark's streaming toolkit. ","m1lname":"Tuncer","industry":"Social Science-Government","analytics":"The streamed tweets were categorized using popular political keywords. The IBM Watson Tone Analyzer was used to calculate sentiment values. iPython was used to create a visualization of the results (bar graphs). ","m2fname":"Syed","description":"There were many mixed emotions regarding the 2016 US election results. Our project aims to analyze the spectrum of reactions through Twitter - an outlet used by millions everyday - to create an emotional response profile. This research is important to better understand how the world is reacting and to create a meaning from the wide range of emotions. ","m1fname":"Nazli","projectname":"Post-Election Politics","m3fname":""},{"m2lname":"Zhu","m4lname":"","m3uni":"dz2337","m1uni":"sf2788","m4uni":"","pid":"201612-55","m2uni":"kz2250","timestring":"Wed Dec 14 23:30:38 2016","m4fname":"","language":"language: Java, python tools:Spark, Flask, WebSocketd, AngularJs, D3.js,","m3lname":"Zhou","dataset":"https://www.quandl.com/browse
https://www.google.com/finance
http://finance.yahoo.com/quote/GOOG/history?p=GOOG ","m1lname":"Fu","industry":"Finance","analytics":"ML clustering/classification algorithms such as Decision tree, random forest, or logistic regression, neural networks, K-means if we have enough time, etc
","m2fname":"Kan","description":"The stock market goes up and down everyday, in order to yield significant profit for investors, this project plan to design a web that can predict the value of a company stock or futures. For the stock market, there are two analysis, the fundamental analysis and technical analysis. The fundamental analysis evaluate a company's past performance and the credibility of its account, the purpose is to check whether the price of the stock now can reasonably satisfy long-term development. And the technical analysis seeks to predict change of stock value in short-term based solely on the potential trends of the past price. The goal of our project is focus on the technical analysis, aiming at providing more short-term information of stock and help investors make right decisions.","m1fname":"Shiq","projectname":"Stock Portfolio Prediction and Recommendation","m3fname":"Duoying"},{"m2lname":"Zhu","m4lname":"","m3uni":"zz2371","m1uni":"bt2414","m4uni":"","pid":"201612-22","m2uni":"yz2868","timestring":"Wed Dec 14 23:30:41 2016","m4fname":"","language":"PySpark, Flask, CherryPy, HTML, JQuery, JavaScript, Jinja, Boostrap Theme and wordle","m3lname":"Zhang","dataset":"
1. Description
The Data is from GroupLens's MovieLen dataset. There are four csv files we mainly used in building up our system:
(1) movies.csv(all movies list; recommendation and clustering features)
(2) ratings.csv(all users' ratings to movies; recommendation and clustering features)
(3) genome-tags.csv(all tags list; visualization)
(3) genome-score.csv(all movies' related tags with calculated relevance; visualization)

2. DataSets Access
complete_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest.zip'
small_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
The complete dataset is 1G and the small dataset is 3.1MB(Used for testing).

3. Records
There are totally 40,110 movies and 24,404,096 ratings recorded in the dataset.
There are totally 1,130 tags.
Each movie could have a lot of tags. And the attachment number for all the movies is 12,040,273.","m1lname":"Tong","industry":"Information","analytics":"Algorithms: Recommendation, KMeans Clustering
Visualization: Word Cloud","m2fname":"Yunxuan","description":"There are a lot of projects in the previous were building up movie recommendation systems. And they all used Hadoop MapReduce programming model combined with Apache Mahout to build up their engines. However, Hadoop MapReduce is not an efficient model to build recommendation systems. The reason is because there are a lot of iterative machine learning algorithms are using when we built up the recommendation engines. MapReduce may do a lot of duplicated jobs when running those kind of algorithms. Thus, we are trying to use Apache Spark’s in-memory processing and pipeline features to build up a more sufficient recommendation system.
Moreover, traditional movie recommendation systems just return their predicted high-rated movies(ML algorithms) to their customers. This may not intuitive to customers to deeply understand the movies recommended. Thus, we want to visualized all the previous customers’ comments to movies to improve the customer experience.
","m1fname":"Benjie","projectname":"Visualized Movie Recommendation System","m3fname":"Zhiwang"},{"m2lname":"Qiu","m4lname":"","m3uni":"","m1uni":"yt2558","m4uni":"","pid":"201612-11","m2uni":"cq2192","timestring":"Wed Dec 14 23:31:41 2016","m4fname":"","language":"Python, R, Scala, Spark, Hadoop, PostgreSQL, PostGIS, Google Cloud Platform ","m3lname":"","dataset":"NYC TLC Taxi Trip Dataset ","m1lname":"Tanaka","industry":"Information","analytics":"Random Forest
Gaussian Mixture Model","m2fname":"Congying","description":"Objectives:
1. Tip prediction for potential business application
2. GPS noise modeling for highly-accurate GPS data

Innovations:
1. Novel objective variable (whether hourly tip is more than $12 or not)
2. Proposed a way to estimate true avenue membership for each data point
","m1fname":"Yasutaka","projectname":"Tip Prediction and GPS Noise Modeling on NYC Taxi Dataset","m3fname":""},{"m2lname":"Zhu","m4lname":"","m3uni":"yg2430","m1uni":"ht2438","m4uni":"","pid":"201612-54","m2uni":"yz3021","timestring":"Wed Dec 14 23:32:07 2016","m4fname":"","language":"Java (Eclipse) and Python (Theano)","m3lname":"Guo","dataset":"In this project, we use the amazon dataset(rating only) download from http://jmcauley.ucsd.edu/data/amazon/links.html
This dataset includes 24 different categories separate rating data and one total data, the total dataset is 3.2gb.
","m1lname":"Tu","industry":"Retail","analytics":"Hadoop Distributed File System, Mahout item-based recommendation, Mahout user-based recommendation, Theano","m2fname":"Yanjia","description":"Online-shopping has become extremely common in our daily life and recommendation system is an indispensable part of that. Thus the performance of recommendation is very significant. However, the recommendation accuracy suffers from the lack of different rule for different item categories, like books, videogames, cellphones. So we want to try different recommendation schema on the different categories's rating data and compare the evaluation results with each other to have a better understanding of the link between item category and recommendation performance. Eventually, we hope this finding would lead to enhance of recommendation research on electronic merchandise.","m1fname":"Huaiyuan","projectname":"Recommendation Schema Performance Comparison on Amazon Rating Dataset","m3fname":"Yixuan"},{"m2lname":"Zhang","m4lname":"","m3uni":"kw2628","m1uni":"cl3390","m4uni":"","pid":"201612-39","m2uni":"hz2400","timestring":"Wed Dec 14 23:33:53 2016","m4fname":"","language":"SparkR, Tableau, Rstudio","m3lname":"Wang","dataset":"Dataset:
Available over Kaggle
Only trainning set(4.1 G) is used, because the outcome variable (hotel_cluster) is not available in the test dataset.
https://www.kaggle.com/c/expedia-hotel-recommendations/data","m1lname":"Liu","industry":"Retail","analytics":"Analytics & Visualization:
1. Explorative analysis (Tableau)

Algorithms:
1. Random Forest
2. Self-designed recommender","m2fname":"Haoyan","description":"Objective:
Make recommendation for users who would like to book hotels over Expedia website.

Innovations:
Self-produced algorithm to provide recommendation

Expected Outcome:
Top ten likely potential hotel clusters for each user out of 100 scored hotel clusters.

Why:
Hotel booking website such as Expedia would like to improve the quality of their recommendtaion.\"Nothing is better than being greeted by your favorite drink just as you walk through the door of the corner cafe.\"","m1fname":"Chang","projectname":"Expedia Hotel Recommendations","m3fname":"Kaisheng"},{"m2lname":"Zhang","m4lname":"","m3uni":"kw2628","m1uni":"cl3390","m4uni":"","pid":"201612-39","m2uni":"hz2400","timestring":"Wed Dec 14 23:34:05 2016","m4fname":"","language":"SparkR, Tableau, Rstudio","m3lname":"Wang","dataset":"Dataset:
Available over Kaggle
Only trainning set(4.1 G) is used, because the outcome variable (hotel_cluster) is not available in the test dataset.
https://www.kaggle.com/c/expedia-hotel-recommendations/data","m1lname":"Liu","industry":"Retail","analytics":"Analytics & Visualization:
1. Explorative analysis (Tableau)

Algorithms:
1. Random Forest
2. Self-designed recommender","m2fname":"Haoyan","description":"Objective:
Make recommendation for users who would like to book hotels over Expedia website.

Innovations:
Self-produced algorithm to provide recommendation

Expected Outcome:
Top ten likely potential hotel clusters for each user out of 100 scored hotel clusters.

Why:
Hotel booking website such as Expedia would like to improve the quality of their recommendtaion.\"Nothing is better than being greeted by your favorite drink just as you walk through the door of the corner cafe.\"","m1fname":"Chang","projectname":"Expedia Hotel Recommendations","m3fname":"Kaisheng"},{"m2lname":"BANSAL","m4lname":"","m3uni":"","m1uni":"sav2125","m4uni":"","pid":"201612-97","m2uni":"sb3766","timestring":"Wed Dec 14 23:35:39 2016","m4fname":"","language":"Python, Bokeh, MatplotLib","m3lname":"","dataset":"These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the \"present\" contains complete loan data for all loans issued through the previous completed calendar quarter.","m1lname":"VARSHNEY","industry":"Finance","analytics":"Random Forests,
GBT,
Regressions, etc.","m2fname":"SHUBHAM","description":"The objective is to model interest rates depending upon user ratings as well as other metadata. It's a booming field in Fintech with huge amounts of data to process. It makes an ideal choice for BDA project. ","m1fname":"SIDDHARTH AMAN","projectname":"Modelling P2P loans Interests","m3fname":""},{"m2lname":"Jain","m4lname":"","m3uni":"jy2799","m1uni":"cx2187","m4uni":"","pid":"201612-64","m2uni":"pj2313","timestring":"Wed Dec 14 23:36:49 2016","m4fname":"","language":"Python, D3.js","m3lname":"Ying","dataset":"1 - Uber Pickups in New York. We searched several datasets online and found this most suitable for our project.
2 - Taxi Trip Data. We searched several datasets online and found this most suitable for our project.","m1lname":"Xiong","industry":"Transportation","analytics":"We will use D3.js for data visualization, Python libraries for statistical analysis. Besides, we used Spark to do mapreduce for locationId counting for each borough in New York","m2fname":"Pulkit","description":"We wish to be able to quantify the effects of Uber and other online taxi sharing services on the previously operated Yellow and Green Taxi services in New York.

Expected outcome: We expect to find a drastic change in the user travel patterns and taxi usage before and after the coming to Uber to NYC. In a broad sense, taxi usage should have dropped as uber usage increased over time.

Importance: It is important to determine this effect as it gives us insight into how the advent of uber has changed the way people travel and the Taxi Industry in general
","m1fname":"Chuhan","projectname":"Comparing Uber with its competition in similar online services and older taxis ","m3fname":"Jiefu"},{"m2lname":"Zhu","m4lname":"","m3uni":"dz2337","m1uni":"sf2788","m4uni":"","pid":"201612-55","m2uni":"kz2250","timestring":"Wed Dec 14 23:37:58 2016","m4fname":"","language":"language: Java, python tools:Spark, Flask, WebSocketd, AngularJs, D3.js,","m3lname":"Zhou","dataset":"Yahoo finance database, Google finance database etc","m1lname":"Fu","industry":"Finance","analytics":"ML clustering/classification algorithms such as Decision tree, random forest, or logistic regression, neural networks, K-means if we have enough time, etc ","m2fname":"Kan","description":"The stock market goes up and down everyday, in order to yield significant profit for investors, this project plan to design a web that can predict the value of a company stock or futures. For the stock market, there are two analysis, the fundamental analysis and technical analysis. The fundamental analysis evaluate a company's past performance and the credibility of its account, the purpose is to check whether the price of the stock now can reasonably satisfy long-term development. And the technical analysis seeks to predict change of stock value in short-term based solely on the potential trends of the past price. The goal of our project is focus on the technical analysis, aiming at providing more short-term information of stock and help investors make right decisions.","m1fname":"Shiqi","projectname":"Stock portfolio prediction and recommendation","m3fname":"Duoying"},{"m2lname":"Rana","m4lname":"","m3uni":"mml2204","m1uni":"jz2733","m4uni":"","pid":"201612-23","m2uni":"rr3087","timestring":"Wed Dec 14 23:39:11 2016","m4fname":"","language":"Python, MongoDB, Spark, Gephi","m3lname":"Lobo","dataset":"The dataset was painstakingly scraped from the internet website allrecipes.com for over 1 week.

Unfortunately, there was no publicly available dataset to meet our needs. So we looked for recipes that have been tested by a large number of users, have detailed ingredient descriptions and user ratings.

Our software requires very specific kind of data with specific field information in order correctly and sensibly make predictions. The publicly available or any other kind of data can not meet the needs of the software application.","m1lname":"Zhang","industry":"Life Science","analytics":"Natural Language Processing for data cleaning.

Computed pointwise mutual information for network construction.

Centralities such as degree and betweenness were calculated for learning the popularity of items. In addition, SVD was used to select important features from the PMI matrix to describe ingredient relationship.

Stochastic Gradient Boosted Trees were used to train a rating prediction model and evaluate candidate recipes for recommendation.

Generated new recipes based on network structure such as cliques and shortest paths.

Visualizations were implemented using Gephi for displaying the whole network graph, and a few ego networks with nodes and title names being displayed with sizes varying according to the degree of the nodes.
","m2fname":"Rahul","description":"To be able to generate recipes for a user based on the user input of ingredients available with them.

In current times, a basic knowledge of cooking has become a essential part of an individual’s life. People are also becoming health conscious but are constrained by busy schedules and a rudimentary knowledge of cooking. Choosing a good recipe with the limited ingredients one possesses at home at the given time is a herculean task.

That is where our software can help them. In churning out a likeable recipe based on the limited ingredients present with the user to be used along with a few other ingredients. ","m1fname":"Jiajia","projectname":"Recipe Generation Based on Ingredient Network","m3fname":"Macrina"},{"m2lname":"Zhang","m4lname":"","m3uni":"ys2840","m1uni":"ck2749","m4uni":"","pid":"201612-78","m2uni":"wz2363","timestring":"Wed Dec 14 23:39:15 2016","m4fname":"","language":"Spark/Scala/Python/Matlab/Excel","m3lname":"Shen","dataset":"Climatology Data from NOAA
https://www.nodc.noaa.gov/access/index.html
Global Surface Temperature Data from NASA
http://data.giss.nasa.gov/gistemp/
Sea Level Trends Data from NOAA
http://tidesandcurrents.noaa.gov/sltrends/sltrends.html
Sea Level Trends Data From PSMSL
http://www.psmsl.org/data/obtaining/","m1lname":"Kanungo","industry":"Social Science-Government","analytics":"Data Filtering using Python and MATLAB
ML-LIB
Regression with Stochastic Gradient Descent
Exponential Least Square Fitting
MATLAB 2-D/3-D Visualization
NCBrowser Visualization","m2fname":"Wei","description":"The issue of Global Warming has been a controversial topic over the past decades. President Trump even claimed it is a made-in-China topic. The impact of global warming can truly be devastating to our planet. For example, 40% of the population in Netherlands are exposed to the risk of drowning. Growing sea level resulted from global warming can lead to submerging city’s land like Manhattan. We used evidence from big geographical data and evaluated the impact of global warming. We predicted the global temperature and the resulting sea level trend around the whole world.","m1fname":"Chandan","projectname":"The Impact of Global Warming from Big Geographical Data","m3fname":"Yizhou"},{"m2lname":"Zhang","m4lname":"","m3uni":"","m1uni":"sz2606","m4uni":"","pid":"201612-86","m2uni":"rz2368","timestring":"Wed Dec 14 23:39:52 2016","m4fname":"","language":"Python, Pyspark, Mahout","m3lname":"","dataset":"General twitter datasets contains more than 3,000,000 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.
Airline twitter comment data was scraped from Twitter and contributors were asked to classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).
","m1lname":"Zhang","industry":"Media","analytics":"Logistic regression, Naive Bayes classifier, K-means, SVM","m2fname":"Renyuan","description":"In this project, we will do sentiment analysis for twitter tweets. By classifying the text, we want to obtain a prediction model for the sentiment of tweets and testing them in airline contexts.
In addition, we will evaluate the difference between a general model and a specific model for airlines. ","m1fname":"Shang","projectname":"Sentiment Analysis for Twitter & US Airline","m3fname":""},{"m2lname":"Bagri","m4lname":"","m3uni":"ss4974","m1uni":"gc2662","m4uni":"","pid":"201612-51","m2uni":"aab2234","timestring":"Wed Dec 14 23:41:16 2016","m4fname":"","language":"Python, Spark, Hadoop/SQL","m3lname":"Srinivasan","dataset":"We are using the Traffic Violations from New York State dataset from data.gov. It is a public data set on the government website. There are a large number of interesting attributes in the dataset
We can combine them to find insightful results and predictions.
","m1lname":"Charitos","industry":"Transportation","analytics":"Visualization via matplot.lib of python,
Binary decision tree algorithm,
• Binary Classification
• Linear SVM – large-scale classification
• SVM with SGD optimization – Stochastic Gradient Descent
• SVM trained through L2 regularization","m2fname":"Aditya","description":"Through this project research, we aim to analyze and display Traffic Violation consequences and study the geographic and demographic aspects of the violation to acquire insightful information. We wish to predict possible traffic violations and create a system that helps avoid such violations in the future.","m1fname":"Georgios","projectname":"Traffic Violation Analysis & Prediction","m3fname":"Srinidhi"},{"m2lname":"Kong","m4lname":"","m3uni":"yj2425","m1uni":"yx2385","m4uni":"","pid":"201612-91","m2uni":"zk2202","timestring":"Wed Dec 14 23:41:50 2016","m4fname":"","language":"Hadoop/Pig, Mahout, Matlab, Apache+PHP+Mysql+Echarts , Shell scripts","m3lname":"Ji","dataset":"Main Dataset 1961-2013 Food Balance Sheets for 42 selected
countries (and updated regional aggregates)
Data Source Food and Agriculture Organization of the United
Nations_Statistics Division
Food balance sheets(FBS) presents a comprehensive picture of the pattern of a country’s food supply during a specified reference period by showing sources of supply and their utilization.
FBS measures food consumption from a food supply perspective.
In FBS, per capita supplies are derived from the total supplies available for human consumption by dividing the quantities of food by the total population actually partaking of the food supplies of the reference period.
Other Datasets
Global Database on Body Mass Index (World Health Organization)
Applied in studying the relationship between diet pattern and obesity.
World Development Indicators (World Band Group)
Applied in studying the possible influence of country’s economic performance on diet pattern.
","m1lname":"Xiong","industry":"Life Science","analytics":"Preprocess dataset:
Use Pig&Hadoop to do query, get:
Data includes food supply (kcal/capita/day)of various kinds of food in counries in several years.
Data includes food supply (kcal/capita/day) of selected countries from 1961 to 2013.
Data includes BMI index of countries and regions in several years.
Data includes GDP per capita index of countries and regions several years.
Cluster data：
Use Mahout to cluster countries and regions. Festure space:{Cereals, Veg&Fruits, Roots, Protein, Sugar&Oil} (%).
Identified clusters by the average GDP per capita in each cluster.
Visualize results:
Use Matlab to dump the 5 cluster center points into pie chart, as well as average BMI structure in each cluster.
Use Apache+PHP+Mysql+Echarts to build a website and show the distrbution of 5 clusters on a world map while showing the results of analysis at the same time.
","m2fname":"Zhuo ","description":"There may be a relationship between lifestyle that can be indecated by food consumption and potential risk of chronic diseases.
A diet high in fruits and vegetables appears to decrease the risk of cardiovascular disease and death.
An unhealthy diet is a major risk factor for a number of chronic diseases including: high blood pressure, diabetes, abnormal blood lipids, overweight/obesity, cardiovascular diseases, and cancer.
The WHO estimates that 2.7 million deaths are attributable to a diet low in fruits and vegetables every year.
Food consumption pattern varies by country as a result of diversity in food culture, agriculture and geography, economy, and population.
","m1fname":"Yuwei ","projectname":"Study of Worldwide Food Consumption Pattern","m3fname":"Yilan"},{"m2lname":"Ren","m4lname":"","m3uni":"","m1uni":"jl4753","m4uni":"","pid":"201612-29","m2uni":"yr2301","timestring":"Wed Dec 14 23:42:15 2016","m4fname":"","language":"Python, Spark","m3lname":"","dataset":"We are using the dataset from US Environmental Protection Agency. The original data was summarized on an annual basis and grouped by parameters. We gathered the data of five major pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide, Ozone and PM 2.5), as well as the weather condition (Temperature, Wind Speed, Wind Direction and Barometric Pressure) for the past three years.
","m1lname":"Liu","industry":"Life Science","analytics":"Linear Regression with elastic net","m2fname":"Yidong","description":"To predict air pollution based on historic data, both air pollutant data and related weather data such as temperature and wind etc.
This topic is getting more and more important with the development of industrialization, air pollution has become a global issue which may cause harm or discomfort to humans or other living organisms, or damages the natural environment into the atmosphere. Even though this has already been under control in the United States, air pollution remains a severe issue in quickly developing countries such as China and India. Therefore, Air pollution prediction and forecasting are essential parts that can help us minimize the harm of air pollution.

","m1fname":"Jingyue","projectname":"Air Pollution Prediction","m3fname":""},{"m2lname":"Chahar","m4lname":"","m3uni":"aa3766","m1uni":"nc2663","m4uni":"","pid":"201612-99","m2uni":"ac3946","timestring":"Wed Dec 14 23:44:44 2016","m4fname":"","language":"Spark, Python, AWS, S3, D3, Hadoop","m3lname":"Arora","dataset":"We will work on the Lending Club loan dataset with roughly 887383 rows and 75 columns. We got this dataset from Lending Club itself. Also we intend to use Prosper dataset in future.","m1lname":"Chauhan","industry":"Finance","analytics":"Naïve Bayes
Logistic Regression
SGDClassifier
Random Forest","m2fname":"Aman","description":"As an investor in peer to peer(P2P) lending marketplace, it's difficult to predict the expected return on the loan portfolio.
The main problem here is to predict the loan default rate.
Through this project, we intend to develop and implement machine learning models to predict the loan default rate based on attributes of the loan like loan amount, loan status, term, et al.","m1fname":"Nitesh","projectname":"P2P Loan Default Rate Prediction","m3fname":"Arushi"},{"m2lname":"Wang ","m4lname":"","m3uni":"wl2575","m1uni":"gx2127","m4uni":"","pid":"20161268","m2uni":"yw2768","timestring":"Wed Dec 14 23:44:50 2016","m4fname":"","language":"python, pyspark, hive","m3lname":"Li","dataset":"Black Friday dataset from Analytics Vidhya
Link:
https://datahack.analyticsvidhya.com/contest/black-friday/","m1lname":"Xu","industry":"Retail","analytics":"Algorithm: K-Means Clustering Algorithm","m2fname":"Yuntong ","description":"Motivation and Importance:
Black Friday is one of the most discussed day of the year. From the view of E-commerce platform and off-line store, we want to predict what kinds of products we should promote and advertise to customers according to their previous purchase records.

Using clustering results, sellers can have a clear picture of the amount of different merchandises in store to obtain a maximum profit.

Tools:
Algorithm: K-Means Clustering Algorithm

Language & platform: Python, hive, Spark
","m1fname":"Guowei","projectname":"Black Friday Merchandise Analytics and Prediction ","m3fname":"Woye "},{"m2lname":"Kamdem","m4lname":"","m3uni":"","m1uni":"bcs2149","m4uni":"","pid":"201612-6","m2uni":"kfk2113","timestring":"Wed Dec 14 23:46:09 2016","m4fname":"","language":"Python, Spark","m3lname":"","dataset":"The dataset is the universe of all loans as provided by Kiva.","m1lname":"Speice","industry":"Social Science-Government","analytics":"We used Spark SQL to perform all of our data processing and augmentation, Scikit-Learn was used to train many different classes of models in order to estimate the generalization error.","m2fname":"Karl-Loic","description":"Our goal is to develop a machine learning method which can predict whether microfinance loans will go bad. This research has not been attempted yet, and so our group is the first to pioneer these methods. What makes microfinance prediction difficult is that it is not governed by standard economic indicators; there is no credit score to take into account. Instead, we have to look to other factors to actually \"learn\" what makes a quality loan. The research we are doing is important to help all people involved in microfinance; The loan servicers can monitor the loans in an automated fashion, the loan originators can spot poor loans from the beginning, and loan investors have an easy way to do risk analysis. All this improves the lifecycle of microfinance ensuring a safer market for all in this important industry.","m1fname":"Bradlee","projectname":"Making Microfinance Safer","m3fname":""},{"m2lname":"Zhang","m4lname":"","m3uni":"rz2368","m1uni":"qw2208","m4uni":"","pid":"201612-86","m2uni":"sz2606","timestring":"Wed Dec 14 23:48:32 2016","m4fname":"","language":"Python, Pyspark, Mahout","m3lname":"Zhang","dataset":"Dataset: General twitter datasets contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment.
Airline twitter comment data was scraped from Twitter and contributors were asked to classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).","m1lname":"Wu","industry":"Media","analytics":"Logistic regression, Naive Bayes classifier, K-means","m2fname":"Shang","description":"In this project, we will do sentiment analysis for twitter tweets. By classifying the text, we want to obtain a prediction model for the sentiment of tweets and testing them in airline contexts.
In addition, we will evaluate the difference between a general model and a specific model for airlines. ","m1fname":"Qingwei","projectname":"Sentiment Analysis for Twitter & US Airline","m3fname":"Renyuan"},{"m2lname":"Gong","m4lname":"","m3uni":"ch3212","m1uni":"mz2584","m4uni":"","pid":"201612-46","m2uni":"xg2244","timestring":"Wed Dec 14 23:48:53 2016","m4fname":"","language":"Python, PySpark","m3lname":"Han","dataset":"The dataset we use is downloaded from the website. The size of the dataset is 2 Gigabytes. As shown, the dataset contains bonds, exchange trade funds and stocks in every major market in the world, such as the US, Hong Kong and Japan.","m1lname":"Zuo","industry":"Finance","analytics":"Parametric VaR, ES","m2fname":"Xiangfeng","description":"Value at Risk is used to estimate the level of financial risk of a company over a certain period of time. It determines the potential loss in the chosen\u001b company and probability of occurrence of the loss. It's important because financial institutions\u001b can know whether they have enough reserves to cover the potential loss.
Through this research, we will get the chosen companies VaR (or CVaR)
","m1fname":"Maolin","projectname":"Investment Assistant: Calculations of Risks","m3fname":"Cong"},{"m2lname":"Jain","m4lname":"","m3uni":"jy2799","m1uni":"cx2187","m4uni":"","pid":"201612-64","m2uni":"pj2313","timestring":"Wed Dec 14 23:50:08 2016","m4fname":"","language":"Python, D3.js","m3lname":"Ying","dataset":"1 - Uber Pickups in New York. We searched several datasets online and found this most suitable for our project.
2 - Taxi Trip Data. We searched several datasets online and found this most suitable for our project.","m1lname":"Xiong","industry":"Transportation","analytics":"We will use D3.js for data visualization, and Python libraries for stats analytics. Besides, we used Spark to do mapreduce for locationId counting w.r.t diff boroughs in New York.","m2fname":"Pulkit","description":"Objectives: We wish to be able to quantify the effects of Uber and other online taxi sharing services on the previously operated Yellow and Green Taxi services in New York.

Expected outcome: We expect to find a drastic change in the user travel patterns and taxi usage before and after the coming to Uber to NYC. In a broad sense, taxi usage should have dropped as uber usage increased over time.

Importance: It is important to determine this effect as it gives us insight into how the advent of uber has changed the way people travel and the Taxi Industry in general","m1fname":"Chuhan","projectname":"Comparing Uber with its competition in similar online services and older taxis ","m3fname":"Jiefu"},{"m2lname":"Wang","m4lname":"","m3uni":"xg2218","m1uni":"yz2990","m4uni":"","pid":"201612-33","m2uni":"ww2420","timestring":"Wed Dec 14 23:50:55 2016","m4fname":"","language":"Spark, System G, Python, d3 js","m3lname":"Gao","dataset":"The Kmart data is collected from Walmart and its main subsidiaries from 04/16/2013 to 10/17/2013. This dataset is downloaded from Global Garment Supply Chain section of datahub website.","m1lname":"Zhu","industry":"Retail","analytics":"K-means Clustering, LDA","m2fname":"Wenqi","description":"Our research objective is to summarize Kmart's current supply chain. This is a descriptive analysis so that we can help Kmart understand which kinds of products are supplied by certain suppliers in certain regions. By better understanding the suppliers' distribution,Kmart is able to identify potential backup if a supplier is unable to supply certain product under emergency, thus running a better supply chain management. ","m1fname":"Yuanxu","projectname":"Supply Chain Management for Walmart","m3fname":"Xuefei"},{"m2lname":"Xu","m4lname":"","m3uni":"sz2531","m1uni":"cz2350","m4uni":"","pid":"201612-32","m2uni":"lx2201","timestring":"Wed Dec 14 23:51:11 2016","m4fname":"","language":"Python, C++","m3lname":"Zhang","dataset":"Election is a hot topic these days, so it’s easy for us to collect the data and construct dataset. For the training data, we focus on those messages that contain certain indicators via the Twitter API. Those indicators could be diversified, such as names or topics like #DonaldTrump, #HillaryClinton, and #election; or emotions like :) :( and emojis; or certain feature words like “fuck”, “hurts”, “damn”. The test data was manually. We are going to collect tweets from different states in a week. A web interface tool was built to aid in the manual classification task. We will select a text analysis software to assess emotional, cognitive, and structural components of text samples. Finally we would rank the trend by tweet volume and get a proper summary of public views of the presidential election.","m1lname":"Zhang","industry":"Social Science-Government","analytics":"Natural Language Processing, probabilistic Context Free Grammar for parsing, Hidden Markov Model for tagging, Naive Bayes classifier or other classification algorithms for Document classification.","m2fname":"Lingqing","description":"With the end of 2016 election, every single person of America has been expressing different emotions toward the outcome of the election. We are interested in the public views to the result and what do they think of the future of two candidates and also of this country. To achieve this goal, we intend to use Twitter to get access to the opinions. Finally, by classifying the emotions, we expect to analyze the reaction of the whole country.
","m1fname":"Ciyuan","projectname":"Analyzing Twitter Sentiment of 2016 Presidential Election","m3fname":"Shihao"},{"m2lname":"Zhou","m4lname":"","m3uni":"cw2962","m1uni":"gl2548","m4uni":"","pid":"201612-85","m2uni":"jz2792","timestring":"Wed Dec 14 23:52:15 2016","m4fname":"","language":"Tool: Spark, MySQL, Python, Flask, GoogleMap Api","m3lname":"Wang","dataset":"Airports, air routes dataset from http://openflights.org/data.html
Air carrier statistics from http://www.transtats.bts.gov/","m1lname":"Li","industry":"Transportation","analytics":"Analytic methods:
We process the airports, airline and flight history datasets with various SparkSQL queries and operations to construct the graph which can best represent the airline reachablility between cities. Transfers are taken into consideration to make it a complete graph. Greedy algorithm, Genetic algorithm and simulated annealing are implemented and evaluated according to its running time and effectiveness. We utilize simulated annealing algorithm to realize the real-time route generation with an appropriate parameter settings to gain the balance between optimality and running time. The results and user-defined route generator are presented in a webpage, built with Flask and embedded with google map.","m2fname":"Jin","description":"Motivation:
Almost everyone of us has the dream of traveling around the world, just like Marco Polo and Magellan. With the help of the widespread airline networks and easy access to the travel info, people start to think how to finish the global traveling with minimum total mileage, with a satisfactory airline and with more fun. Therefore, we want to utilize a variety of airline datasets to plan the multi-city travel itinerary for the modern Marco Polo.
Objectives:
Find an efficient and effective algorithm to build an itinerary to visit most of the capital cities in the world with minimum flying mileage; build the flight routes using a specific airline for the big fan of American Airlines, Air France and so on; enable the itinerary programming with the user-defined city-list to maximize the planner’s flexibility.","m1fname":"Gongqian","projectname":"Flying Marco Polo","m3fname":"Chong"},{"m2lname":"An","m4lname":"","m3uni":"","m1uni":"xh2301","m4uni":"","pid":"201612-72","m2uni":"ya2345","timestring":"Wed Dec 14 23:52:23 2016","m4fname":"","language":"Java, Python","m3lname":"","dataset":"The dataset is from IBM. It contains 10,000 images of various roads conditions.","m1lname":"Hua","industry":"Transportation","analytics":"Faster R-CNN, Caffe","m2fname":"Yu","description":"Our objective is to use Faster R-CNN algorithm to train a model that is able to detect pedestrians and vehicles on the road. We hope it can be used on driver-less cars. ","m1fname":"Xiang","projectname":"Pedestrian & Vehicle Recognition","m3fname":""},{"m2lname":"Zheng","m4lname":"","m3uni":"th2668","m1uni":"yw2928","m4uni":"","pid":"201612-28","m2uni":"mz2597","timestring":"Wed Dec 14 23:52:55 2016","m4fname":"","language":"Python, Hadoop, Spark, HBase, Scikit-learn, Tableau","m3lname":"Huang","dataset":"It is a data sample of user activities over several months at Yahoo webpages, which includes user interactions with pages, ads, and search results for a training period of 90 days and labels from a test period of 2 weeks.","m1lname":"Wu","industry":"Media","analytics":"Principle Component Analysis (PCA);
Support Vector Machine (SVM);
Multinomial Naive Bayes classification algorithm;
Logistic Regression algorithm.
","m2fname":"Minghong","description":"Objective: Modeling user behaviors based on past user activities and generating accurate prediction of future user behaviors.

Expected Outcome: A classifier model mapping past user activities to possible future behavior.

Why: Predicting user behaviors has always been one of the trending topics. By predicting accurately, we can make recommendations more effective, improve user experience and thus increase our profits and popularity. Such researches are not just a double win between us and users. It's also a effective and practical way to develop and test data analytics/machine learning algorithms.","m1fname":"Avery","projectname":"User Behavior Modeling","m3fname":"Tsung-Yi"},{"m2lname":"Shankar","m4lname":"","m3uni":"as5147","m1uni":"mp3542","m4uni":"","pid":"201612-81","m2uni":"as5171","timestring":"Wed Dec 14 23:53:22 2016","m4fname":"","language":"Hadoop, mahout, Spark, Python, Java, Android Studio ","m3lname":"Sinha","dataset":"Sr. Function Datasets Size
1 Movies Movielens 20 Million 632 MB
2 Music Yahoo! Music 2.1 GB
3 News NEWS Scraping
5 E-Com Deals Walmart.com
6 Fitness Management Users Input ","m1lname":"Patel","industry":"Information","analytics":"Content-Based Recommendation, Collaboration Based filtering, K-Means Clustering, ALS, Andriod Application for visualization ","m2fname":"Aman ","description":"Pocket Buddy: Stocks, News, Movies, Music, Fitness Quotient & E-commerce recommendations.

All in one recommender system to help users to make decisions quickly and efficiently. Involves recommendation from all paths and spheres of the life. ","m1fname":"Mohneesh ","projectname":"POCKET BUDDY: ALL IN ONE RECOMMENDATION SYSTEM","m3fname":"Anul Kumar"},{"m2lname":"Netto","m4lname":"","m3uni":"hl2997","m1uni":"gc2708","m4uni":"","pid":"2016-12-31","m2uni":"rn2388","timestring":"Wed Dec 14 23:53:43 2016","m4fname":"","language":"python spark, linux","m3lname":"","dataset":"Extracted directly from Twitter website.
Retrieved \"fullname\",\"id\",\"text\",\"timestamp\",\"user\" of the tweets in json format.","m1lname":"Cho","industry":"Social Science-Government","analytics":"LDA algorithm and word cloud for visualization","m2fname":"Richa","description":"Analyse Twitter data to obtain tweets based on relevant hashtags and periods of time during the election.
Visualise the words clustered on the basis of importance within the topic using Word Clouds.
","m1fname":"GwonJae","projectname":"Analysis of Social Media Data to Draw Conclusions about the Presidential Election 2016","m3fname":""},{"m2lname":"Xue","m4lname":"","m3uni":"lz2494","m1uni":"zl2513","m4uni":"","pid":"201612-73","m2uni":"xx2241","timestring":"Wed Dec 14 23:55:44 2016","m4fname":"","language":"Languages: Python","m3lname":"Zhang","dataset":"Dataset: Dataset: Yelp dataset challenge.
Description: The dataset contains information about business, review, user, check-in, tip, photos (from the photos auxiliary file) including the user id, business id, business type, rating star, etc.","m1lname":"Liu","industry":"Information","analytics":"Analytics: User-based recommendation.
Natural language processing algorithms.","m2fname":"Xun","description":"The recommendation system provided by Yelp is not efficient enough for user to find out the best one for the specific user.
Generally, the recommendation result is based on the sort of business rating which can not be personalized by users.
With a series of recommended items based on the basic recommendation algorithm, we go a step further to find out the most efficient and personalized sorting based on the text similarity matching.

","m1fname":"Zhangyu","projectname":"Personallized Yelp Recommendation System","m3fname":"Lingyu"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jmz2131","m4uni":"","pid":"201612-88","m2uni":"","timestring":"Wed Dec 14 23:57:01 2016","m4fname":"","language":"Spark, Java, PostgreSQL, Ruby","m3lname":"","dataset":"Citi Bike publishes real-time station data (updated several times a minute) at https://feeds.citibikenyc.com/stations/stations.json. Unfortunately, this data is not available for download in aggregate. In order to obtain my data set, I polled and stored the real-time station data once every 10 minutes for about 6 weeks.","m1lname":"Zhao","industry":"Transportation","analytics":"Spark MapReduce, clustering","m2fname":"","description":"Citi Bike is a popular bike-share program with stations throughout NYC that a rider can pick up a bike from, or return a bike to. Previous projects on Citi Bike have analyzed popular stations and trip start/end locations, but have not addressed issues with availability and capacity of each station. Particularly at more popular locations, such as in Midtown Manhattan, there are sometimes no available bikes during popular time periods. Conversely, there are sometimes no available docking stations (for bikes to be returned to) at less popular stations. For potential Citi Bike users that are considering getting a Citi Bike membership, as well as for current Citi Bike members, it would be extremely useful to be able to predict their chances of getting a bike and/or finding an available docking station, given a day of week/time of day.","m1fname":"Jessica","projectname":"Citi Bike Availability Predictor","m3fname":""},{"m2lname":"Netto","m4lname":"","m3uni":"hl2997","m1uni":"gc2708","m4uni":"","pid":"201612-31","m2uni":"rn2388","timestring":"Wed Dec 14 23:57:05 2016","m4fname":"","language":"python spark, linux","m3lname":"","dataset":"Extracted directly from Twitter website.
Retrieved \"fullname\",\"id\",\"text\",\"timestamp\",\"user\" of the tweets in json format.","m1lname":"Cho","industry":"Social Science-Government","analytics":"LDA algorithm and word cloud for visualization","m2fname":"Richa","description":"Analyse Twitter data to obtain tweets based on relevant hashtags and periods of time during the election.
Visualise the words clustered on the basis of importance within the topic using Word Clouds.
","m1fname":"GwonJae","projectname":"Analysis of Social Media Data to Draw Conclusions about the Presidential Election 2016","m3fname":""},{"m2lname":"Zheng","m4lname":"","m3uni":"yw2882","m1uni":"yz2993","m4uni":"","pid":"201612-67","m2uni":"qz2271","timestring":"Wed Dec 14 23:57:24 2016","m4fname":"","language":"Python, Spark, MATLAB","m3lname":"Wang","dataset":"The dataset we use is from Quandl. It is an open source.","m1lname":"Zhou","industry":"Finance","analytics":"The analytics and algorithms we use are moving average, k-means clustering and linear regression.

Moving Average for each basic trend analysis of each stock

K-means Clustering for recommendation

Linear Regression for prediction of stock market

Besides, We use MATLAB to make graph for visualization.","m2fname":"Qianwen","description":"The objectives for this project are:

1. According to the accessible dataset, analyze the stock market.
2. Provide recommendation for user.
3. Make a prediction on trend of stock market.

Because of the devaluation, people may invest in the stock market, but the stock market itself is full of risk. In this case, a good stock analysis and recommendation can help people have more opportunities to make money. Stock data is really a Big Data. Some tools and algorithms are useful to handle the big dataset. It is necessary for not only a short-term trader but also long-term vendors to have a clear and precise overview about the stock market.

The innovation in our project is using clustering to make a recommendation. According to each cluster and the stocks it contains, we can make a recommendation for user, i.e. if user choose stock 1 to buy, and this stock is in cluster A, we can recommendation the rest of stocks in cluster A to the user.
","m1fname":"Yimeng","projectname":"Stock Analysis and Recommendation","m3fname":"Ye"},{"m2lname":"Zheng","m4lname":"","m3uni":"mz2591","m1uni":"tw2568","m4uni":"","pid":"201612-24","m2uni":"zz2406","timestring":"Wed Dec 14 23:57:24 2016","m4fname":"","language":"Python, Javascript, R; Spark","m3lname":"zhou","dataset":"Yelp dataset are provided from Yelp Data Challenge. The URL of the source is: https://www.yelp.com/dataset_challenge
","m1lname":"Wu","industry":"Information","analytics":"Analytics: City business trend, meals ordered most frequently, most crowded hours of the restaurants, etc.
Algorithm: Clustering, Support Vector Machine, Bayes Classifier, Random Forest, CNN
Visualization: Review Stars distribution, Review Texts Word-cloud, Review text sentiment analysis","m2fname":"Zhi","description":"The overall objective is to explore and visualize the yelp dataset, predict yelp review Star categories from yelp reviews as well as visualize the sentiment analysis of the dataset.
","m1fname":"Tongyun","projectname":"Yelp Reviews Exploration & Visualization","m3fname":"Ming"},{"m2lname":"Tian","m4lname":"","m3uni":"kc3051 ","m1uni":"hw2507","m4uni":"","pid":"201612-102","m2uni":"jt2867","timestring":"Wed Dec 14 23:57:41 2016","m4fname":"","language":" Python Java spark R","m3lname":"Chen","dataset":"The data we use is Daily weather historical records in Weather Underground.

The website mainly is https://www.wunderground.com.

We do web scrapping from the website.

And the factors we use are Temperature, Precipitation, Humidity, Wind Speed…
","m1lname":"Wu","industry":"Information","analytics":"Algorithms
PCA to reduce the number of variables
K-means clustering
Slope-One Algorithm makes complete user dataset

System modules: sparkR, beautifulsoup

Visualization: shiny","m2fname":"Jiani","description":"Weather factor is important for people when they choose where to live or travel. This tool gives people suggestions about their dream place to live based on their preference on climate conditions.","m1fname":"Haowei ","projectname":"Weather Taste – city recommendation tool","m3fname":"Kuan-Sheng"},{"m2lname":"MIAO","m4lname":"","m3uni":"LL3001","m1uni":"ml3810","m4uni":"","pid":"201612-18","m2uni":"mg2666","timestring":"Wed Dec 14 23:57:48 2016","m4fname":"","language":"python, spark,hive","m3lname":"LIU","dataset":"public jmcauley.ucsd.edu/data/amazon
product information: scraped from amazon.com

Size: 1.48 GB
Instance: 60k products with over 1.5 million reviews
Features: 9 feature","m1lname":"LIU","industry":"Retail","analytics":"algorithm: binormal seperation, maxent model, maxent model, lda
visualization: ggplot","m2fname":"GUANXIONG","description":"Through analyzing the product reviews, we want to summarize the positive and negative feature of it. For the products that have dramatic rating change, we would like to explore the reason behind it, and output a compact summary that user can easily understand

For customers, it will be easier and faster to extract information from thousands of reviews into simple “pros and cons”
For sellers, it will be possible to see what do people like and not like about their products, and to understand customer’s preferences
","m1fname":"MENGYING","projectname":"Amazon electronic product reviews analysis","m3fname":"LIU"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jmz2131","m4uni":"","pid":"201612-88","m2uni":"","timestring":"Wed Dec 14 23:58:10 2016","m4fname":"","language":"Spark, Java, PostgreSQL, Ruby","m3lname":"","dataset":"Citi Bike publishes real-time station data (updated several times a minute) at https://feeds.citibikenyc.com/stations/stations.json. Unfortunately, this data is not available for download in aggregate. In order to obtain my data set, I polled and stored the real-time station data once every 10 minutes for about 6 weeks. I ended up with about 400 MB of data.","m1lname":"Zhao","industry":"Transportation","analytics":"Spark MapReduce, clustering","m2fname":"","description":"Citi Bike is a popular bike-share program with stations throughout NYC that a rider can pick up a bike from, or return a bike to. Previous projects on Citi Bike have analyzed popular stations and trip start/end locations, but have not addressed issues with availability and capacity of each station. Particularly at more popular locations, such as in Midtown Manhattan, there are sometimes no available bikes during popular time periods. Conversely, there are sometimes no available docking stations (for bikes to be returned to) at less popular stations. For potential Citi Bike users that are considering getting a Citi Bike membership, as well as for current Citi Bike members, it would be extremely useful to be able to predict their chances of getting a bike and/or finding an available docking station, given a day of week/time of day.","m1fname":"Jessica","projectname":"Citi Bike Availability Predictor","m3fname":""},{"m2lname":"Shi","m4lname":"","m3uni":"yn2289","m1uni":"fy2207","m4uni":"","pid":"201612-77","m2uni":"qs2158","timestring":"Wed Dec 14 23:58:33 2016","m4fname":"","language":"Spark (SQL, MLlib), Python (Flask), Javascript","m3lname":"Nian","dataset":"We use the datasets comes from Gowalla, a location-based social networking website where users share their locations by checking-in. It is provided by the SNAP group of Stanford University. (http://snap.stanford.edu/data/index.html)

The dataset consists of two parts.
Nodes data : Each node contains the user information including user id, check in location (latitude and longitude), location id, and check in time
Edges data : This dataset demonstrates the social network of each user (bidirectional).
","m1lname":"Yang","industry":"Social Science-Government","analytics":"Social Network Exploration:
1. Find out all the locations visited by a specific user
2. Given two users, find out locations which they have both visited at the same time slot
Recommendation:
3. Find out locations visited most frequently by one’s friends
4. Based on location visited in common, recommend new friends for the user
5. Using the visited time distribution of each location to do recommendation at different time of a day
","m2fname":"Qiner","description":"We focus on the users’ social network research and aim at developing web based interaction tools to visualize the network and provide recommendation for the users. Sometimes visualization is more important than analytics.
","m1fname":"Fan","projectname":"Visualization of User Movement in Location-based Social Networks","m3fname":"Yiqun"},{"m2lname":"Zhu","m4lname":"","m3uni":"yg2430","m1uni":"ht2438","m4uni":"","pid":"201612-54","m2uni":"yz3021","timestring":"Wed Dec 14 23:58:41 2016","m4fname":"","language":"Java (Eclipse) and Python (Theano)","m3lname":"Guo","dataset":"In this project, we use the amazon dataset(rating only) download from http://jmcauley.ucsd.edu/data/amazon/links.html,
This dataset includes 24 different categories separate rating data from Amazon.com, the total dataset is 3.2GB.
","m1lname":"Tu","industry":"Retail","analytics":"Hadoop Distributed File System, Mahout item-based recommendation, Mahout user-based recommendation, Theano","m2fname":"Yanjia","description":"Online-shopping has become extremely common in our daily life and recommendation system is an indispensable part of that. Thus the performance of recommendation is very significant.
However, the recommendation accuracy suffers from the lack of different rule for different item categories, like books, videogames, cellphones. So we want to try different recommendation schema on the different categories's rating data and compare the evaluation results with each other to have a better understanding of the link between item category and recommendation performance.
Eventually, we hope this finding would lead to enhance of recommendation research on electronic merchandise.

","m1fname":"Huaiyuan","projectname":"Recommendation Schema Performance Comparison on Amazon Rating Dataset","m3fname":"Yixuan"},{"m2lname":"Cong","m4lname":"","m3uni":"qc2201","m1uni":"lea2142","m4uni":"","pid":"201612-52","m2uni":"axc2105","timestring":"Wed Dec 14 23:58:49 2016","m4fname":"","language":"MATLAB, Python, Spark, Keras, Scikit learn","m3lname":"Chen","dataset":"MIRFLICKR-25000 (25k Flickr images) Publicly available on http://press.liacs.nl/mirflickr. Our software can technically support all types of images as long as we can preprocess it into the appropriate colorspace and dimensions.","m1lname":"Aguilar","industry":"Media","analytics":"convolutional neural networks, image processing techniques","m2fname":"Amery","description":" Grayscale images are extremely commonplace due to being more computationally inexpensive than full color pictures or a lack of image capturing resources. However, estimating the true colors of the images can be a valuable asset in obtaining more information from a photo. We plan on applying the big data manipulation techniques acquired this semester to train a model off of a large image dataset in order to accurately colorize similar grayscale images. CNNs are good at learning patterns from pictures and associate them with object classes. Color is often correlated with these features.
Some features are better learned and colorized than others. We will explore if better techniques or different CNN architectures can improve results. ","m1fname":"Luis","projectname":"Image Colorization Using Convolutional Neural Networks","m3fname":"Qipeng"},{"m2lname":"Wang","m4lname":"","m3uni":"rl2836","m1uni":"tz2297","m4uni":"","pid":"201612-57","m2uni":"yw2867","timestring":"Wed Dec 14 23:59:00 2016","m4fname":"","language":"Python, Flask","m3lname":"Li","dataset":"Yahoo Webscope Music Rating datasets
MovieLen
Other public APIs and datasets with rating information","m1lname":"Zhao","industry":"Media","analytics":"Alternative Least Square algorithm in Spark
","m2fname":"Yisheng","description":"Music is one of the most common form of entertainment
people enjoy everyday, and people's great need and their
frequent access to the Internet has brought both the importance
and the possibility of the application of music recommendation.
In this project, we propose a web-based recommendation
system which is expected to run online in our website, where
users' tastes and preference are analyzed when they visit the
website and give their own ratings to presented songs.","m1fname":"Tiange","projectname":"Web-based Music Recommendation System","m3fname":"Ran"},{"m2lname":"Zheng","m4lname":"","m3uni":"rz2331","m1uni":"sz2547","m4uni":"","pid":"201612-101","m2uni":"jz2672","timestring":"Wed Dec 14 23:59:02 2016","m4fname":"","language":"Python, Spark, AWS","m3lname":"Zhong","dataset":"The TLC Yellow Taxi dataset
Pickup & drop-off locations, trip time & date, fare, tips
10+ million rides per month
From Jul 2015 to Jun 2016 and Apr 2014 to Sep 2014

Uber trip data
Date/time, pickup coordinates
Incomplete: Apr 2014 to Sep 2014

Historical weather data for each day
Fetched from http://weathersource.com/
Collect each day’s weather data into a dictionary

","m1lname":"Zhang","industry":"Information","analytics":"ML Algorithms: Features and outcome variables
Objectives: predict 3 continuous outcome variables
Total number pickup rides (pickup_counts)
Expected (average) trip fare (after-tax but before-tip)
Expected (average) tip fare
Based on 3 categorical and 2 continuous (numerical) features:
Neighborhood of pickup location (categorical; 18 neighborhoods in Manhattan below West 110th Street and East 95th Street)
Pickup time (numerical; grouped hourly into 24 categories)
Weather (categorical; 3 categories: sunny, rain, or snow)
Business day or weekend/federal holiday (categorical; binary)
Temperature in Fahrenheit (numerical; with 10F interval i.e. 10, 20, …, 90)

ML Algorithms: Random Forest
Since all features are categorical, we use Random Forest, a “panacea” for data scientists!
Algorithm Pros:
Can handle both categorical & numerical features
Can rank importance of features
Non-parametric → no assumptions on raw data
Remains accuracy when a lot of data missing
Well-handles unbalanced data & non-linearity
As a decision-tree algorithm, easy to interpret
","m2fname":"Jialei","description":"Uber v.s. Yellow Taxi in Manhattan?
Analyze and compare relationship between ride amount of yellow taxi and Uber, at different locations and time
2. Predict demands for Yellow Taxi in specified area at specific time
Given pickup location, time, weather condition, etc., apply machine learning algorithms (e.g. clustering, regression, time series models, or neural network, etc.) to predict:
Ride amount requested by passengers
Average distance of trip
Average tip amount
3. Develop a web app to inform Yellow Taxi drivers of real-time predicted ride amount, distance, and tip, so that riders could make better pick-up decisions","m1fname":"Shengjia","projectname":"Analyzing 1 Billion+ NYC Yellow Taxi and Uber Rides for Taxi Drivers","m3fname":"Ruilin"},{"m2lname":"feng","m4lname":"","m3uni":"","m1uni":"gs2885","m4uni":"","pid":"201612-50","m2uni":"kf2508","timestring":"Wed Dec 14 23:59:04 2016","m4fname":"","language":"python spark hadoop","m3lname":"","dataset":"We get the dataset from a related research
and here are the datasets.:

http://www.basketball-reference.com
http://stats.nba.com
","m1lname":"sheng","industry":"Media","analytics":"MapReduce
Linear Regression","m2fname":"kaiyan","description":"Analyze how an NBA player matches his salary based on comprehensive performances and statistics. Give suggestions on salaries of current players.

Both of us are crazy fans of NBA, we often watch NBA games together and really love this game. We really hope the NBA players could perform well according to their salaries and don't let their fans down.Of course ,we could give good advice to the teams to find the proper players according to their salary space.

Give the player an appropriate salary is really meaningful for the whole league.","m1fname":"guanpeng","projectname":"Cost efficiency analysis for NBA players","m3fname":""},{"m2lname":"Xin","m4lname":"","m3uni":"yw2902","m1uni":"tz2307","m4uni":"","pid":"201612-47","m2uni":"qx2156","timestring":"Wed Dec 14 23:59:07 2016","m4fname":"","language":"Python, Hadoop, Spark","m3lname":"Wang","dataset":"The datasets are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records of products a customer has, such as \"credit card\", \"savings account\", etc. We will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind_(xyz)_ult1, which are the columns #25 - #48 in the training data. We will predict what a customer will buy in addition to what they already had at 2016-05-28. The test and train sets are split by time, and public and private leaderboard sets are split randomly. Obtained from kaggle.","m1lname":"Zhou","industry":"Finance","analytics":"Logistic Regression, K-means Clutersing, Collaborative Filtering","m2fname":"Qi","description":"Santander Bank offers a lending hand to their customers through personalized product recommendations.

Under their current system, a small number of Santander’s customers receive many recommendations while many others rarely see any resulting in an uneven customer experience. We are going to build an application to predict which products their existing customers will use in the next month based on their past behavior and that of similar customers.
","m1fname":"Tian ","projectname":"Santander Product Recommendation","m3fname":"Yaqing"},{"m2lname":"Keren","m4lname":"","m3uni":"","m1uni":"mjc2261","m4uni":"","pid":"201612-38","m2uni":"ik2338","timestring":"Wed Dec 14 23:59:23 2016","m4fname":"","language":"Python 2.7 ; Ubuntu ","m3lname":"","dataset":"We utilized two main datasets:
1)Million Song Database – Free collection of audio features and metadata for a million contemporary popular music tracks that was developed by Columbia University’s LabRosa and The Echo Nest. This data is accessible using both SQLite databases or HDF5 files and includes metadata such as the song’s name, artist’s name, similar artists, artist tags

2)Personal Music library which included genres such as Rock,Hip-Hop,Classical,Pop, and Instrumental. This data set was 1 TB
","m1lname":"Colacot","industry":"Media","analytics":"System G and Gremlin","m2fname":"Itay","description":"Objectives:
We developed a locally hosted system that provides recommended playlists off of user inputs or user activity. Our system is open-source, lightweight, and optimized for large datasets. We also used System G to create visualizations of our recommendations

Why are these research/toolkits important?:
Currently most recommendation services are either cloud based services that thrive on streaming music(Pandora, Spotify,etc) or local tools that are not optimized for large music libraries (iTunes). Our project will provide the recommendation tools of the cloud based systems for your local library. There is a fairly large (and growing) community of self-hosting and open-source application users who do not feel comfortable using cloud-based services or do not want to go through the hassle of uploading their libraries to the cloud. Currently there are not many good recommendation services for self-hosting users, so our application gives them a better alternative while staying on a self-hosted platform and keeping the data on a \"personal cloud\".

","m1fname":"Manu ","projectname":"RecoSonic: Self-Hosted Music Recommendation System","m3fname":""},{"m2lname":"Gu","m4lname":"","m3uni":"sy2628","m1uni":"yl3395","m4uni":"","pid":"201612-58","m2uni":"yg2466","timestring":"Thu Dec 15 00:00:02 2016","m4fname":"","language":"Python, Spark, AWS EMR, AWS DynamoDB, AWS Data Pipeline, AWS Athena, AWS EC2, AWS S3","m3lname":"Yue","dataset":"Uber surge data for UberX, UberXL, UberBLACK in NYC. We crawled on an EC2 through Uber API.
Lyft Surge date for Lyft, Lyft Plus in NYC. We crawled on an EC2 through Lyft API. (Yes, we are glad to share the above data for free)
NYC weather data from NYC government website.","m1lname":"Luo","industry":"Transportation","analytics":"Custom Algorithm - Sliding Window
- Originally used in Weather Forecast
- Compare of similar span between previous days is used
- Original accuracy was pretty high
","m2fname":"Yu","description":"Our objective is to develop an easily accessible and reliable way of predicting surge prices for a given time frame.

One of the major problems we are facing these days with apps like Uber and Lyft is that the surge prices significantly increase our expense on transportations and there is almost no way of knowing when the surge will end. Uber, for example, will indicate that surge may change in 2 mins – which, sadly enough, is never the case.

Our application, however, generates results like when the surge will end and predict what the surge price will be for a given time.

The importance for this transportation wise application is quite obvious – with our application, we no longer have to undergo the excruciating pain of waiting for the surge to end indefinitely. All we need to do is simply check the application and figure out when the surge will end or ‘where’ the surge will end (where there are no surges), and act accordingly – either enjoy a tasteful afternoon at Starbucks or walk few blocks down the road and catch a Uber – free of surge!
","m1fname":"Yiwen ","projectname":"NYC Uber Surge/Lyft Prime Prediction","m3fname":"Siyan"},{"m2lname":"feng","m4lname":"","m3uni":"","m1uni":"gs2885","m4uni":"","pid":"201612-50","m2uni":"kf2508","timestring":"Thu Dec 15 00:01:10 2016","m4fname":"","language":"python spark hadoop","m3lname":"","dataset":"We get the dataset from a related research
and here are the datasets.:

http://www.basketball-reference.com
http://stats.nba.com
","m1lname":"sheng","industry":"Media","analytics":"MapReduce
Linear Regression","m2fname":"kaiyan","description":"Analyze how an NBA player matches his salary based on comprehensive performances and statistics. Give suggestions on salaries of current players.

Both of us are crazy fans of NBA, we often watch NBA games together and really love this game. We really hope the NBA players could perform well according to their salaries and don't let their fans down.Of course ,we could give good advice to the teams to find the proper players according to their salary space.

Give the player an appropriate salary is really meaningful for the whole league.","m1fname":"guanpeng","projectname":"Cost efficiency analysis for NBA players","m3fname":""},{"m2lname":"Qi","m4lname":"","m3uni":"rp2815","m1uni":"mz2594","m4uni":"","pid":"201612-63","m2uni":"yq2211","timestring":"Thu Dec 15 00:02:49 2016","m4fname":"","language":"Java, Python, Hadoop, Mahout, IBM System G","m3lname":"Peng","dataset":"Dataset: Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.
website: https://www.kaggle.com/stackoverflow/stacksample ","m1lname":"Zheng","industry":"Information","analytics":"Java, Python, Hadoop, Mahout, IBM System G","m2fname":"Yi","description":"Objectives: Create a Q&A assistant for Stack Overflow community

Aim: Shorten question solving time and provide data analysis for the community.
Outcome:
1. Realize clustering for users according to their areas of expertise.
2. Predicting question solving time and score according to the context of the question.
3. Recommend users for questions not answered.
4. Set up Q&A relation graph for data visualization. ","m1fname":"Mingyang","projectname":"Stack Overflow Q&A assistant","m3fname":"Ruxue"},{"m2lname":"Keren","m4lname":"","m3uni":"","m1uni":"mjc2261","m4uni":"","pid":"201612-38","m2uni":"ik2338","timestring":"Thu Dec 15 00:03:44 2016","m4fname":"","language":"Python 2.7 ; Ubuntu","m3lname":"","dataset":"We utilized two main datasets:
1)Million Song Database – Free collection of audio features and metadata for a million contemporary popular music tracks that was developed by Columbia University’s LabRosa and The Echo Nest. This data is accessible using both SQLite databases or HDF5 files and includes metadata such as the song’s name, artist’s name, similar artists, artist tags

2)Personal Music library which included genres such as Rock,Hip-Hop,Classical,Pop, and Instrumental. This data set was 1 TB ","m1lname":"Colacot","industry":"Media","analytics":"System G & Gremlin","m2fname":"Itay","description":"Objectives:
We developed a locally hosted system that provides recommended playlists off of user inputs or user activity. Our system is open-source, lightweight, and optimized for large datasets. We also used System G to create visualizations of our recommendations

Why are these research/toolkits important?:
Currently most recommendation services are either cloud based services that thrive on streaming music(Pandora, Spotify,etc) or local tools that are not optimized for large music libraries (iTunes). Our project will provide the recommendation tools of the cloud based systems for your local library. There is a fairly large (and growing) community of self-hosting and open-source application users who do not feel comfortable using cloud-based services or do not want to go through the hassle of uploading their libraries to the cloud. Currently there are not many good recommendation services for self-hosting users, so our application gives them a better alternative while staying on a self-hosted platform and keeping the data on a \"personal cloud\". ","m1fname":"Manu","projectname":"RecoSonic: Self-Hosted Music Recommendation System","m3fname":""},{"m2lname":"Xie","m4lname":"","m3uni":"wg2297","m1uni":"qy2179","m4uni":"","pid":"201612-43","m2uni":"yx2382","timestring":"Thu Dec 15 00:10:40 2016","m4fname":"","language":"SQL,java","m3lname":"Guo","dataset":"1.NYPD Motor Vehicle Collisions data(data.cityofnewyork.us)

2.Geo data
postcode geo coordinates:
http://data.beta.nyc/dataset/nyc-zip-code-tabulation-areas

borough geo coordinates:
http://data.beta.nyc/dataset/nyc-borough-boundaries

3.Software
System G
Eclipse
MySQL

","m1lname":"Yang","industry":"Transportation","analytics":"PApplet, Processing, UnfoldingMap
System G queries
Search and sort

","m2fname":"Yucong","description":"Our project aims to provide useful and helpful information about accidents to reduce the number of crashes. We analyze the area, time and reason for high incidence of motor accidents in New York City according to the dataset so as to caution people against traffic accidents.","m1fname":"Qingyu","projectname":"Analysis of Motor Vehicle Collisions of NYC","m3fname":"Wenjing"},{"m2lname":"Liu","m4lname":"","m3uni":"sf2794f","m1uni":"gy2237","m4uni":"","pid":"201612-08","m2uni":"sl4039","timestring":"Thu Dec 15 00:11:46 2016","m4fname":"","language":"python, ","m3lname":"Fang","dataset":"We use python to write web scraper to download the datasets.
URL:http://www.basketball-reference.com/
• Data for each Game
• Points of each quarter (over time if
available)
• Total score
• Win/Loss
• Data for each team (per season)
• Win/Loss rate
• Field Goals (& attempts, & percentage)
• 3-point Field Goals (& attempts, &
percentage)
• 2-point Field Goals (& attempts, & percentage)
• Free Throws (& attempts, & percentage)
• Rebounds (Offensive & Defensive)
• Assists, Steals, Blocks, Turnover, Fouls
• Total points
• Data for each player
• (Clauses same with each team)","m1lname":"Yang","industry":"Information","analytics":"• Our prediction use naive bayesian classified mode.
• We extract some useful vectors for naive bayesian (Rebounds, Assist, Steals, Blocks, ... ), and modify the vector based on history performance of team and the newest game report.
• The classification label is the result of each game.
• We use python on Spark to realize the classification.","m2fname":"Shutian ","description":"Sport result prediction is nowadays very popular which makes the problem of predicting the results of sporting events, a new and interesting challenge.
-Help sports commentator's work
-Help the team assess whether they have reached their expected performance
-Help advertisers make decisions
What previous papers concentrated are the classification results of the original game data, or say data that just modified using simple feature extraction methods.

We want to do more:
We want to establish a dynamic model based on Naive Bayesian model. And We think that we should modified the feature vector based on daily match result. And also, we want to consider the effect of team member changing.","m1fname":"Guang","projectname":"Data Analysis for Basketball Matches Outcomes Prediction ","m3fname":"Shu"},{"m2lname":"Wang","m4lname":"","m3uni":"shp2135","m1uni":"cj2452","m4uni":"","pid":"201612-7","m2uni":"jw3316","timestring":"Thu Dec 15 00:12:05 2016","m4fname":"","language":"Python, Flask, PySpark, sqlite, html, JavaScript, Hadoop hdfs","m3lname":"Park","dataset":"data source: Data Attained From the Yelp Dataset Challenge
Consist of Business Data , Review Data, User Data, Check-in Data, Tip Data, and Photos

Business Data
56 Features
Over 80,000 Businesses

Review Data
7 Features
Over 2 GB worth of reviews

We directly downloaded it from Yelp Dataset Challenge Website","m1lname":"Ji","industry":"Retail","analytics":"We trained classification models saved to hdfs, used SQL queries for sqlite database. We used flask to build a web app that implemented google map API.","m2fname":"Jiayi","description":"Objectives:
For business owners, it's important to determine whether a certain location for their business will be profitable.

A profitable location can determine the success of failure of a business due to many factors.
Some factors including what type of business they're trying to open in that location, if the target market is there, what competing businesses there are, are there other businesses that can make your business more profitable? Or can it harm your business.

Capabilities
Utilizing Big Data, one can do a couple of things.
Determine the success of a business in a certain area utilizing ratings
You can also recommend the type of business that is the most profitable business of a certain location

Innovations:
New business idea for an app

","m1fname":"Chenlu","projectname":"Cooked to Location","m3fname":"Sam"},{"m2lname":"Fang","m4lname":"","m3uni":"sl4039","m1uni":"gy2237","m4uni":"","pid":"201612-84","m2uni":"sf2794","timestring":"Thu Dec 15 00:14:39 2016","m4fname":"","language":"python, spark","m3lname":"Liu","dataset":"We use python to write web scraper to download the data
URL:http://www.basketball-reference.com/
• Data for each Game
• Points of each quarter (over time if
available)
• Total score
• Win/Loss
• Data for each team (per season)
• Win/Loss rate
• Field Goals (& attempts, & percentage)
• 3-point Field Goals (& attempts, &
percentage)
• 2-point Field Goals (& attempts, & percentage)
• Free Throws (& attempts, & percentage)
• Rebounds (Offensive & Defensive)
• Assists, Steals, Blocks, Turnover, Fouls
• Total points
• Data for each player
• (Clauses same with each team)","m1lname":"Yang","industry":"Information","analytics":"• Our prediction use naive bayesian classified mode.
• We extract some useful vectors for naive bayesian (Rebounds, Assist, Steals, Blocks, ... ), and modify the vector based on history performance of team and the newest game report.
• The classification label is the result of each game.
• We use python on Spark to realize the classification.","m2fname":"Shu","description":"Sport result prediction is nowadays very popular which makes the problem of predicting the results of sporting events, a new and interesting challenge.
Help sports commentator's work
Help the team assess whether they have reached their expected performance
Help advertisers make decisions

What previous papers concentrated are the classification results of the original game data, or say data that just modified using simple feature extraction methods.
We want to do more:
We want to establish a dynamic model based on Naive Bayesian model. And We think that we should modified the feature vector based on daily match result. And also, we want to consider the effect of team member changing.","m1fname":"Guang","projectname":"Data Analysis for Basketball Matches Outcomes Prediction ","m3fname":"Shutian"},{"m2lname":"Xin","m4lname":"","m3uni":"yw2902","m1uni":"tz2307","m4uni":"","pid":"201612-47","m2uni":"qx2156","timestring":"Thu Dec 15 00:27:20 2016","m4fname":"","language":"Python, Hadoop, Spark, AWS","m3lname":"Wang","dataset":"The datasets are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records of products a customer has, such as \"credit card\", \"savings account\", etc. We will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind_(xyz)_ult1, which are the columns #25 - #48 in the training data. We will predict what a customer will buy in addition to what they already had at 2016-05-28. The test and train sets are split by time, and public and private leaderboard sets are split randomly. Obtained from kaggle.","m1lname":"Zhou","industry":"Finance","analytics":"Logistic Regression
K-means Clutersing
Collaborative Filtering","m2fname":"Qi","description":"To support needs for a range of financial decisions, Santander Bank offers a lending hand to their customers through personalized product recommendations.

Under their current system, a small number of Santander’s customers receive many recommendations while many others rarely see any resulting in an uneven customer experience. We are going to build an application to predict which products their existing customers will use in the next month based on their past behavior and that of similar customers.","m1fname":"Tian","projectname":"Santander Product Recommendation","m3fname":"Yaqing"},{"m2lname":"Fu","m4lname":"","m3uni":"yy2694","m1uni":"xx2243","m4uni":"","pid":"34","m2uni":"wf2223","timestring":"Thu Dec 15 00:46:50 2016","m4fname":"","language":"Sparks","m3lname":"yang","dataset":"Using crawler to collect from http://www.basketball-reference.com/players/
","m1lname":"Xiao","industry":"Media","analytics":"Linear Regression
MLP","m2fname":"Wenyu","description":"According to the policy changes. The league improved the salary cap level of all teams this year, as a result, some of new players has bigger contract than some famous players already signed which means the performance of some players are not qualified for their salary level.
So it’s necessary to find relationship between the performance and salary which can be a reliable source to determine the fair salary of a NBA players.
","m1fname":"Xuanyu","projectname":"Salary prediction for NBA players","m3fname":"Yang"},{"m2lname":"Fu","m4lname":"","m3uni":"yy2694","m1uni":"xx2243","m4uni":"","pid":"201612-34","m2uni":"wf2223","timestring":"Thu Dec 15 00:48:02 2016","m4fname":"","language":"Sparks","m3lname":"yang","dataset":"Using crawler to collect from http://www.basketball-reference.com/players/
","m1lname":"Xiao","industry":"Media","analytics":"Linear Regression
MLP","m2fname":"Wenyu","description":"According to the policy changes. The league improved the salary cap level of all teams this year, as a result, some of new players has bigger contract than some famous players already signed which means the performance of some players are not qualified for their salary level.
So it’s necessary to find relationship between the performance and salary which can be a reliable source to determine the fair salary of a NBA players.
","m1fname":"Xuanyu","projectname":"Salary prediction for NBA players","m3fname":"Yang"},{"m2lname":"Piao","m4lname":"","m3uni":"","m1uni":"jf3030","m4uni":"","pid":"201612-37","m2uni":"yp2419","timestring":"Thu Dec 15 01:17:42 2016","m4fname":"","language":"Spark, Python","m3lname":"","dataset":"http://jmcauley.ucsd.edu/data/amazon/
I have contacted the author for permission.
Any dataset with book id, book title, user id and rating can be supported.","m1lname":"Fu","industry":"Information","analytics":"Alternating Least Squares Algorithm","m2fname":"Yanglu","description":"As everyone may known, there are tons of books published word-widely every year. However, previews of books are not as self-explanatory as trailers of movies, ingredients of foods and appearances of outfits in many circumstances.
There are places that make their recommendations based on the “also bought with” information
A better idea to make the recommendation is using the ratings given by costumers.
","m1fname":"Jiayi","projectname":"Book Recommender","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rb3074","m4uni":"","pid":"201612-103","m2uni":"","timestring":"Thu Dec 15 01:47:56 2016","m4fname":"","language":"Python, OpenCL, Spark Python API, and Spark SQL","m3lname":"","dataset":"The data set used for this project is the wikipedia library. This choice lends itself to both relational and graph databases and will allow performance comparisons between the two.","m1lname":"Barghouti","industry":"Information","analytics":"So far, the work completed is relational algebra routines in OpenCL. Depending on what can be completed in the next day or so, the analytics algorithm will be determined.","m2fname":"","description":"OLAP is typically done in cluster computing environments that require sophisticated distributed setups and equally sophisticated software frameworks. As opposed to trasactional processing in which data records are changing constantly and query response times must be as close to real-time as possible, OLAP jobs are performed on very large data sets that are rarely updated (with updates occurring as file-appends only) and often take minutes, even hours to complete

As seen by the success of recent research projects (MapD @ MIT) and other commercial efforts, there is a real need in the data mining community to simplify and accelerate OLAP.

This project aims to use general purpose graphics processing unit (GPGPU) programming techniques to demonstrate a generic SQL-compliant simulation of OLAP on a single node. The goal is to understand the feasibility of such an approach, including answering some of the following questions: (1) what sort of performance gains can be obtained; (2) are these gains large enough so as to make interactive OLAP a reality; (3) what size data sets can be processed with reasonable response times; and (4) are the response times small enough to justify a nonfault-tolerant approach.","m1fname":"Rashad","projectname":"Single-Node, GPU-Accelerated SQL Compliant Online Analytics Processing (OLAP), with Subjective Comparisons to System G. Traversals","m3fname":""},{"m2lname":"Yang","m4lname":"","m3uni":"yj2415","m1uni":"jz2778","m4uni":"","pid":"201612-100","m2uni":"ry2310","timestring":"Thu Dec 15 05:52:10 2016","m4fname":"","language":"Hadoop, Hive, Spark, Python, Matlab, CUDA","m3lname":"Jiang","dataset":"We scrape the data from Internet by ourselves.
We can support any well-formatted data as long as it contains sufficient information.","m1lname":"Zhao","industry":"Information","analytics":"1. Data Scraping
2. Artificial Neural Network
3. Parallel Computing","m2fname":"Runqing","description":"Predictions about the Grammy is a hot topic around December every year, we are willing to give our own predictions about Grammy based on big data analytics. Besides, Winning the Grammy Award may bring huge profit, mainly about advertising revenue. Studying the potential pattern of the awards like “Best New Artist” can help agents contact new artists in advance.

The main contributions of this project are: 1. a comprehensive prediction about 2017 Grammy Awards, especially Song of the Year. 2. Using Parallel Computing on GPU to accelerate processing speed.","m1fname":"Jianqiao","projectname":"Predictions about the 2017 Grammy Awards","m3fname":"Yi"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jmz2131","m4uni":"","pid":"201612-88","m2uni":"","timestring":"Thu Dec 15 06:51:59 2016","m4fname":"","language":"Spark, Java, PostgreSQL, Ruby","m3lname":"","dataset":"Citi Bike publishes real-time station data (updated several times a minute) at https://feeds.citibikenyc.com/stations/stations.json. Unfortunately, this data is not available for download in aggregate. In order to obtain my data set, I polled and stored the real-time station data once every 10 minutes for about 6 weeks. I ended up with about 400 MB of data.","m1lname":"Zhao","industry":"Transportation","analytics":"Spark MapReduce, clustering","m2fname":"","description":"Citi Bike is a popular bike-share program with stations throughout NYC that a rider can pick up a bike from, or return a bike to. Previous projects on Citi Bike have analyzed popular stations and trip start/end locations, but have not addressed issues with availability and capacity of each station. Particularly at more popular locations, such as in Midtown Manhattan, there are sometimes no available bikes during popular time periods. Conversely, there are sometimes no available docking stations (for bikes to be returned to) at less popular stations. For potential Citi Bike users that are considering getting a Citi Bike membership, as well as for current Citi Bike members, it would be extremely useful to be able to predict their chances of getting a bike and/or finding an available docking station, given a day of week/time of day.","m1fname":"Jessica","projectname":"Citi Bike Availability Predictor","m3fname":""},{"m2lname":"Piao","m4lname":"","m3uni":"","m1uni":"jf3030","m4uni":"","pid":"201612-37","m2uni":"yp2419","timestring":"Thu Dec 15 10:01:26 2016","m4fname":"","language":"PySpark, R","m3lname":"","dataset":"http://jmcauley.ucsd.edu/data/amazon/
I contacted the author for permission","m1lname":"Fu","industry":"Information","analytics":"ALS algorithm","m2fname":"Yanglu","description":"Tons of books are published every year. However, previews of books are not as self-explanatory as trailers of movies, ingredients of foods and appearances of outfits in many circumstances.
There are places that make their recommendations based on the “also bought with” information
A better idea to make the recommendation is using the ratings given by costumers.
","m1fname":"Jiayi","projectname":"Book Recommender","m3fname":""},{"m2lname":"Piao","m4lname":"","m3uni":"","m1uni":"jf3030","m4uni":"","pid":"201612-37","m2uni":"yp2419","timestring":"Thu Dec 15 10:05:54 2016","m4fname":"","language":"PySpark, R","m3lname":"","dataset":"http://jmcauley.ucsd.edu/data/amazon/
I asked the author for permission to use this dataset
","m1lname":"Fu","industry":"Information","analytics":"ALS","m2fname":"Yanglu","description":"Tons of books are published every year. However, previews of books are not as self-explanatory as trailers of movies, ingredients of foods and appearances of outfits in many circumstances.
There are places that make their recommendations based on the “also bought with” information
A better idea to make the recommendation is using the ratings given by costumers.","m1fname":"Jiayi","projectname":"Book Recommender","m3fname":""},{"m2lname":"Piao","m4lname":"","m3uni":"","m1uni":"jf3030","m4uni":"","pid":"201612-37","m2uni":"yp2419","timestring":"Thu Dec 15 10:13:30 2016","m4fname":"","language":"PySpark, R","m3lname":"","dataset":"http://jmcauley.ucsd.edu/data/amazon/
I contacted Julian McAuley for permission to use his dataset.
Any dataset with user id, book id, book titles and ratings can be supported.
","m1lname":"Fu","industry":"Information","analytics":"ALS","m2fname":"Yanglu","description":"Tons of books are published every year. However, previews of books are not as self-explanatory as trailers of movies, ingredients of foods and appreaces of outfits in many circumstances.
There are places that make their recommendations based on the °∞also bought with°± information
A better idea to make the recommendation is using the ratings given by costumers.
","m1fname":"Jiayi","projectname":"Book Recommender","m3fname":""},{"m2lname":"Saxena","m4lname":"","m3uni":"avi2111","m1uni":"dgm2138","m4uni":"","pid":"201612-89","m2uni":"hs2873","timestring":"Thu Dec 15 10:14:56 2016","m4fname":"","language":"Spark, IBM System G, Hadoop","m3lname":"Iyappan","dataset":"We will be using publicly available dataset from openpayments.cms.gov

The dataset that we will be using for this project consists of:
1. Payments paid directly to physicians and teaching hospitals
2. Payments indirectly to physicians and teaching hospitals through an intermediary such as medical specialty society.
3. Designated by physicians or teaching hospitals to be paid to another party.
","m1lname":"Motwani","industry":"Information","analytics":"k-means clustering.
MLlib in Spark
Hadoop for processing
D3 for visualization","m2fname":"Harshit ","description":"As per government regulations Centers for Medicare and Medicaid services are required to collect and display information reported by manugacturers and group purchasing organizations about payments and other transfer of value to physicians and teaching hospitals.

In this project we will be using this dataset to cluster the various sources of payment for hospitals and other medical organization.

The dataset consists of reporting period for past 3 years and we will be comparing the change in data and its parameters over the period of three years. ","m1fname":"Dhruv","projectname":"Relationship Between Phyicians and Drug Manufacturers","m3fname":"Abhinav"},{"m2lname":"Zhou","m4lname":"","m3uni":"cw2962","m1uni":"gl2548","m4uni":"","pid":"201612-85","m2uni":"jz2792","timestring":"Thu Dec 15 12:16:00 2016","m4fname":"","language":"Tool: Spark, MySQL, Python, Flask, GoogleMap Api","m3lname":"Wang","dataset":"Dataset:
Airports, air routes dataset from http://openflights.org/data.html
Air carrier statistics from http://www.transtats.bts.gov/","m1lname":"Li","industry":"Transportation","analytics":"Analytic methods:
We process the airports, airline and flight history datasets with various SparkSQL queries and operations to construct the graph which can best represent the airline reachablility between cities. Transfers are taken into consideration to make it a complete graph. Greedy algorithm, Genetic algorithm and simulated annealing are implemented and evaluated according to its running time and effectiveness. We utilize simulated annealing algorithm to realize the real-time route generation with an appropriate parameter settings to gain the balance between optimality and running time. The results and user-defined route generator are presented in a webpage, built with Flask and embedded with google map.","m2fname":"Jin","description":"Motivation:
Almost everyone of us has the dream of traveling around the world, just like Marco Polo and Magellan. With the help of the widespread airline networks and easy access to the travel info, people start to think how to finish the global traveling with minimum total mileage, with a satisfactory airline and with more fun. Therefore, we want to utilize a variety of airline datasets to plan the multi-city travel itinerary for the modern Marco Polo.
Objectives:
Find an efficient and effective algorithm to build an itinerary to visit most of the capital cities in the world with minimum flying mileage; build the flight routes using a specific airline for the big fan of American Airlines, Air France and so on; enable the itinerary programming with the user-defined city-list to maximize the planner’s flexibility.","m1fname":"Gongqian","projectname":"Flying Marco Polo","m3fname":"Chong"},{"m2lname":"AN","m4lname":"","m3uni":"","m1uni":"xh2301","m4uni":"","pid":"201612-72","m2uni":"ya2345","timestring":"Thu Dec 15 12:38:09 2016","m4fname":"","language":"Java, Python ","m3lname":"","dataset":"The dataset is from IBM. It contains 10,000 images of various roads conditions. ","m1lname":"HUA","industry":"Transportation","analytics":" Faster R-CNN, Caffe","m2fname":"YU","description":"Our objective is to use Faster R-CNN algorithm to train a model that is able to detect pedestrians and vehicles on the road. We hope it can be used on driver-less cars. ","m1fname":"XIANG","projectname":"Pedestrian & Vehicle Recognition ","m3fname":""},{"m2lname":"AN","m4lname":"","m3uni":"","m1uni":"xh2301","m4uni":"","pid":"201612-72","m2uni":"ya2345","timestring":"Thu Dec 15 12:39:46 2016","m4fname":"","language":"Java, Python ","m3lname":"","dataset":"The dataset is from IBM. It contains 10,000 images of various roads conditions. ","m1lname":"HUA","industry":"Transportation","analytics":" Faster R-CNN, Caffe","m2fname":"YU","description":"Our objective is to use Faster R-CNN algorithm to train a model that is able to detect pedestrians and vehicles on the road. We hope it can be used on driver-less cars. ","m1fname":"XIANG","projectname":"Pedestrian & Vehicle Recognition ","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"jmz2131","m4uni":"","pid":"201612-88","m2uni":"","timestring":"Thu Dec 15 14:08:15 2016","m4fname":"","language":"Spark, Java, PostgreSQL, Ruby","m3lname":"","dataset":"Citi Bike publishes real-time station data (updated several times a minute) at https://feeds.citibikenyc.com/stations/stations.json. Unfortunately, this data is not available for download in aggregate. In order to obtain my data set, I polled and stored the real-time station data once every 10 minutes for about 6 weeks. I ended up with about 400 MB of data.","m1lname":"Zhao","industry":"Transportation","analytics":"Spark, nearest neighbor, clustering","m2fname":"","description":"Citi Bike is a popular bike-share program with stations throughout NYC that a rider can pick up a bike from, or return a bike to. Previous projects on Citi Bike have analyzed popular stations and trip start/end locations, but have not addressed issues with availability and capacity of each station. Particularly at more popular locations, such as in Midtown Manhattan, there are sometimes no available bikes during popular time periods. Conversely, there are sometimes no available docking stations (for bikes to be returned to) at less popular stations. For potential Citi Bike users that are considering getting a Citi Bike membership, as well as for current Citi Bike members, it would be extremely useful to be able to predict their chances of getting a bike and/or finding an available docking station, given a day of week/time of day.","m1fname":"Jessica","projectname":"Citi Bike Availability Predictor","m3fname":""},{"m2lname":"Jain","m4lname":"","m3uni":"jy2799","m1uni":"cx2187","m4uni":"","pid":"201612-64","m2uni":"pj2313","timestring":"Thu Dec 15 14:59:11 2016","m4fname":"","language":"Python, D3.js","m3lname":"Ying","dataset":"1 - Uber/FHV Pickups in New York. We searched several datasets online and found this most suitable for our project.
2 - Taxi Trip Data. We searched several datasets online and found this most suitable for our project.
","m1lname":"Xiong","industry":"Transportation","analytics":"We will use D3.js for data visualization, and Python Math libraries for statistical analysis. Besides, we used Spark to do mapreduce for counting how many services in a specific borough/district in New York.","m2fname":"Pulkit","description":"Objectives: We wish to be able to quantify the effects of Uber and other online taxi sharing services, and FHV on the previously operated Yellow Taxi services in New York.

Innovations: There are many projects that do comparisons between Uber and Taxi, but we add comparisons between FHV and Uber.

Expected outcome: We expect to find a drastic change in the user travel patterns and taxi usage before and after the coming to Uber to NYC. In a broad sense, taxi usage should have dropped as uber usage increased over time.

Importance: It is important to determine this effect as it gives us insight into how the advent of uber has changed the way people travel and the Taxi Industry in general
","m1fname":"Chuhan","projectname":"Comparing Uber with its competition in similar online services and older taxis ","m3fname":"Jiefu"},{"m2lname":"Li","m4lname":"","m3uni":"","m1uni":"td2488","m4uni":"","pid":"90","m2uni":"cl3406","timestring":"Thu Dec 15 15:28:18 2016","m4fname":"","language":"Python/Java Platform:hadoop/Spark","m3lname":"","dataset":"The data set was made on the basis of discharge YASP 3.5 Million Data Dump (http://academictorrents.com/details/5c5deeb6cfe1c944044367d2e7465fd8bd2f4acf) replays games Dota 2 site yasp.co (http://yasp.co/) . During unloading thank Albert Cui and Howard Chung and Nicholas Hanson -Holtry. The license for the discharge of: CC BY-SA 4.0. ","m1lname":"Ding","industry":"Media","analytics":"Analytics: Statistical methods(R, Spark, Mahout, Python)
Logistic regression
SVM
Bayesian analysis
Deep learning (MATLAB)
Neural networks
Hidden Markov Model
Recommendation Model (Mahout, Spark, Hadoop)
Winners/Losers may share common features
Computer studies the data and “recommends” winners ","m2fname":"Chi","description":"Dota 2 - multiplayer computer game MOBA genre. Players play each other matches. In each match, as a rule, involved 10 people. Matches are generated from a queue, taking into account the level of the game all players. Before the game, players are automatically divided into two five-man teams. One team plays for the bright side (The Radiant), the other - for the dark (The Dire). The goal of each team to destroy the main building of the base of the enemy - the throne.

Each player controls one character - the hero, and in one match can not be the same characters. Players take turns picking heroes, following a tricky procedure (Captains Mode).

In each match, the game starts over again, ie, Player History does not affect, directly, for the current match. Also, players do not have the possibility to relieve themselves playing for real money. Thus, after selecting the hero, the outcome of the match depends on the actions and experiences of all players.

Heroes differ in their characteristics and abilities. From the combination of the selected characters depends largely on the success of the team. As part of the event you are invited to predict the winning team (Radiant and Dire), knowing only the set of characters that players play.

Heroes running around the map, fight, etc. You can watch matches on the visualization Wisdota service. ","m1fname":"Tianhong","projectname":"Dota 2: Win Probability Prediction","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"","m1uni":"td2488","m4uni":"","pid":"201605-90","m2uni":"cl3406","timestring":"Thu Dec 15 15:44:42 2016","m4fname":"","language":" Language:Python/Java Platform:hadoop/Spark","m3lname":"","dataset":"Dataset: The feature csv file from the Kaggle competition website.

The data set was made on the basis of discharge YASP 3.5 Million Data Dump (http://academictorrents.com/details/5c5deeb6cfe1c944044367d2e7465fd8bd2f4acf) replays games Dota 2 site yasp.co (http://yasp.co/) . During unloading thank Albert Cui and Howard Chung and Nicholas Hanson -Holtry. The license for the discharge of: CC BY-SA 4.0.

Original upload matches has been cleared, the proposed set of matches are present:
played with the 05/01/2015 to 17/12/2015
lasting at least 15 minutes
cleaned matches with incomplete information (eg, information about players otsutvuet)
Of the 15% random dataset records were identified in the test set.
To razmotivirovat contestants on Kaggle occupy high places cheating methods (for example, download the original data set and podsmotrev answers to the test set of matches), we made minimal obfuscation of data, ie, slightly confused dataset:
changed identifiers matches
the start of each match moved to the value of a random variable normally distributed with a standard deviation in 1 day","m1lname":"Ding","industry":"Media","analytics":"Analytics: Statistical methods(R, Spark, Mahout, Python)
Logistic regression
SVM
Bayesian analysis
Deep learning (MATLAB)
Neural networks
Hidden Markov Model
Recommendation Model (Mahout, Spark, Hadoop)
Winners/Losers may share common features
Computer studies the data and “recommends” winners ","m2fname":"Chi","description":"Dota 2 - multiplayer computer game MOBA genre. Players play each other matches. In each match, as a rule, involved 10 people. Matches are generated from a queue, taking into account the level of the game all players. Before the game, players are automatically divided into two five-man teams. One team plays for the bright side (The Radiant), the other - for the dark (The Dire). The goal of each team to destroy the main building of the base of the enemy - the throne.

Each player controls one character - the hero, and in one match can not be the same characters. Players take turns picking heroes, following a tricky procedure (Captains Mode).

In each match, the game starts over again, ie, Player History does not affect, directly, for the current match. Also, players do not have the possibility to relieve themselves playing for real money. Thus, after selecting the hero, the outcome of the match depends on the actions and experiences of all players.

Heroes differ in their characteristics and abilities. From the combination of the selected characters depends largely on the success of the team. As part of the event you are invited to predict the winning team (Radiant and Dire), knowing only the set of characters that players play.

Heroes running around the map, fight, etc. You can watch matches on the visualization Wisdota service. ","m1fname":"Tianhong","projectname":"Dota 2: Win Probability Prediction","m3fname":""},{"m2lname":"Li","m4lname":"","m3uni":"","m1uni":"td2488","m4uni":"","pid":"201612-90","m2uni":"cl3406","timestring":"Thu Dec 15 15:47:45 2016","m4fname":"","language":"Python/Java Platform:hadoop/Spark","m3lname":"","dataset":"The data set was made on the basis of discharge YASP 3.5 Million Data Dump (http://academictorrents.com/details/5c5deeb6cfe1c944044367d2e7465fd8bd2f4acf) replays games Dota 2 site yasp.co (http://yasp.co/) . During unloading thank Albert Cui and Howard Chung and Nicholas Hanson -Holtry. The license for the discharge of: CC BY-SA 4.0.
","m1lname":"Ding","industry":"Media","analytics":"Analytics: Statistical methods(R, Spark, Mahout, Python)
Logistic regression
SVM
Bayesian analysis
Deep learning (MATLAB)
Neural networks
Hidden Markov Model
Recommendation Model (Mahout, Spark, Hadoop)
Winners/Losers may share common features
Computer studies the data and “recommends” winners ","m2fname":"Chi","description":"Dota 2 - multiplayer computer game MOBA genre. Players play each other matches. In each match, as a rule, involved 10 people. Matches are generated from a queue, taking into account the level of the game all players. Before the game, players are automatically divided into two five-man teams. One team plays for the bright side (The Radiant), the other - for the dark (The Dire). The goal of each team to destroy the main building of the base of the enemy - the throne.

Each player controls one character - the hero, and in one match can not be the same characters. Players take turns picking heroes, following a tricky procedure (Captains Mode).

In each match, the game starts over again, ie, Player History does not affect, directly, for the current match. Also, players do not have the possibility to relieve themselves playing for real money. Thus, after selecting the hero, the outcome of the match depends on the actions and experiences of all players.

Heroes differ in their characteristics and abilities. From the combination of the selected characters depends largely on the success of the team. As part of the event you are invited to predict the winning team (Radiant and Dire), knowing only the set of characters that players play.

Heroes running around the map, fight, etc. You can watch matches on the visualization Wisdota service.
","m1fname":"Tianhong ","projectname":"Dota 2: Win Probability Prediction","m3fname":""},{"m2lname":"Han","m4lname":"","m3uni":"","m1uni":"hy2457","m4uni":"","pid":"201612-105","m2uni":"xh2256","timestring":"Thu Dec 15 16:34:49 2016","m4fname":"","language":"Java/Python/Hadoop/Spark","m3lname":"","dataset":"Dataset: Any image datasets can do.","m1lname":"Yan","industry":"Media","analytics":"KMeans","m2fname":"Xi","description":"Use Spark and Hadoop to set up a large-scale online image searching engine.
Expected outcome: when you upload an image, the system will return similar images in the database back to you. And we hope the latency is below 1 second.
Importance: If you want to buy something online, you might not know the merchandise name, then you can search by image.","m1fname":"Hang","projectname":"Large scale real-time similar image search","m3fname":""},{"m2lname":"Qi","m4lname":"","m3uni":"rp2815","m1uni":"mz2594","m4uni":"","pid":"201612-63","m2uni":"yq2211","timestring":"Thu Dec 15 16:42:37 2016","m4fname":"","language":"Java, Python, Hadoop, Mahout, IBM System G","m3lname":"Peng","dataset":"Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

This is organized as three tables:

Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
Tags contains the tags on each of these questions

website: https://www.kaggle.com/stackoverflow/stacksample ","m1lname":"Zheng","industry":"Information","analytics":"K-means Clustering, Classification, Item-Based Recommendation, Data Visualization
","m2fname":"Yi","description":"Objectives: Create a Q&A assistant for Stack Overflow community

Outcome:
1. Realize Recommendation for users according to their areas of expertise.
2. Achieve auto classification for questions and identify tags from question text.
3. Cluster questions according to tags or title.
4. Set up Q&A relation graph for data visualization. ","m1fname":"Mingyang","projectname":"Stack Overflow Q&A assistant","m3fname":"Ruxue"},{"m2lname":"Singh","m4lname":"","m3uni":"am4590","m1uni":"vm2486","m4uni":"","pid":"201605-83","m2uni":"ss516","timestring":"Thu Dec 15 19:18:58 2016","m4fname":"","language":"Python(numpy, pandas, scikit-learn), Spark, Databricks, SystemG, Ubuntu 14.04","m3lname":"Mudgal","dataset":"The data represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, files are separated by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken.","m1lname":"Mahajan","industry":"Social Science-Government","analytics":"Naive bayes, random forest, Gradient boosting, Multi-layer Perceptron","m2fname":"Sheallika","description":"To bring down the cost of manufacturing it is imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. We are faced with the task of predicting the internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable the production company to bring quality products at lower costs to the end user.
The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).","m1fname":"Vibhuti","projectname":"Reducing Manufacturing Failures","m3fname":"Aayus"},{"m2lname":"Singh","m4lname":"","m3uni":"am4590","m1uni":"vm2486","m4uni":"","pid":"201612-83","m2uni":"ss516","timestring":"Thu Dec 15 19:29:28 2016","m4fname":"","language":"Python(numpy, pandas, scikit-learn), Spark, Databricks, SystemG, Ubuntu 14.04","m3lname":"Mudgal","dataset":"The data represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, files are separated by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken","m1lname":"Mahajan","industry":"Social Science-Government","analytics":"Naive bayes, random forest, Gradient boosting, Multi-layer Perceptron","m2fname":"Sheallika","description":"To bring down the cost of manufacturing it is imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. We are faced with the task of predicting the internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable the production company to bring quality products at lower costs to the end user.
The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).","m1fname":"Vibhuti","projectname":"Reducing Manufacturing Failures","m3fname":"Aayush"},{"m2lname":"Singh","m4lname":"","m3uni":"am4590","m1uni":"vm2486","m4uni":"","pid":"201612-83","m2uni":"ss5136","timestring":"Thu Dec 15 19:30:56 2016","m4fname":"","language":"Python(numpy, pandas, scikit-learn), Spark, Databricks, SystemG, Ubuntu 14.04","m3lname":"Mudgal","dataset":"The data represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939.

On account of the large size of the dataset, files are separated by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken","m1lname":"Mahajan","industry":"Social Science-Government","analytics":"Naive bayes, random forest, Gradient boosting, Multi-layer Perceptron","m2fname":"Sheallika","description":"To bring down the cost of manufacturing it is imperative to ensure that the recipes for the production of its advanced mechanical components are of the highest quality and safety standards. We are faced with the task of predicting the internal failures using thousands of measurements and tests made for each component along the assembly line. This would enable the production company to bring quality products at lower costs to the end user.
The goal is to predict which parts will fail quality control (represented by a 'Response' = 1).","m1fname":"Vibhuti","projectname":"Reducing Manufacturing Failures","m3fname":"Aayush"},{"m2lname":"Wang ","m4lname":"","m3uni":"wl2575","m1uni":"gx2127","m4uni":"","pid":"201612-68","m2uni":"yw2768","timestring":"Mon Dec 19 18:52:10 2016","m4fname":"","language":"Hadoop, Spark, Mahout, python","m3lname":"Lin","dataset":"We use a dataset on Black Friday purchase records from Analytics Vidhya.
https://datahack.analyticsvidhya.com/contest/black-friday/","m1lname":"Xu","industry":"Retail","analytics":"Collaborative Filtering Recommendation
","m2fname":"Yuntong ","description":"Our objective is to analyze the previous Black Friday purchase records and dig out some interesting trends or features. These conclusions could be useful for merchants to have a better idea of what kind of merchandises are more popular in certain groups of people and help the sellers to reach maximum Black Friday profits.","m1fname":"Guowei","projectname":"Black Friday Merchandise Analytics and Prediction ","m3fname":"Woye "},{"m2lname":"Li","m4lname":"","m3uni":"yl3407","m1uni":"lz2484","m4uni":"","pid":"201612-10","m2uni":"hl2915","timestring":"Mon Dec 19 22:59:30 2016","m4fname":"","language":"Language: Python, jupyter, pyspark , Matlab","m3lname":"Li","dataset":"Dataset: We used Raw NYC Taxi Trip Data that has been available to public
We processed around 4.1 million records from March 2015 to May 2015, and May 2016
","m1lname":"Zhou","industry":"Transportation","analytics":"Analytics: Random Forest Algorithm,
Linear Regression,
Carto db ","m2fname":"Hongyi","description":"Save $$ money and time on travel
Recommend optimal traveling plans for users
Predict traveling costs based on model

Meaning:
Users could use the prediction to avoid rush hours to take taxi based on their specific location
Police Department could use this prediction on daily basis to deploy their police to ease the traffic
Users could use the cost prediction model for an estimation of traveling cost on specific time
Taxi Companies could use the prediction model for developing long-term policies for taxis distribution

","m1fname":"Liutong","projectname":"Travel Planing System","m3fname":"Yiting "},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201612-95","m2uni":"","timestring":"Wed Dec 21 10:32:01 2016","m4fname":"","language":"IBM System G Graph Tools, Amazon AWS EC2, gShell, R, Python, Bash, Ubuntu 14.04, Mac OSX El Capitan, Windows 7","m3lname":"","dataset":"Yelp Academic Dataset Challenge Round 8:
- 2.7 Million Reviews
- 648k Tips
- 86k Businesses
- 10 Cities
- 687k Users
- 4.2 M Social Edges ","m1lname":"Hasbamrer","industry":"Information","analytics":"Created:
- Graph visualization of yelp businesses and users.
- Edges show connection between users and businesses.
- Showed relative popularity of businesses with node size based on PageRank.","m2fname":"","description":"Yelp users traditionally interact with business listings by looking at star ratings and reading other user’s reviews. However, manually scanning through pages of ratings and reviews isn’t scalable. A city like Pittsburgh can contain over thousands of business listings and tens of thousands of reviews. I propose a network graph visualization of the Yelp social recommender network that uses PageRank relative node size to illustrate business importance and influence. To demonstrate the benefits of graph visualization, I created sample graphs using data from the Yelp Academic Dataset Round 8.

Previous work on Yelp Academic Dataset:
- Review text sentiment analysis.
- Restaurant recommendations with ML.
- Circle graph and heat map visualization.
- However, no visualization of business and user connection.

Current work: Graph visualization
- Show how businesses and users are connected.
- Help businesses identify consumer influence.
- Identify business competition.
- Help consumers quickly identify business popularity.
- Easier to look at a graph than read a data table. ","m1fname":"Nond","projectname":"Graph Visualization and PageRank of the Yelp Social Recommender Network","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rch2130","m4uni":"","pid":"201612-41","m2uni":"","timestring":"Thu Dec 22 14:03:04 2016","m4fname":"","language":"System G, Jupyter iPython notebooks with Spark, Mathematica","m3lname":"","dataset":"The text corpora of 11 19th-century American politicians were tested for the presence ofALL 7-word n-grams from 4 foundational texts, leading to a total dataset of approximately 1GB.

The American politicans and orators were: Alexander Stephens, Abraham Lincoln, Charles Sumner, Daniel Webster, Edward Everett, William Seward, John Calhoun, Henry Clay, Wendell Phillips, Andrew Jackson, and Frederick Douglass. The text files were from Archive.org (.txt files from .pdf scans, so not 'curated') and Gutenberg.org (with text files 'cleaned' and curated). ","m1lname":"Hill","industry":"Social Science-Government","analytics":"Search algorithms operating in parallel over network with Mathematica, similar algorithms with Spark on Jupyter, and visualization using System G.","m2fname":"","description":"Scholars often seek to measure and model the influence of
earlier texts on later texts. They usually do so in qualitative,
selective ways, leading to, for example, selection bias in
choice of data. Put simply, in the humanities data is limited
to the amount of information that a single scholar can process
‘in memory’- on his/her own, without the aid of computation.

On the other hand, scholarly ingenuity is near-limitless. The
result is that there are too many degrees of freedom, and
scholarly work in the humanities often has the characteristics
of ‘overfitting’- poor performance outside of the training set,
lack of model-portability to other contexts, and unnecessary
complexity in the models produced.

To be blunt, these overfitting side-effects, familiar to
those of us with experience in statistics or machine learning,
have three attendant effects:

1)results and interpretations are seen as idiosyncratic and
untrustworthy, lacking predictive power;

2)scholars appear overly specialized and narrow in their
interests, since their models do not transfer to other contexts;

3) humanities scholarship seems purposefully opaque and
needlessly mysterious to outsiders, because models are overly
complex.

The answer to many of these issues, according to
scholars in the emerging ‘digital humanities’ fields and,
perhaps more importantly, is a computational approach to
humanistic work.

However, scholars in history, literature, classics, art, etc. rarely
have rigorous technical training. The result: much work in the
digital humanities focuses only on digitization, rather than
analysis. Much work is amateurish.

This project aims to:

Create a ‘sandbox’ for work in the digital humanities, setting
up Jupyter notebooks running Spark kernels, in order to
demonstrate the potential of digital methods for humanists.

Employ Big Data approaches, like parallelization/distributed
computing, to solve the ‘influence’ problem mentioned above,
rigorously identifying and quantifying quotation between texts.

Use IBM System G to visualize networks of shared influence,
here concentrating on leading political figures of the 19th-
century for whom large text corpora are readily available.

","m1fname":"Robert","projectname":"Building a Big Data Digital Humanities Workspace: 19th-century American history with System G","m3fname":""},{"m2lname":"Xu","m4lname":"","m3uni":"yt2549","m1uni":"zs2324","m4uni":"","pid":"201612-56","m2uni":"hx2208","timestring":"Thu Dec 22 15:48:26 2016","m4fname":"","language":"Spark, Python","m3lname":"Tie","dataset":"Firstly we use Yelp Api to get different Business id, then we write a scarpy to get the data we want by making request which combined with business id directly.
After our IP was banned, we combined the data we get by using scarpy with other yelp dataset we downloaded from yelp dataset challenge round 8.
We only collect 500M from Yelp.
To satisfy the requirement of 1G dataset, we use Yelp","m1lname":"Song","industry":"Social Science-Government","analytics":"Collaborative filtering with ALS

Linear Regression

K-Nearest Neighbor

WordCould2
","m2fname":"Hao","description":"Yelp has been more and more prevalent in making recommendations for people’s daily life, like where to eat or where to shop, however, little has been done regards making recommendations for who they could go with. We proposed that eating mate could be recommended based on the review history of each user. Therefore, we collected review data from Yelp to make eating mate recommendations, which included a collaborative filtering of the review data, linear regression on a single user and business data, and finding K nearest neighbors of each user. The experiment on data points confirms that it is a good way to make recommendations for eating mate. Since the current experiment is based on review data on partial cities in the United States, more tests are needed for other cities, and in order to have a better representation for recommendation, a User Interface is needed in the future.

Yelp is a platform about sharing reviews and making recommendations. On the one hand, users of Yelp could submit reviews on different shops and restaurants using a one to five-star rating. On the other hand, Yelp would make recommendations for users based on their past review history. McNichol states Yelp has become “one of the most import sites on the Internet” [1]. According to the statistics provided by Yelp on 2016, it had about 102 million total reviews and a monthly average of 92 million unique visitors in the second quarter of 2016 [2].
Though Yelp has strong functionalities in helping people find a place to eat, it has less tricks in finding people an eating mate. It is a great business loss since under most circumstance, people would not like to go eating alone, no matter how good the business is.
Based on this situation, this project is intended to make recommendations for people who share similar preferences to have meal together based on each individual’s review history. We are dedicated to dig into each user’s tastes which hides behind their
reviews, and hopefully, to analysis taste distributions of each city.
Finally, the application should work like: 1). For a new user, we would like this user answer some question so we can know about his preference and make eating mate recommendation based on his answers. 2). For an existed user, we would make recommendation directly based on his rating history. The final out would be several eating mates and corresponding restaurant.
","m1fname":"Zehao","projectname":"Eating Mate Recommendation for Yelp Users","m3fname":"Yutong"},{"m2lname":"Xu","m4lname":"","m3uni":"yt2549","m1uni":"zs2324","m4uni":"","pid":"201612-56","m2uni":"hx2208","timestring":"Thu Dec 22 16:47:59 2016","m4fname":"","language":"Spark, Python","m3lname":"Tie","dataset":"Firstly we use Yelp Api to get different Business id, then we write a scarpy to get the data we want by making request which combined with business id directly.
After our IP was banned, we combined the data we get by using scarpy with other yelp dataset we downloaded from yelp dataset challenge round 8.
We only collect 500M from Yelp.
To satisfy the requirement of 1G dataset, we use Yelp Official Data Challenge Data Set as supplement, which is more than 1G. ","m1lname":"Song","industry":"Social Science-Government","analytics":"Collaborative filtering with ALS

Linear Regression

K-Nearest Neighbor

WordCould2 ","m2fname":"Hao","description":"Yelp has been more and more prevalent in making recommendations for people’s daily life, like where to eat or where to shop, however, little has been done regards making recommendations for who they could go with. We proposed that eating mate could be recommended based on the review history of each user. Therefore, we collected review data from Yelp to make eating mate recommendations, which included a collaborative filtering of the review data, linear regression on a single user and business data, and finding K nearest neighbors of each user. The experiment on data points confirms that it is a good way to make recommendations for eating mate. Since the current experiment is based on review data on partial cities in the United States, more tests are needed for other cities, and in order to have a better representation for recommendation, a User Interface is needed in the future.

Yelp is a platform about sharing reviews and making recommendations. On the one hand, users of Yelp could submit reviews on different shops and restaurants using a one to five-star rating. On the other hand, Yelp would make recommendations for users based on their past review history. McNichol states Yelp has become “one of the most import sites on the Internet” [1]. According to the statistics provided by Yelp on 2016, it had about 102 million total reviews and a monthly average of 92 million unique visitors in the second quarter of 2016 [2].
Though Yelp has strong functionalities in helping people find a place to eat, it has less tricks in finding people an eating mate. It is a great business loss since under most circumstance, people would not like to go eating alone, no matter how good the business is.
Based on this situation, this project is intended to make recommendations for people who share similar preferences to have meal together based on each individual’s review history. We are dedicated to dig into each user’s tastes which hides behind their
reviews, and hopefully, to analysis taste distributions of each city.
Finally, the application should work like: 1). For a new user, we would like this user answer some question so we can know about his preference and make eating mate recommendation based on his answers. 2). For an existed user, we would make recommendation directly based on his rating history. The final out would be several eating mates and corresponding restaurant. ","m1fname":"Zehao","projectname":"Eating Mate Recommendation for Yelp Users","m3fname":"Yutong"},{"m2lname":"Yuan","m4lname":"","m3uni":"xy2282","m1uni":"xj2178","m4uni":"","pid":"201612-98","m2uni":"jy2736","timestring":"Thu Dec 22 17:09:51 2016","m4fname":"","language":"Hadoop, Python, Spark, Matlab, html/css, Jquery, D3.js, AngularJS, Google map API","m3lname":"Yu","dataset":"Our dataset is extracted from the NYC government website, the link is below: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
","m1lname":"Ji","industry":"Transportation","analytics":"ALGORITHMS
Clustering (GMM)
Random Forest
Cross Validation

VISUALIZATION
D3.js
AngularJs
Google map API ","m2fname":"Jingyi","description":"Traditionally, Taxi drivers pick up customers randomly. Our project built a web application telling the drivers real-time demands of Taxi cars in a specific location.

The results can help the government better allocate the NYC Taxi car flow, help the drivers increase their revenue and shrink the customer's average waiting time.

Main Functionalities:
1. Compute the feature importance of each factor which may influence Taxi demands using Random Forest Algorithm.
2. Plot real time density map on Google Map to visualize the demand of Taxis.
3. Use Gaussian mixture model to calculate several high demanding clusters for Taxis and visualize the cluster on Google Map API.
4. Built a web application that can predict taxi demands in the vicinity based on the timezone and location
5. Use D3.js building interactive UI to visualize the overall trend of taxi demands changing with time and location. ","m1fname":"Xiangbing","projectname":"NYC Taxi Data(green car) Analysis","m3fname":"Xinzhe"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201612-95","m2uni":"","timestring":"Thu Dec 22 19:02:45 2016","m4fname":"","language":"IBM System G Graph Tools, Amazon AWS EC2, gShell, R, Python, Bash, Ubuntu 14.04, Mac OSX El Capitan, Windows 7 ","m3lname":"","dataset":"Yelp Academic Dataset Challenge Round 8:
- 2.7 Million Reviews
- 648k Tips
- 86k Businesses
- 10 Cities
- 687k Users
- 4.2 M Social Edges
- JSON format","m1lname":"Hasbamrer","industry":"Information","analytics":"Created:
- Graph visualization of yelp businesses and users.
- Edges show connection between users and businesses.
- Showed relative popularity of businesses with node size based on PageRank. ","m2fname":"","description":"Yelp users traditionally interact with business listings by looking at star ratings and reading other user’s reviews. However, manually scanning through pages of ratings and reviews isn’t scalable. A city like Pittsburgh can contain over thousands of business listings and tens of thousands of reviews. I propose a network graph visualization of the Yelp social recommender network that uses PageRank relative node size to illustrate business importance and influence. To demonstrate the benefits of graph visualization, I created sample graphs using data from the Yelp Academic Dataset Round 8.

Yelp data volume and understanding:
-115 million reviews as of Q3 2016.
-174 million unique visitors per month.
-Ex. Mon Ami Gabi – 6,300 reviews.
-Manual scanning doesn’t scale.

Previous work on Yelp Academic Dataset:
-Review text sentiment analysis.
-Restaurant recommendations with ML.
-Word cloud, circle graph, and heat map visualization.
-However, no visualization of business and user connection.

Current work: Graph visualization
-Show how businesses and users are connected.
-Help businesses identify consumer influence.
-Identify business competition.
-Help consumers quickly identify business popularity.
-Easier to look at a graph than read a data table.","m1fname":"Nond","projectname":"Graph Visualization and PageRank of the Yelp Social Recommender Network","m3fname":""},{"m2lname":"Zhang","m4lname":"","m3uni":"jy2803","m1uni":"cr2826","m4uni":"","pid":"201612-62","m2uni":"lz2467","timestring":"Thu Dec 22 19:31:43 2016","m4fname":"","language":"Spark, Python, Django, Bootstrap, HTML, CSS, Git I/O, AWS RDS, MySQL database","m3lname":"Yu","dataset":"OKCupid Profile Data: 50k user profile data
Chicago Face Dataset: over 100 photos
Class Profile photo: over 900 photos","m1lname":"Ren","industry":"Social Science-Government","analytics":"k-means: recommendation phase
Microsoft face API: Recognize face and find top-k similar faces based on user's photo","m2fname":"Lyujia","description":"The face is the index of the mind --- Chinese proverb
People are constantly, both consciously and unconsciously, looking for their future partners. This application will expedite this process by analyzing and matching people's face to those in the database, and recommending the \"correct\" ones to the clients. By analyzing clients' photos (aka their faces), we are able to find the right one(s) for them to date. The innovation in this project is to integrate the facial recognition to the recommendation system.","m1fname":"Chuqiao ","projectname":"MateFinder: The Next Generation of Dating Recommender System","m3fname":"Jinyang"},{"m2lname":"Zhang","m4lname":"","m3uni":"jy2803","m1uni":"cr2826","m4uni":"","pid":"201612-62","m2uni":"lz2467","timestring":"Thu Dec 22 19:35:40 2016","m4fname":"","language":"Spark, Python, Django, Bootstrap, HTML, CSS, Git I/O, AWS RDS, MySQL database ","m3lname":"Yu","dataset":"OKCupid Profile Data: 50k user profile data
Chicago Face Dataset: over 100 photos
Class Profile photo: over 900 photos ","m1lname":"Ren","industry":"Social Science-Government","analytics":"k-means: recommendation phase
Microsoft face API: Recognize face and find top-k similar faces based on user's photo","m2fname":"Lyujia","description":"The face is the index of the mind --- Chinese proverb
People are constantly, both consciously and unconsciously, looking for their future partners. This application will expedite this process by analyzing and matching people's face to those in the database, and recommending the \"correct\" ones to the clients. By analyzing clients' photos (aka their faces), we are able to find the right one(s) for them to date. The innovation in this project is to integrate the facial recognition to the recommendation system.","m1fname":"Chuqiao ","projectname":"MateFinder: The Next Generation of Dating Recommender System","m3fname":"Jinyang"},{"m2lname":"Bhatt","m4lname":"","m3uni":"","m1uni":"jjg2188","m4uni":"","pid":"201612-92","m2uni":"tb2658","timestring":"Thu Dec 22 20:10:27 2016","m4fname":"","language":"Ubuntu 14.04, Python, hmmlearn, Pyspark, Scipy, Sklearn, Numpy, h5py","m3lname":"","dataset":"We would be working with a dataset provided by Kaggle from the ‘American Epilepsy Society Seizure Prediction Challenge’. In this dataset, Intracranial EEG was recorded from dogs (~15GB) with naturally occurring epilepsy using an ambulatory monitoring system to identify a region of brain that can be resected to prevent future seizures are included.","m1lname":"Guerra","industry":"Life Science","analytics":"In order to properly classify and forecast this kind of dataset, an algorithm that takes into consideration the temporal information is necessary. EEG data has been analyzed in many projects, however, we have decided to implement Hidden Markov Models initializing the prior, transition and emission matrices using k-means algorithms from Pyspark. ","m2fname":"Tulika","description":"For this project, we aimed to fulfill all the requirements for a ‘Big Data’ dataset. These requirements entailed high-volume, high-velocity and high-variety information dataset. In addition, we are interested in the development of machine learning algorithms with an emphasis on applications of data acquisition, processing, understanding, and learning. More specifically, we would like to work on the development of systems that incorporate sequential decision-making, reasoning, and inference, in an efficient and secure manner to improve healthcare applications.

As of result, we decided to concentrate on the area of seizure forecasting by using EGG-based information. The seizure forecasting results would be used for the prevention of Epilepsy. Epilepsy affects nearly 1% of the world’s population, and it happens as of a result of spontaneous and sequential seizures. Machine learning algorithms can be used to forecast the presence of seizures and, as of result, avoid epilepsy by taking the right measurements.

Based on clinical data, the temporal dynamics of brain activity can be classified into 4 states: Interictal (between seizures, or baseline), Preictal (prior to seizure), Ictal (seizure), and Post-ictal (after seizures). By implementing seizure forecasting, a new therapeutic strategies for epilepsy, such as providing patient warnings and delivering preemptive therapy can be utilized. Epilepsy prevention can be reach by the understanding, learning and forecasting of seizures through the information of electroencephalogram (EEG) signals. In this paper, a proposed methodology utilizing feature extraction and Hidden Markov Model (HMM) is used to classify EGG time series data. The proposed methodology accurately predict whether a signal is preictal or interictal with up to 84\\% accuracy. K-means Pyspark, as a framework learnt in the Big Data Analytics (EECS E6893), is used to initialized the probabilities matrices of the HMM. ","m1fname":"Jorge","projectname":"Seizure Forecasting Analysis of EEG Data","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"tnn2109","m4uni":"","pid":"201612-87","m2uni":"","timestring":"Thu Dec 22 21:06:50 2016","m4fname":"","language":"Spark","m3lname":"","dataset":"The complete data set consisting for Address information and Solar Panel metrics from Project Sunroof was created by using the purchased address list and the Content Grabber 2 web scrapper. Creating this dataset took much of the time as most free web scrapers are difficult to use and the best scrapers are very expensive. 10 days of communication with the makers of Content Grabber 2, Sequentum Pty Ltd, allowed a temporary Trial key for use for this project.","m1lname":"Nguyen","industry":"Life Science","analytics":"The tools used most often were web scrapers. Several were tried with varying results, eventually settling on Content Grabber 2. Below are screenshots of various scrapers and their drawbacks. Also shown are various Content Grabber 2 screens showing the configuration of the scraper and its customization to extract Solar metrics for the Project Sunroof Site (https://www.google.com/get/sunroof)","m2fname":"","description":"The goal of this project is to Spark’s MapReduce to shift through the data and return a recommendation on affordability of using Solar Energy at a site in Indiana","m1fname":"Trien","projectname":"Is Solar Energy Worthwhile for My Home?","m3fname":""},{"m2lname":"Wang","m4lname":"","m3uni":"xg2218","m1uni":"yz2990","m4uni":"","pid":"201612-33","m2uni":"ww2420","timestring":"Thu Dec 22 23:49:12 2016","m4fname":"","language":"Spark, System G, Python, nltk, sklearn, d3.js ","m3lname":"Gao","dataset":"The Walmart import data is collected from Walmart and its main subsidiaries from 04/16/2014 to 10/17/2014. This dataset is downloaded from Global Garment Supply Chain section of datahub website. ","m1lname":"Zhu","industry":"Retail","analytics":"K-means Clustering, LDA ","m2fname":"Wenqi","description":"Our research objective is to summarize Walmart's current supply chain. This is a descriptive analysis so that we can help Walmart understand which kinds of products are supplied by certain suppliers in certain regions. By better understanding the suppliers' distribution, Walmart is able to identify potential backup if a supplier is unable to supply certain product under emergency, thus running a better supply chain management.
","m1fname":"Yuanxu","projectname":"Supply Chain Management for Walmart","m3fname":"Xuefei"},{"m2lname":"Piao","m4lname":"","m3uni":"","m1uni":"jf3030","m4uni":"","pid":"201612-37","m2uni":"yp2419","timestring":"Thu Dec 22 23:50:21 2016","m4fname":"","language":"Python, Spark","m3lname":"","dataset":"we find the Amazon book dataset from the UCSD researcher Julian McAuley’s website, http://jmcauley.ucsd.edu/data/amazon/.

Any data with user id, book id, book title, and book rating can be supported","m1lname":"Fu","industry":"Information","analytics":"Collaborative Filtering
Alternating Least Squares","m2fname":"Yanglu","description":"As everyone may or may not know, there are more than 6,000,000 books being published each year. Facing the huge amount of books on the market, readers could feel it is difficult to figure out which books they would be interested in reading and buying. Admittedly, the books all have previews that can introduce the book contents to the potential readers. However, the previews of books are more like advertisements, so readers may feel they are misleading. Under this situation, it is helpful that if the readers can have some third-party recommendations. Nowadays, the majority of websites that sell books have their own recommendations to their customers. The problem is that most of time the recommendations given by those websites are based on the “what were the customers also bought with this book”. However, this kind of recommendation could be problematic, as buying two books together does not necessarily indicate any similarity between the two books. For a not very popular cooking book, a few customers have bought it with a top-selling novel, but there is no way to make a logistic conclusion that recommending the top-selling novel to the future cooking book buyers will make them satisfied. Facing this problem, we believe we can apply what we have learned in the class to do better recommendations. In particular, here we are going to use the ratings given by previous customers to do our book recommendations.","m1fname":"Jiayi","projectname":"Book Recommender","m3fname":""},{"m2lname":"Tian","m4lname":"","m3uni":"","m1uni":"tw2565","m4uni":"","pid":"201705-6","m2uni":"yt2545","timestring":"Thu May 11 12:35:36 2017","m4fname":"","language":"Java, Android","m3lname":"","dataset":"This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015
Please cite the paper if you want to use it :)

It contains sentences labelled with positive or negative sentiment.

=======
Format:
=======
sentence score

=======
Details:
=======
Score is either 1 (for positive) or 0 (for negative)
The sentences come from three different websites/fields:

imdb.com
amazon.com
yelp.com

For each website, there exist 500 positive and 500 negative sentences. Those were selected randomly for larger datasets of reviews.
We attempted to select sentences that have a clearly positive or negative connotaton, the goal was for no neutral sentences to be selected","m1lname":"Wang","industry":"Information","analytics":"our proposed bag-of-words sentimental analysis algorithm, Naive Bayes classifier , Average Perceptron classifier. Microsoft Azure Emotion API, Microsoft Azure Face API, iflytech API","m2fname":"Yuan","description":"The “overload information” embedded in massively abundant data becomes a critical challenge in this big data explosion era. Speech recognition and sentimental analysis both are popular approaches that could help people to efficiently get information and extract value from large amount of data. Our team would like to design and implement an android app which combined the tasks of emotion recognition and speech recognition.
","m1fname":"Tianrou","projectname":"VoiceDay Based on Speech Recognition","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"iuu1","m4uni":"","pid":"201705-2","m2uni":"","timestring":"Thu May 11 12:43:15 2017","m4fname":"","language":"Java, C++, Android NDK","m3lname":"","dataset":"Object detection model comes from here and is a part of a GitHub repo the project is based on:

https://www.dropbox.com/s/0i2fr9krb8wv8mp/phone_data.tar?dl=1

This URL is listed in the tools.py script of the parent project.","m1lname":"Ukpo","industry":"Media","analytics":"Object detection via Fast RCNN (Region-based Convolutional Networks) and Selective Search.","m2fname":"","description":"The Concept-”You know what it looks like, but you don’t know what it’s called!”. The goal of this project is to build something that can help users identify objects in the world by image alone.
","m1fname":"Ihimu","projectname":"Visual Search Engine","m3fname":""},{"m2lname":"Yuan","m4lname":"","m3uni":"","m1uni":"wf2223","m4uni":"","pid":"201705-5","m2uni":"xy2306","timestring":"Thu May 11 13:10:06 2017","m4fname":"","language":"Python, Swift","m3lname":"","dataset":"We use several e-book with .txt format as the source data, generate our original text image from the source, and then clip those ground truth image to make them as our input.

We set the RGB mode of ground truth image to binary -- the pixels are 0 or 1 in the image -- because all we need is the exact location of words rather than any extra information; For the input clipped image, the RBG mode is grey scalar, because the real world image can have all kinds of “noise” – shadow, blurry text, or dirt.","m1lname":"Fu","industry":"Information","analytics":"Apply a 6-layer Convolutional Neural Network with TensorFlow, build server on AWS EC2, and display the result with iOS application","m2fname":"Xing","description":"Goal: Build a real product that is usable for image recognition

Procedure:
Build an AWS EC2 instance deploy Neural Network
Build iOS application as front end
Upload target image and waiting for the recognition result","m1fname":"Wenyu","projectname":"Text Recognition","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"asf2182","m4uni":"","pid":"201704-12","m2uni":"","timestring":"Thu May 11 13:11:06 2017","m4fname":"","language":"Python (Tensorflow, Keras), PHP","m3lname":"","dataset":"CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html","m1lname":"Suprem","industry":"Information","analytics":"Algorithms: Convolutional Neural networks, Gradient Ascent with regularization, Computational graph creation and optimization

Visualizations: Gradient ascent to get optimal inputs
","m2fname":"","description":"Convolutional neural nets have seen significant usage over the past few years since their record-breaking implementation in the ImageNet competition 2012, where a CNN implementation (AlexNet) achieved an error rate of 17% (down from 28.2% - the previous state-of-the-art result). Since then, CNNs have been applied to a variety of problems, from extended image recognition, where the current top-5 error is 3.6% by ResNet, a Microsoft deep net architecture with 152 layers, to language processing to speech classification to general pattern recognition efforts in graph analytics and other non-visual/audio media. However, their results reside in a black-box – the functional process is not well known, even though it can be easily represented mathematically. The key issue is the nonlinear activation of the neurons – unlike linear functions, the CNN is a non-linear convolved function of its inputs, which makes reverse-engineering difficult, if not impossible. In this project, we present some work on visualizing various components of deep nets to obtain better understanding of various feature maps and representations in both a layer-by-layer and holistic basis.","m1fname":"Abhijit","projectname":"Explainable Machine Learning","m3fname":""},{"m2lname":"Chen","m4lname":"","m3uni":"","m1uni":"lz2494","m4uni":"","pid":"201705-3","m2uni":"xc2360","timestring":"Thu May 11 13:29:29 2017","m4fname":"","language":"Python，Caffe","m3lname":"","dataset":"Facescrub, EmotiW and self-constructed dataset","m1lname":"Zhang","industry":"Information","analytics":"MTCNN face detection, VGG-net face recognition and facial expression estimation, HOG feature template matching, correlation tracker","m2fname":"Xucheng","description":"Our goal is to create a smart platform that is very similar to the AI in TV series: Person of Interest. However, such strong AI with exceptional understanding of human activities and high accuracy in human recognition is far from technology in this area. Although it is still a dream to achieve actual artificial intelligence, we are still able to construct a rather smart platform combining cutting-edge technology like deep learning. With such hope of building great artificial intelligence, we built PRPOF——a person recognition platform that can conduct automated detection and tracking of targeted person.
We firstly learning features of the specific face of the target person and we detect the specific person we want among the mass of people in multiple video cameras of surveillance system. Then we do inference from face location to the whole body bounding box and start tracking of that person. Since we are tracking the whole body, even if the person only gives us the back view, we can still keep tracking. And we also start face detecting mode once the person disappears, so in this way, we never lose the person we want. Additionally, with the person tracked, we also estimate the person's gender and facial expression to do some analysis.","m1fname":"Lingyu","projectname":"Never-lost: Multicamera Face detection and tracking","m3fname":""},{"m2lname":"Hao","m4lname":"","m3uni":"","m1uni":"qz2273","m4uni":"","pid":"201705-8","m2uni":"zh2282","timestring":"Thu May 11 13:32:13 2017","m4fname":"","language":"python","m3lname":"","dataset":"Our financial data comes from API.
We crawled social media data on two bitcoin forums: https://forum.bitcoin.com/
https://bitcointalk.org/","m1lname":"Zhou","industry":"Finance","analytics":"We use TFlearn to implement neural networks. Also we designed an equation to combine several outputs to get a more precise prediction. ","m2fname":"Zijun","description":"Given the financial data and social media data, we use different kinds of Recurrent Neural Networks to learn the previous pattern and make a prediction of future bitcoin price.","m1fname":"Qing","projectname":"Market Intelligence Analysis III Bitcoin price Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"asf2182","m4uni":"","pid":"201704-12","m2uni":"","timestring":"Thu May 11 13:42:27 2017","m4fname":"","language":"Python (Tensorflow, Keras), PHP","m3lname":"","dataset":"CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html","m1lname":"Suprem","industry":"Information","analytics":"Algorithms: Convolutional Neural networks, Gradient Ascent with regularization, Computational graph creation and optimization

Visualizations: Gradient ascent to get optimal inputs
","m2fname":"","description":"Convolutional neural nets have seen significant usage over the past few years since their record-breaking implementation in the ImageNet competition 2012, where a CNN implementation (AlexNet) achieved an error rate of 17% (down from 28.2% - the previous state-of-the-art result). Since then, CNNs have been applied to a variety of problems, from extended image recognition, where the current top-5 error is 3.6% by ResNet, a Microsoft deep net architecture with 152 layers, to language processing to speech classification to general pattern recognition efforts in graph analytics and other non-visual/audio media. However, their results reside in a black-box – the functional process is not well known, even though it can be easily represented mathematically. The key issue is the nonlinear activation of the neurons – unlike linear functions, the CNN is a non-linear convolved function of its inputs, which makes reverse-engineering difficult, if not impossible. In this project, we present some work on visualizing various components of deep nets to obtain better understanding of various feature maps and representations in both a layer-by-layer and holistic basis.","m1fname":"Abhijit","projectname":"Explainable Machine Learning","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"","m4uni":"","pid":"201705-14","m2uni":"","timestring":"Thu May 11 13:42:37 2017","m4fname":"","language":"Linux/Windows. python/javascript","m3lname":"","dataset":"mnist handwritten digits","m1lname":"","industry":"Information","analytics":"It is using python scikit-learn package to run SVM on mnist handwritten digits, using THREE.js and d3 to display the results on the front end","m2fname":"","description":"With our lives are greatly influenced by machine learning, Explainable Artificial Intelligence (XAI) becomes more and more needed.
This project is to build a software to allow users to visually see the machine learning results.
","m1fname":"","projectname":"Visual Analytics of Interactive Machine Learning","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rz2357","m4uni":"","pid":"201705-04","m2uni":"","timestring":"Thu May 11 13:52:15 2017","m4fname":"","language":"Python, C++","m3lname":"","dataset":"The audio training dataset is from Berlin Database of Emotional Speech. As a part of the DFG funded research project SE462/3-1 in 1997 and 1999 recorded a database of emotional utterances spoken by actors. The recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. Director of the project was Prof. Dr. W. Sendlmeier, Technical University of Berlin, Institute of Speech and Communication, department of communication science. Members of the project were mainly Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss.[6]
","m1lname":"Zhang","industry":"Information","analytics":"SVM

Random Forest

Naive Bayes","m2fname":"","description":"Using pre-trained model to find faces in movies and recognize the expressions.

Using pre-trained model to analyze the speech emotions.

Combining two features to make the movie emotional timeline.

","m1fname":"Ruomeng","projectname":"Movie Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rz2357","m4uni":"","pid":"201705-04","m2uni":"","timestring":"Thu May 11 13:52:23 2017","m4fname":"","language":"Python, C++","m3lname":"","dataset":"The audio training dataset is from Berlin Database of Emotional Speech. As a part of the DFG funded research project SE462/3-1 in 1997 and 1999 recorded a database of emotional utterances spoken by actors. The recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. Director of the project was Prof. Dr. W. Sendlmeier, Technical University of Berlin, Institute of Speech and Communication, department of communication science. Members of the project were mainly Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss.[6]
","m1lname":"Zhang","industry":"Information","analytics":"SVM

Random Forest

Naive Bayes","m2fname":"","description":"Using pre-trained model to find faces in movies and recognize the expressions.

Using pre-trained model to analyze the speech emotions.

Combining two features to make the movie emotional timeline.

","m1fname":"Ruomeng","projectname":"Movie Analysis","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"rz2357","m4uni":"","pid":"201705-4","m2uni":"","timestring":"Thu May 11 13:55:26 2017","m4fname":"","language":"Python, C++","m3lname":"","dataset":"The audio training dataset is from Berlin Database of Emotional Speech. As a part of the DFG funded research project SE462/3-1 in 1997 and 1999 recorded a database of emotional utterances spoken by actors. The recordings took place in the anechoic chamber of the Technical University Berlin, department of Technical Acoustics. Director of the project was Prof. Dr. W. Sendlmeier, Technical University of Berlin, Institute of Speech and Communication, department of communication science. Members of the project were mainly Felix Burkhardt, Miriam Kienast, Astrid Paeschke and Benjamin Weiss.[6]
","m1lname":"Zhang","industry":"Information","analytics":"SVM

Random Forest

Naive Bayes","m2fname":"","description":"Using pre-trained model to find faces in movies and recognize the expressions.

Using pre-trained model to analyze the speech emotions.

Combining two features to make the movie emotional timeline.

","m1fname":"Ruomeng","projectname":"Movie Analysis Based on Face Expression Recognition and Speech Emotion Detection","m3fname":""},{"m2lname":"Alvarado","m4lname":"Ryan","m3uni":"sjk2218","m1uni":"ik2338","m4uni":"gr2547","pid":"201705-17","m2uni":"jaa2220","timestring":"Thu May 11 13:59:53 2017","m4fname":"Gabriel","language":"Java, MySQL, Neo4j, Microsoft Hololens, C#, Networkx, Postgres, REST","m3lname":"Karpate","dataset":"180 days physician shared patient dataset (2015): List of Medicare provider Physicians and the number of patients they shared during that time. Approximately 65.7 million records.
National Plan and Provider Enumeration System (NPPES): Provider information (name, address, taxonomy ID). Approximately 5.1 million records
Health Care Provider Taxonomy Code: Taxonomy information (translate taxonomy ID into provider specialty description).","m1lname":"Keren","industry":"Information","analytics":"Used Markov Clustering Algorithm to analyze the shared patient graph database.

Used graph libraries with visual DFS, BFS, Force Directed Graph, Heatmap, and sizing based on properties. Also implemented interactivity via Gaze, Touch, Voice, and Text to Speech.

","m2fname":"Jose","description":"There are many advantages of using graph databases over relational databases, and there is a growing need for graph databases. However, there are no friendly migration tools on the market.
Need to take a very large, highly-interconnected graph database, extract and present useful information - extremely challenging to do in RDBMS. Using clustering, we can create more meaningful subgraphs.
Need to present graph data in immersive and interactive way, to be more accessible to regular users.
","m1fname":"Itay","projectname":"Prometheus: Relational to Graph Migration, Visualization, and Analytics","m3fname":"Sarang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sg2665","m4uni":"","pid":"201705-18","m2uni":"","timestring":"Thu May 11 14:02:14 2017","m4fname":"","language":"Java, MySQL, Neo4j, Microsoft Hololens, C#, Networkx, Postgres, REST","m3lname":"","dataset":"180 days physician shared patient dataset (2015): List of Medicare provider Physicians and the number of patients they shared during that time. Approximately 65.7 million records.
National Plan and Provider Enumeration System (NPPES): Provider information (name, address, taxonomy ID). Approximately 5.1 million records
Health Care Provider Taxonomy Code: Taxonomy information (translate taxonomy ID into provider specialty description).
","m1lname":"Guleff","industry":"Information","analytics":"Used Markov Clustering Algorithm to analyze the shared patient graph database.
Used graph libraries with visual DFS, BFS, Force Directed Graph, Heatmap, and sizing based on properties. Also implemented interactivity via Gaze, Touch, Voice, and Text to Speech.

","m2fname":"","description":"There are many advantages of using graph databases over relational databases, and there is a growing need for graph databases. However, there are no friendly migration tools on the market.
Need to take a very large, highly-interconnected graph database, extract and present useful information - extremely challenging to do in RDBMS. Using clustering, we can create more meaningful subgraphs.
Need to present graph data in immersive and interactive way, to be more accessible to regular users.","m1fname":"Sam","projectname":"Prometheus: Relational to Graph Migration, Visualization, and Analytics","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nmp2139","m4uni":"","pid":"201705-10","m2uni":"","timestring":"Thu May 11 14:06:44 2017","m4fname":"","language":"Python, C, Cython","m3lname":"","dataset":"All types of data, useful for Graphical Analysis. ","m1lname":"Paranjape","industry":"Information","analytics":"The following modules were implemented using a variety of approaches to showcase the C-Python interface:
1. Maximum flow Algorithm (Ford Fulkerson).
2. Fibonacci Series implementation
3. Looping with run time O(n^2)","m2fname":"","description":"With increasing need to analyze the rapidly growing data, it is essential to have tools that fulfill the speed requirements. Python, the most widely used language for data analytics, though has a very good accuracy, is often deemed as slow. We look at ways to enhance the performance of Python by interfacing it with other languages. We take into consideration several factors including speed as well as ease of implementation and means of switching over to other approaches.","m1fname":"Nachiket","projectname":"Python for High Performance Graph Analytics","m3fname":""},{"m2lname":"jia","m4lname":"","m3uni":"","m1uni":"tp2522","m4uni":"","pid":"201705-7","m2uni":" jj2860","timestring":"Thu May 11 14:15:07 2017","m4fname":"","language":"We used Python, flask, mongoDB, Yahoo-finace API, and Scikit-learn","m3lname":"","dataset":"We tested using the StockTwits dataset. I got this dataset by getting access to StockTwits' API, querying the API, and storing the related information into our database. With little modification, our software can support various other social media data such as Twitter and forum messages.","m1lname":"peng","industry":"Finance","analytics":"We implemented the following things:
1. StockTwits Mongo database which parse StockTwits' message in real-time, process the raw message, and stored the related information into the database.
2. StockTwits Database Python API. We created a python API to easily and quickly retrieve all the analyzed data from our database.
3. StockTwits Popularity and Sentiment analysis algorithms. We created a algorithm that used the data from our database to calculate daily popularity scores, and sentiment scores.
4. StockTwits Machine Learning module. We also used all the analyzed information to train machine learning models for stock trending prediction.
5. StockTwits Web visualization. In order for users to check our analysis, we created a website where people can see daily updated popularity and sentiment analysis. They can also see related messages. Also we provided different plots like prices trend, weekly sentiment changes, and weekly popularity changes on our website.","m2fname":"ji","description":"The goal of our project is collect and analyze StockTwits dataset for stock prediction. Stock market prediction has been a really important part of the market analysis. There has been a lot of research done related to stock market prediction. For our research, we use sentiment analysis of social media data to extract the public sentiment. And then we analysis the extracted sentiment to get a popularity score and a sentiment score for each stock. People can use these scores as a reference for predicting the future stock market movement. Also, we used this information to train machine learning model for predicting future stock trend.

There are four main parts of our research project. The first part is design and build a database for StockTwits data.The second part is analyzing the information stored in our database. The third part of our research is creating a standard user interface for people to check our prediction. We will use flask to create a website which provides daily updated stock market analysis. For our final project, we used the information we collected in our database and the historical data to create a machine learning model for stock prediction.

This research and tool we created is important because based on our research, we can see that there are correlations between public sentiment extracted from StockTwits and the stock market trend. It shows that StockTwits dataset has the potential to be used for future stock prediction Researches. Other researchers can use our API and tools to perform various stock market analysis. It would also be really interesting to combine the Twitter dataset with the StockTwits dataset for training machine learning model. We hope to achieve better accuracy than models that just used twitter dataset. Also, it would interesting to incorporate other types of data such as news into our system. We can see a lot of potential usage of the StockTwits dataset for predicting and analyzing stock
market.","m1fname":"tianrui","projectname":"Market Intelligence Analysis: Stock Prediction using StockTwits Dataset","m3fname":""},{"m2lname":"jia","m4lname":"","m3uni":"","m1uni":"tp2522","m4uni":"","pid":"201705-7","m2uni":" jj2860","timestring":"Thu May 11 14:17:07 2017","m4fname":"","language":"We used Python, flask, mongoDB, Yahoo-finace API, and Scikit-learn","m3lname":"","dataset":"We tested using the StockTwits dataset. I got this dataset by getting access to StockTwits' API, querying the API, and storing the related information into our database. With little modification, our software can support various other social media data such as Twitter and forum messages.","m1lname":"peng","industry":"Finance","analytics":"We implemented the following things:
1. StockTwits Mongo database which parse StockTwits' message in real-time, process the raw message, and stored the related information into the database.
2. StockTwits Database Python API. We created a python API to easily and quickly retrieve all the analyzed data from our database.
3. StockTwits Popularity and Sentiment analysis algorithms. We created a algorithm that used the data from our database to calculate daily popularity scores, and sentiment scores.
4. StockTwits Machine Learning module. We also used all the analyzed information to train machine learning models for stock trending prediction.
5. StockTwits Web visualization. In order for users to check our analysis, we created a website where people can see daily updated popularity and sentiment analysis. They can also see related messages. Also we provided different plots like prices trend, weekly sentiment changes, and weekly popularity changes on our website.","m2fname":"ji","description":"The goal of our project is collect and analyze StockTwits dataset for stock prediction. Stock market prediction has been a really important part of the market analysis. There has been a lot of research done related to stock market prediction. For our research, we use sentiment analysis of social media data to extract the public sentiment. And then we analysis the extracted sentiment to get a popularity score and a sentiment score for each stock. People can use these scores as a reference for predicting the future stock market movement. Also, we used this information to train machine learning model for predicting future stock trend.

There are four main parts of our research project. The first part is design and build a database for StockTwits data.The second part is analyzing the information stored in our database. The third part of our research is creating a standard user interface for people to check our prediction. We will use flask to create a website which provides daily updated stock market analysis. For our final project, we used the information we collected in our database and the historical data to create a machine learning model for stock prediction.

This research and tool we created is important because based on our research, we can see that there are correlations between public sentiment extracted from StockTwits and the stock market trend. It shows that StockTwits dataset has the potential to be used for future stock prediction Researches. Other researchers can use our API and tools to perform various stock market analysis. It would also be really interesting to combine the Twitter dataset with the StockTwits dataset for training machine learning model. We hope to achieve better accuracy than models that just used twitter dataset. Also, it would interesting to incorporate other types of data such as news into our system. We can see a lot of potential usage of the StockTwits dataset for predicting and analyzing stock
market.","m1fname":"tianrui","projectname":"Market Intelligence Analysis: Stock Prediction using StockTwits Dataset","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"asf2182","m4uni":"","pid":"201704-12","m2uni":"","timestring":"Thu May 11 14:51:53 2017","m4fname":"","language":"Python (Tensorflow, Keras), PHP","m3lname":"","dataset":"CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html","m1lname":"Suprem","industry":"Information","analytics":"Algorithms: Convolutional Neural networks, Gradient Ascent with regularization, Computational graph creation and optimization

Visualizations: Gradient ascent to get optimal inputs
","m2fname":"","description":"Convolutional neural nets have seen significant usage over the past few years since their record-breaking implementation in the ImageNet competition 2012, where a CNN implementation (AlexNet) achieved an error rate of 17% (down from 28.2% - the previous state-of-the-art result). Since then, CNNs have been applied to a variety of problems, from extended image recognition, where the current top-5 error is 3.6% by ResNet, a Microsoft deep net architecture with 152 layers, to language processing to speech classification to general pattern recognition efforts in graph analytics and other non-visual/audio media. However, their results reside in a black-box – the functional process is not well known, even though it can be easily represented mathematically. The key issue is the nonlinear activation of the neurons – unlike linear functions, the CNN is a non-linear convolved function of its inputs, which makes reverse-engineering difficult, if not impossible. In this project, we present some work on visualizing various components of deep nets to obtain better understanding of various feature maps and representations in both a layer-by-layer and holistic basis.","m1fname":"Abhijit","projectname":"Explainable Machine Learning","m3fname":""},{"m2lname":"Mao","m4lname":"","m3uni":"","m1uni":"lc3201","m4uni":"","pid":"201705-16","m2uni":"sm4206","timestring":"Thu May 11 14:54:21 2017","m4fname":"","language":"Python, C","m3lname":"","dataset":"Image Set:
Microsoft COCO dataset is used for test. It is a public dataset and could be acquired online.
VOC07 is used to train the YOLO model. Public dataset.
Text Set:
Wikipedia. Public dataset.
Captions in VIST dataset. Public dataset.
Microsoft COCO dataset for test. Public dataset.
Video:
demo on Youtube videos

Generally, this system could support the test for dataset which contain images/videos, along with the corresponding descriptions or captions. ","m1lname":"Chen","industry":"Information","analytics":"Milestone1:
\u000b Raise the idea and read papers to figure out the state-of –art and get some methods
Milestone2:
\u000b Use the raw picture and transfer learning method to extract the picture features
\u000b Use K-means to cluster these pictures.
\u000b Run word embedding to get the word representation,
\u000b Use these embedding to infer the labels and do the voting process to get the label of each cluster
Milestone3:
\u000b Run regression on the extracted feature of raw picture to get the primary object in one picture.
\u000b Crop the original picture according to this bounding box and extract the feature again to get the representation of new picture.
\u000b Use larger database to acquire more reasonable word embedding.
\u000b Design a probabilistic model to learn the label distribution of image and produce better inference of label.
Final Project:
Rather than write the model ourselves like before, we implement some state-of-art model, i.e. YOLO, to do the object localization for us, other parts holds.
Implement GloVe (Global Vectors for Word Representation), a better word representation method, to get
the label distribution
Expand the images to multiple-object images, and use the probabilistic model to predict the label for each image
Implement our method to predict the object occurring in video
","m2fname":"Sun","description":"The quick development of machine learning and deep learning in recent years enables data-driven artificial intelligence in various fields, including natural language processing, computer vision, speech recognition and data mining, to reduce manual work or support better decisions. However, most machine learning or deep learning applications that have been designed so far are established upon supervised machine learning, which requires a large set of annotated data as training set. Annotating a large amount of data could be extremely expensive, time consuming, and might be inconsistent if the annotators have different background or knowledge level. Thus, supervised machine learning methods are not able to implement the vast amount of unlabeled data in the world, and the results could be biased due to the limited resources of data. To solve this gap of knowledge and make full use of generally existing unlabeled data, autonomous learning, which implements statistical and unsupervised learning methods to achieve automated learning and predicting based on reference, rather than human interactions, has gradually attracted the attention of researchers and become a promising direction in artificial intelligence.

In this project, we implement several unsupervised machine learning and deep learning methods to detect the category of objects in images by analyzing the images and corresponding captions. More specifically, we implement transfer learning to extract the feature representation of images and word embedding to acquire the representation of words, and finally build a unsupervised mapping system to predict the categories of images.

Ideally, we feed the system with some raw images with text, the system can understand all the mapping relations. Later on, when we feed it with some other picture, it can tell us what kind of information such image conveyed. Or when we give the system some text, it can help us find some related cluster of pictures.
If we feed the system with video, it can draw the bounding box and find out the cluster it belongs to.
","m1fname":"Luoxin","projectname":"Autonomous Learning: From Text to Vision","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"asf2182","m4uni":"","pid":"201705-20","m2uni":"","timestring":"Thu May 11 14:54:44 2017","m4fname":"","language":"Python (Tensorflow, Keras), PHP, Javascript","m3lname":"","dataset":"CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html
","m1lname":"Suprem","industry":"Information","analytics":"Algorithms: Convolutional Neural networks, Gradient Ascent with regularization, Computational graph creation and optimization
Visualizations: Gradient ascent to get optimal inputs
","m2fname":"","description":"Convolutional neural nets have seen significant usage over the past few years since their record-breaking implementation in the ImageNet competition 2012, where a CNN implementation (AlexNet) achieved an error rate of 17% (down from 28.2% - the previous state-of-the-art result). Since then, CNNs have been applied to a variety of problems, from extended image recognition, where the current top-5 error is 3.6% by ResNet, a Microsoft deep net architecture with 152 layers, to language processing to speech classification to general pattern recognition efforts in graph analytics and other non-visual/audio media. However, their results reside in a black-box – the functional process is not well known, even though it can be easily represented mathematically. The key issue is the nonlinear activation of the neurons – unlike linear functions, the CNN is a non-linear convolved function of its inputs, which makes reverse-engineering difficult, if not impossible. In this project, we present some work on visualizing various components of deep nets to obtain better understanding of various feature maps and representations in both a layer-by-layer and holistic basis.
","m1fname":"Abhijit","projectname":"Explainable ML: Visualization of Training Process of Deep Learning","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"asf2182","m4uni":"","pid":"201705-9","m2uni":"","timestring":"Thu May 11 14:55:34 2017","m4fname":"","language":"Python (Tensorflow, Keras), PHP, Javascript","m3lname":"","dataset":"CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html
","m1lname":"Suprem","industry":"Information","analytics":"Algorithms: Convolutional Neural networks, Gradient Ascent with regularization, Computational graph creation and optimization
Visualizations: Gradient ascent to get optimal inputs
","m2fname":"","description":"Convolutional neural nets have seen significant usage over the past few years since their record-breaking implementation in the ImageNet competition 2012, where a CNN implementation (AlexNet) achieved an error rate of 17% (down from 28.2% - the previous state-of-the-art result). Since then, CNNs have been applied to a variety of problems, from extended image recognition, where the current top-5 error is 3.6% by ResNet, a Microsoft deep net architecture with 152 layers, to language processing to speech classification to general pattern recognition efforts in graph analytics and other non-visual/audio media. However, their results reside in a black-box – the functional process is not well known, even though it can be easily represented mathematically. The key issue is the nonlinear activation of the neurons – unlike linear functions, the CNN is a non-linear convolved function of its inputs, which makes reverse-engineering difficult, if not impossible. In this project, we present some work on visualizing various components of deep nets to obtain better understanding of various feature maps and representations in both a layer-by-layer and holistic basis.
","m1fname":"Abhijit","projectname":"Explainable ML: Visualization of Training Process of Deep Learning","m3fname":""},{"m2lname":"Patel","m4lname":"","m3uni":"","m1uni":"rr3087","m4uni":"","pid":"201705-11","m2uni":"mp3542","timestring":"Thu May 11 16:35:44 2017","m4fname":"","language":"Python, JavaScript, HTML, CSS, d3, Gephi","m3lname":"","dataset":"The first one is bitcoin blockchain dataset and other 2 are for social network analysis - FB Group Posts network and FB Page Like Network. All these datasets were acquired using APIs and own scripts and are not publicly available.","m1lname":"Rana","industry":"Information","analytics":"Outlier Detection,
Fraud Detection,
Pattern Understanding
Knowledge Spreading,
PageRank Algorithm,
Eigenvector Centralities,
Community Detection,
Degree Distributions

Network Graph Visualization using d3 and gephi was implemented.
","m2fname":"Mohneesh","description":"To gather Data for building knowledge graphs and visualizing and performing analysis on the networks.

This is really important for getting insights into networks which would not have been possible with traditional approaches, like trends and outlier detections. The importance can also be understood from our presentation.","m1fname":"Rahul ","projectname":"DATA ACQUISITION FOR KNOWLEDGE GRAPHS - FROM BITCOINS TO SOCIAL NETWORK ANALYSIS","m3fname":""},{"m2lname":"Patel","m4lname":"","m3uni":"","m1uni":"rr3087","m4uni":"","pid":"201705-11","m2uni":"mp3542","timestring":"Thu May 11 16:35:55 2017","m4fname":"","language":"Python, JavaScript, HTML, CSS, d3, Gephi","m3lname":"","dataset":"The first one is bitcoin blockchain dataset and other 2 are for social network analysis - FB Group Posts network and FB Page Like Network. All these datasets were acquired using APIs and own scripts and are not publicly available.","m1lname":"Rana","industry":"Information","analytics":"Outlier Detection,
Fraud Detection,
Pattern Understanding
Knowledge Spreading,
PageRank Algorithm,
Eigenvector Centralities,
Community Detection,
Degree Distributions

Network Graph Visualization using d3 and gephi was implemented.
","m2fname":"Mohneesh","description":"To gather Data for building knowledge graphs and visualizing and performing analysis on the networks.

This is really important for getting insights into networks which would not have been possible with traditional approaches, like trends and outlier detections. The importance can also be understood from our presentation.","m1fname":"Rahul ","projectname":"DATA ACQUISITION FOR KNOWLEDGE GRAPHS - FROM BITCOINS TO SOCIAL NETWORK ANALYSIS","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"as4948","m4uni":"","pid":"201705-12","m2uni":"","timestring":"Thu May 11 16:50:09 2017","m4fname":"","language":"CUDA C, Pyhton","m3lname":"","dataset":"Main Dataset used in this project is anonymized Facebook Network from Stanford Network (SNAP) with approximately 1600 nodes and 67K edges.

It is public and can be downloaded from https://snap.stanford.edu/data/egonets-Facebook.html","m1lname":"Saliev","industry":"Information","analytics":"Visualization of the Facebook Network was done using Gephi Software.

Average Clustering Coefficient, Number of Communities, and Density of the Network Graph is also found using Gephi.

Breadth First Algorithm - starts at the tree root and traverses its neighboring nodes first before moving to the next level node - is done using Python including Gunrock libraries.

Also, a Quasirandom or low discrepancy sequence, such as Sobol sequences, is run as an example. The algorithm itself is built in CUDA sample libraries.
","m2fname":"","description":"Main goal of this project is to demonstrate the performance improvement of running big data on GPU (or CPU+GPU) over traditional CPU only platform.

I am planning to achieve this goal by running various algorithms on two different platforms.

As a toolkit, I will be using mainly CUDA C/C++ (by NVIDIA) for this purpose. Also, I will use Python, and C/C++ with Gunrock libraries.

All the computation is run in IBM's Minsky Cluster with NVLink capabilities. ","m1fname":"Azizjon","projectname":"GPU-Based Graph Analytics","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"pk2532","m4uni":"","pid":"201705-13","m2uni":"","timestring":"Thu May 11 16:54:01 2017","m4fname":"","language":"TODO","m3lname":"","dataset":"TODO","m1lname":"Ke","industry":"Transportation","analytics":"TODO","m2fname":"","description":"TODO","m1fname":"Pu ","projectname":"NYC Parking","m3fname":""},{"m2lname":"Alvarado","m4lname":"Ryan","m3uni":"sjk2218","m1uni":"ik2338","m4uni":"gr2547","pid":"201705-17","m2uni":"jaa2220","timestring":"Thu May 11 17:43:53 2017","m4fname":"Gabriel","language":"Java, MySQL, Neo4j, Microsoft Hololens, C#, Networkx, Python, Postgres, REST API","m3lname":"Karpate","dataset":"180 days physician shared patient dataset (2015): List of Medicare provider Physicians and the number of patients they shared during that time. Approximately 65.7 million records.
National Plan and Provider Enumeration System (NPPES): Provider information (name, address, taxonomy ID). Approximately 5.1 million records
Health Care Provider Taxonomy Code: Taxonomy information (translate taxonomy ID into provider specialty description).","m1lname":"Keren","industry":"Information","analytics":"Used Markov Clustering Algorithm to analyze the shared patient graph database.
Used graph libraries with visual DFS, BFS, Force Directed Graph, Heatmap, and sizing based on properties. Also implemented interactivity via Gaze, Touch, Voice, and Text to Speech.","m2fname":"Jose","description":"There are many advantages of using graph databases over relational databases, and there is a growing need for graph databases. However, there are no friendly migration tools on the market.
Need to take a very large, highly-interconnected graph database, extract and present useful information - extremely challenging to do in RDBMS. Using clustering, we can create more meaningful subgraphs.
Need to present graph data in immersive and interactive way, to be more accessible to regular users.","m1fname":"Itay","projectname":"Prometheus: Relational to Graph Migration, Visualization, and Analytics","m3fname":"Sarang"},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sg2665","m4uni":"","pid":"201705-18","m2uni":"","timestring":"Thu May 11 17:45:36 2017","m4fname":"","language":"Java, MySQL, Neo4j, Microsoft Hololens, C#, Networkx, Python, Postgres, REST API","m3lname":"","dataset":"180 days physician shared patient dataset (2015): List of Medicare provider Physicians and the number of patients they shared during that time. Approximately 65.7 million records.
National Plan and Provider Enumeration System (NPPES): Provider information (name, address, taxonomy ID). Approximately 5.1 million records
Health Care Provider Taxonomy Code: Taxonomy information (translate taxonomy ID into provider specialty description).","m1lname":"Guleff","industry":"Information","analytics":"Used Markov Clustering Algorithm to analyze the shared patient graph database.
Used graph libraries with visual DFS, BFS, Force Directed Graph, Heatmap, and sizing based on properties. Also implemented interactivity via Gaze, Touch, Voice, and Text to Speech.","m2fname":"","description":"There are many advantages of using graph databases over relational databases, and there is a growing need for graph databases. However, there are no friendly migration tools on the market.
Need to take a very large, highly-interconnected graph database, extract and present useful information - extremely challenging to do in RDBMS. Using clustering, we can create more meaningful subgraphs.
Need to present graph data in immersive and interactive way, to be more accessible to regular users.","m1fname":"Sam","projectname":"Prometheus: Relational to Graph Migration, Visualization, and Analytics","m3fname":""},{"m2lname":"Ryan","m4lname":"","m3uni":"","m1uni":"sjk2218","m4uni":"","pid":"201705-18","m2uni":"gr2547","timestring":"Thu May 11 20:53:20 2017","m4fname":"","language":"Java, MySQL, Neo4j, Microsoft Hololens, C#, Networkx, Python, Postgres, REST API","m3lname":"","dataset":"180 days physician shared patient dataset (2015): List of Medicare provider Physicians and the number of patients they shared during that time. Approximately 65.7 million records.
National Plan and Provider Enumeration System (NPPES): Provider information (name, address, taxonomy ID). Approximately 5.1 million records
Health Care Provider Taxonomy Code: Taxonomy information (translate taxonomy ID into provider specialty description)","m1lname":"Karpate","industry":"Information","analytics":"Used Markov Clustering Algorithm to analyze the shared patient graph database.
Used graph libraries with visual DFS, BFS, Force Directed Graph, Heatmap, and sizing based on properties. Also implemented interactivity via Gaze, Touch, Voice, and Text to Speech.","m2fname":"Gabriel","description":"There are many advantages of using graph databases over relational databases, and there is a growing need for graph databases. However, there are no friendly migration tools on the market.
Need to take a very large, highly-interconnected graph database, extract and present useful information - extremely challenging to do in RDBMS. Using clustering, we can create more meaningful subgraphs.
Need to present graph data in immersive and interactive way, to be more accessible to regular users.
","m1fname":"Sarang","projectname":"Prometheus: Relational to Graph Migration, Visualization, and Analytics","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nam2169","m4uni":"","pid":"201705-15","m2uni":"","timestring":"Thu May 11 21:01:44 2017","m4fname":"","language":"Python, Can be used in any Unix platform supporting Phython 2.7.","m3lname":"","dataset":"Yahoo finance API was used to collect the historical stock prices. Catergory information was collected from the NASDAC website.","m1lname":"Mitra","industry":"Finance","analytics":"A form of Markowitz optimization portfolio was used to do the optimization.","m2fname":"","description":"The goal of this project is to build a part of a robo-advisor to optimize personal investment strategy based on user input. The input of the system are what is the amount for investment, what is the desired return, what is the timeframe for this return. Data set consists of information about existing stocks, bonds, index funds is used to calculate yearly return of each stock and future price prediction is done using linear regression. Then Markowitz portfolio optimization theory is implemented in python to achieve the optimization task.

Investing in the stock market is a game of return and risk. An investor always wants to maximize the return while keeping the risk minimized. An optimization tool that helps achieve that is very relevant, given recent stock market volatility.

The project deals with large number of stocks and does the optimization in real time. Parallelization was used to speed up the computation.

The project delivers a overall system that stores, predicts stock prices and provides the user an optimize portfolio with a web interface.","m1fname":"Nandita","projectname":"Optimized Personal Investment Strategy (I)","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201705-19","m2uni":"","timestring":"Thu May 11 21:34:42 2017","m4fname":"","language":"Swift 3.1, iOS 10.3.1 on iPhone 6s, Xcode 8.3.2 (on macOS Sierra)","m3lname":"","dataset":"The dataset used to train and test the classifier were collected using the data-logger and exporter built into Swift-HAR.","m1lname":"Hasbamrer","industry":"Information","analytics":"In this project, I demonstrate an HAR application written in Swift 3.1 for the iPhone 6s. Data logs gathered from the CoreMotion API feeds into a Feed Forward Neural Network created using the Swift-AI framework. Both training and inference from the neural network is executed locally in the iPhone 6s. The Swift-HAR classifier can expand to accommodate a variable number of actions. The sample neural network successfully classifies user activity as standing still, walking, or doing bicep curls.","m2fname":"","description":"Human activity recognition (HAR) is the use of sensors and algorithms to classify human action. Applications of HAR in healthcare involve remote monitoring of patient physical activity, detecting falls in elderly patients, and helping classify movement in patients with motion disorders. HAR can also be used in wearable electronics that track user exercise and activity. The goal of this project is to create an HAR application that can collect data as well as train a classifier locally on an iPhone 6s.","m1fname":"Nond","projectname":"Swift-HAR: Human Activity Recognition on iPhone 6s","m3fname":""},{"m2lname":"alvarez","m4lname":"","m3uni":"","m1uni":"as5147","m4uni":"","pid":"201705-1","m2uni":"aea2161","timestring":"Thu May 11 21:51:21 2017","m4fname":"","language":"python, keras","m3lname":"","dataset":"The dataset used was cornell movie dialogue dataset.
https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html","m1lname":"sinha","industry":"Information","analytics":"Neural networks, Keras, Tkinter, LSTM, word2vec, Gensim embeddings","m2fname":"anthony","description":"To create a chat bot with emotions.
It is based on generative model, which generates output based on emotions.

It is related to natural language processing which is a complex field and a understanding of such field would take us closer to AGI. ","m1fname":"anul kumar","projectname":"Di-feelBot - The Emotional Chatbot","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"sg2665","m4uni":"","pid":"201705-20","m2uni":"","timestring":"Thu May 11 23:37:27 2017","m4fname":"","language":"C#, Unity, Windows 10, Hololens","m3lname":"","dataset":"* Northwind sample database from Neo4j sample datasets https://neo4j.com/developer/guide-importing-data-and-etl/#_northwind_introduction.

* mixed Species Cat Brain graphML model http://awesome.cs.jhu.edu/data/static/graphs/cat/mixed.species_brain_1.graphml

* Any Neo4j Cypher query

* Any GraphML formated file","m1lname":"Guleff","industry":"Information","analytics":"Touch and Gaze management within Hololens
Communications with Neo4j via REST by passing Cypher queries
Parsing of XML based GraphML files
Graph: Force Directed, Breadth First Search, Depth First Search
Text to Speech and Voice Manager for command recognition
","m2fname":"","description":"Objectives:

* Provide an interactive library to load graph data from typical graph stores (e.g. Neo4j, SystemG, GramML, csv files)
* Render in 3D space a representation of the graph
* Provide rudimentary graph algorithms: Breath First Search, Depth First Search, Force Directed Graphs
* Provide interactivity with the graph: Gaze, Touch, Voice, Text To Speech.
* Provide rendering of graph properties in Location, Size, and Color tied to numeric properties

Innovations:

* First of its kind library to render in 3D environment graph models and allow interactive analysis of graphs by verbally asking for scaling and heat maps of numeric properties. Touch interaction of nodes, and text to speech to read properties back to user.

Capabilities:

* Loading of graphs from Neo4j Cypher Queries
* Loading of graphs from GraphML standard format
* Touch, Gaze, Speech to Text interactivity
* In Memory graph model allowing for traversal and quick lookups.
* Remote logging to Postgres via REST API
","m1fname":"Samuel","projectname":"Visualization of Large Graph in Immersive Environments","m3fname":""},{"m2lname":"Sinha","m4lname":"","m3uni":"","m1uni":"aea2161","m4uni":"","pid":"201705-1","m2uni":"as5147","timestring":"Fri May 12 02:11:24 2017","m4fname":"","language":"Python","m3lname":"","dataset":"Cornell Movie Dialogue Corpus - https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html","m1lname":"Alvarez","industry":"Information","analytics":"LSTM,
Reinforcement Learning,
Word Embeddings,
One-Hot representation
TKInter UI
","m2fname":"Anul","description":"Develop a chat bot that has an internal state of emotion and responds differently accordingly.","m1fname":"Anthony","projectname":"Emotional Chat Bot","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"nh2518","m4uni":"","pid":"201705-19","m2uni":"","timestring":"Tue May 16 00:05:17 2017","m4fname":"","language":"Swift 3.1, iOS 10.3.1 on iPhone 6s, Xcode 8.3.2 (on macOS Sierra) ","m3lname":"","dataset":"The dataset used to train and test the classifier were collected using the data-logger and exporter built into Swift-HAR. ","m1lname":"Hasbamrer","industry":"Information","analytics":"In this project, I demonstrate an HAR application written in Swift 3.1 for the iPhone 6s. Data logs gathered from the CoreMotion API feeds into a Feed Forward Neural Network created using the Swift-AI framework. Both training and inference from the neural network is executed locally in the iPhone 6s. The Swift-HAR classifier can expand to accommodate a variable number of actions. The sample neural network successfully classifies user activity as standing still, walking, or doing bicep curls. ","m2fname":"","description":"Human activity recognition (HAR) is the use of sensors and algorithms to classify human action. Applications of HAR in healthcare involve remote monitoring of patient physical activity, detecting falls in elderly patients, and helping classify movement in patients with motion disorders. HAR can also be used in wearable electronics that track user exercise and activity. The goal of this project is to create an HAR application that can collect data as well as train a classifier locally on an iPhone 6s.","m1fname":"Nond","projectname":"Swift-HAR: Human Activity Recognition on iOS (iPhone 6s)","m3fname":""},{"m2lname":"","m4lname":"","m3uni":"","m1uni":"kc2980","m4uni":"","pid":"201705-21","m2uni":"","timestring":"Tue May 16 02:50:52 2017","m4fname":"","language":"Python, keras, tensorflow","m3lname":"","dataset":"imagenet","m1lname":"chen","industry":"Information","analytics":"YOLO system","m2fname":"","description":"Detect object in a quick and accurate way is the key challenge in vehicle object recognition. In this project , I implement a high speed object detection technique called Yolo. It fits the scenario well and can be used here. ","m1fname":"kun","projectname":"Vehicle Object Recognition","m3fname":""},{"projectname":"Multi-Agent-Diagnose-System","timestring":"Tue May 13 21:06:50 2025","m1uni":"yk3108","m2lname":"Wu","m1fname":"Yunfei","m4fname":"","m1lname":"Ke","m3fname":"Ziyi","description":"Project Goals Description: Objectives, Innovations, and Capabilities

Our project aims to develop a Multi-Agent Clinical Diagnosis System that enhances diagnostic accuracy and reasoning explainability through structured debate among large language models (LLMs). Inspired by Socratic dialogue, the system simulates a multi-round deliberation between agents holding opposing views (Pro and Con), followed by an impartial consensus agent to determine the final diagnosis.

Objectives
Improve clinical reasoning performance on medical benchmarks such as MedQA and symptom-to-disease prediction datasets.
Demonstrate how multi-agent collaboration can outperform single-agent LLMs (e.g., DeepSeek, GPT-4) in both accuracy and reasoning depth.
Create a reusable and extensible framework for medical question answering, diagnosis generation, and reasoning visualization.

Innovations
New framework: implementation of a multi-agent debate framework for medical AI, using multiple specialized agents (Pro, Con, Judge, Consensus).
Integration of multi-round structured reasoning with final aggregation via a consensus model.
Generation of Mermaid flowcharts to trace reasoning paths, enabling transparency and interpretability.
Incorporation of bias testing and ablation studies to verify robustness, fairness, and model interactions.

Capabilities
Supports multiple backend models including Mistral, DeepSeek, and GPT via API integration.
Handles both multiple-choice and free-text symptom inputs, supporting real-world clinical tasks.
Provides a web-based frontend for interactive reasoning and visualization.

Why is this important?

LLMs are becoming powerful tools in medical AI, but single-agent outputs are limited by tunnel vision, hallucination, or lack of justification. Our debate-based system addresses these limitations by encouraging structured confrontation, diverse viewpoints, and reasoning refinement. It pushes toward more trustworthy, transparent, and clinically meaningful AI systems—an essential step for deploying LLMs in real-world healthcare settings.","uni":"yk3108","language":"Python, Python environments, with compatibility for local machines, Jupyter notebooks, or cloud platforms","pid":"202505-12","m4uni":"","analytics":"Our system implements a multi-agent debate framework for clinical diagnosis, inspired by Socratic reasoning. The key components include:
Agent Modules:
Pro Agent and Con Agent: Two large language model (LLM)-based agents provide competing diagnoses and justifications.
Judge Agent: Optionally evaluates and critiques both sides in each round to guide deeper reasoning.
Consensus Agent: Makes the final diagnostic decision after reviewing all arguments.

Multi-Round Debate Algorithm:
Structured reasoning unfolds over three rounds, with agents iteratively refining their positions.
The consensus agent considers all prior reasoning to generate a final answer.

Evaluation Modules:
Accuracy measurement against gold-standard datasets (MedQA and Kaggle Symptom-to-Disease).
Ablation studies including role swapping and agent diversity tests.
Bias analysis to verify the neutrality of the consensus agent.

Visualization:
We generate Mermaid flowcharts to visualize the reasoning process.
An interactive Flask-based web interface allows users to explore debate transcripts and diagnosis trees in real-time.

These components work together to produce interpretable, rigorous diagnostic predictions beyond traditional single-agent LLM prompting.","m4lname":"","industry":"Information","m3lname":"Xin","dataset":"We evaluate our multi-agent clinical reasoning system on two datasets:

1. MedQA (USMLE-style multiple choice)
Description: A benchmark dataset designed to assess clinical reasoning based on real United States Medical Licensing Examination (USMLE) questions. It contains high-quality, expert-level multiple-choice questions covering internal medicine, surgery, pediatrics, and more.
Language Coverage: English, Simplified Chinese, Traditional Chinese.
Subset Used: We evaluate on the first 315 English questions for consistent benchmarking across models.
Source: Publicly available at : https://paperswithcode.com/dataset/medqa-usmle

2. Kaggle Disease Symptom Prediction Dataset
Description: A publicly available dataset containing medical symptoms and corresponding disease labels. Each row consists of a disease name and a list of associated patient symptoms.
Use Case: We use this dataset for free-form diagnosis generation, evaluating the system’s ability to produce correct or more specific disease names given natural symptom descriptions.
Source: Kaggle repository: https://www.kaggle.com/datasets/itachi9604/disease-symptom-description-dataset

Other Data Compatibility

Our framework is designed to be model-agnostic and data-flexible. It can support:
Any multiple-choice clinical QA dataset (e.g., MedMCQA, PubMedQA)
Any free-text symptom-to-disease dataset with natural language input
Future support for multi-label or comorbidity prediction datasets, EMRs, and real clinical notes via minor formatting changes.","m2uni":"qw2438","m2fname":"Qinyun","m3uni":"zx2504"},{"projectname":"House Value in Contiguous United States","timestring":"Sat Dec 22 04:12:34 2018","m1uni":"mg3847","m2lname":"Zhang","m1fname":"Minglei","m4fname":"","m1lname":"Gu","m3fname":"Zichen","description":"This project provides recommendations for investigates about which real estate is worth to investigate. We utilized regression model to evaluate housing value in Contiguous United States. The ROIC (return on invested capital) of house value over time provides a clear instruction for investors. We conducted horizontal comparison of house prices with different features such as area, number of rooms, MHI, and average days on market as well as vertical comparison (e.g., ROIC over time) to predict future house prices. In order to do so, proper regression models are implemented and tested. The output is displayed in a reader friendly form on our website.
","uni":"mg3847","language":"python, java script ","pid":"201812-4","m4uni":"","analytics":"ROIC calculation, linear regression, generalized linear regression, decision tree regression, gradient boosted tree regression, random forest regression.
Huge amount of svg drawing.Pan and zoom for SVG. SVG coloring based on data. On 'mouseover', 'mouseout', 'click' functions. Input boxes and check boxes. Line charts.
","m4lname":"","industry":"Finance","m3lname":"Liu","dataset":"Zillow Research","m2uni":"hz2558","m2fname":"Haopeng","m3uni":"zl2668"},{"projectname":"Intelligent Volatility-Driven Stock Insight System","timestring":"Fri Dec 19 03:59:11 2025","m1uni":"sb5181","m2lname":"Munjal","m1fname":"Sreenivas","m4fname":"","m1lname":"Bandi","m3fname":"Harissh","description":"Project Goals Description
This project, titled \"Intelligent Volatility-Driven Stock Insight System\", aims to bridge the gap between rigorous, institutional-grade quantitative finance and accessible, user-friendly AI assistance. By deploying a swarm of specialized AI agents, the system automates the complex workflow of a financial analyst, providing real-time, statistically grounded market insights.

Objectives
Democratize Quantitative Analysis: To make advanced financial modeling (typically reserved for institutional \"Quants\" with expensive tools) accessible to individual investors and non-specialists.
neuro-symbolic Integration: To combine the deterministic precision of classical econometrics (ARIMA/GARCH models) with the probabilistic reasoning and natural language capabilities of Large Language Models (LLMs) like Google Gemini.

End-to-End Automation: To create a fully autonomous pipeline that handles data ingestion, cleaning, statistical modeling, visualization, and report generation without human intervention.
Transparent Self-Validation: To implement a \"Walk-Forward Backtesting\" mechanism where the system validates its own performance (calculating RMSE) and transparently communicates its error margins to the user, fostering trust.
Innovations

Neuro-Symbolic Architecture: Unlike standard \"Black Box\" deep learning models (like LSTMs) that are hard to interpret, this system uses a \"White Box\" approach. It uses statsmodels and arch for the math (Symbolic) and CrewAI/LLMs for the explanation (Neural), ensuring that every prediction is mathematically traceable.
Agentic Swarm Orchestration: The system utilizes the CrewAI framework to orchestrate a team of specialized agents (MarketDataAgent, QuantAgent, VisualizationAgent, PredictionAgent). This allows for \"Chain-of-Thought\" reasoning where agents can pass structured data (DTOs) and critique each other's outputs.
Dynamic Volatility Modeling with GARCH: Instead of assuming constant market risk, the system implements GARCH (Generalized Autoregressive Conditional Heteroskedasticity) to model \"volatility clustering.\" This allows the AI to understand that market risk changes over time and adjust its confidence intervals accordingly.

Capabilities
Autonomous Data Retrieval: Fetches and sanitizes high-fidelity, split-adjusted OHLCV data from Polygon.io with a 2-year lookback period.
Statistical Forecasting:
ARIMA: Predicts the conditional mean of asset prices (Price Target).
GARCH: Predicts the conditional variance (Risk/Volatility).
Anomaly Detection: Identifies price movements that deviate significantly from expected trends or volatility profiles using statistically generated confidence intervals.
Interactive Visualization: Automatically generates dual-panel charts showing Price Trends (with forecast error bars) and Volatility Regimes.
Contextual Reporting: Synthesizes complex numerical outputs (RMSE, Alpha/Beta coefficients, SMA trends) into a plain-English investment report (Buy/Sell/Hold assessment).
Why are these toolkits important?
Interpretability in Finance: In high-stakes regulatory or investment environments, \"explainability\" is paramount. This system provides clear reasons for its predictions, unlike opaque neural networks.

Risk Management: By explicitly modeling volatility and exposing the model's own error rate (RMSE), the system prevents users from blindly following AI predictions. It highlights when the model is uncertain, which is critical for capital preservation.
Efficiency: It reduces the time required for a comprehensive technical and fundamental analysis—including chart drawing and report writing—from hours to mere seconds, allowing for faster decision-making in volatile markets.","uni":"sb5181","language":"Language / Platform Description Programming Language: Python 3.8+ The entire backend and frontend logic is written in Python, chosen for its dominance in data science, easy integration with financial libraries (statsmodels, arch), and robust support for AI agent frameworks. Web Framework: Streamlit Used to create the interactive web interface. It serves as the frontend platform where users input stock tickers and view the generated reports and charts. Orchestration Platform: CrewAI The \"operating system\" for the agents. It manages the lifecycle, context window, and task delegation between the specialized agents (Retriever and Analyst). LLM Provider: Google Gemini (via google-generativeai) The cognitive engine powering the reasoning capabilities of the agents, translating numerical data into natural language validation and advice. Data Platform: Polygon.io The external backend service used for fetching real-time and historical institutional-grade financial data.","pid":"202512-29","m4uni":"","analytics":"Analytics Algorithms Description
Financial Time-Series Algorithms:
ARIMA (AutoRegressive Integrated Moving Average): Used to model the conditional mean of the stock price ($p, d, q$ parameters). It captures linear trends, seasonality, and momentum (AR=1, I=1, MA=0).
GARCH (Generalized Autoregressive Conditional Heteroskedasticity): Used to model the \"volatility clustering\" of the market (i.e., the risk). Specifically, a GARCH(1,1) model is fitted on the residuals of the ARIMA model to generate dynamic confidence intervals.
SMA (Simple Moving Average): A 50-day Simple Moving Average is calculated to classify the rigorous \"Market Regime\" (Bull vs. Bear) before deeper analysis.
Validation & Error Metrics:
Walk-Forward Backtesting: A recursive algorithm that simulates \"trading\" over the past 20 days. It re-trains the model at every step (t, t+1, ... t+n) to strictly prevent data leakage.
RMSE (Root Mean Squared Error): Computed dynamically for every forecast to quantify the model's recent accuracy.
System Modules:
MarketDataAgent: The ETL (Extract, Transform, Load) module for sanitizing and chronologically sorting OHLCV data.
QuantAgent: The computational core containing the statsmodels and arch implementations.
VisualizationAgent: The rendering engine that produces the PNG assets for the UI.
Visualization:
Library: Matplotlib (Style: 'bmh').
Dual-Panel Charts:
Price Trend Panel: Overlays the \"Actual Price\" (Grey) against the \"Predicted Price\" (Blue Dashed) and visualizes the \"Forecast Interval\" (Red Error Bar).
Volatility Panel: Displays the Conditional Volatility derived from the GARCH model, effectively visualizing the \"Fear Gauge\" of the specific stock.","m4lname":"","industry":"Finance","m3lname":"Anbarasan Sudhamathi","dataset":"Project Dataset Description
Tested Dataset
The core validation of the \"Intelligent Volatility-Driven Stock Insight System\" was performed using high-fidelity, split-adjusted market data.

Subjects Tested:
High-Beta / Major Tech Stocks: NVIDIA Corp (NVDA), Tesla Inc (TSLA), Alphabet Inc (GOOGL), and Apple Inc (AAPL) were selected to stress-test the system's ability to model extreme volatility and rapid price shifts (checking the GARCH model's responsiveness).
Data Type: Daily OHLCV (Open, High, Low, Close, Volume) aggregates.
Timeframe: A rolling 2-year window (approximately 504 trading days) was used for each test run to ensure statistical significance for the ARIMA-GARCH models.
Data Source: Polygon.io (specifically the Aggregates/Bars API endpoint). This source was chosen for its institutional-grade reliability, guaranteed uptime, and automatic handling of stock splits (preventing false \"crash\" anomalies).
Dataset Availability
The dataset is dynamically fetched via API and is not a static file.
Source: Polygon.io
Description: The system requires a valid Polygon.io API key. It queries the v2/aggs/ticker/{stocksTicker}/range/1/day/{from}/{to} endpoint.
Public Access: While the specific historical data requires a subscription (a free tier is available for basic testing), the data schema is public and standard for financial time series.
Other Supported Data
The system's modular architecture (MarketDataAgent) is designed to be agnostic to the underlying asset class, as long as it is supported by the Polygon.io API. Technically, the software can support:

Cryptocurrencies: By inputting crypto tickers formatted for Polygon (e.g., X:BTCUSD for Bitcoin), the system can retrieve and model crypto volatility, which is particularly well-suited for the implemented GARCH methodology.
Forex (Foreign Exchange): Major currency pairs (e.g., C:EURUSD) can be analyzed, provided the user's API key has access to the Forex subscription.
Penny Stocks / OTC: Any equity listed on US exchanges (NYSE, NASDAQ, AMEX, and OTC) is supported out-of-the-box.","m2uni":"mm6840","m2fname":"Manav","m3uni":"ha2771"},{"projectname":"AI Academic Advisor Chatbot for Columbia University","timestring":"Thu Jan 1 01:03:09 2026","m1uni":"jz3850","m2lname":"Zhang","m1fname":"Junfeng","m4fname":"","m1lname":"Zou","m3fname":"","description":"This project presents an intelligent academic advising system for Columbia University that combines Retrieval-Augmented Genera tion (RAG), semantic search, and hybrid intent detection to provide personalized course recommendations. The system addresses the challenge of navigating Columbia’s extensive course catalog of 8,120+ courses by implementing a conversational AI interface that understands natural language queries and generates context-aware recommendations using local language models. We developed a novel hybrid intent detection approach that combines regex-based pattern matching with intelligent parameter extraction, achieving 100% accuracy across our comprehensive test suite. The system demonstrates significant practical improvements with response times under 5 seconds and zero hallucination rate through strict prompt engineering. Our evaluation shows that the hybrid ap proach provides reliable, deterministic intent classification while maintaining the flexibility needed for natural language understand ing. The system successfully handles instructor queries, specific course lookups, level-filtered searches, and topic-based exploration with student profile personalization.","uni":"jz3850","language":"Programmed using Python and deployed on Google Cloud Platform","pid":" 202512-4","m4uni":"","analytics":"Our project implements a full-stack AI-driven academic advising system that integrates multiple analytics techniques, algorithms, system modules, and visual components.
From an analytics and algorithmic perspective, we implemented semantic text embedding for course descriptions, vector similarity search using FAISS, and a hybrid retrieval strategy that combines semantic similarity with structured filtering over course metadata stored in MongoDB. A ranking and de-duplication algorithm was applied to prioritize relevant courses and remove redundant entries. We further adopted a retrieval-augmented generation (RAG) framework to ground large language model (LLM) responses in retrieved data and reduce hallucination.
From a system design perspective, the system consists of a data preprocessing and storage module, a semantic retrieval and hybrid search module, an LLM reasoning module based on LLaMA 3.2 (1B), and a backend orchestration layer that manages query processing, retrieval, and response generation. A web-based user interface module enables interactive, multi-turn academic advising.
For visualization, we implemented a chat-based web interface that displays structured course results, instructor information, schedules, and advisor-style explanations generated by the LLM. The UI supports real-time interaction, result comparison, and interpretable presentation of recommendations.
","m4lname":"","industry":"Information","m3lname":"","dataset":"Our system integrates three primary data sources from Columbia University’s course catalog. The course data contains 8,120 courses with attributes including call number (unique identifier), course code (department and level such as \"EECS E6895\"), title, instructor name, department, credit points, full course description, prereq uisites, and academic term. The enrollment data contains 8,120 enrollment records with historical enrollment information for fu ture enhancements. The instructor data includes 14,699 instructor records with faculty names and department affiliations.
So far our platform only support json format file.
https://github.com/soid/columbia-catalog-data
","m2uni":"yz4843","m2fname":"Yangyang","m3uni":""},{"projectname":"Stock Price Prediction with BERT and XGBoost using Twitter Data","timestring":"Tue May 19 23:32:36 2020","m1uni":"ch3470","m2lname":"Xie","m1fname":"Chenyu","m4fname":"","m1lname":"Huang","m3fname":"","description":"In recent years, many scholars are using methods based on machine learning or deep learning to predict stock price movement using web-based social data. However, the growing volume of opinionated text and complexity of the market caused by chaotic event interactions, makes it almost impossible to come up with a precise strategy for decision making in the stock market. So as to fix this problem, we proposed an event aggregation model based on BERT and Sentiment Analysis to acquire a better feature representation of the stock movement. By eliminating the redundancy of features and the necessity of iterative computation, our model is evaluated to perform better than several traditional models.","uni":"ch3470","language":"Python, JavaScript","pid":"202005-9","m4uni":"","analytics":"BERT, XGBoost, Sentiment Analysis, d3.js, flask","m4lname":"","industry":"Finance","m3lname":"","dataset":"Twitter Data collected from 2019/02/06 - 2020/02/06 containing the keyword $GOOGL
Stock Price dataset of Alphabet collected from Yahoo! Finance from 2019/02/06 - 2020/02/06","m2uni":"sx2257","m2fname":"Shangzi","m3uni":""},{"projectname":"AI Music Agent Studio","timestring":"Wed May 13 02:41:36 2026","m1uni":"jt3645","m2lname":"Xu","m1fname":"Jucnehn","m4fname":"","m1lname":"Teng","m3fname":"Ranyi","description":"AI Music Agent Studio is an interactive music intelligence system for audio analysis, real-song recommendation, playlist planning, and image-based BGM suggestion. Users can upload audio, describe music needs in natural language, or upload an image, and the system returns structured music insights and real song recommendations. The main innovation is combining local audio perception with an AI agent layer for conversational, multimodal music discovery.","uni":"jt3645","language":"Python, FastAPI, Next.js, React, TypeScript, CSS, PyTorch, Transformers, librosa, pandas, NumPy, scikit-learn, Pillow, imageio-ffmpeg, Git LFS, Windows.","pid":"202605-8","m4uni":"","analytics":"The system uses an AST-based audio classifier with multi-crop clip aggregation for genre prediction. Librosa is used to extract tempo, RMS energy, spectral centroid, zero-crossing rate, and MFCC features. Text requests are mapped to music-style targets for retrieval and playlist planning. The agent module combines audio perception, text intent, image mood analysis, playlist structure, and LLM-based real-song recommendation. Visualizations include genre probability bars, sound profile cards, playlist timelines, energy curves, and song recommendation cards.","m4lname":"","industry":"Media","m3lname":"Dong","dataset":"The project uses the public GTZAN Music Genre Dataset for local audio perception and testing. GTZAN contains 1,000 audio clips across 10 genres, including blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, and rock. The system extracts tempo, energy, brightness, texture, and genre information from audio. It can also support arbitrary uploaded audio files, text music requests, playlist goals, and image inputs.","m2uni":"xx2511","m2fname":"Xinge","m3uni":"rd3217"},{"projectname":"C9: Virtual Doctor -- Conversations","timestring":"Fri May 3 19:17:08 2024","m1uni":"yz4665","m2lname":"Chen","m1fname":"Yawen","m4fname":"","m1lname":"Zhou","m3fname":"","description":"We are committed to making our remote robot doctor achieve the following goals:

1. Immediate health consultation and support: Provide users with 24/7 health consultation services, allowing access to medical advice and information anytime, anywhere.
2. Disease prevention and health management: Help users prevent diseases and improve lifestyles through health education and self-management advice.
3. Primary screening and guidance: Capable of conducting preliminary symptom analysis and providing advice on seeking medical attention or guidelines for emergencies.
4. Mental health support: Offer mental health consultation and stress management advice, helping users deal with emotional and psychological issues.

By achieving these goals, we aim to improve access to health care in remote or resource-limited areas. In addition, reducing pressure on the healthcare system is another benefit, such as self-service, reducing unnecessary hospital visits, and enabling doctors to focus on more urgent and complex cases. Finally, personalized health management through virtual doctors can also promote health awareness and education.
","uni":"yz4665","language":"python","pid":"202405-13","m4uni":"","analytics":"We implemented a series of data processing and machine learning modules to process and analyze disease-related text data. First, we process text data using TfidfVectorizer and SentenceTransformer to generate two types of features: TF-IDF features and BERT embedding vectors. These features are combined for subsequent machine learning tasks. We applied KMeans clustering algorithm to group the data, and used IsolationForest for anomaly detection to identify abnormal symptoms in the data. In addition, we have developed a similarity search function based on cosine similarity to help users find known symptoms that are most similar to the entered symptoms. Finally, we integrate the one using BertForSequenceClassification and pipeline for disease prediction and treatment recommendations of the module. This series of tools and methods provides powerful support for disease diagnosis and information retrieval.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The datasets tested include a list of diseases from A to Z along with their symptoms and treatments. This information was sourced from the NHS (National Health Service, UK) and IDPH (Illinois Department of Public Health) websites. This information is intended for use in a virtual doctor conversation application to help diagnose and suggest potential treatments for various health conditions based on symptoms described by a user. The datasets were obtained through a combination of web scraping and manual entry. Web scraping involves using python to automatically extract data from websites, while manual entry involves manually inputting or correcting data.
Our software can potentially support a variety of medical and health-related data depending on its design. Possible types of data could include:
Patient medical history: Information about past medical conditions, surgeries, and treatments.
Medication data: Information about medications, dosages, and possible side effects.
Lifestyle data: Data related to patients' lifestyle choices such as diet, exercise, and sleep patterns.
Real-time health data: Data from health monitoring devices like heart rate, blood pressure monitors, etc.
Research and clinical trial data: For more advanced applications, integrating findings from recent medical research or ongoing clinical trials might be useful.","m2uni":"wc2855","m2fname":"Wenhe","m3uni":""},{"projectname":"Intelligent Recommendation and Indexing System for Deep Learning Papers","timestring":"Thu Dec 19 20:09:25 2024","m1uni":"sc5534","m2lname":"Chen","m1fname":"Sihan","m4fname":"","m1lname":"Chen","m3fname":"","description":"The primary objective of our project is to simplify the exploration of deep learning literature by creating a personalized learning platform that integrates a domain-specific knowledge graph, semantic embeddings, and retrieval-augmented generation (RAG). Our system is designed to address the growing challenges posed by the rapid expansion of interdisciplinary research and dense citation networks, making it easier for users to navigate and engage with complex academic content. The ultimate goal is to bridge the gap between the overwhelming volume of information and the unique needs of learners and researchers.

Our project introduces several key innovations to achieve this goal. Central to the system is a domain-specific knowledge graph that captures semantic relationships between research papers, methods, and tasks. This knowledge graph provides a foundation for advanced paper search capabilities and serves as a backbone for personalized recommendations. Enhanced by large language model embeddings and external knowledge sources, the system offers tools for contextualized search, paper summarization, and detailed question answering. A Graph Attention Network (GAT) powers the recommendation engine, which dynamically predicts user preferences and achieves a high level of accuracy.

Our system is capable of delivering tailored suggestions by identifying and recommending research papers aligned with individual interests. It supports efficient information retrieval, combining structured and unstructured data to produce accurate and relevant results. Users can ask open-ended questions and receive contextually aware responses grounded in reliable sources. Additionally, a paper summarization feature distills key contributions, methodologies, and applications, making complex research more accessible.

Our work is important in today’s academic landscape where the rapid growth of deep learning has led to information overload. By organizing dense citation networks into an intuitive knowledge graph and simplifying intricate methodologies through automated summarization, our project ensures that key innovations and insights are not lost in the flood of new publications. Furthermore, our platform democratizes access to knowledge by tailoring the learning experience to users of varying expertise levels, ultimately fostering greater engagement and understanding.
","uni":"sc5534","language":"TypeScript, SCSS, HTML(Angular), Python, Java(SpringBoot, Spring Data)","pid":"202412-14","m4uni":"","analytics":"Analytics and Algorithms
1. Graph Attention Network (GAT):
Our recommendation system is powered by a two-layer GAT model that predicts user-paper interactions with a high degree of accuracy (AUC of 0.91). GAT uses attention mechanisms to selectively weigh the importance of relationships between nodes in the knowledge graph, dynamically updating node embeddings for improved predictions.

2. Retrieval-Augmented Generation (RAG):
The RAG module combines vector-based retrieval with large language model (LLM) generation. This approach ensures responses to open-ended queries are both contextually relevant and grounded in factual knowledge, mitigating common issues like hallucination in LLMs.

3. Vector Retrieval:
Our system employs semantic similarity-based document retrieval using high-dimensional vector embeddings. The embeddings are generated from fine-tuned language models and stored in a dedicated vector database for efficient similarity search. Techniques such as approximate nearest neighbor (ANN) search algorithms optimize retrieval speed.

4. Knowledge Graph Construction:
A heterogeneous knowledge graph represents relationships between entities such as research papers, methods, and tasks. Built using Neo4j, this graph supports advanced citation-based queries and structural learning.

5. Heuristic Subgraph Search:
A heuristic algorithm narrows the search space for candidate papers by analyzing user embeddings and employing subgraph expansion strategies, ensuring efficient and targeted recommendations.

6. Paper Summarization:
The summarization module extracts concise descriptions of research papers, including methodologies, key innovations, and applications, making dense content more accessible to users.

System Modules
1. Data Acquisition and Preprocessing:
APIs from ArXiv, Papers with Code, and Semantic Scholar were used to gather metadata, citation relationships, and bibliographic details. The data is preprocessed to ensure quality and standardization.

2. Knowledge Graph Storage and Management:
The Neo4j database stores the heterogeneous graph, with a dual-database approach incorporating MySQL for auxiliary data management, such as user preferences and structured queries.

3. Recommendation Engine:
Built on the GAT model, this engine dynamically updates and personalizes suggestions based on user interactions and graph relationships.

4. Backend Services:
Spring Boot and Spring Data frameworks form the backend, enabling efficient data processing and API communication. Neo4j handles graph queries, while MySQL supports relational data operations.

5. Frontend Interface:
Implemented using Angular, the interface provides an interactive user experience with features such as vectorized search, dynamic graph visualizations, and rich content display using Angular Material and Tailwind CSS.

Visualization
1. Graph Visualizations:
Neo4j-powered interactive visualizations allow users to explore relationships between papers, such as citation chains and shared methodologies. These visualizations use D3.js for a rich, interactive experience.

2. Vectorized Search Results:
Search results are displayed with relevance scores and contextual highlights, making it easier for users to identify key research.

3. Paper Insights:
For individual papers, detailed views include abstracts, associated tasks, methods, and links to related works or implementations, offering comprehensive insights.

4. Open-Ended QA Responses:
Results for user queries are presented in an intuitive format, combining generated text with supporting citations and references.

5. Training Analytics:
During model training, metrics such as loss trends across epochs and the ROC curve are visualized to demonstrate the system's performance and convergence.
","m4lname":"","industry":"Information","m3lname":"","dataset":"Tested Datasets
1. ArXiv Metadata: A collection of bibliographic information such as titles, abstracts, and publication dates. This dataset is sourced from the publicly available ArXiv API and provides identifiers (e.g., \"1512.03385\") to track publication details.
2. Papers with Code: This dataset focuses on metadata specific to tasks and methods in machine learning research. It includes structured taxonomies that categorize research papers by domain and subdomain.
3. Semantic Scholar: This dataset contributes citation relationships and additional bibliometric data, enriching the metadata to ensure comprehensive bibliographic coverage.

How These Datasets Were Acquired
The data acquisition process involved querying publicly available APIs:
1. ArXiv API: For retrieving metadata, including publication titles, abstracts, and identifiers.
2. Papers with Code API: To gather structured information on tasks, methods, and implementations.
3. Semantic Scholar API: For extracting citation relationships and validating bibliographic data.
","m2uni":"hc3515","m2fname":"Haowei","m3uni":""},{"projectname":"Predicting Lending Club Loan Status","timestring":"Sat Dec 22 23:44:32 2018","m1uni":"es3573","m2lname":"Tian","m1fname":"Erik","m4fname":"","m1lname":"Su","m3fname":"Rishabh ","description":"Our objective in this project is to build a classifier that is able accurately to predict the loan status of an applicant from their Lending Club loan application. The loan status is an indication of whether or not the loan will be fulfilled and thus if investments turn into fruition. This is important information to be able to predict for both the investor and applicant and we aim to reduce the overhead involved in prediction.

We build upon previous studies of this dataset by applying different preprocessing criteria, stacking models, and balancing the data before training our models.

This research is important for both the investor and applicant since our findings suggest that a majority of the application is not important toward predicting the loan status. Our model can be used to accurately predict the loan status of an applicant with fewer fields to fill out in an application, further reducing the complexity in dealing with loans.

","uni":"es3573","language":"Python, R","pid":" 201812-40","m4uni":"","analytics":"Variational inflation factor, general linear models, and pearson correlation heat maps were used in preprocessing through the seaborn, sci-kit, and R packages. The machine learning algorithms were from pySpark's machine learning library along with its evaluation packages. The stacked model implementation was a custom out-of-fold prediction based algorithm. Undersampling was a simple majority random ratio-based sampling while oversampling was performed using the SMOTE package. Finally, visualizations were done with matplotlib. ","m4lname":"","industry":"Finance","m3lname":"Jain","dataset":"The dataset used was found on Kaggle but can also be found on the Lending Club site. It is approximately 480 MB with 890,000 observations and 75 features.

These files contain complete loan data for all loans issued through the 2007-2015 and a data dictionary is provided in a separate file

https://www.kaggle.com/wendykan/lending-club-loan-data","m2uni":"ht2459","m2fname":"Hangyu","m3uni":"rj2511"},{"projectname":"US Craigslist Car Sale Data Analysis and price modeling","timestring":"Sat Dec 17 03:40:06 2022","m1uni":"np2839","m2lname":"Liu","m1fname":"Napasorn","m4fname":"","m1lname":"Phongphaew","m3fname":"Zachary","description":" Existing challenge - AutoTrader, KBB, and any car valuation website has tried to tackle this challenge, which can be subjective
","uni":"np2839","language":"Python, R, Html","pid":"202212-20","m4uni":"","analytics":"Input data from all CSV and web scrapers:

Gathered existing Craigslist dataset
Scraped existing AutoTrader dataset
Created custom webscraper of Cars.com & AutoTrader.com in real-time

SPARK → GCP into RDD for cleaning data:
EDA to determine distributions & correlation
Dropped all NaN columns
Reclassified not-so-common make/models into “other”

ML models - TensorFlow & R:
Preliminary R regression models
TensorFlow Keras Dense MLPs
SKLearn Decision Trees

Visualization charts & Price Prediction program:
PyPlot
Current price prediction from different models
Live AutoTrader.com & Cars.com listings

Front-end - Website:
HTML D3 with Flask back-end
CSV & h5 data repository","m4lname":"","industry":"Retail","m3lname":"Burpee","dataset":"- Craigslist
- Autotrader.com
- cars.com","m2uni":"ml4802","m2fname":"Ming","m3uni":"zcb2110"},{"projectname":"Emotional Nutritionist Chatbot with Hybrid Retrieval System","timestring":"Tue May 13 03:20:08 2025","m1uni":"ym3068","m2lname":"Han","m1fname":"Yigang","m4fname":"","m1lname":"Meng","m3fname":"Ziyao","description":"This project creates an emotionally intelligent nutrition chatbot that combines scientific credibility with emotional awareness. Key objectives include: developing a RoBERTa-based emotion classifier (BEAM) to detect 27 emotion categories, implementing emotion-aware prompt rewriting for empathetic responses, and creating a hybrid retrieval system that dynamically integrates NCBI/PubMed research with real-time web search. The innovation lies in bridging the gap between generic nutrition advice and personalized, emotionally-sensitive guidance that considers users' psychological states. This research is important because existing nutrition chatbots lack emotional intelligence and provide only generic recommendations, limiting their effectiveness in supporting users' actual dietary challenges and emotional needs.","uni":"ym3068","language":"Python-based system integrating OpenAI GPT-4o, fine-tuned RoBERTa (125M), LangGraph agent framework, FAISS vector search, Hugging Face Transformers, Tavily web search API, and NCBI E-utilities API with OpenAI text-embedding-ada-002.","pid":"202505-5","m4uni":"","analytics":"Core algorithms include: (1) RoBERTa fine-tuning with cross-entropy loss for 27-class emotion classification; (2) Emotion-aware prompt engineering that rewrites user queries to embed detected emotional context; (3) Hybrid retrieval system combining local FAISS vector search with dynamic NCBI fetching using similarity thresholds (0.65); (4) TF-IDF scoring for ranking retrieved article abstracts; (5) Stack-based knowledge accumulation where new documents are incrementally added to the local knowledge base; (6) ReAct (Reasoning and Acting) agent architecture for intelligent tool routing between academic and real-time search based on query characteristics and emotional context.","m4lname":"","industry":"Information","m3lname":"Zhou","dataset":"Primary datasets include: (1) GoEmotions dataset - 58,000 Reddit comments labeled with 27 fine-grained emotions, publicly available from Google Research; (2) NCBI/PubMed database - accessed via E-utilities API for peer-reviewed nutrition research articles. The system can support any nutritional data that can be embedded into vector format, user query logs, and scientific literature from various medical databases. All datasets are publicly accessible - GoEmotions through Hugging Face datasets library and NCBI through their open API.","m2uni":"dh3071 ","m2fname":"Dongbing","m3uni":"zz2915"},{"projectname":"Citibike Data Analysis & Visualization","timestring":"Thu Dec 14 23:56:32 2023","m1uni":"xc2641","m2lname":"Zeng","m1fname":"xingen","m4fname":"","m1lname":"chen","m3fname":"Zhengxuan","description":"Objectives:

Data Analysis and Visualization: The primary goal is to analyze and visualize Citi Bike historical trips data to derive meaningful insights. This involves transforming complex raw data into clear, customizable visual representations.

User-Friendly Web Application: Develop a user-friendly web application that allows users to interact with the data intuitively. The application should cater to individual preferences, providing a seamless and enjoyable experience for users exploring biking trends.

Machine Learning Predictions: Employ machine learning techniques to accurately predict biking trends. This involves utilizing pre-trained models, including Facebook's Prophets model and decision trees models, to make informed predictions based on historical data.

Innovations:

Integration of Machine Learning: By incorporating machine learning models, the project aims to go beyond basic descriptive statistics. This innovation allows for the generation of predictive insights, enabling users to anticipate biking patterns and make data-driven decisions.

Scalability on GCP: Leveraging the scalability of Google Cloud Platform (GCP) for running the web application and performing large-scale data processing using Apache Spark. This choice of technology enables handling vast amounts of data efficiently.

Customizable Visual Representations: The innovation lies in providing users with the ability to customize visual representations. This empowers users to tailor the data visualizations to their specific needs, fostering a more personalized and insightful exploration of the data.

Capabilities:

Large-Scale Data Processing: The project showcases the capability to process large datasets, with over 10 million trips and close to 2 GB of data handled by the web application. The backend, powered by Flask and integrated with Apache Spark, ensures efficient processing.

Predictive Analytics: Through the use of machine learning models, the project demonstrates the capability to move beyond historical analysis and provide predictive analytics. This enables users to make informed decisions based on anticipated biking trends.

Cloud-Based Infrastructure: The use of cloud storage for storing data and pre-trained machine learning models demonstrates the capability to leverage cloud-based infrastructure. This provides flexibility, scalability, and accessibility for the application's users.

Importance:

Informed Decision-Making: The project is important for individuals, city planners, and bike-sharing operators as it enables them to make informed decisions regarding bike usage patterns. Predictive analytics can guide resource allocation and infrastructure planning.

Enhanced User Experience: The user-friendly web application, coupled with customizable visualizations, enhances the overall user experience. This is crucial for ensuring that users can easily interact with and derive insights from the data without requiring advanced technical skills.

Contribution to Urban Mobility: Understanding biking trends is essential for improving urban mobility. This toolkit contributes to the broader goal of creating more sustainable and efficient transportation systems in urban environments.

In summary, the project aims to leverage data analysis, visualization, and machine learning to offer a comprehensive toolkit for understanding and predicting Citi Bike usage patterns, with a focus on user-friendliness, scalability, and real-world applicability.","uni":"xc2641","language":"Python; HTML; CSS; Javascript and GCP","pid":"202312-13","m4uni":"","analytics":"The project implements a full-stack web app running on the GCP visualizing Citi Bike Data.
The web app includes three main parts:
The frontend is built with HTML, CSS, JavaScript, and enhanced by frameworks such as Bootstrap, jQuery, and popper. The frontend return visualization as bar chart, line chart, pie chart, etc.
The Flask app serves as the backend, integrating with Apache Spark for large data processing.
The data, including CSV files and machine learning models pre-trained by local hardware, such as Facebook’s Prophets model and decision trees models, are stored in cloud storage. ","m4lname":"","industry":"Transportation","m3lname":"Wen","dataset":"The dataset is from https://citibikenyc.com/system-data. We got it though web source. It is a public dataset.(I do submit the description via link provided) Our software can support any dataset with similar structures.","m2uni":"yz4307","m2fname":"Yuteng","m3uni":"zw2851"},{"projectname":"Pokémon Battle Result Prediction","timestring":"Fri Dec 17 22:40:03 2021","m1uni":"ww2569","m2lname":"Zhu","m1fname":"Wenpu","m4fname":"","m1lname":"Wang","m3fname":"Ruilin","description":"1. Predict the battle result of two Pokémon based on two datasets - Pokémon attributes dataset and battle result history dataset.
2. Analyze which features lead to the victory, which features of Pokémon are more important for battle.
3. Develop a web application easy to use by users to predict the result of Pokémon battles.
","uni":"ww2569","language":"Python, Windows","pid":"202112-44","m4uni":"","analytics":"Machine Learning methods for classification in our problem: K-nearest Neighbours, Support Vector Machine, Linear Discriminant Analysis, Random Forest.","m4lname":"","industry":"Information","m3lname":"Fan","dataset":"1. Pokémon attributes dataset: official data from Game Freak.
2. Pokémon battle result history dataset: collected by players and posted on the Internet.
Both datasets are stored in CSV file.","m2uni":"zz2765","m2fname":"Zikai","m3uni":"rf2756"},{"projectname":"NYC Motor Vehicle Collisions Analysis ","timestring":"Fri Dec 20 18:30:01 2019","m1uni":"jc5020","m2lname":"Gu","m1fname":"Jixuan","m4fname":"","m1lname":"Chen","m3fname":"","description":" NYC is a huge and busy city, there are thousands of vehicles on the road every day and also many collisions. As the years progressed, the traffic data is growing rapidly so that more detailed analyses could be conducted.
The goal of the project is to analysis the vehicle collision dataset and try to find some potential relationships between accidents and some factors from different datasets. With the deeper understanding of the vehicle collisions, the society resources can be distributed more efficiently to solve this problem.
","uni":"jc5020","language":"GCP, Spark, Python, html, javascript","pid":"201912-20","m4uni":"","analytics":"1. k-means to do the accident location clustering
2. Correlation Analysis to analyze the factors that are related to the accident
3. Web dashboard to visualize the results
4. Google map API to visualize the clustering result","m4lname":"","industry":"Information","m3lname":"","dataset":"1. NYPD Motor Vehicle Collisions - Crashes
2. Traffic_Volume_Counts_2014-2018
3. New York City Taxi Trip - Hourly Weather Data
4. Vehicle_Snowmobile_and_Boat_Registrations
","m2uni":"cg3095","m2fname":"Chaoxun","m3uni":""},{"projectname":"Stock Price Prediction with Media Sentiment","timestring":"Mon Dec 19 03:41:51 2022","m1uni":"wx2283","m2lname":"Cao","m1fname":"Wenshuo","m4fname":"","m1lname":"Xie","m3fname":"Jingchao","description":"Goal:
1. Our goal is to create a model that provides accurate and reliable predictions about stock prices to assist with investment decision-making.
2. Build a model that can predict stock prices using sentiment analysis on Twitter data.
3. Estimate the “market sentiment” based on the “public sentiment” and then predict the stock trend.

Novelty:

Our novelty is connecting the public sentiment with the stock price, and the use of different techniques to improve the accuracy and interpretability of the model.

","uni":"wx2283","language":"Hadoop, Spark, Jupyter Notebook, sklearn(MLP, Random Forest, Linear Regression), D3, Node Js","pid":"202212-27","m4uni":"","analytics":"Methodology:
Data Pre-processing:
Load the data csv file into the Google Cloud storage bucket, and use Pyspark to read it, then we parse and filter the data.

Sentiment Analysis:
Used the vader_lexicon in the NLTK package in Python to conduct sentiment analysis on the twitter data, obtaining the polarity of the tweets for each day.

Model Training and evaluation:
Trained and evaluated machine learning models (linear regression, random forest, and multilayer perceptron) on the processed data

Algorithm:

Linear regression:
Modeled the relationship between sentiment data from Twitter and stock prices.

Random forest:
Used an ensemble learning method to predict stock prices based on Twitter sentiment data

Multilayer perceptron (MLP)
Used an artificial neural network to predict stock prices based on Twitter sentiment data

Visualization:

Finally, we used html/css/javascript and developed an interactive website to display our results.

","m4lname":"","industry":"Finance","m3lname":"Hu","dataset":"1. Tweets about the Top Companies from 2015 to 2020
Volume: Over 3 million rows of unique tweets data
Velocity: We performed a one-time scraping to collect the data, it contains tweets about top companies from 2015 to 2020
Variety: 7 columns, including tweet id, author of the tweet, post date, the text body of the tweet, and the number of comments, likes, and retweets
2. Values of Top NASDAQ Companies from 2010 to 2020
Volume: About 17,500 rows of stock price data
Velocity: We performed a one-time scraping to collect the data, it contains stock price data of top companies from 2010 to 2020
Variety: 7 columns, including ticker symbol, day date, close value, volume, open value, high value, low value
3. Twitter7 dataset(from Stanford Large Network Dataset Collection)
Volume: 467 million Twitter posts from 20 million users
Velocity: It’s an existing dataset. It contains data covering a 7-month period from June 1 2009 to December 31, 2009
Variety: 7 columns, including tweet id, author of the tweet, post date, the text body of the tweet, and the number of comments, likes, and retweets
4. Historical stock price dataset
Volume: About 15,000 rows of stock price data
Velocity: It’s an existing dataset. It contains stock price data of top companies in 2009
Variety: 7 columns, including ticker symbol, day date, close value, volume, open value, high value, low value
","m2uni":"sc5124","m2fname":"Shengqi","m3uni":"jh4312"},{"projectname":"Automated Stock Trading Using Deep Reinforcement Learning","timestring":"Sun Apr 25 04:00:51 2021","m1uni":"ll3297","m2lname":"","m1fname":"Luis ","m4fname":"","m1lname":"Lopez","m3fname":"","description":"Objectives:
To build a scalable deep reinforcement learning trading agent capable of day trading and executing complex trading strategies over N stocks in the US stock market.

Innovations:
Developed a modular and scalable deep reinforcement learning trading system that can carry out all steps of the trading pipeline from data collection, to training, to evaluation, and finally actual real time trading.

Capabilities:
1. The system is able to collect and store customizable intra-day stock market data from IEX Cloud.
2. The system is able to train and save model parameters for a deep reinforcement learning agent.
3. The system is able to load pre-trained agent and execute real-time trades. It can make up to 5 trades/second over N stocks.
4. The systems support continuous learning.

Importance:
Leveraging deep reinforcement learning for stock trading is an active and ongoing area of research. As it is, training deep reinforcement learning systems take a long time. My toolkit allows research to be expedited by facilitating the collecting of customizable stock market data and by making training various models as easy as importing a package. ","uni":"ll3297","language":"Python, Google Colab, and Google Cloud ","pid":"202105-14","m4uni":"","analytics":"Analytics:
1. Data collection was done using Requests to query IEX Cloud
2. Deep reinforcement learning was used to implicitly predict stock prices through policy improvement and thereby creating an agent that could carry out optimal trades.

Algorithms:
The following deep reinforcement learning algorithms were implemented in my project: DQN, DDPG, A2C, PPO2, SAC, TD3, and GAIL.

System Modules:
1. For data collection the module Requests was used to query IEX Cloud.
2. For training the deep RL trading agent, Gym was used to create the training and trading environment while Stable Baselines was used for the deep RL model itself.
3. For visualization, plotly was used.

Visualization:
I used plotly to render a graph showing the change in portfolio value over time. The graph has a slider which allows the user to focus on certain time frames. ","m4lname":"","industry":"Finance","m3lname":"","dataset":"The datasets were collected by querying IEX Cloud at a rate of 4 batch queries/second, where each batch consisted of stock prices for 10 stocks. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Forecasting Future Energy Demand in the Smart Grid","timestring":"Tue May 19 23:35:50 2020","m1uni":"sy2657","m2lname":"","m1fname":"Serena","m4fname":"","m1lname":"Yuan","m3fname":"","description":"Data analytics plays a prominent role in modern industrial systems such as electricity transmission. Smart meter data is irregular and unpredictable from traditional system-level data. Forecasting of energy using deep learning provides a clearer interpretation of uncertainty and volatility in future energy demand. Renewable energy exhibits volatility and intermittent and random behavior. Our approach for forecasting future energy demand is a hybrid combination of different methods from machine learning, representation learning, and deep learning. We extract important features such as time-based features, the trend, and the optimal lag and use these to create representations that our models can operate on.","uni":"sy2657","language":"python, keras, gluonts, mxnet, plotly, dash","pid":"202005-25","m4uni":"","analytics":"Analytics: hierarchical time series analysis (top-down and bottom up), lag features and window features and use of the features in the model, LSTMs, Deep AR, Feedforward networks with L1, L2, Huber, and triplet loss
Algorithms: sampling triplets to input into triplet loss, autocorrelation function, algorithm to calculate trend with rolling mean given window period,
Visualizations: plotly, dash, plot of predictions (median, 50 percent and 90 percent confidence interval bands), visualization of trend and hour of day","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The main group of datasets is from Pecan Street Institute (https://dataport.pecanstreet.org/). This data is collected based on the Independent system operators (ISOs) region, where the US is divided into 7 ISO regions. The other group of datasets is from the Australian Smart Grid Smart City (SGSC) project that was collected from 10,000 customers in New South Wales from 2010 to 2014.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Large-scale Fraud Detection System","timestring":"Fri Apr 23 17:44:01 2021","m1uni":"zz2777","m2lname":"Wu","m1fname":"Zixuan","m4fname":"","m1lname":"Zhang","m3fname":"","description":"Build a scalable and microservice-based fraud detection system that can:
1. handle large traffic with possible peaking volume
2. While offering scalable and real-time predictions, the system must have high precision and recall
3. The system must support user correction and approve similar requests immediately.
","uni":"222777","language":"Python, AWS","pid":"202105-11","m4uni":"","analytics":"Use one-class SVM and local outlier factor to build a pipelined fraud detection model","m4lname":"","industry":"Information","m3lname":"","dataset":"We generated the data ourselves because financial datasets are sensitive and heavily anonymized. The generation algorithm can be found here: https://github.com/namebrandon/Sparkov_Data_Generation","m2uni":"lw2944","m2fname":"Linxiao","m3uni":""},{"projectname":"LoL Live Evaluator","timestring":"Sat Dec 24 05:17:33 2022","m1uni":"ts3415","m2lname":"Talkani","m1fname":"Tianyi","m4fname":"","m1lname":"Sun","m3fname":"Sheng","description":"Throughout the course of this project, we have been able to deploy a web application that can analyze live and completed league of legend games through a user friendly web interface and deep learning model for predictions and analysis. We have elaborated on the system in detail, ranging from the data collection/preprocessing, deep learning model training and experiments, website front-end and back-end, as well as potential future work for the project such as making the deep learning model more lightweight, integrating airflow, etc. With some more work, we believe this project could be significantly useful from both a business standpoint for streamers and viewers, as well as from an analysis standpoint to allow players to improve upon their gameplay","uni":"ts3415","language":"Pyhton, HTML, JS, CSS, Flask, Jinja, Google Cloud","pid":"202212-6","m4uni":"","analytics":"For our deep learning model prediction, we needed to preprocess the data by leveraging libraries such as literal_eval from ast in order to convert strings to arrays, while also converting the time stamps into a sequence of timestamps similar to time-series data taken per minute.

As teams will need to destroy a sequence of enemy turrets, invade the base, and destroy the enemy nexus in order to win a game, we consider game stats such as the difference in dragons, barons, kills, etc between two teams. We store this into a dataframe in order to train our deep learning model(which leverages pytorch and sparktorch) prior to the real-time streaming.
","m4lname":"","industry":"Information","m3lname":"Shen","dataset":"For our dataset, we make use of the open-source LoL kaggle dataset shown here, which includes competitive games from 2015-2017. This data consists of around 7621 matches, and includes features taken directly from the Riot API. We then add additional matches to this dataset through the Riot API, and reformat the dataset to get our desired features. Our final dataset comes to over 17,000 matches. As we also make use of the live API, our data is both voluminous and has velocity. As we make use of random matches, including both professional and amateur matches, along with 7 features, it also has variety.
","m2uni":" aat2193","m2fname":"Ayman","m3uni":"ss6635"},{"projectname":"Fake News Detection","timestring":"Sat Dec 22 05:15:08 2018","m1uni":"yd2466","m2lname":"Ji","m1fname":"Yuanchu","m4fname":"","m1lname":"Dang","m3fname":"Wei","description":"With the rapid development of Internet and social media, the sheer volume of news and contents starts to explode in an exponential manner, and so do fake ones. Our final project is focused on the topic of fake news detection, an active area of research where many problems remains to be solved or perfected. There are many important sub-problems in this field, such as sentiment analysis, predicting the veracity of news and documents, validating sources against one other, etc. For this project, we choose to narrow down to a specific formulation, that is, given a news headline and a body of paragraph, our objective is to determine whether the body discusses, agrees of disagrees with, or is totally unrelated to the headline text. In terms of modeling, we first implement and configure vanilla feed-forward neural networks and achieve solid benchmark accuracy. On top of that, we further experiment with recurrent neural networks, especially long short-term memory units that are known to perform well with sequential data such as texts. Last but certainly not least, we apply BERT - a latest pre-trained fine-tuning language model developed by Google AI - to this classification problem and achieve decent training outcomes. Separately, using Twitter's streaming API and our trained classifiers, we build an interface that validates the truthfulness of tweets in real time based on the predefined ground truth body texts. ","uni":"yd2466","language":"Python, JavaScript, TensorFlow, Keras, Elephas, Google Cloud, Twitter Streaming API","pid":"yd2466","m4uni":"","analytics":"Feed-forward Neural Network, Multilayer Perceptron (MLP), Recurrent Neural Network (RNN), Long Short-Term Memory Unit (LSTM), Bidirectional Encoder Representations from Transformers (BERT)","m4lname":"","industry":"Information","m3lname":"Luo","dataset":"The dataset we use comes from Stage 1 of the Fake News Challenge. In total, the dataset is comprised of 1683 article bodies, 49972 headlines, and ground truth labels corresponding to each headline-body pair. The size of the joined dataframe is over 100MB. We split the entire dataset into training and validation according to a 85% to 15% ratio. ","m2uni":"yj2466","m2fname":"Yanmin","m3uni":"wl2671"},{"projectname":"Avocado Price and Sales Prediction System","timestring":"Sat Dec 18 02:47:23 2021","m1uni":"yl4628","m2lname":"Qiu","m1fname":"Yiran","m4fname":"","m1lname":"Lin","m3fname":"Stacy","description":"Objective: Build a system that can predict the price and consumption of avocados in certain markets as well as generate a visual data report by analyzing the marketing data of avocados.

Innovations: Most applications using this avocado dataset either focus on prediction or data visualization. We want to build a system that allows both functionalities.

Capabilities: For the model prediction function, our application would be used for estimating the price and consumption level of avocado which helps sellers in the process of demand planning and estimating the sales revenue. Plus, the data visualization function analyzes consumer purchasing behavior in the U.S. which presents marketing insights of consumer buying behavior, ex: the most sought-after type of avocado and how preferences vary by region across the U.S.

","uni":"yl4628","language":"We used Python and R","pid":"202112-24","m4uni":"","analytics":"Algorithms: ML algorithms including Linear Regression, Ridge Regression, Support Vector Machine, Bayes Ringe, Random Forest, and Multilayer Perceptron were used.

Visualization: reactive charts(line, bar, pie) were implemented using Shiny App.","m4lname":"","industry":"Retail","m3lname":"Lai","dataset":"We used the historical data on avocado prices and sales volume in multiple US markets found on Kaggle. This is the only data our software is using.","m2uni":"yq2310","m2fname":"Yunze","m3uni":"sl4450"},{"projectname":"Trending YouTube Videos Analysis","timestring":"Fri Dec 23 10:09:25 2022","m1uni":"smc2306","m2lname":"Shah","m1fname":"Sumit","m4fname":"","m1lname":"Chavan","m3fname":"","description":"The trending page on Youtube gives more exposure and reach to the videos and can increase the viewership and hence the revenue drastically for the content creator. The work done in this project aims to understand what factors lead to the popularity of a video by analyzing different data points about popular videos such as the views, shares, comments, likes, etc. The work further expands to also performing sentiment analysis on the comments to understand whether the public sentiments of the real world events affect the popularity of videos by creation of different machine learning models.","uni":"smc2306","language":"Python, Pyspark, Jupyter Notebook, Apache Airflow, Google Cloud Platform (Dataproc, Cloud Storage, BigQuery), Tableau","pid":"202212-31","m4uni":"","analytics":"1. Exploratory data analysis:
Distribution of title length across videos and word patterns in title
Trends observed in publishing time and day of the week
Distribution of trending videos in terms of categories and channels
Correlation between no of views, likes and comment count.
2. Sentiment Analysis:
Used count vectorizer on the textual content of comments.
Performed sentiment analysis on comments by implementing different ML algorithms :
Logistic Regression, Naive Bayes, Decision Trees, Random Forests
3. Workflow Orchestration: Airflow
4. Visualization: BigQuery, Tableau (Distribution of sentiments by categories & by channels, accuracies of models over days, etc)
","m4lname":"","industry":"Media","m3lname":"","dataset":"The dataset is a combination of static dataset from Kaggle taken from the following link: https://www.kaggle.com/datasets/datasnaek/YouTube and real-time streaming data from YouTube. The Kaggle dataset consists of ~700K unique comments for ~2k trending videos for the US region.

We are also fetching the most popular videos for the US region via the Youtube Data API v3 and then all the corresponding comments for each video for sentiment analysis. The contents of the video information are similar to that of the Kaggle dataset. However, the comments information has more details like likes, replies and the published time for each comment. The volume of data obtained from youtube API scraping has around 500K comments processed daily on around 200 trending videos.
","m2uni":"vs2779","m2fname":"Vinay","m3uni":""},{"projectname":"Language Translation Application on Google Glass","timestring":"Wed May 11 23:26:01 2022","m1uni":"wk2359","m2lname":"Liu","m1fname":"Weiyi","m4fname":"","m1lname":"Kong","m3fname":"","description":"Augmented reality (AR) is one of the biggest technology trends right now. It enables us to experience real-world environments with a digital augmentation overlaid on them. AR has been used in a variety of areas now. Among plenty of AR applications, we want to build an AR application that can translate different languages when users wear it. The reason that we want to work on this project is that we find that sometimes it’s difficult for people to travel to other countries without understanding the languages of that country. And it can be very awkward in some situations without understanding the language. Also, with international trading, we can see plenty of international products in the market. Hence, even without traveling, an AR language translator is necessary for our daily life. Therefore, we want to design an application on Google Glass that can translate what users see simultaneously to the language they choose. Nowadays, there are some similar products that have already been used. For example, Google has an app called Google Translate live AR camera. This app can be used for translating any type of text like letters, signs, and translation of video games into a foreign language.","uni":"wk2359","language":"Kotlin/Android Studio","pid":"202205-11","m4uni":"","analytics":"We deploy the object detection system in our application using the Google Vision service.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"We use the pictured that we take to test if the detection and translation module works well.","m2uni":"xl3129","m2fname":"Xiaoyu","m3uni":""},{"projectname":"Combining Macroeconomic Data, Formulaic Alphas, Financial Text and AI - Returns Forecasting and Position Sizing using Machine Learning, and Explainable Machine Learning for Diagnostics","timestring":"Fri May 5 17:10:40 2023","m1uni":"no2367","m2lname":"Benghiat","m1fname":"Nicklaus","m4fname":"","m1lname":"Ong","m3fname":"","description":"Objectives: We want to build a smart trading algorithm that utilizes a variety of data and techniques, tweaks bet sizing according to the market regime, and utilize explainable machine learning to elucidate the decision making process. This is crucial research that will push the boundaries of end to end AI trading systems. We will focus on S&P500 companies.

Innovations: Obtained historically accurate constituents, elucidated steps to construct, purged k-fold cross validation. Gold standard for testing and simulation. Use multiple sources of data (market data, news, macroeconomic). Comparison of multiple news sources. Compare FinBERT, Financial Lexicon Approaches and ChatGPT. Use triple barrier method for more stable returns. Use Machine Learning for Bet Sizing. Reinforcement Learning for Bet Sizing.

Capabilities: Combining Macroeconomic Data, Formulaic Alphas, Financial Text and AI - Returns Forecasting and Position Sizing using Machine Learning, and Explainable Machine Learning for Diagnostics. Multi stocks, multi alpha, multiple data source processing, multiple sentiment analysis methods.

Why Important: Machine learning has the potential to transform the world of investments and allows investors to gain an edge over peers. To capture a holistic view of the markets, trading algorithms should combine a variety of data from different sources in prediction of returns. Algorithms can be further enhanced if it can learn bet sizing; this helps the algorithm adapt to different market environments and generate alpha. A black box algorithm can be dangerous! If a model ever malfunctions, investors will not know until the model starts losing money. Explainable ML allows investors to understand models and intervene if ever necessary. Datasets are public, retail investors can also access and apply our research to democratize finance and even the playing field.
","uni":"no2367","language":"Python, Plotly Dash, AWS Lambda, AWS SNS, AWS Sagemaker, AWS S3, AWS EC2, OpenAI ChatGPT","pid":"202305-1","m4uni":"","analytics":"Modules: openai, finrl, yfinance, pynytimes, lightgbm, FinBERT, lazypredict, sklearn, matplotlib, missingno, numpy, pandas, pyarrow, boto3, seaborn, scipy, tqdm

Machine Learning Algorithms: RandomForest, AdaBoost, GradientBoosting, Light Gradient Boosting, Support Vector Machines, Extra Trees

Visualizations: LIME, Kernel Density Estimates, Histogram, Bar Charts, Feature Importance LGBM, Heatmaps

Dashboard (For every stock and every date, several benchmarks): Daily Returns, Cumulative Returns, Cumulative Log Returns, Daily Alpha, Cumulative Alpha, LIME, Drop Down Filtering for granular analysis for stock, Date Range Filtering

Reinforcement Learning: Actor-Critic (A2C), Deep deterministic policy gradient (DDPG), Proximal policy optimization (PPO), Twin delayed deep deterministic policy gradient (TD3), Soft Actor-Critic (SAC)

Machine Learning Concepts: Purged Cross Fold Validation, Triple Barrier Method, Bet Sizing using predicted probabilities

AWS: AWS SNS (lambda triggering), AWS Lambda (Sagemaker activation), AWS Sagemaker (data scraping, model training, results, predictions, dashboard data analytics), AWS EC2 (dashboard webserver hosting), AWS S3 (data storage)","m4lname":"","industry":"Finance","m3lname":"","dataset":"Historically Accurate S&P500 Constituents, S&P Dow Jones Indices releases tracked via https://en.wikipedia.org/wiki/List_of_S%26P_500_companies, 752 tickers in total

Market Data (Open, High, Low, Close, Volume) from Yahoo Finance (for active tickers), Investing.com (for delisted tickers), scraped 743 tickers in total.

Macroeconomic Data (Federal Reserve Database), obtained Real Gross Domestic Product, 10-Year Inflation Breakeven Inflation, Purchasing Managers Index, US Yield.

New York Times News using New York Times Article Search API Scraped 292,584 news articles (https://developer.nytimes.com/apis)

Analyst News (Zack’s Investment Research, Investing.com, Seeking Alpha, Motley Fool) from https://www.kaggle.com/datasets/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests

US Equity News (Yahoo Finance), dataset was removed from Kaggle

Loughran Mcdonald Dictionary (https://sraf.nd.edu/loughranmcdonald-master-dictionary/)

We can also support higher or lower frequency market data (news have timestamps e.g for new york times api), we could process any other news site. We can pull more macroeconomic data from FRED.
","m2uni":"jb4653","m2fname":"Jonathan","m3uni":""},{"projectname":"Multimodal Video Understanding System","timestring":"Thu Dec 18 20:09:19 2025","m1uni":"dy2525","m2lname":"Shi","m1fname":"Dishen","m4fname":"","m1lname":"Yang","m3fname":"Tianyao","description":"Objectives: The primary goal of this project is to develop a Multimodal Video Understanding System capable of answering natural language questions grounded in video content. We aimed to bridge the gap between static visual perception and temporal reasoning by integrating a pre-trained EVA-CLIP visual encoder with the Llama-2 language model.
Innovations:Our key innovation lies in designing a resource-efficient 3-Stage Training Pipeline (Image Alignment - Video Pre-training - Instruction Tuning) that operates successfully on a single NVIDIA L4 GPU (24GB VRAM).To address hardware constraints, we implemented novel efficiency strategies:
Token Pooling: Concatenating adjacent visual tokens to reduce sequence length by 75%
LoRA (Low-Rank Adaptation): Fine-tuning only 0.5% of parameters to enable efficient domain adaptation without full-model retraining.
Capabilities: The system demonstrates capabilities in static perception, temporal motion understanding, and conversational interaction. It supports short-video inference where users can upload a video and ask complex questions, receiving context-aware natural language responses.
Importance: This research is critical because it demonstrates the feasibility of building powerful multimodal systems on consumer-grade hardware, lowering the barrier to entry for advanced video AI. It has significant applications in video search, content moderation, and assistive technologies for the visually impaired.

","uni":"dy2525","language":"Python on Google Cloud Platform","pid":"202512-1","m4uni":"","analytics":"1. Algorithms & Core Models:
Visual-Language Alignment: We implemented a Linear Projection Algorithm to map frozen EVA-CLIP (ViT-g/14) visual features into the Llama-2-7B embedding space .
Efficiency Algorithms:Token Pooling: A custom algorithm that concatenates every 4 adjacent visual tokens into a single token, reducing sequence length by 75% for memory optimization. LoRA (Low-Rank Adaptation): Implemented on the LLM's q_proj and v_proj layers (Rank=64, Alpha=16) to enable parameter-efficient fine-tuning on a single GPU.
2. System Modules:
3-Stage Training Pipeline: A modular system designed for progressive learning: Stage 1 (Static Alignment), Stage 2 (Temporal Perception), and Stage 3 (Instruction Tuning).Data Processing Module: Custom Python scripts developed for parsing CSV metadata, validating video URLs, and converting raw data (LAION, Condensed Movies, VideoChatGPT) into structured JSON formats for the training loader.
Inference Module: An end-to-end pipeline integrating OpenCV for frame sampling, EVA-CLIP for encoding, and Llama-2 for response generation.
3. Visualization & Analytics:
Weights & Biases (WandB): Integrated for real-time visualization of Training Loss convergence, Learning Rate scheduling, and System Metrics (GPU VRAM usage, Power consumption) to monitor stability during the single-batch training process.","m4lname":"","industry":"Information","m3lname":"Yu","dataset":"We utilized a multi-stage dataset strategy involving three specific public datasets, processed into subsets to fit our computational constraints:LAION dataset, Condensed Movies (CMD) dataset, VideoChatGPT dataset.
Supported Data: Our software architecture is designed to support any standard video files (e.g., .mp4) paired with textual descriptions or instruction-answer pairs formatted in JSON, making it extensible to other public video-text datasets.","m2uni":"js6605","m2fname":"Jinze","m3uni":"ty2573"},{"projectname":"Deep Learning Streaming Platform for Maritime Vessel Mobility Pattern Detection","timestring":"Sat Dec 18 05:05:58 2021","m1uni":"jkk2139","m2lname":"Tetali","m1fname":"Joseph","m4fname":"","m1lname":"Krozak","m3fname":"","description":"In recent years, the vast number of available sensor data and associated Internet of Things (IoT) devices produces high volumes of data that – if processed efficiently and accuracy – can give significant situational awareness and insight. The maritime domain is an example of this phenomenon, where vessels transiting the globe are outfitted with automatic identification system (AIS) transponders that report vessel positions in near real-time. In aggregate, the stream constitutes several thousand positional messages per second.

Maritime traffic is critical to the global economy, as a wide variety of goods are shipped across the globe. The ability to understand vessel activities is therefore critical to establishing global financial situational awareness. Additionally, potentially nefarious vessel activities - such as pirating, human trafficking and illegal drug conveyance - not only have a profound impact on the global economy but also on the national security postures of the countries involved in global vessel transportation.

Big data and deep learning technologies can be leveraged to process global AIS transponder data, in real time, in order to infer vessel activities by their mobility patterns. A novel approach to vessel activity classification is the translation of vessel movements into track images that can be classified using deep learning networks and computer vision approaches.","uni":"jkk2139","language":"AWS S3 / Athena (Data Lake) / Glue (ETL) / Managed Airflow (Workflow) / AWS QuickSight (Visualization) / AWS SageMaker (Deep Learning), Python","pid":"202112-22","m4uni":"","analytics":"Image Factory - Conversion of positional vessel data into normalized, colored images suitable for image classification. Extensive Athena SQL queries - leveraging LEAD and LAG SQL window functions - to isolate mobility patterns of interest from an ocean of vessel position data. AWS QuickSight dashboard providing interactive insight into the vessel position datasets.","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"Marine Cadestre (https://marinecadastre.gov/ais/) Automatic Identification System (AIS) data sets contain historical vessel transponder data covering 20 geographical zones - essentially the Western Hemisphere. This is a publicly available data source, with billions of positional data records, organized in CSV files, from 2009 to 2021.","m2uni":"rht2115","m2fname":"Rao","m3uni":""},{"projectname":"A BERT and CNN based prediction model on competitive products","timestring":"Thu May 12 04:40:26 2022","m1uni":"yc4031","m2lname":"Wang","m1fname":"Yuefei","m4fname":"","m1lname":"Chen","m3fname":"","description":"In this project, the topic is lied in the Automatic Sales Leads Finding. This project intended to find and classify the potential customers of a group of companies that have similar products. Our group proposes a novel model and measurements to explore and find the potential customers on each brand. The model is based on BERT and convolutional neural networks to classify potential customers. Eventually, two popular brands Apple and Samsung are selected to evaluate the model. Public tweets of these brands collected is designed in this project with final accuracy above 75\%.","uni":"yc4031","language":"Python and HTML","pid":"202205-18","m4uni":"","analytics":"BERT model, CNN model. website is implemented","m4lname":"","industry":"Retail","m3lname":"","dataset":"This is a dataset from kaggle. It consists of more than 8600 subjects, each subject represents an interviewee, along with his text material and a tested standard MBTI type. The MBTI type distribution in dataset matches the estimated real-world MBTI type distribution. This is part of the reason we choose MBTI as personality indicator since datasets containing Big Five model information are no longer available.
This dataset is then used for BERT MBTI classification model training and testing.","m2uni":"xw2812","m2fname":"Xinzhe","m3uni":""},{"projectname":"Graph Neural Networks for Deepfake Detection","timestring":"Sat May 16 01:28:47 2020","m1uni":"zmp2105","m2lname":"Pan","m1fname":"Zane","m4fname":"","m1lname":"Peycke","m3fname":"","description":"As part of an increasing amount of false information, videos altered or generated with artificial intelligence techniques have reached a level of realism that makes it difficult for human viewers to discern whether a video is real or fake. These videos, called deepfakes because of the deep learning methods often used to create them, pose a significant problem for society, and new tools are needed to detect the authenticity of media. Widespread adoption of deepfake creation tools will fuel an increased amount of disinformation, continued erosion of trust, and increase the liar’s dividend. Efficient deepfake detection is a difficult problem because of both the computational complexity and questions surrounding what constitutes a real video. We propose a novel implementation of graph neural networks to build an accurate deepfake detection classifier. Graph neural networks are well suited to this problem because of their ability to detect hidden patterns in non-euclidean space, and the spatial and temporal relationships of deepfake alterations. We are not aware of any existing deepfake detection techniques that utilize graph neural networks. ","uni":"zmp2105","language":"We used Python, Jupyter Notebooks, and C.All storage and computing resources were handled using Google Cloud Platform (8vCPUs, 30GB of RAM, and one Tesla P100)","pid":"202005-30","m4uni":"","analytics":"We implemented a Deep Graph Convolution Neural Network.
The paper is available here: http://muhanzhang.github.io/papers/AAAI_2018_DGCNN.pdf
Source code is available here: https://github.com/muhanzhang/DGCNN","m4lname":"","industry":"Media","m3lname":"","dataset":"We utilized several publicly available datasets comprised of original and modified videos. Data is available from the following locations:
http://kaldir.vc.in.tum.de/faceforensics_benchmark/documentation
https://www.kaggle.com/tunguz/70000-real-faces-1
https://www.kaggle.com/tunguz/1-million-fake-faces/kernels

Our model can support any new videos or images. Graphs can be created using the graph-creation notebook in our repository. ","m2uni":"zp2217","m2fname":"Zhongtian","m3uni":""},{"projectname":" Forecast of Stock Price By Public Sentiment","timestring":"Sat Dec 18 04:43:14 2021","m1uni":"yc4029","m2lname":"Huang","m1fname":"Yi","m4fname":"","m1lname":"Chen","m3fname":"Haoxiong","description":"Goals:
Predict public sentiment towards daily stock prices by tweets
Generate optimal decisions to buy in/sell specific stocks based on Long/Short Strategy
Innovations:
Predict public sentiment towards daily stock prices by tweets
Generate optimal decisions to buy in/sell specific stocks based on Long/Short Strategy
","uni":"yc4029","language":"Python, django, d3.js, java script, GCP","pid":"202112-12","m4uni":"","analytics":"We trained the following five classifiers to predict the rise/fall of the stock price in the next day:
1. Logistic Regression(LR)
2. Support Vector Machine(SVM)
3. Decision Tree(DT)
4. Random Forest(RF)
5. K-Nearest Neighbors(KNN)","m4lname":"","industry":"Information","m3lname":"Su","dataset":"1.NASDAQ100 Twitter Dataset
This dataset includes about 1 million tweets collected over 79 days from March 28, 2016, to June 15, 2016, with references to cashtags of NASDAQ100 companies. The data is provided by followthehashtag.com, a Twitter search analytics and business intelligence tool.
2.Stock price data from Yahoo Finance
Stock price data was downloaded from Yahoo Finance using yfinance package. Cashtags of NASDAQ100 stocks were used to query stock prices at corresponding dates. Each stock has ”open price”, ”high price”, ”low price”, ”close price”, ”volume” of each day as columns.
3.Twitter streaming data
Twitter streaming data for online application of our model is collected from Twitter API using sockets. Data was thus stored in our local storage. Detailed process will be discussed in later sections.","m2uni":"th2884","m2fname":"Tianchun","m3uni":"hs3228"},{"projectname":"Big Data Analysis on Taiwanese Stock Market","timestring":"Sat Dec 22 09:32:37 2018","m1uni":"cc4338","m2lname":"Chou","m1fname":"Chun-Lin","m4fname":"","m1lname":"Chao","m3fname":"Yu-Chun","description":"In a stock market, there are 2 groups of investors (individual investors and institutional investors). Our goal is to help individual investors to get positive annual return on investment. Unlike the previous work about the analysis on stock market (only use Backtesting) , we also modified some trading strategy and use EMA(Exponential Moving Average) to analyze the data. We believe that the prediction can be more accurate with our methodology.","uni":"cc4338","language":"Python/HTML/Javascript","pid":"201812-33","m4uni":"","analytics":"For algorithms, we used three kinds of typical regressions, Linear Regression, Decision Tree Regression and Random Forest Regression. For analytics, we used StringIO to fetch a dataset and Numpy and Pandas to analyze the data frame. Besides, we also used TA-lib to analyze EMA(Exponential Moving Average). Finally, we used matplotlib to make several plots to analyze the trend of stock price. We built some trading strategies such as Stock Screening and BackTesting (Modified Double Cross Method) with EMA to calculate the Return on Investment and win rate. For system modules and visualization, we built a web page interface to show all of the plots we've made by using HTML and JavaScript to import all of the plots and show all of them.","m4lname":"","industry":"Finance","m3lname":"Shih ","dataset":"We crawled the data from the website of TWSE (Taiwan Stock Exchange Corporation) fetched the data, and output a csv file using the API tool, StringIO.","m2uni":"sc4400","m2fname":"Shao-Chi","m3uni":"ys3152"},{"projectname":"TradeFx: An AI-Powered FX Trader","timestring":"Fri May 3 17:07:09 2024","m1uni":"jw4455","m2lname":"Cheng","m1fname":"Jianghao","m4fname":"","m1lname":"Wu","m3fname":"","description":"Objectives：

Predict FX Rates: Leverage machine learning to forecast future currency exchange rates accurately.
Provide Investment Analysis: Utilize an AI-powered chatbot to offer investment suggestions and analysis.
Trade: Implement automatic trading strategies that align with user-defined short or long-term preferences.

Innovations:

Integration of Machine Learning and Reinforcement Learning: Combining ML for prediction and RL for active trading decisions is innovative in the application of these technologies in the forex market.
AI-Powered Investment Advisor: A chatbot that not only provides general advice but also understands and reacts to market sentiment, enhancing decision-making for traders.

Capabilities:

forecast future currency exchange rates.
AI-powered chatbot offering investment suggestions and analysis.
Trading simulations.

Why important?

Utilizing machine could enhance Predictive Accuracy, enabling traders to make more timely investment decisions.
AI-powered investment advisor (chatbot) helps bring informed decision making.
","uni":"zc2747","language":"Python, html, javascript, css","pid":"202405-5","m4uni":"","analytics":"SARIMA, Reinforcement Learning, OpenAI Gym, gpt3.5, flask, matplotlib, plotly","m4lname":"","industry":"Finance","m3lname":"","dataset":"Forex price historical data from Kaggle

Forex related news from ForexLive.com","m2uni":"zc2747","m2fname":"Zekai","m3uni":""},{"projectname":"High-Frequency Trading via Real-Time Streams","timestring":"Fri Dec 15 20:41:27 2023","m1uni":"ms6641","m2lname":"Thevenin","m1fname":"Musa","m4fname":"","m1lname":"Shams","m3fname":"Yiduo","description":"The goal of this project is basically to perform near real-time predictions on given stock tickers using various machine learning algorithms. The data is streamed through PySpark and from the yfinance API and is projected through a Flask web application.","uni":"ms6641","language":"Python was used with various libraries such as PySpark, TensorFlow, Flask, yfinance, etc.","pid":"202312-3","m4uni":"","analytics":"The LSTM algorithm was the final algorithm implemented, with a graph showing predicted stock values also implemented as well.","m4lname":"","industry":"Finance","m3lname":"Jiang","dataset":"The dataset is dependent on the ticker chosen and is streamed through yfinance. The data consists of mainly previous stock prices of the selected ticker and is publicly available through any stock price viewer application.","m2uni":"nit2111","m2fname":"Nicholas","m3uni":"yj2723"},{"projectname":"Predicting Customer Creativity Based On Amazon Customer Review Data","timestring":"Fri May 5 01:19:33 2023","m1uni":"yj2737","m2lname":"Wang","m1fname":"Clarence","m4fname":"","m1lname":"Jiang","m3fname":"","description":"Objective: customer behavior data are normally utilized to help predict customer behavior, especially if a customer is willing to buy a given product or not. However, from a different perspective, customer behavior data not only reflects their attitudes toward products but also demonstrates customer characteristics such as creativity or personality. We plan to use customer behavior data to know more about customers. We will scrape Amazon product review information, convert raw data into a formalized dataset, develop robust machine-learning models to predict their creativity, and eventually build an application interface for users to interact with our model.

Innovation: 1. We scraped our own data and developed our own dataset, which also requires more data cleaning. 2. We connected customer behavior data with a quite abstract target feature: creativity. No one has done that before

Capability: 1. Our system has decent accuracy 2. Our product is user-friendly since we also created a straightforward for them to use 3. To our best knowledge, we are the first to connect customer behavior data on Amazon reviews with customer creativity. It could bring more inspiration for those who are interested in creativity measures.

Why important:
1. From a technical point of view, this research is important because of mainly its integration of machine learning techniques, data scraping tools, and software applications.
2. From an innovation point of view, our work combines creativity with customer behavior data, which is both interesting and creative. We believe our work could serve as a good start for those who are interested in the same thing.","uni":"yj2737","language":"Python, HTML, CSS, TypeScript, Jupyter Notebook","pid":"202305-7","m4uni":"","analytics":"Scraping: requests_html, json, lexical_diversity, emoji, textstat, nltk, pandas

Machine Learning: sklearn, linear regression, decision tree, random forest, gradient boosting, hist gradient boosting, xgboost, histogram, heatmap, cross-validation, scatter plot, feature importance

Software: flask, angular, multiprocessing","m4lname":"","industry":"Finance","m3lname":"","dataset":"We had a raw data dataset and a formalized dataset used to train a machine learning model. Our raw data are all scraped from the Amazon product review page including 15 different products, and it's stored as a JSON file. Then, we also have a formalized dataset that is converted from raw data, including 8 features and 1 target feature (creativity). All of the data are scraped and processed mainly through NLP techniques. Our software mainly deals with the Amazon product review page, so it does not support well if other web pages do not contain the features we used to train the model, such as \"the number of people who upvotes a review\". However, it should work with most web shopping pages. ","m2uni":"yw3912","m2fname":"Yuyang","m3uni":""},{"projectname":"Prediction and Analysis Based on English Premier League","timestring":"Fri Dec 13 14:54:46 2019","m1uni":"yl4305","m2lname":"Huang","m1fname":"Yuan-Hsi ","m4fname":"","m1lname":"Lai","m3fname":"Qiaoyu ","description":"English Premier League, as the top level of English soccer, is the craziest sports league all over the world, with more than 4.6 billion audiences and broadcast in over 200 countries. Accompanied by its spectacular popularity, it meanwhile enjoys huge commercial value exploited and to be exploited.
There have been so many websites reporting Premier League in the current market, whose functions are mainly merely limited to data storage and query, such as WhoSocred. Our website, however, serves various groups related to the Premier League, with more novel and practical functions.
Besides basic data displays, we add other modules including future match results prediction, player style analysis based on classification, player impact evaluation, etc.
On our platform, based on big data analysis methods and machine learning algorithm implementation, proactively focuses on creating valuable information for people with their respective demands. For football managers, we can offer them transfer strategy-- if Ed Woodward, Manchester's manager, wants to sign a winger whose style performs like Lingard, we can offer him a list of potential targets: Gray from Leicester or Richarlison from Everton; If Tom, a fan of Liverpool, plans to buy lottery of the match between his home team and Arsenal this weekend, our website prediction system can give him a hand.
The goal of our website is offering more reliable, more complete and more accurate data. We are committed to building our website a qualified information assistant for all of you interested in the Premier League.","uni":"qg2172","language":"Pyspark, Keras, Sklearn Algorithms: Random Forest, XGBoost, Multilayer Perceptron (MLP), LSTM, SVM, K-means Visualization: Matplotlib, D3js, Django","pid":"201912-23","m4uni":"","analytics":"We use pyspark and sklearn to do pre-processing, including cleansing and standard scaling and then used Random Rorest, XGBoost, MLP, SVM for classification. We also use K-means for player clustering and LSTM for team performance prediction. For team performance score, we designed an algorithm using in game stats. For visualization, we use Django as backend and d3js to present our result on webpage We also plot results using Matplotlib.","m4lname":"","industry":"Information","m3lname":"Gu","dataset":"Our dataset is downloaded from Kaggle, containing all match stats from Season 14-15 to Season 17-18 of the Englsh Premier League. Our platform can support other soccer game stats files, recording in-game data includingplayer touches, passes, yellow cards, full-time scores, etc, and automatically execute prediction and analysis based on thses stats.","m2uni":"jh4137","m2fname":"Jin","m3uni":"qg2172"},{"projectname":"IRIS RECOGNITION BY DEEP LEARNING","timestring":"Sat Dec 18 03:22:34 2021","m1uni":"rw2902","m2lname":"Zhao","m1fname":"Ruisi","m4fname":"","m1lname":"Wang","m3fname":" Zhongsheng","description":"The objectives of our project is to explore the potential of deep learning on iris recognition. Our research focuses on the accuracy of deep learning on different quality of human eye images, the necessity of different pre-processing steps before training and the ways to combine traditional methods with deep learning.
Innovations, Capabilities, and Importances: For security reasons, many applications or situations require personal identification to restrict access to certain resources. For example, to log onto an email account, a password is needed, and it is usually known only by the owner. However, one person can have multiple accounts for various platforms. Each platform can have different rules to set passwords. Except for passwords, we also have fingerprint recognition and facial recognition. However, under cold weather or when one’s hand is wet, fingerprint recognition can fail. For facial recognition, since covid-19, everyone wears a mask outside, so when they want to use a masked face to unlock their mobile devices, the current recognition algorithm cannot identify successfully. Compared with fingerprints and facial recognition, human iris recognition has a high potential to be a more reliable method. ","uni":"rw2902","language":"Python, Tensorflow, CoLab","pid":"202112-54","m4uni":"","analytics":"Deep learning models: ResNet50 and VGG","m4lname":"","industry":"Life Science","m3lname":"Chen","dataset":"IITD_Delhi, CASIA_V4_Interval, and MMU2
Downloaded from websites","m2uni":"xz2987","m2fname":" Xiaoshu","m3uni":"zc2583"},{"projectname":"What's in a Book Cover?","timestring":"Fri Dec 13 18:00:27 2019","m1uni":"my2570","m2lname":"Silva","m1fname":"Najim","m4fname":"","m1lname":"Yaqubie","m3fname":"Charles","description":"Is a picture worth a thousand words? Book covers are the first glimpse into the book and need to immediately convey a convincing message to induce readers. But how do you design a book cover and what should you prioritize? We use Amazon product and review data to explore book covers. Using nearest neighbor and convolutional neural network models, we aim to learn what is a good book cover and help design novel covers by generating suggestions.

Often, conventional cover design requires hiring expensive art designers over a prolonged undefined and non-standardized process. Designers often have their own style derived by a personal artistic system, all of which is designed to entice potential readers to open a new book. However, if the goal is to have more readers, or be a more successful book, why not explore whether there is a relationship between cover design and success directly?

We use image feature extraction and machine learning techniques to learn successful covers for books. By doing so, we discover certain elements of book covers are quite related to the eventual success of a book as defined by Amazon reviews and overall rating compared to similar covers. We can judge a book by its cover, and we can generate suggestions developed to enhance the success of the book. Importantly, we also show what successful books are similar to the proposed cover to give the user an idea of how to improve their cover stylistically.","uni":"my2570","language":"Python, HTML, JavaScript, Pandas, NumPy, Keras, SKLearn, Google Dataproc, Flask, Docker on Google Cloud Build, Kubernetes, Google Cloud Storage, Google BigQuery","pid":"201912-35","m4uni":"","analytics":"We analyzed each book review and consolidated ratings to an overall rating for all books in our data set. We then extract image features such as color vibrancy from each book cover and generate to a k-nearest neighbors model. We then trained a convolutional neural network on cover images to determine whether there is a learnable relationship between book cover and successfulness by overall rating. Given the strong relationship, we combined the k-nearest neighbor and CNN models to build a web app suggestion engine that, given a potential book cover, can give you an estimated rating and some recommendations to improve.","m4lname":"","industry":"Retail","m3lname":"Summers","dataset":"Mainly, we will rely on the Amazon Review Data with respect to books. This contains 51,311,621 reviews for 2,935,525 books containing product information, links to images, and metadata. Related to this data is a curated set of 207,572 cleaned book cover images of size 224x224 that may be easier to rely upon as some product images are more than just the cover. Both are available online for free. Our software can support more cover images of size 224x224 with metadata including title, ASIN id, reviews, and overall rating.","m2uni":"dcs2180","m2fname":"Daniel","m3uni":"cgs2161"},{"projectname":"Trending Topics Sentiment Analysis of Twitter","timestring":"Sat Dec 14 01:42:40 2019","m1uni":"sw3385","m2lname":"Lu","m1fname":"Shao-Fu","m4fname":"","m1lname":"Wu","m3fname":"Sin-Yi","description":"We aim to develop a realtime system that classify tweets into specific topics and analyze the polar sentiment lying with the text data. The system should then visualize the data points by its geolocation on a world map, and show the data for selected cities.
The sentiment will be show in a bar, the greener shows the more positive and the red indicate negative sentiment.
And the website will update its tweets and sentiments every five seconds.

Besides the map, there is a graph shows how similar trendy topics between cities. If both cities share the same with more than four topics, then we will give an edge between the two cities(nodes). The more edges in the graph tells that the more similarity of the topics at this time. ","uni":"sw3385","language":"python, javascript, node.js, react.js, Firebase, BigQuery, Google App Engine, Pyspark, Google Dataproc","pid":"201912-14","m4uni":"","analytics":"TF-IDF, LDA, Linear Regression, React App, Django, Flask, SVG","m4lname":"","industry":"Information","m3lname":"Huang","dataset":"We get our history tweets in the following website.

https://archive.org/details/twitterstream?sort=-publicdate
","m2uni":"jl5255","m2fname":"Jing-Wei","m3uni":"sh3907"},{"projectname":"Explainable Credit Default Prediction using AMEX Dataset","timestring":"Sat Dec 24 05:28:58 2022","m1uni":"yb2540","m2lname":"Patange","m1fname":"Yatharth","m4fname":"","m1lname":"Bansal","m3fname":"Miheer","description":"Credit card default prediction is an imperative component of the credit card industry to manage and optimize lending decisions. With the rapid increase of the credit card industry, it has become essential to deal with the increasing delinquency rates to prevent fraud and financial loss to the industry. In this work, we aim to solve this crucial problem by exploring the realms of big data and machine learning to predict the default behavior of a customer by developing an explainable credit card default prediction model. We build a cloud model utilizing Apache Spark engines to achieve efficient and optimized performance to run our model on the high-volume AMEX dataset. To serve the industry standards, this work incorporates the issue of explainability for such models by applying game theory to our prediction model.
We incorporate various machine learning algorithms like Logistic Regression, Linear SVM, Decision Tree, Random Forest, and Gradient Boosted Trees to compare the performance of each algorithm using the F1 score evaluation metric.
We showcase the significant performance of our model and explainability of our model with the help of Logistic Regression and game theory techniques like shapley.","uni":"yb2540","language":"Python, Jupyter Notebook, PySpark, Dataproc, Google Cloud Storage, Google Cloud Platform, Streamlit","pid":"202212-30","m4uni":"","analytics":"Logistic Regression, Decision Tree, Random Forest, Support Vector Classifier, Gradient Boosting Trees, Shapley Values, Spark, Streamlit, Scikit-learn, F1-Score, charts, Correlation Plots, Heatmaps, Beeswarm Plots, Waterfall plots, Feature Distribution Plots.","m4lname":"","industry":"Finance","m3lname":"Prakash","dataset":"In our work, we aim to provide a relevant and practical model suitable for the financial industry, so for these purposes, we approach building the model incorporating the Amex dataset. As we know that American Express (or Amex) is a multinational corporation specializing in payment card services and in 2016, credit cards using the Amex network accounted for 22.9 of the total dollar volume of credit card transactions in the United States. We observed that recently Amex came up with a default prediction competition on Kaggle and we took benefit of this opportunity to build a robust model based on such high volumes of industry data. So, the dataset being used in this work is the one given by Amex for that competition.

The motivation for choosing this dataset was that Amex provided us with an enormous dataset (which helped us to incorporate the volume property of big data) which helps us to leverage the machine learning algorithms for training models.
The aim of this dataset is to predict the probability that a customer will pay the amount back in the future based on the profile that is analyzed on a monthly bases.
The dataset includes time series behavioral data and anonymized customer profile information in order to maintain customer privacy. In the given dataset, a customer is considered to be a defaulter if the customer is not able to pay for the due amount in the given time period, i.e., 120 days after their latest card statement. And in addition to this, an eighteen-month performance was analyzed after the customer's last statement to define the target variable, i.e., customer default behavior.","m2uni":"aap2239","m2fname":"Aishwarya ","m3uni":"mp3939"},{"projectname":"Pocket-Conditioned Diffusion for EGFR T790M Inhibitor Design: Selectivity Evaluation and In Silico SAR Optimization","timestring":"Wed May 13 03:29:36 2026","m1uni":"yy3645","m2lname":"","m1fname":"Yixuan","m4fname":"","m1lname":"Ye","m3fname":"","description":"This project investigates whether pocket-conditioned 3D diffusion models can generate T790M-selective EGFR inhibitors by distinguishing two binding pockets that differ by a single residue (T790M gatekeeper mutation). The pipeline integrates de novo molecular generation, cross-docking selectivity evaluation, three mechanistic ablation experiments, and in silico SAR optimization to identify lead candidates. The central innovation is a systematic empirical characterization of mutation-pair selectivity limitations in TargetDiff — the first such evaluation in a clinically relevant resistance context — alongside a scaffold-dependent SAR response rule (NH-hinge + aminopyrimidine scaffolds achieve 54–75% improvement rates under electron-withdrawing substitution) that compensates for the model's training objective gap. This work is important because EGFR T790M resistance remains a leading cause of treatment failure in non-small cell lung cancer, and understanding the selectivity boundaries of generative models is critical for guiding next-generation training objective design and practical inhibitor discovery pipelines.","uni":"yy3645","language":"Python 3.10; Google Colab; RDKit (cheminformatics), AutoDock Vina (molecular docking), OpenBabel (format conversion), PyTorch + PyTorch Geometric (TargetDiff inference), Biopython (PDB parsing), pandas / NumPy / SciPy (data processing), scikit-learn / umap-learn (dimensionality reduction), matplotlib / seaborn / py3Dmol (visualization)","pid":"202605-5","m4uni":"","analytics":"The pipeline begins with pocket-conditioned 3D molecular generation using TargetDiff, an SE(3)-equivariant diffusion model pretrained on 22 million protein–ligand docked poses from CrossDocked2020. Generation runs as a 1,000-step reverse diffusion process, producing ten batches of 100 molecules per target pocket with distinct random seeds to maximize diversity. All generated molecules are then evaluated against a six-criterion drug-likeness filter covering QED, synthetic accessibility, Lipinski Rule of Five, PAINS alerts, molecular weight, and LogP, retaining approximately 26–28% of molecules as drug-like candidates.

Cross-docking selectivity evaluation docks the top candidates against both EGFR WT and T790M receptors using AutoDock Vina. T790M selectivity is quantified as Δ = Vina_Mut − Vina_WT, and between-pool distributions are compared using Kolmogorov–Smirnov and Mann–Whitney U tests. Three ablation experiments systematically rule out competing explanations for the observed null selectivity result: pocket radius sensitivity (Experiment A), Pearson correlation between structural similarity to known selective drugs and docking Δ (Experiment B), and SMARTS-based pharmacophore screening across five established EGFR features (Experiment C).

Candidate scaffolds for SAR optimization are selected from the top-20 molecules ranked by a weighted composite score combining normalized T790M docking score, QED, synthetic accessibility, and a Tanimoto-based novelty penalty. For each scaffold, aromatic C–H positions are enumerated via SMARTS matching and eight substituents spanning electron-withdrawing, electron-donating, and polar character are systematically tested using RDKit atom editing, yielding approximately 80 analogs that are cross-docked under the same Vina protocol.

Visualization components include cross-docking selectivity scatter plots, Vina score distribution histograms, substituent effect bar charts, parent-versus-best-analog comparison plots, 2D structure grids, and two interactive browser-based HTML demos showing top-10 candidate report cards and a dynamic cross-docking scatter plot.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Two experimental crystal structures were retrieved from the RCSB Protein Data Bank (public): 1M17 (EGFR wild-type kinase domain co-crystallized with Erlotinib, 2.60 Å resolution) and 4I22 (EGFR T790M/L858R double mutant co-crystallized with WZ-4002, 2.80 Å resolution). Binding pockets were extracted computationally (10 Å radius sphere around the co-crystal ligand centroid). Three FDA-approved reference drugs — Erlotinib (PubChem CID 176870), Gefitinib (CID 123631), and Osimertinib (CID 71496458) — were used as baseline controls, with SMILES validated against PubChem InChIKeys. The generative model (TargetDiff) was applied to produce 1,987 unique molecules conditioned on the two pocket structures. The pipeline can support any protein target for which a PDB crystal structure with a co-crystal ligand is available; pocket extraction radius and docking box parameters are fully configurable via configs/targets.yaml.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Outcome Prediction and Consciousness Detection in Patients With Acute Traumatic Brain Injury","timestring":"Sat May 11 02:28:43 2024","m1uni":"cy2644","m2lname":"Tracy","m1fname":"Calvin","m4fname":"","m1lname":"Yu","m3fname":"","description":"Patients with acute traumatic brain injury (TBI) in an unconscious state will have varying degrees of long-term outcomes that are hard to prognosticate. If patient outcomes can be projected, better patient centered care and utilization of resources can be administered more effectively and efficiently.
Historically, patient outcome predictions by neurologists are inaccurate at best.

The overarching goal is to combine the analysis of advanced diagnostic techniques such as electroencephalogram (EEG), Functional Magnetic Resonance Imaging (fMRI) analysis and other contextual factors for consciousness detection and possible patient outcome prediction - see Glasgow Outcome Scale Extended (GOSE). Patient prognosis is an important consideration when it comes to treatment.

The scope of work for Spring 2024 is as follows: 1) System integration of previous semesters’ work. 2) Update the 3-D CNN model to perform GOS-E classification instead of a binary conscious or unconscious classification. 3) Streamline the end-to-end processing pipeline. 4) Improve data preprocessing. 5) EEG model improvements and exploration of transformer based models. 5) MRI imaging model improvements and exploration of transformer based models (i.e. Vision Transformer). 6) Submit work for publication.","uni":"cy2644","language":"Python, Matlab, PyTorch","pid":"202405-14","m4uni":"","analytics":"Support vector machines (SVM), convolutional neural networks (CNNs, i.e. ResNet50), Transformers (3D Vision Transformer)","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset consists of 117 patients who have suffered from traumatic brain injury. The data was collected from a medical group in Taipei and consists of EEG and MRI data for each patient.","m2uni":"gpt2108","m2fname":"Gregory","m3uni":""},{"projectname":"Philosophizing via Unsupervised Neural Text Style Transferring","timestring":"Fri May 6 17:10:36 2022","m1uni":"bl2899","m2lname":"","m1fname":"Bin","m4fname":"","m1lname":"Li","m3fname":"","description":"Writing is a serious endeavor. It takes years of training for one to become fully proficient. When humans employ writing as a vehicle for ideas, the style of writing evolves accordingly, becoming more specialized, and arguably more \"complex.\" Some styles have evolved to be so convoluted that many argue they have become the frills of language evolution. For example, Martha Nussbaum, a philosophy professor at the University of Chicago, wrote a lengthy article panning her fellow scholar Judith Butler for bad writing, whose works, Nussbaum argues, are nothing more than sophistry and should be precluded from genuine philosophical discussion [Nussbaum, 1999].

\"Academese,\" language that is filled with unnecessary jargons and turgid verses commonly associated with some fields of humanities, has drawn criticism aplenty. A common response to the criticism from within the academy is that the specialized lingo is a necessity driven by the complexity and abstractness of subject matters. However, Pinker [2014] argues that the roots of academese are in the mix of academics’ goal to share their knowledge with the readers and their fear of \"being convicted of philosophical naïveté about his own enterprise.\"[Thomas and Turner, 2017]

Is the convoluted mannerism of academic writing an inseparable blend of style and content or some fancy sprinkles on the cake of knowledge? To answer this question, we elect to focus our effort on building a machine learning model to automatically generate clear rephrasing and/or explanation of obscure academic writings.","uni":"bl2899","language":"Python, Pytorch, Huggingface Transformers, Jupyter Notebook, Google Cloud Platform","pid":"202205-6","m4uni":"","analytics":"A variant of the Style Transfer via Paraphrasing (STRAP) model (Krishna, Wieting, and Iyyer, 2020) is replicated in this project to support the text style transferring of philosophical text. A multi-task learning framework is also created based on the model for the purpose of experimenting with concurrent training of language modeling task and sequence classification task for fine-tuning generative language models.","m4lname":"","industry":"Information","m3lname":"","dataset":"We use an in-house dataset of philosophical text in encyclopedias and anthology and literary theory text from authors of a specific school of theory in the field of humanities. The text was extracted from ebook sources and then cleaned and normalized.

The model also supports datasets used in the work by Krishna, Wieting, and Iyyer (2020). The instruction for how to access these datasets can be found in their project GitHub repository.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Alzheimer's Disease EHR Prediction","timestring":"Thu Dec 19 18:30:20 2024","m1uni":"fc2795","m2lname":"Humayun","m1fname":"Kuo","m4fname":"","m1lname":"Gong","m3fname":"FeiYang","description":"New Idea: investigates the application of Multi Agent LLMs to preprocesses natural language and detect Alzheimer’s Disease.

Challenge on Data: Dataset is really large and hard to clean or analyze. 20GB (over 210,000 rows) and full of vague descriptions, medical terminologies, and foreign languages.

Method: Build classification models, which predict the discrete CDR(Clinical Dementia Rating) label from 4 column natural language data (SOAP).

System: Build a front-end website, where users without cs background can easily upload big csv file and get noticed with Alzheimer’s disease and high-risk keyword graph.

Our research can help doctor to improve the accuracy of prediction.","uni":"kg3175","language":"Python, Spark, AirFlow, Google Cloud","pid":"202412-1","m4uni":"","analytics":"Upload the Raw Data Into Google Cloud Bucket
Fetching The Raw Data to the Local Machine
Use Spark to delete the duplicate data & Handling Missing Values
Use LLM Agent to understand the natural language in each column
Write the Prompt for translate agent which Translate Chinese to English
After cleaning the data, we send the Clean Data back to google Bucket
We also use the matplotlib, and word cloud for data visualization ","m4lname":"","industry":"Information","m3lname":"Chen","dataset":"
Data Type: Semi Structured Data which have Natural Language and Number. In each row, the data also have medical terminologies, and foreign languages.

SOAP (Subjective-Objective-Assessment-Plan) Medical data

Data Resource: 10 Years Private Data collected by Taiwan Hospital Alzheimer’s Disease Department

Data Size: 20 GB Big Electrical Health Data","m2uni":"hhs2128","m2fname":"Syed","m3uni":"fc2795"},{"projectname":"Automed Dynamic Asset Allocation Recommender System.","timestring":"Mon May 6 18:29:20 2024","m1uni":"dn2572","m2lname":"Liao","m1fname":"Dhvanil","m4fname":"","m1lname":"Nanshah","m3fname":"","description":"The objective of this initiative is to develop a no-cost tool designed to cater specifically to a broad demographic, including but not limited to everyday households and individuals which may be averse to financial investments due to a lack of experience or knowledge.

This ADAA tool is engineered to assist the every-day individual in reaping the benefits that come from robust portfolios stemming from portfolio diversification. This system concentrates on asset allocation rather than the selection of specific securities. Research indicates a significant shortfall in portfolio diversification among most Americans, with individuals under 40 typically holding between one and two types of assets, and those over 40 holding between two and three. This gap often arises from insufficient domain knowledge coupled with either an unwillingness or an inability to afford financial advisors or investment tools.

Our tool seeks to close this divide, offering all users a foundational level of confidence and security as they venture into the realms of financial investment. Each asset class included in our ADAA recommender system is selected with the everyday user in mind. For instance, asset classes such as Bond or Real Estate ETFs were chosen for their liquidity (reflected by average daily trading volume) and accessibility (real estate trusts emulate real estate market dynamics without the substantial capital required for direct property investments).

Moreover, this tool enriches the landscape of transparent, open-source financial tools, providing a robust platform that anyone can utilize and expand upon.","uni":"dn2572","language":"Python is used for this project primarily. Streamlit was used for the front-end and flask for the back-end. Sqlite was used for the database.","pid":"202405-6","m4uni":"","analytics":"Numerous algorithms, analytics, system modules and visualizations were implemented. Data processing included preliminary exploration and preprocessing which entailed data imputation, deletion and frequency adjustment. Furthermore, PCA and TSNE were used for dimensionality reduction as well as min-max and standard scaler for data scaling. Scree plots were utilized to tune the PCA variance capture degree by means of using the elbow heuristic. Loadings plots were further used in our analyses of indicator contribution for regime prediction. Silhouette distance plots were utilized to visualize the degree of separations of clusters for different k/c values in k-means clustering and fuzzy c means clustering. Hierarchical clustering was also utilized to decompose our indicators into sub-clusters for manual analysis via literature review. This hierarchical cluster was visualized as a heat map using SeaBorn. Additional plotting was done in our code using Matplotlib.pyplot module. Numerous different models were explored for the individual asset class forecasts. These include Arima, Sarima, SariMAX, ETS Models, and LSTMs. These models were paired with different indicators such as Moving Average Convergence Divergence, Bollinger Bands, Moving Averages, Exponential Moving Averages, Relative Strength Indexes, etc. Validation loss was plotted against epochs to help visualize when overfitting occurs and additional dropout layers were added as needed. Random search, grid search, hyperband tuning and Bayesian optimization were performed for hyper parameter tuning of models. A PPO Reinforcement Learning model was used as our final model that forecasts were fed-into to output asset class ratios. Tensorflow, PyTorch, Keras packages were at the heart of all ML work done.","m4lname":"","industry":"Finance","m3lname":"","dataset":"Multiple datasets were tested and utilized.

For the regime prediction, 49 sets of distinct time series indicator data was taken from the Federal Bank in St. Louis using the quandl python API. Furthermore, time series data was experimented with from the World Bank and IMF.

For the AGG Bond ETF and the XLRE Real Estate Trust ETF data was taken from TradingView using a trial subscription.

For the S&P500 Index data was taken from Yahoo Finance using the python3 yfinance API.

Bitcoin was used as a proxy for the crypto space (justification/evidence provided in technical paper + slides).
This data was taken from multiple sources as data pre-2014 is not available from sources like TradingView and YFinance. Data was combined from Yahoo Finance, TradingView, and MarketWatch manually.

Data for gold prices taken from TradingView as well (downloaded).

All data files can be found in the GitHub except for the S&P500 and regime prediction indicators which are downloaded using python code found in the .ipynb files.","m2uni":"kl3545","m2fname":"Ken","m3uni":""},{"projectname":"Analysis of Correlation between Movie Score and Public Review","timestring":"Sat Dec 18 00:00:12 2021","m1uni":"cw3355","m2lname":"Zang","m1fname":"Chenhao","m4fname":"","m1lname":"Wang","m3fname":"Shengbo","description":"Our project can predict movie ratings by text reviews, which is more objective than traditional movie rating websites. Because most audience doesn’t have a habit to rate after watching unless they extremely like or dislike the movie. So the average ratings come from those professional reviewers or movie fans who can’t represent most. Also, everyone has a different rating scale, e.g. a 5/10 rate may be good for someone but bad for others.
We trained 11478 reviews and other data(vote, popularity, budget, revenue) from 501 movies from 2010 to 2022. Finally, our model predicts an 80% of similar ratings to its original rating, which is closed to our expected. We analyzed Error Distribution and Cluster Percentage Deviance, then made our results into graphs like WordCloud and a website with a movie search function.
We also did the Twitter stream experiment to collect reviews but chose IMDB and TMDB reviews as final predict data because the latter shows better short-term efficiency.
Theoretically, with the publishing of new movies, our system can collect and analyze reviews continuously to update our model and movie database.
","uni":"cw3355","language":"Python and Linux(VM on GCP)","pid":"202112-9","m4uni":"","analytics":"We use NLTK SIA Sentiment Analysis and LDA(spark) to vectorize and preprocess our data. Then train with algorithms of Linear Regression, Random Forest Regressor(Spark). After training, we tested about 100 movies, 80% of predicted ratings show results close to those websites. WordCloud and Matplotlib are implemented to analyze the result of LDA topic words and ratings. The Web design tools like HTML, CSS, JS, Jquery are used to do the final visualization.","m4lname":"","industry":"Media","m3lname":"Chen","dataset":"Our system uses labeled data(e.g. vote average and count, popularity, budget, revenue) and 11478 text reviews of 501 movies from the public API of TMDB and IMDB. And the test set is 98 movies. With the preprocess of Sentiment Analysis and LDA in our system, any text reviews like movie-related tweets, or comments on youtube can be supported to predict and update our model.

","m2uni":"cz2678","m2fname":"Chengbo","m3uni":"sc4918"},{"projectname":"Stock Performance Predictions based on News Analytics","timestring":"Sat Dec 22 16:03:10 2018","m1uni":"fg2432","m2lname":"Gatlin","m1fname":"Fan","m4fname":"","m1lname":"Gao","m3fname":"Ruisi","description":"Historically, news and stock performance are highly correlated. In this analysis, we worked with a large uncensored dataset in an attempt to uncover the predictive power of market and news analytics. One challenge is that we had to plow through the data carefully, weed out data errors and impute missing values. Another challenge is the computational burden as the calculation wouldn't easily fit into the memory of a VM with 32GBs of memory. We had to carefully slice the data, recycle variables and collect unused space to avoid memory errors. For a memory hungry model like LSTM, both training and inference had to be done in batches.

The results are encouraging as they show great improvements over the benchmark. Features from market data such as previous returns and prices have more significance in the models, contributing almost two thirds of the predictive power of the model. Features from news data further enhance the performance by about 35%. As for model selections, gradient boosted trees show better performance than neural network. This could be due to overfitting or suboptimal look back window in the LSTM model. Tuning the LSTM model is beyond the scope of this project. Rather than picking the best model, we think data preprocessing and feature engineering play a bigger role in determining the final performance. Some extreme outliers could have compromised the out-of-sample performance of the models if they hadn't been treated with caution. We conclude that news data indeed add value to stock performance predictions and in conjunction with market data, it shows great improvement over the benchmark.","uni":"fg2432","language":"Python, Javascript, HTML, CSS, Jupyter Notebook, Google Cloud.","pid":"201812-23","m4uni":"","analytics":"Feature extraction, model selection, data preprocessing: scikit-learn, pandas, NLTK.
Predictive models: DMTK LightGBM, CatBoost, Keras LSTM.
Visualization: Node.js, d3.js, pyplot, Matplotlib, seaborn. ","m4lname":"","industry":"Finance","m3lname":"Wang","dataset":"The data includes a subset of US-listed instruments. The set of included instruments changes daily and is determined based on the amount traded and the availability of information. This means that there may be instruments that enter and leave this subset of data. There may therefore be gaps in the data provided, and this does not necessarily imply that that data does not exist (those rows are likely not included due to the selection criteria). The market data contains a variety of returns calculated over different timespans. All of the fields in Market Data are shown in the report. The data covers period between 2007 to 2018. In total, there are 4,072,956 samples in the training data.

The news data contains information at both the news article level and asset level (in other words, the table is intentionally not normalized). All of the fields in News Data are shown in the report. The data covers period between 2007 and 2018. In total, there are 9,328,750 samples in the training data.","m2uni":"ceg2195","m2fname":"Connor","m3uni":"rw2720"},{"projectname":"Model Homophily and Heterophily of Graph Data Using An Improved Graph Neural Network","timestring":"Fri Dec 15 21:22:16 2023","m1uni":"xs2482","m2lname":"Dai","m1fname":"Xiaozhou","m4fname":"","m1lname":"Shi","m3fname":"Yuanhao","description":"Aim to address the limitations of current GNN models in handling mixed homophily and heterophily in graph data.
Focus on improving graph representation learning in diverse real-world scenarios.","uni":"ys3609","language":"Python, Typescript+React, github","pid":"202312-4","m4uni":"","analytics":"The learning process of the kernel selection gate receives input graph information and outputs selective signal α to discriminate if the neighbor node labels are consistent. The bi-kernel feature transformation trains Ws and Wd, namely weights to capture the similarity between nodes and weights to capture the dissimilarity between nodes. It uses the signal α from the former module to combine these two W in the process of message passing, then doing mean aggregation, and finally producing node embedding. In the training phase, we have an additional cross-entropy loss to train the selection gate with supervision.
","m4lname":"","industry":"Information","m3lname":"Shen","dataset":"The citation network datasets \"Cora\", \"CiteSeer\" and \"PubMed\" from the “Revisiting Semi-Supervised Learning with Graph Embeddings” paper. Nodes represent documents and edges represent citation links. Training, validation and test splits are given by binary masks.

For Texas, Cornell, and Wisconsin datasets, nodes represent web pages and edges represent hyperlinks between them. Node features are the bag-of-words representation of web pages. The task is to classify the nodes into one of the five categories, student, project, course, staff, and faculty.

All datasets mentioned above is from PyTorchGeometric.

","m2uni":"yd2674","m2fname":"Yiting","m3uni":"ys3609"},{"projectname":"AD Targeting for Apple Products on Twitter","timestring":"Sun Dec 23 04:00:25 2018","m1uni":"yh2866","m2lname":"Wang","m1fname":"Yuanqing","m4fname":"","m1lname":"Hong","m3fname":"Fangbing","description":"Ad targeting is a form of advertising where online advertisers use sophisticated methods to target the most receptive audiences with certain traits, based on the product or the advertiser is promoting. It is so popular that it helps a company cut the marketing cost and increase profits by reducing wasted advertising, attract new customers, and increase repurchase rate. Twitter provides all of the tools business owners need to put highly tailored ads in front of the people most likely to click on them. In this project, we will discuss ways to target audience for Apple Inc. more effectively with Twitter ads.","uni":"yh2866","language":"Python, Javascript, HTML/CSS, Google Cloud Platform, MongoDB Atlas, Flask, D3.js, Twitter APIs","pid":"201812-20","m4uni":"","analytics":"1. User Profile-Users profile gives us an overview of the type of users tweeting about Apple products.
2. Latent Dirichlet Allocation- We applied an unsupervised machine learning technique called Latent Dirichlet Allocation (LDA) to extract the main topics from these tweets.
3. Sentiment Analysis-In order to know what Twitter users felt positively or negatively about the product, creating ads that play up the positive aspects of a product to evoke pleasant feelings of the customers.
4. Influence Analysis-we analyzed user influence, in terms of Page Rank and Indegree Centrality, to find the influencers in the network as the ad promotors.","m4lname":"","industry":"Media","m3lname":"Liu","dataset":"We fetched more than 240K tweets related to Apple products (AirPods, iPhone, iPad, and Apple Watch) through Twitter Streaming API and saved in MongoDB (Data Size: 1.43 GB).","m2uni":"jw3592","m2fname":"Jingyi","m3uni":"fl2476"},{"projectname":"Online Training of Large-Scale Sentiment Analysis with Deep Learning","timestring":"Thu Dec 23 04:41:46 2021","m1uni":"dl3447","m2lname":"","m1fname":"Danni","m4fname":"","m1lname":"Lin","m3fname":"","description":"This project focuses on the problem of online training for sentiment analysis tasks. Typically, a deep learning approach from the laboratory requires a full training set and a full validation set for training the model on the training set and validating the model on the validation set. However, a sentiment analysis system from the real world often deals with cases that are more complicated. For an online system, it gets real-time data from the online system and then must make predictions in a limited time. It cannot fix its prediction by running on the training set many times. To achieve high accuracy, an online training system must be powerful enough to make robust predictions while training. In this project, I propose several learning strategies for such an online training sentiment analysis system and compare their accuracy by experiments. The experiment results show that it is best for an online learning model to apply to look backward and dropping easy sample strategy, that is, to use data samples seen before at each step, and to drop correctly predicted samples with high confidence. The goal of this project is to provide some practical learning strategies for online training a sentiment analysis model.

","uni":"dl3447","language":"Python","pid":"202112-59","m4uni":"","analytics":"In this project, I implemented four practical learning strategies to control the data flow for training a sentiment analysis model. Specifically, a naive strategy, a decaying learning rate strategy, a looking backward strategy, and a dropping easy sample strategy. For each strategy proposed, I train an RNN model for sentiment analysis tasks from scratch, and visualize the training process of each strategy, in terms of recent and overall accuracy and f1-score, to show its effectiveness of it. The influence of hyper-parameter for the dropping easy sample strategy is also visualized in terms of f1-score and accuracy. By using these metrics, all the four strategies are compared fairly, and thus the conclusion that dropping the easy sample strategy works the best could be drawn.

","m4lname":"","industry":"Information","m3lname":"","dataset":"The dataset used in this project is the IMDB dataset, which contains user reviews on movies recorded on the IMDB platform. The whole dataset consists of 84,919 pieces of data. All the data are used in this project. Each piece of data is a paragraph of user review on the movie, and a star rating from 1 to 10, representing the user’s preference. Besides, a user id and a movie id are also attached to each review. User and movie id is not used in this project. The dataset is public. It can be downloaded at https://www.imdb.com/interfaces/.

","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Investment Strategy — AI Trader (US)","timestring":"Fri Apr 23 19:46:48 2021","m1uni":"mpd2155","m2lname":"Roy","m1fname":"Meet","m4fname":"","m1lname":"Desai","m3fname":"Richard","description":"Over the past several years, there has been a surge in the use of algorithmic trading in the stock markets. Various studies have shown that hedge funds employing algorithmic trading techniques have earned more compared to traditional hedge funds. These types of studies highlight the potential of AI-powered systems in the stock trading world. Conventionally, human traders have been devising strategies to execute trades and maximize profits. However, they do possess certain flaws to perform well consistently in the stock market. For instance, it is difficult for humans to infer from data and market factors from thousands of stocks without using computational resources. Also, human traders may be vulnerable to their sentiments and emotions like hope, greed, and fear and may end up taking the wrong trading decisions. AI Traders on the other hand can not only process the humongous amounts of market data but are also unaffected by emotions. These AI systems can learn historical data, make stock price predictions and even execute trades when deemed necessary while simultaneously minimizing risk and optimizing profits.
","uni":"mpd2155","language":"Python(numpy, scipy, matplotlib, streamlit, pytorch, Tensorflow), Jupyter notebook, Google colab","pid":"202105-15","m4uni":"","analytics":"Primarily, we have used deep learning in order to analyze big data. We tried many different models like LSTM (Long Short-Term memory), Transformers, and policy gradient method along with moving averages to optimize trades. Besides, we have used common Python libraries like pandas and Streamlit for visualization of data and trade suggestions.","m4lname":"","industry":"Finance","m3lname":"Samoilenko","dataset":"We use free stock price data from the Yahoo Finance API. This API provides market data with parameters like Open, High, Low, Close, and Volume for thousands of stocks. We have used intraday data and experimented with hourly, and data by the minute.
","m2uni":"sr3767","m2fname":"Shambhavi ","m3uni":"rs4094"},{"projectname":"Analysis and Prediction of YouTube Data","timestring":"Sat Dec 23 03:03:47 2023","m1uni":"yg2905","m2lname":"Wu","m1fname":"Yutong","m4fname":"","m1lname":"Gao","m3fname":"Siyu ","description":"Our project aims to analyze and predict YouTube data by Apache Airflow and Google Cloud Platform (GCP) to assist video creators in evaluating video descriptions and platforms adjusting exposure. We utilized regression modeling, sentiment analysis, cluster analysis, and translation to forecast future trends in YouTube videos. The findings help video creators understand the potential popularity of their videos and provide comprehensive insights into global viewer interests. Our system also offers real-time data monitoring and prediction capabilities through user-friendly interactive web pages and continuous data updates.
","uni":"yg2905","language":"Python (Pandas, Sklearn, PySpark) , Google Cloud Platform (GCP), Apache Airflow, HTML, Flask","pid":"202312-6","m4uni":"","analytics":"EDA: Feature distributions, correlations matrix, sentiment proportion, wordcloud
Cluster: Classify video tags using K-means
Sentiment: Analyze the titles and descriptions of videos
Translation: Translate non-English text
Regression: Train models using Linear Regression, Ridge and Lasso regression, Decision Trees, Random Forest, select and save the best one. Predict the video likes based on given inputs
Airflow: Get real-time data by YouTube API, visualize top 10 liked videos and future likes
","m4lname":"","industry":"Media","m3lname":"Li","dataset":"1. Trending YouTube Video Statistics
This dataset is from Kaggle, which includes several months of data on daily trending YouTube videos, encompassing millions of data volumes. Data is included for the US, GB, DE, CA, FR, RU, MX, KR, JP, and IN regions (USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India, respectively), with up to 200 listed trending videos per day. [5] It includes CSV and JSON files.
2. Real-time YouTube Video data
This dataset includes eleven countries' real-time trending YouTube video data scraped by YouTube API.
","m2uni":"nw2533","m2fname":"Ningyi ","m3uni":"sl5394"},{"projectname":"SteamAI: Game Recommendation and Review Agent","timestring":"Fri Dec 19 22:17:53 2025","m1uni":"hd2593","m2lname":"Chen","m1fname":"Thomas","m4fname":"","m1lname":"Duan","m3fname":"","description":"Objectives: SteamAI aims to address two critical challenges in Steam's massive gaming ecosystem: discoverability and interpretability. With over 239,000 applications on the platform, users struggle to find games that match their preferences without extensive manual searching and filtering. Additionally, understanding whether a game is worth trying requires reading through hundreds of reviews, creating a significant barrier to informed decision-making. SteamAI seeks to transform game discovery by enabling semantic, multi-modal exploration that goes beyond simple keyword matching, while providing concise, AI-generated review summaries that capture community sentiment at a glance.
Innovations.

Innovations: The system introduces several technical innovations to achieve these goals. First, it leverages the BGE-M3 embedding model to enable content-based similarity search across 1024-dimensional semantic space, allowing for nuanced game recommendations based on multi-modal data including structured metadata, review text, and learned features. Second, it implements an AI-agent pipeline that uses GPT-4o-mini to extract and clean user requirements from natural language prompts, bridging the gap between casual user input and the precise embeddings needed for database matching. Third, the system employs an Airflow-orchestrated incremental review crawling and summarization pipeline that keeps content fresh by periodically fetching new reviews via the Steam API and regenerating LLM-based summaries, ensuring recommendations reflect current community feedback rather than stale data. Finally, SteamAI constructs interactive force-directed similarity graphs that enable multi-hop exploration, allowing users to discover related games through visual neighborhood navigation rather than accepting a single static ranked list.

Why This Matters: As digital marketplaces grow exponentially, traditional search-and-filter approaches become increasingly inadequate—users need intelligent systems that understand intent, learn from multi-modal signals, and present information in explorable, digestible formats. SteamAI demonstrates how combining embedding-based retrieval, LLM-powered natural language understanding, automated summarization, and interactive visualization can create a more intuitive and efficient discovery experience. Beyond gaming, these techniques are applicable to any domain facing similar scale and interpretability challenges, from streaming media recommendations to scientific literature discovery to e-commerce product search. By showing how to maintain fresh, semantically-rich recommendations at scale while keeping review summaries current and representative, SteamAI provides a blueprint for building next-generation recommendation systems that balance computational efficiency with user experience quality.","uni":"hd2593","language":"Python on local machines","pid":"202512-30","m4uni":"","analytics":"SteamAI implements a multi-layered analytics architecture combining embedding-based retrieval, natural language processing, and interactive visualization. At its core, the system uses the BGE-M3 encoder to generate 1024-dimensional embeddings for all 239K games in the dataset, enabling content-based similarity search through K-Nearest Neighbors (KNN) algorithms that efficiently retrieve the top-10 most similar games based on cosine similarity. The recommendation pipeline incorporates fuzzy string matching with normalized name search and LM-generated alias fallback to handle user queries robustly, followed by optional structured filtering on metadata (genres, release dates, ratings, price) and multi-modal re-ranking that combines embedding similarity with structured features.
For natural language interaction, the AI-agent module employs GPT-4o-mini for requirement extraction and prompt cleaning, transforming free-form user descriptions into structured game descriptions that align with the dataset's embedding space. Review analytics leverage LLM-based summarization to generate concise pros/cons summaries from aggregated user reviews, with summaries kept current through an Airflow pipeline that incrementally crawls new reviews via the Steam API and triggers periodic regeneration.
The system includes two primary visualization modules: force-directed graphs that display similarity relationships as interactive network diagrams enabling multi-hop exploration of game neighborhoods, and rating history line charts that plot temporal trends in user sentiment. The backend Spark pipeline handles large-scale ETL operations including schema normalization, feature extraction, review aggregation, and local LM-based alias generation for popular titles, with all processed data stored in columnar Parquet format for efficient in-memory filtering and sub-second query response times.","m4lname":"","industry":"Information","m3lname":"","dataset":"The project utilizes the Steam Dataset 2025, a large-scale multi-modal dataset on Kaggle that specifically designed for game analytics and content discovery.","m2uni":"nc3221","m2fname":"Nuo","m3uni":""},{"projectname":"Toxic Comment Detection","timestring":"Fri Dec 21 18:56:55 2018","m1uni":"cs3736","m2lname":"Zhou","m1fname":"Canjie","m4fname":"","m1lname":"Shi","m3fname":"Weibo","description":"With the development of Internet service, the life of human being has changed a lot from the past. People bring their ofﬂine behavior to the virtual world, including sending toxic words. This misuse of language may lead to some serious social issues. Online Violence and Internet Harassment has made different people suffer from mental disorder. Opposed under insults, Some people commit suicide. Also, children may be induce to crime by obscene propaganda. Thus, it is important to identify the occurrence of toxic words, such that further action can be made to prevent the bad things. Here we use several different machine learning models to predict toxic comments on social network and localize where this happens. the local police can use this tool to warn the comment sender of their behavior.","uni":"cs3736","language":"Python, JavaScript, GCP","pid":"201812-26","m4uni":"","analytics":"Used Multinomial Bayes and LSTM network to train the model. Used React for visualization and Flask for static web hosting.","m4lname":"","industry":"Media","m3lname":"Zhang","dataset":"","m2uni":"yz3395","m2fname":"Yanzhao","m3uni":"wz2353"},{"projectname":"Twitch Stream Analysis and Visualization","timestring":"Sat Dec 18 04:36:31 2021","m1uni":"yl4735","m2lname":"Zhou","m1fname":"Yanhao","m4fname":"","m1lname":"Li","m3fname":"Chaoying","description":"Twitch, as one of the largest streaming platforms in the world, generates data of high volume, high velocity, and high variation every day, every hour, and every second. Exposed to tons of dispersive, abstract raw data, people find it struggling to extract useful information from it. Thus, data aggregation and analysis become so valuable that it endures more practical meanings to this data, making it extremely efficient and easy for streamers, viewers, platform managers, and even game companies to do planning, target setting, and exploration. Therefore, to free people from the infinite data flow, we present a visualized intuitive Twitch analysis system.

Although there are several stream analyzers existing, all of the visualization projects just fix themselves on the layer of “visualization” without considering the underlying pattern of the data. But in our project, we are not only focusing on visualizations, but also, we are providing concise trends and predictions of how popular their stream categories would be in the future.
","uni":"yl4735","language":"Python, HTML, js, Flask, AJAX, Echarts, Airflow, Bigquery, Google Cloud Platform","pid":"202112-52","m4uni":"","analytics":"This system utilizes twitch API to request stream information, apply time series forecasting models to predict future viewer counts, and provides a web interface to visualize the data. For dataset collection, we use Airflow to schedule data crabbing through Twitch API. For data prediction, we use Random Forest Regression, ARIMA, and Prophet to forecast future viewer counts based on previous records. For data visualization, we use Echart + Ajax + Flask + Google Big Query to design and build a web application system that provides visualized data for users to interact with. Furthermore, we integrate streaming channels and live chat rooms in our web application. Users can search by their keywords, or select popular streamers to watch streaming and chat messages directly in our web app.","m4lname":"","industry":"Media","m3lname":"Zhang","dataset":"Our dataset is data of games, channels, players, tags, languages, schedule of twitch stream platform. Among those data, we especially focus on viewer data of popular games, and we did prediction of viewer count for a game based on this dataset. We collected the dataset by ourselves, we used airflow to collect viewer count of popular games through Twitch API. We scheduled the tasks to be triggered every 3 minutes(Due to the cache mechanism of Twitch, data does not refresh that fast) and stored them in Google Big Query for further training.","m2uni":"cz2664","m2fname":"Chengrui","m3uni":"cz2617"},{"projectname":"Search for a Connection: Energy Demand & Twitter Trending Topics","timestring":"Fri Dec 17 21:25:13 2021","m1uni":"rr3417","m2lname":"Luthfan","m1fname":"Rohan","m4fname":"","m1lname":"Raghuraman","m3fname":"Kevin ","description":"The goal of this project is to evaluate the hypothesis that spikes in energy demand for a given region can be reliably predicted by the monitoring and analysis of social media activity using big data analytics. Can Twitter activity formulated into topics be linked with a meaningful, causal relationship with energy demand spikes? The approach we take to evaluate this hypothesis is to explore the problem in the inverse. First, hourly energy demand data is collected for New York State for a 10 year period. A time series forecasting model is then used to control for seasonality and extract the dates of anomalous demand spikes. Twitter activity for the given region on the dates specified is then collected, and a topic modeling algorithm is used to determine the topics of interest during the period of the demand spike. Finally, using big data visualization techniques and power systems domain expertise we determine if the topics corresponding to demand spikes are merely correlations or a meaningful causal relationships, thus allowing us to form a hypothesis that twitter activity relating to certain topics of interest is likely to cause a spike in energy utilization.

Efficient prediction of anomalous spikes in energy demand is invaluable to utility companies. It allows them to perform appropriate resource allocation and pricing. Therefore, fewer losses to be incurred during unexpected spikes in demand.
Our analysis has found that utilizing social media to monitor certain topics can be used to predict energy demand anomalies. Of the topics we found, utility companies only monitor weather. Our analysis has found that traffic incidents and large cultural events are highly correlated with energy demand spikes. We have shown that through the use of big data analytics, a social media sensing pipeline that monitors twitter activity for traffic incidents or large cultural events could be utilized by utility companies to reduce losses incurred by anomalous spikes in demand.
","uni":"rr3417","language":"GCP (DataProc, Storage Buckets), Python, Twitter Full Archive Search API, Streamlit, LDA, FBProphet Time Series Forecasting, PySpark","pid":"202112-8","m4uni":"","analytics":"Getting electricity load data:
Electricity load data was scraped from the NYISO website for all of NY state using Python. The data was stored as CSV files in Google Cloud Storage. A BigQuery database was created using the CSV files. The data was preprocessed to be stored in aggregation table (hourly, daily, weekly, monthly, yearly) and stored with needed aggregation so that a heavy query need not always be run.

Time-series forecasting:
The BigQuery database with electricity data was queried using pandas-gbq and converted to Spark RDD. The FBProphet library was used for forecasting. The forecasting was applied in parallel for different zones in NY state, utilizing PySpark's applyInPandas(). The metrics that were tracked were MAPE and RMSE. A few hyperparameters, like seasonality_mode and interval_width, had to be tuned.

Identifying critical dates:
The forecast and evaluation results were merged and used to calculate the forecast error of each date. The absolute difference value was achieved. The trends and seasonality were analyzed. An anomalous predicted hour was determined as having an RMSE value of 5 times more than actual. An anomalous day was determined if there were 4 or more anomalous hours in the day.

Getting Twitter data and LDA topic modeling:
Using the Twitter Full Archive Search API, tweets/Twitter data on identified dates were streamed and stored in storage. The data was preprocessed and vectorized. LDA was run on the data to determine most salient topics on each day.

Data visualization:
The Streamlit library was used for data visualization and an ngrok server was used to host the web dashboard.","m4lname":"","industry":"Information","m3lname":"Murning","dataset":"1. Hourly electricity load data from NYISO website for all of NY state from 01/01/2010 to 10/31/2021: http://dss.nyiso.com/dss_oasis/PublicReports. This had to be scraped and stored on Google Cloud Storage and then transferred to a BigQuery database after preprocessing.

2. Twitter data from 2010 to 2021 using the Twitter Full Archive Search API (Academic Research Access needed)","m2uni":"rl3154","m2fname":"Rifqi","m3uni":"kmm2344"},{"projectname":"Machine Learning-Assisted Design of Novel EGFR Inhibitors for Lung Cancer","timestring":"Wed May 13 03:25:00 2026","m1uni":"ss7654","m2lname":"Cheng","m1fname":"Saha Dev","m4fname":"","m1lname":"Shanmugam","m3fname":"","description":"The objective of this project is to build an end-to-end AI workflow for EGFR inhibitor discovery that goes beyond molecule generation and supports practical candidate triage. The system generates candidate molecules, predicts potency (IC50/Kd), ranks candidates with medicinal chemistry constraints, and explains why each candidate is prioritized. Its key innovation is the integration of generation, tiered ranking, explainability, and reference-drug comparison in one toolkit. In practice, this is important because early discovery teams do not need only “more molecules”; they need transparent, defensible shortlists that balance predicted efficacy, chemical feasibility, and risk signals before moving to experiments.","uni":"ss7654","language":"Python, TypeScript, JavaScript, CSS, YAML, FastAPI, Next.js, React, RDKit, PyTorch, Pandas, NumPy","pid":"202605-4","m4uni":"","analytics":"The implemented analytics and algorithms include potency prediction (predicted IC50/Kd with pIC50/pKd ranking transforms), hard-gate medicinal chemistry checks (including Lipinski-related constraints), developability scoring, diversity shortlist construction, SA score estimation, and PAINS-style structural alert annotation. The system modules include data ingest/merge, model prediction, molecule generation, tiered ranking with explanation fields, structural-alert screening, and comparison APIs for benchmark and pairwise analysis. The visualization layer includes live generation tiles, candidate card lists with ranking metadata, 3D molecule viewing, benchmark bar charts for reference-vs-generated potency, Morgan-Tanimoto similarity strips, and detailed side-by-side candidate-versus-reference metric tables with structured ranking breakdowns.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The project is tested on EGFR-related bioactivity data from public sources, primarily ChEMBL and BindingDB, which are ingested and merged into pipeline inputs and then curated for model use. In the current workflow, these data support affinity modeling for IC50/Kd and downstream candidate ranking and comparison. The public dataset description can be summarized as “EGFR-targeted small-molecule bioactivity records from ChEMBL and BindingDB, curated and merged for affinity prediction and candidate prioritization.” Beyond this exact set, the software can support other datasets as long as they provide valid molecular structures (e.g., SMILES) and compatible activity labels (such as IC50/Kd in consistent units) through the same CSV/config-driven pipeline pattern.","m2uni":"hc3645","m2fname":"Eric","m3uni":""},{"projectname":"Asset Allocation and Recommendation (US)","timestring":"Wed May 13 23:33:52 2020","m1uni":"qz2354","m2lname":"Lin","m1fname":"Qianrui","m4fname":"","m1lname":"Zhang","m3fname":"","description":"The purpose of out project is to design and implement an asset recommendation system to let users buy traditional assets and cryptocurrency at the same time. The system takes in a combined datasets of stock, ETF and cryptocurrencies, as well as the user's preference on value return or risk. Then the system will give these inputs to two state-of-the-art, model-free reinforcement learning models and then choose the strategy that better fits the user’s preference as output. In this sense, our system is an application using reinforcement learning models on mix datasets (traditional assets and cryptocurrencies) with user preference considered.","uni":"qz2354","language":"AWS, Python","pid":"202005-13","m4uni":"","analytics":"1. We implemented two state-of-the-art reinforcement leraning models for asset allocation.
2. We designed and implemented a model-selection algorithm to choose one model based on user preference.
3. Finally we built a web application on AWS for asset recommendation.","m4lname":"","industry":"Finance","m3lname":"","dataset":"The dataset we use is the historical financial data of stocks, ETFs and cryptocurrencies. We got the data mainly via Yahoo! Finance API. And our system can support all kinds of historical price data with open, close, high, low values.","m2uni":"tl2957","m2fname":"Tianyi","m3uni":""},{"projectname":"Integrating Satellite Data and Sentiment Analysis for Comprehensive Environmental Monitoring and Public Perception of Climate Change","timestring":"Fri Dec 20 11:53:03 2024","m1uni":"tb3145","m2lname":"Gantz","m1fname":"Thomas","m4fname":"","m1lname":"Bordino","m3fname":"","description":"This project integrates high-resolution satellite data from NASA’s TROPOMI with advanced natural language processing (NLP) techniques to provide a comprehensive framework for environmental monitoring and understanding public perceptions of climate change. By combining pollutant distribution maps with sentiment and thematic analyses of climate-related news articles, the research aims to address gaps in correlating geospatial data with societal discourse. This interdisciplinary approach enhances insights into the intersection of environmental conditions and public awareness, offering valuable implications for policymaking, public communication, and environmental informatics.
","uni":"tb3145","language":"Python, Panoply, Plotly, TROPOMI NASA instrument","pid":"202412-27","m4uni":"","analytics":"Analytics
The study integrates multiple analytics techniques to provide a comprehensive understanding of climate change dynamics. Sentiment analysis quantifies the tone of climate-related articles, categorizing them as positive, negative, or neutral, and aggregates these scores regionally to capture geographic trends. Emotion analysis delves deeper into media narratives, identifying prevalent emotions such as anger, fear, joy, and sadness, which are crucial for understanding the emotional undertones of public discourse on environmental issues. Geospatial pollutant mapping utilizes satellite data from instruments like TROPOMI to create high-resolution maps of atmospheric pollutants such as NO2 and CO2, offering insights into air quality and pollution hotspots.

Algorithms
The research employs state-of-the-art algorithms for both geospatial and textual data processing. Named Entity Recognition (NER) is performed using a fine-tuned BERT-large-cased model optimized for extracting precise location-based entities from climate-related text. For sentiment and emotion classification, DistilBERT and DistilRoBERTa models are utilized, offering computational efficiency without compromising on accuracy. These pre-trained transformers, fine-tuned on domain-specific datasets, capture nuanced emotional tones and sentiment trends in climate discourse. Topic modeling is executed using Latent Dirichlet Allocation (LDA), which identifies thematic trends within large textual datasets by extracting the most relevant topics and keywords. These advanced algorithms enable detailed exploration and correlation of textual and geospatial data.

System Modules
The system architecture is designed to seamlessly integrate geospatial and textual data. A satellite data extraction module accesses pollutant information from NASA’s Earthdata archives, utilizing tools like Panoply for processing netCDF files. The natural language processing (NLP) module processes thousands of climate-related news articles, performing NER, sentiment analysis, emotion detection, and topic modeling. An interactive visualization interface, built using HTML and JavaScript, allows users to explore the results dynamically, featuring dropdown menus for time-based filtering and real-time updates of visual data. These modules work in unison to provide an integrated platform for analyzing the interplay between environmental conditions and public discourse, supporting policy-making and public awareness initiatives.","m4lname":"","industry":"Media","m3lname":"","dataset":"The project utilized two primary datasets:
NASA Satellite Data: Pollutant data, including CO2 and NO2 distributions, were sourced from public NASA missions such as TROPOspheric Monitoring Instrument (TROPOMI). These datasets provide high-resolution geospatial pollutant distribution, publicly accessible through NASA's Earth Observing System Data and Information System (EOSDIS).
Climate-Related News Articles: A dataset of over 30,000 articles from The Guardian was acquired via Kaggle. Articles from December 2017 to January 2024 were filtered to include only climate-related content from 2023. This dataset contains titles, introductory summaries, full article texts, authors, and publication dates, structured for detailed sentiment and thematic analysis.
The models developed in this project are designed with adaptability to incorporate additional datasets beyond the ones tested. For example on the NLP side, the framework can accommodate textual datasets from diverse sources, including social media platforms (e.g., Twitter API), blogs, policy documents, or environmental reports. Pretrained NLP models such as BERT or DistilBERT can be fine-tuned to adapt to domain-specific vocabularies, ensuring accurate sentiment analysis, topic modeling, and emotion classification. Additionally, the system's architecture allows for the integration of multimodal data, enabling analyses that combine geospatial, textual, and even temporal datasets to derive richer, context-aware insights.","m2uni":"bg2666","m2fname":"Brian","m3uni":""},{"projectname":"Exploring the use of AI in aging research","timestring":"Thu May 12 00:31:03 2022","m1uni":"vv2339","m2lname":"","m1fname":"Viswajit","m4fname":"","m1lname":"Vinod Nair","m3fname":"","description":"Researching the possibilities of applying AI in aging research using epigenetic and lifestyle data.","uni":"vv2339","language":"Python, React.js, Javascript","pid":"202205-22","m4uni":"","analytics":"","m4lname":"","industry":"Information","m3lname":"","dataset":"I used the Johansson dataset from the NCBI Gene Expression Omnibus repostiory. It contains DNA methylation data, age, smoking intensity etc of 732 individuals.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"MotiveOps: Motivation-Aware Agent for AI Workflow Adoption","timestring":"Wed May 13 00:05:50 2026","m1uni":"cc5240","m2lname":"","m1fname":"Chih-Hsin","m4fname":"","m1lname":"Chen","m3fname":"","description":"MotiveOps is a motivation-aware agent for AI workflow adoption. The objective is to help users involved in AI workflow adoption choose a safe, concrete first action when adoption is blocked by trust, risk, workflow uncertainty, manager judgment, confidentiality, or governance concerns.

The key innovation is treating motivation as an inspectable control variable for agent behavior. Instead of secretly inferring a user's psychology, MotiveOps uses an explicit self-report motivation intake, where the user selects concrete concerns such as risk and policy, protecting affected people, experimentation, visible progress, or uncertainty about where to safely start. The system maps those self-reported concerns into an initial motivation estimate for interpretation, while still running controlled motivation-profile agents for comparison.

MotiveOps combines transparent motivation intake, controlled motivation-profile generation, policy-grounded recommendation synthesis, a three-layer action/rationale audit, and a sensitivity map for testing when motivational changes alter behavior. These toolkits are important because enterprise AI adoption failure is not only a tooling problem; it is also a motivation, trust, safety, and workflow activation problem.","uni":"cc5240","language":"The system uses TypeScript, React, Vite, and a Cloudflare Worker backend for the web application and API layer. A local Python RAG sidecar uses Chroma for vector retrieval over the curated policy corpus. The system uses OpenAI models for subject-agent generation and judge-based evaluation. The project runs locally with npm scripts and local environment variables for API keys.","pid":"202605-24","m4uni":"","analytics":"The project implements a motivation-aware multi-agent recommendation pipeline, explicit motivation-intake scoring, policy-grounding RAG, blocker and risk detection, intervention playbook retrieval, and structured intervention-card generation. The motivation-intake layer maps self-reported adoption concerns to an estimated motivation profile and strongest value-axis signal for interpretation, while the multi-agent layer compares Achievement, Exploration, Preservation, and Neutral profiles under controlled prompts. The evaluation modules include modal stability analysis, divergent scenario rate, three-layer value alignment auditing, policy retrieval Recall@5, constraint uptake rate, and a 144-cell sensitivity-grid boundary map. The three-layer audit consists of L1 declared motivation parsing, L2 independent judge inference from the chosen intervention, and L3 rationale-based motivation extraction using a lexicon with model fallback. The frontend visualizes scenario setup, motivation intake, recommendations, policy grounding, three-layer alignment results, evaluation metrics, and boundary-map heatmaps.","m4lname":"","industry":"Information","m3lname":"","dataset":"The evaluation used 9 synthetic canonical AI adoption scenarios grouped into three families: trust and workflow anxiety, social and professional risk, and governance/accountability. Each scenario was tested across 4 motivation profiles and 5 trials, producing 180 intervention outputs. The system also evaluated 144 sensitivity-grid contrasts to test whether perturbing motivation axes changed the selected recommendation. For policy grounding, I curated 37 policy/RAG chunks from public responsible-AI and workplace governance sources, including NIST AI RMF, NIST Generative AI Profile, EEOC, DOL, FTC, Fed/OCC model-risk guidance, education privacy guidance, and workplace AI governance documents. The system also supports custom user-entered AI adoption scenarios with motivation-intake selections, and can be extended with company-specific AI policies, security rules, rollout notes, employee survey snippets, or internal responsible-AI guidance.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Sentiment Analysis between Social Media Personalities & Events and Market/Cryptocurrency Trends","timestring":"Thu Dec 23 02:39:18 2021","m1uni":"sg3904","m2lname":"Matic","m1fname":"Sanket Sunil","m4fname":"","m1lname":"Gokhale","m3fname":"","description":"This paper aims to uncover trends between social media and the stock/cryptocurrency markets. In order to achieve this goal, an abundant amount of Twitter and Reddit data is
collected along with financial data (stocks and cryptocurrencies) from Yahoo. An LSTM model is used to predict market trends for both stocks and cryptocurrencies based on the events occurring on Twitter and Reddit. The historical data collected within the past 2 years is used to train the LSTM model, and analyze the trends of the financial markets in response to opinions posted on Twitter and Reddit. Once the model is trained, real-time data from both Twitter and Reddit again are collected, and used for prediction of upcoming market trends. The focus of this paper is to analyze these trends on a day-to-day basis and report the findings of the trends. Findings show that a simple LSTM model is able to accurately predict the trends of stock and cyptocurrency prices (if prices are trending upwards or
downwards), though requires additional training and precise parameter tuning for the model to predict prices consistently within a 5% accuracy range.","uni":"sg3904","language":"Python, tensorflow","pid":"202112-50","m4uni":"","analytics":"1. Data Extraction:
The Twitter and Reddit streams are connected to the VM using Tweepy and PRAW, respectively. They are
stored in Spark DataFrames, and they are then transferred to Google’s Cloud Storage (GS Buckets). A variety of scripts are used to extract the data. All of the scripts are written in Python, however different libraries are used to fulfill different purposes for collection. Once again, the Twint library is used for historical Twitter data collection while the Tweepy library is used for streaming of live Twitter data. Similarly, PSAW is a library used for historical Reddit data collection, while PRAW is the library used for live Reddit data streaming. GCP jobs for each of these scripts, namely the streaming data ones, are invoked on a daily basis for data collection to
support LSTM model prediction and training/updating simultaneously.
2. Sentiment Analysis:
Sentiment Analysis on historical data is performed using Flair NLP package. The Flair package is used to get the sentiment scores for individual tweets and Red-
dit posts.
3. Machine Learning Pipeline:
Spark is used to create a pipeline for the LSTM model to take the twitter and reddit sentiment scores. ","m4lname":"","industry":"Finance","m3lname":"","dataset":"The data is collected using several Python scripts; there are 5 scripts being used to collect data with 2 being for collecting past Twitter and Reddit data, and another 2 being used for real time streaming of Twitter and Reddit data. The 2 scripts used for collecting past data for Twitter and Reddit comments collect past data to be used for training purposes. They are vital for model training, and output both Twitter and Reddit data with numerous features as described above. Another script used is to collect financial data for certain stocks and cryptocurrencies from Yahoo’s data. This data encompasses daily values for the past year, in order to be compared with the social media posts used for prediction. This dataset is used for training and validation purposes. The other 2 scripts are used for real-time data collection, following the ‘3 Vs’ of big data. This data will be used for prediction to determine whether stock and cryptocurrency markets will rise or drop depending on the sentiment and the content of the Twitter and Reddit posts.
","m2uni":"2725","m2fname":"Filip ","m3uni":""},{"projectname":"Multi-agent Stress Testing of LLM Ethical Reasoning","timestring":"Wed May 6 02:01:28 2026","m1uni":"ns3942","m2lname":"Krasner","m1fname":"","m4fname":"","m1lname":"Sharma","m3fname":"","description":"This project builds a multi-agent stress-testing toolkit for evaluating whether LLM ethical reasoning is robust, sensitive, and internally consistent. Instead of only checking whether a model gives the “right” answer, the system probes the model with counterfactual variants and Socratic follow-up questions, then scores the resulting behavior with RuC, DS, and CSR. The main innovation is turning Chang’s maieutic and counterfactual ethical-reasoning ideas into an automated, measurable evaluation pipeline. This is important because ethical LLM failures often appear not as obvious wrong answers, but as unstable reasoning, rigidity, or contradictions under pressure.","uni":"ns3942","language":"The backend evaluation pipeline is written in Python, with JSONL/JSON used for scenario data, traces, metrics, and frontend static assets. The frontend is built with TypeScript, React, Next.js App Router, and Tailwind CSS. Model providers include local Ollama for llama3.2:3b, Anthropic API for Claude Haiku/Sonnet, and OpenAI API for GPT models and instrumentation agents. The system runs locally on macOS and can be deployed as a static-friendly Next.js demo frontend.","pid":"202605-22","m4uni":"","analytics":"The system implements a Proposer, Counterfactualist, Maieutic Inquirer, CSR Judge, deterministic judgment parser, metrics engine, runner/orchestrator, and trace writer. The core analytics are RuC, which measures stability under morally irrelevant perturbations; DS, which measures sensitivity to morally relevant perturbations; and CSR, which measures how often Socratic probing surfaces contradictions. The frontend includes a dataset browser, scenario detail pages, demo trace replay, model-selectable aggregate scorecard, and cross-model scoreboard. Visualizations include Moral Machine emoji-grid renderings, timeline-style trace walkthroughs, metric cards, by-source breakdown tables, contradiction histograms, and model comparison tables.","m4lname":"","industry":"Information","m3lname":"","dataset":"We tested 180 public benchmark scenarios from three sources: Moral Machine-style autonomous-vehicle dilemmas, Scruples/AITA-style naturalistic moral anecdotes, and Hendrycks ETHICS deontology/justice questions. These were converted into a unified JSONL schema with consistent fields for task format, answer options, ground truth, metadata, and perturbation-relevant attributes. The software can support other ethical-reasoning datasets as long as they can be mapped into the same schema: a scenario text, finite answer options, optional ground-truth majority label, and metadata/attributes for perturbation generation.","m2uni":"dk3460","m2fname":"","m3uni":""},{"projectname":"AI Assistant for Nurses in Healthcare","timestring":"Tue May 13 11:54:56 2025","m1uni":"mch2214","m2lname":"","m1fname":"Mackiah","m4fname":"","m1lname":"Henry","m3fname":"","description":"The objective was to make an AI nurse that can help nurses, general practitioners, patients, and medical students in routine work. The system is capable of answering patient specific questions, general medical questions, interpretating x-rays of chest and predicting top 3 most likely diseases from symptoms. This is an important innovation. The unified system with all three features is not yet implemented by anyone. Some great tools were implemented by experts and researchers but none of them provided a unified toolkit. The goal was to revolutionalize healthcare using existing solutions (Retrieval augmented technigques, models, datasets).","uni":"mch2214","language":"I used Python and the platform is Google Colab. I used huggingface to load models and datasets. Gradio was used to create interface. Datasets: qiaojin/PubMedQA, FreedomIntelligence/Disease_Database, hongrui/mimic_chest_xray_v_1 Models: pubmed-clip-vit-base-patch32, vit-chest-xray, Llama-3.2-11B-Vision-Radiology-mini, Llama-3.2-3B-Instruct.","pid":"202505-7","m4uni":"","analytics":"The core of the system is Retrieval Augmented Generation. It is a technique where the data is converted from datasets into numeric data (embeddings) and stored in database(chromadb in my case). Now when the user enters its query, the query is also converted to embeddings. These query embeddings are then used as filter to match from stored embeddings. When match finishes, each data item(embedding) is assigned a similarity score ( score of how that specific stored embedding and query embedding are similar). The top 3-5 documents with highest similarity scores are fetched. Now these documents are again converted into normal text form. They are fed into the model and instruct the model to generate answer on the basis on these documents. Its kind of like providing the model with a book and asking it to find answer to the question from the book and answer it.
For analytics, I used bert score. It basically checks whether the answer the model provided is similar to ground truth or not. I got good results of around 72.4% match with ground truth. Its like matching the models’ responses with the actual answer and then giving it a score out of 100. Bert score is divided into 3 metrics: Precision, Recall and F1 Score.
Libaries and modules: Chromadb(database), transformers(to import models), datasets(to load datasets from huggingface), bitsandbytes, torch, torchvision, gradio(to create interface), unsloth(to load the model for scan interpreation).","m4lname":"","industry":"Information","m3lname":"","dataset":"My system currently supports textual and image datasets. The datasets include MIMIC Xray for chest x-rays, PubMedQA Dataset for clinical Q/A, and Disease_Database from huggingface for disease prediction module. All the datasets are available on hugging face. I loaded all the datasets from huggingface.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Dynamic High-Frequency Trading Model","timestring":"Thu Dec 19 22:01:42 2024","m1uni":"yg2960","m2lname":"Huang","m1fname":"Yining","m4fname":"","m1lname":"Gan","m3fname":"","description":"1. Objectives
Individual investors and portfolio managers face challenges in navigating the complexity of financial markets: 1. What to invest in? 2. What positions to hold? 3. How to allocate capital effectively? Our core objectives are: 1. Predict stock returns using robust, data-driven methods; 2. Optimize portfolio weights to balance risk and return; 3. Provide actionable insights and real-time visualization for investors.
2. Innovations
Data Processing Pipelines: Big Data Handling: Processes millions of data points from over 98 stocks across multiple years.
Multi-Factor Feature Engineering: Incorporates 15+ technical indicators (e.g., SMA, RSI, Bollinger Bands) for enhanced predictive power.
Dynamic Splitting and Rolling Windows: Efficiently prepares sequential data for time-series modeling.
Deep Learning Model (GRU): Custom GRU Architecture: Designed to capture long-term dependencies in sequential financial data, making it well-suited for return prediction. Information Coefficient (IC) Loss Function: Aligns model training with the financial goal of maximizing return predictability.
Optimization Techniques: Convex Optimization for Portfolio Weights: Maximizes risk-adjusted returns under real-world constraints, such as limiting single-stock exposure.
Real-Time Simulation and Visualization: Yahoo Finance API: Fetches real-time market data for trade simulation; Interactive Flask Dashboard: Visualizes portfolio performance and positions dynamically, providing an intuitive tool for investors.
3. Capabilities
Prediction Accuracy: The GRU model predicts next 5-day returns for stocks with high accuracy, leveraging advanced feature engineering and IC-based training.
Portfolio Optimization: Balances risk and return effectively, ensuring that the portfolio is aligned with investor preferences and market conditions.
Scalability: Handles large volumes of financial data (big data) and can scale to include more stocks, features, or alternative asset classes.
Actionable Insights: Ranks stocks based on predicted returns, enabling informed decision-making.
Real-Time Application: Simulates trades and visualizes results interactively, bridging the gap between theoretical models and practical trading strategies.","uni":"yg2960","language":"We utilized Python as the primary programming language due to its robust ecosystem of libraries and tools for data analysis, machine learning, and web development. Key Python libraries used include NumPy and Pandas for data manipulation, Scipy for statistical computations, and yfinance for fetching financial data from the Yahoo Finance API. For building and training deep learning models, the project employs PyTorch, leveraging its flexibility for constructing custom architectures such as GRU-based models. The portfolio optimization component relies on CVXPY, a powerful library for convex optimization problems. For data visualization and user interaction, the project incorporates Plotly to create interactive charts and Flask for deploying a local web server that presents portfolio performance dashboards.","pid":"202412-12","m4uni":"","analytics":"The analytics component involved processing millions of data points from historical stock market data, engineering technical indicators such as moving averages, Bollinger Bands, RSI, and Beta to create a rich feature set, and normalizing these features for consistency. The predictive modeling relied on a custom Gated Recurrent Unit (GRU) deep learning model, which captured sequential patterns in rolling windows of stock data. The model was trained using the Information Coefficient (IC) loss function, which aligns with financial performance objectives by maximizing the correlation between predicted and actual returns. To rank stocks, an Inverse Transform Value Method (IVM) was applied, sorting stocks into deciles and selecting the top-performing stocks for further optimization. Portfolio optimization was performed using a convex optimization algorithm implemented with CVXPY, maximizing risk-adjusted returns under constraints such as no short selling and maximum stock weight limits. For trade simulation, a system module fetched real-time market data via the Yahoo Finance API, dynamically calculated portfolio values, and tracked daily performance. The visualization module included an interactive Flask-based web application that utilized Plotly to display portfolio metrics such as cumulative returns, portfolio value trends, and individual stock performance. These components were seamlessly integrated to create a comprehensive system for predictive analytics, portfolio optimization, trade simulation, and real-time performance visualization.","m4lname":"","industry":"Finance","m3lname":"","dataset":"1. Datasets Tested
The datasets used in this project were obtained from the Yahoo Finance API, a public and widely used resource for historical financial data. The dataset includes daily price information for 98 publicly traded stocks selected from various sectors and major indices to ensure a diverse representation. The data spans the period from October 28, 2021, to December 11, 2024, capturing a broad range of market conditions. For each stock, the dataset provides open, high, low, close, adjusted close prices, and daily trading volumes, forming the foundation for subsequent analysis.
2. The dataset was programmatically fetched using the Yahoo Finance API through the yfinance Python library. By specifying the stock tickers, start and end dates, and the data interval, the library automatically retrieved and structured the data. The data was then preprocessed and transformed into weekly aggregates for use in the deep learning models. This approach ensured seamless integration of large-scale financial datasets directly into the analysis pipeline.","m2uni":"hh3042","m2fname":"Honghao","m3uni":""},{"projectname":"Dual Model Machine Learning: Predicting Movie performance Metrics","timestring":"Wed Jan 5 22:12:10 2022","m1uni":"wjl2128","m2lname":"","m1fname":"William","m4fname":"","m1lname":"Lees","m3fname":"","description":"*Can marketing a movie tailer still produce adequate profit even if the movie is viewed as overall bad in the public eye after release?
*Are there any outliers?
*Is it possible to predict the outcome of a movies performance (IMDb rating and net profit) based off youtube comments of the trailer.
*If it is possible to predict these things how will it impact the entertainment industry?
*How will trailers adapt how will movies adapt to meet consumer needs when ML is implemented?
","uni":"wjl2128","language":"Python ","pid":"202112-60","m4uni":"","analytics":"*Youtube API
Needed two keys to pull all data for 154 movie trailer links
*Pandas
Used to manipulate and append CSV data
*Urllib, Beautiful Soup
HTML Parsing for movie ranking (Rotten tomato ranks)
*Selenium:google chrome
Useful google chrome python library to pull youtube java script data (comments)
*Pyspark Linear Regression ML libraries
Dual linear regression models for movie ranking and net revinue
*VectorizerAssembler
","m4lname":"","industry":"Information","m3lname":"","dataset":"*List of 2046 words with rankings (-5 to 5)
*List of 154 movies andGross income for action movie list
*Youtube Comment Data to pull data for 60 seconds per movie in movie list (total run time approx. 6 hours)
*List of all movies releasing in 2022 with trailers
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"AI for Human Consciousness: Outcome Prediction and Consciousness Detection in Patients With Acute Traumatic Brain Injury","timestring":"Fri May 5 17:02:53 2023","m1uni":"jak2302","m2lname":"","m1fname":"Jesse","m4fname":"","m1lname":"Kotsch","m3fname":"","description":"Combine analysis of advanced diagnostic techniques such as electroencephalogram (EEG), Functional Magnetic Resonance Imaging (fMRI) analysis and other contextual factors for consciousness detection and possible patient outcome prediction.

Patient prognosis is an important consideration when it comes to treatment.
","uni":"jak2302","language":"Matlab 2023a","pid":"202305-3","m4uni":"","analytics":"The EEG SVM model takes in a matrix of trials that has been split into training and test data and has labels associated with it. (Brain Activation or No Brain Activation)

The model can be run in manual mode, where the user inputs desired frequencies or in automatic mode where the model will extract frequencies.

The features for the SVM model are extracted from the power spectral density and then Principal Component Analysis (PCA) is performed to reduce the dimensionality of the features for processing.
The model will run through several PCA trials, preserving varied amounts of the variance. It will determine the error rate of the training data for each dimensional reduction and choose the best one.

I ran the model in manual mode so I could select specific frequencies for the brain waves. I chose 4,8,12 HZ to be in the window of alpha beta and theta brain waves.

For the MRI model builtin Matlab functions are used to extract interest point descriptors from each image. dimensionality reduction techniques are used before inputing into a Support Vector Machine Model.

The MRI and EEG model results are then combined using a weighted average to determine patient outcome.
","m4lname":"","industry":"Information","m3lname":"","dataset":"The data set used was provided by Professor Lin. It consists of EEG data and MRI data from 4 patients.
2 unconscious patients and 2 control (or conscious patients).

The EEG data is an excel spreadsheet consisting of data collected from 32 channels and was collected while conducting times trials.
During these trials, the researchers would ask the patients to open and close their hand, or to stop opening and closing their hand and then they would record for two seconds.

This data was split into 2 second segments which were inputted into their support vector machine model.

The MRI data consists of .dcm files. The images were taken using multiple readout techniques and was not part of the original experiment. My goal was to use this to increase the accuracy of the EEG SVM model.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NYPD Crime Dara Analysis","timestring":"Sat Dec 22 16:32:08 2018","m1uni":"zb2227","m2lname":"cheng","m1fname":"zhiyu","m4fname":"","m1lname":"bi","m3fname":"","description":"Safety is one of the most fundamental needs of people. Population-dense cities often have to deal with increased crime rates, which usually are unevenly distributed within a city. As one of the most populous cities in the US, New York City consists of five different administrative divisions called boroughs. Each borough differs significantly in area, demographics, wealth and lifestyle. Therefore, the motivation of this project is to understand the relationship between the different aspects of the city and crime rate through data visualization and modeling.

The specific objectives of this project are listed below:
1. Gain insights into the intrinsic relationship between geographic information, demographics, types of crime and crime rates.
2. Examine crime similarities between the 77 precincts in New York and then grouping among them.
3. Identify factors that affect the crime rate.

There are three parties that can take advantage of our analysis:
1. Help the New York Police Department (NYPD) understand how to better deploy police resources in the city in order to decrease the crime rate differences between boroughs.
2. Help New York City authorities work on urban planning and thus improve the quality of life in the city.
3. Help housing business better pricing in relation to crime situation.
","uni":"zb2227","language":"Language (Python, Spark), Platforms (Jupyter Notebook) ","pid":"201812-7","m4uni":"","analytics":"The following three analysis is conducted through data visualization:
1. Time trend: Line chart is used to illustrate the crime trend throughout the years from 2006 to 2016 for the whole New York City and different boroughs. The hourly trend is also plotted to find the peak time with high crime records.
2. Crime density: In order to clearly show the differences in crime rate among boroughs in relation to their demographic and geographic information, the histogram is chosen. In addition, the crime density heatmap is plot where intense color means high crime density.
3. Crime Categories: There are different types of crimes: Felony, Misdemeanor, and Violation. It is meaningful to figure out which type of crime happens frequently in each borough by using the histogram.

Several regression algorithms are implemented to understand the relationship between the crime rate (per 1000 residents) and some demographic factors.

k-means clustering is implemented to check how strong are the similarities between the 77 NYC precincts and if there are in any homogenous groupings among precincts that could be investigated further.
","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"The main dataset used in this project is collected from NYPD Complaint Data Historic public dataset through API. This dataset includes valid felony, the misdemeanor, and violation crimes reported to the NYPD from 2006 to the end of 2016. It has more than 6 million rows and 35 columns with 1.9 GB.

Another dataset used in supervised learning part is from NYU Furman Center. This dataset provides information on neighborhood demographics between 2010 and 2015. It has 25 features in total including borough, population, poverty rate etc. ","m2uni":"zc2391","m2fname":"zian","m3uni":""},{"projectname":"Music Personalization","timestring":"Mon Jan 4 22:23:39 2021","m1uni":"tc3070","m2lname":"Shi","m1fname":"Thi Thuy Linh","m4fname":"","m1lname":"Chu","m3fname":"","description":"Music personalization is a project inspired by Spotify (Discover Weekly and Wrapped) that provides users with song recommendations based on listening behavior. It also allows users to get an insight into their listening history through quick/simple visualizations.

Through this project we learned about cloud computing (AWS), personalization algorithms, and different visualization techniques.","uni":"tc3070","language":"Python, Javascript, HTML, CSS, AWS Services (Lambda, DynamoDB, API Gateway, S3)","pid":"202012-4","m4uni":"","analytics":"for Top Music/Lyrics Cloud/Geography/Moods we focus more on visualizations, so the data presented is simple summary stats. For Recommendations, the algorithm we use is content-based recommendation, which returns songs similar to those that a user liked in the past (by learning preference in terms of audio features). We added more flavor to the recommendation system by allowing the user to pick specific genres or only include songs released after 2010.
","m4lname":"","industry":"Media","m3lname":"","dataset":"","m2uni":"hs3142","m2fname":"Hankun","m3uni":""},{"projectname":"Twitter Based Setiment Analysis of Two Famous Actresses","timestring":"Fri Dec 21 15:51:51 2018","m1uni":"yf2466","m2lname":"Yang","m1fname":"Yue","m4fname":"","m1lname":"Feng","m3fname":"Yizhi","description":"We have heard about the controversies between Anne Hathaway and Jennifer Lawrence all the time. Thanks to this class, we have a chance to uncover the mystery. Now taking advantage of Big Data, we could collect and filter tweets from people of American, train a sentiment analysis model to do prediction and get a general idea that how do people think of Hathaway and Lawrence.

A lot of work of sentiment analysis have been done by classification algorithm like LSTM. There are even models can predict several sentiments like surprised, angry and so on, while other uses Maximum Entropy Classifier and Decision Tree. But their models are trained based on different kind of dataset. Here, we decide to train our own model.

In the related models, several use Naïve Bayes, Random Forests, Support Vector Machine, etc. Here we tested Naïve Bayes and Random Forest and XGBoost, finding XGBoost performs as the best. For testing data, we also use a method that can get location information of tweets, which helps in geo-visualization.
","uni":"yf2466","language":"python, javascript","pid":" 201812-11","m4uni":"","analytics":"This project conducts comparison of popular sentiment of two actresses - Anne Hathaway and Jennifer Lawrence based on Twitter API. We trained a XGBoost model to classify the sentiment of tweets and visualized the results using d3.js.","m4lname":"","industry":"Media","m3lname":"Zhang","dataset":"","m2uni":"my2577","m2fname":"Manqi","m3uni":"yz3376"},{"projectname":"Automated DraftKings DFF Roster Optimizer Utilizing Ridge-Regression and Integer Programming","timestring":"Tue Dec 22 19:58:35 2020","m1uni":"jrg2204","m2lname":"Thakker","m1fname":"John","m4fname":"","m1lname":"Gearheart","m3fname":"Sakib","description":"Background and Objectives:
As DraftKings DFS has grown to 500,000 active monthly users, an increase in the number of commercial Fantasy Sports Roster Analysis Tools and publically available NFL datasets has led to the most competitive landscape in Fantasy Sports yet. This project aims to explore the use of Machine Learning and Optimization Algorithms to outperform competitive systems and discover if it is in one's best interest to accept DraftKings' model at face value.

Innovations and Capabilities:
Linear Ridge-Regression model creation coupled with Integer Programming Optimization for Automated DraftKings Roster Generation (Weeks 7-14 of 2020 NFL Season)","uni":"jrg2204","language":"Python","pid":"202012-1","m4uni":"","analytics":"onehot, pandas, numpy, gurobipy, requests, Linear Ridge-Regression, Integer Programming Optimization, Web Scraping","m4lname":"","industry":"Media","m3lname":"Salim","dataset":"www.statheads.com, www.footballdiehards.com","m2uni":"mtt2132","m2fname":"Maharshi","m3uni":"srs2284"},{"projectname":"Automatic Consumer Personalized Recommendation","timestring":"Fri May 15 18:40:35 2020","m1uni":"zl2839","m2lname":"Feng","m1fname":"Ziying","m4fname":"","m1lname":"Liu","m3fname":"","description":"Due to a large number of tweets are posted per day, high-quality articles or useful tweets may be overwhelmed by other tweets that users do not care about at all. Our project goal is building a complete recommender system that can detect the users’ potential interests to recommend the tweet and hashtags that they really interested in.","uni":"zl2839","language":"Python 3","pid":"202005-8","m4uni":"","analytics":"For hashtag ranking, we introduce an original algorithm named Hashtag Frequency-Inverse Hashtag Ubiquity (HF-IHU), which is a variation of the well-known TF-IDF.
For tweet ranking, we make tweet recommendations based on collaborative ranking to capture personal interests and incorporate explicit feature from Twitter as well.","m4lname":"","industry":"Media","m3lname":"","dataset":"We scraped the real-time data by Twitter's API and get 52,104 tweets with location nearby NYC on March 1st.
We created the tweet post heatmap for NYC and filtered out some data that are not from those five boroughs: Manhattan, Queens, Brooklyn, Bronx, Staten Island. After filtering, the data contains 37092 tweets. We used those 30 thousand tweets to build NYC hashtags trend and recommend to each user.
For the user's personalized timeline recommendation, the final number of target users was 8025 and we scraped 50 tweets for each user.","m2uni":"jf3283","m2fname":"Jiana","m3uni":""},{"projectname":"Electricity Price Forecasting","timestring":"Fri Dec 15 21:27:28 2023","m1uni":"wl2927","m2lname":"Huang","m1fname":"Wenbo","m4fname":"","m1lname":"Liu","m3fname":"Jiajun","description":"Background: Due to factors such as global renewable energy growth, variability in energy output from sources such as solar and wind, and instability of global energy market from the pandemic and recent conflicts, energy prices has become increasingly difficult to predict, causing issues in balancing supply-demand and ensuring grid stability.
Goal: Create an advanced electricity price prediction platform using big data and machine learning technologies to empower energy sector users
Novelty:
Comprehensive Data Integration: Uses diverse datasets such as grid status, weather data, and future market information to provide results based on a wide range of variables.
Advanced Data Processing: Employs big data processing methods such as spark and airflow with advanced algorithms whereas traditional models may struggle with large-scale data.
Real-Time Ingestion and Prediction: Ingests data and provide accurate predictions on a daily basis with the ability to increase frequency based on data sources.
Continuous Improvements: Regular updates and improvements to the prediction model captures current events such as pandemic and russo-ukrainian war.
User Interface: Accessible and interactive user interface broadens the audience of the model, allowing professionals as well as amateurs to explore the current energy status.

","uni":"wl2927","language":"Python, JavaScript, HTML, Google Cloud Platform","pid":"202312-12","m4uni":"","analytics":"Our data are from different sources and with different granularities. For the load data, we calculate the sum of loads firstly and then calculate the proportion of different power methods. For the LMP data, we calculate the average value of a whole day.

We have more than 10 features initially. We calculate the correlation between the price and these features. For the features with a negative correlation, we use their reciprocal to train our model.

After getting the training set, we standardize the data to enhance the performance of our model.

For the data of uranium and coal, because they are from future market and we could not get their price on weekends or holidays, we just use the data from last friday as their price on weekends.

After comparing the performance of six different models, including linear regression, decision tree, random forest, K-neighbors, XgBoost and neutral network, we decided to use random forest algorithm to make the prediction.

We schedule our program to run every day with the help of Airflow. For each running, we pull the latest data and train a new model to predict the price of the next day.

We use D3 to visualize our data. Our web page can show the figures of \"load\", \"price\", \"weather\" and the performance of our trained model.
","m4lname":"","industry":"Information","m3lname":"Han","dataset":"Data Sourcing:
Electricity data are sourced from “gridstatus” package which contains up to date grid data of NYC. The granularities of both price and load are five minutes.

Weather data are sourced from Visual Crossing. The raw data includes ‘temp’, ‘tempmax’, ‘tempmin’, etc.. The granularity is one day.

Price of coal and uranium futures market are sourced from investing.com.

Features:
After testing different features, we select 9 features:
Power generation method proportions (5 features)
Price of electricity, coal and uranium
Weather (temperature)

3Vs:
Volume: Our data contains more than 1,000,000 rows for model training and test development

Velocity: Our source data refreshes every five minutes. We currently pull the dataset daily and process the data immediately to train the model.

Variety: Our raw dataset has more than 20 attributes and after processing our raw data, we obtain 9 features to train our model

","m2uni":"jh4709","m2fname":"Jiuke","m3uni":"jh4598"},{"projectname":"PUBG Game Analysis and Visualization","timestring":"Tue Dec 25 04:32:37 2018","m1uni":"cc4289","m2lname":"Yin","m1fname":"Chen","m4fname":"","m1lname":"Chen","m3fname":"Yisu","description":"Existing analyses of PUBG have primarily focused on the performance of an individual player and general descriptive information of equipment. However, it may still be difficult for new players to truly understand the rules without a clear demonstration of a match. Also, players may be frustrated by continuous failures if they do not master some \"game techniques\".

Therefore, we aim to provide effective survival guidelines for players with various game backgrounds.
We performed a series of analyses based on the game data collected from the official PUBG Developer API in an attempt to conclude some useful survival strategies. In this project, we will display death-prone areas, visualize the dynamic process of each game and predict players’ final rankings given specific features.","uni":"yz3477","language":"Python3, JavaScript, Hadoop ","pid":"201812-5","m4uni":"","analytics":"1. Display heatmaps of distributions of players' deaths, essential weapons and medicines on game maps.
2. Visualize the dynamic process of a randomly picked match with JavaScript.
3. Predict players' final rankings given features such as the total number of enemies killed, by linear regression, random forest and ada boost algorithms.
","m4lname":"","industry":"Information","m3lname":"Zhong","dataset":"The dataset was collected from the official PUBG Developer API, which returns data based on certain formats of URL that user requests, with a rate limit of 10 entries per minute. The size of the dataset is around 55G. Specifically speaking, the dataset contains 30,000 instances of match data and 5,000 instances of telemetry data, where each instance is stored in the json format.","m2uni":"hy2568","m2fname":"Hang","m3uni":"yz3477"},{"projectname":"Social Media Analytics: Developing a Bloomberg Function Like Tool that Informs Trading Strategies","timestring":"Sat Dec 18 01:52:42 2021","m1uni":"rc3372","m2lname":"Hu","m1fname":"Rui","m4fname":"","m1lname":"Cheng","m3fname":"Shihang","description":"With the increasing volume of retail trading in the US stock market, we have witnessed growing influence from retail traders' actions on the price of a particular stock. To understand this trend, we develop a Bloomberg Function-like social media analytic tool to gain insight on retail investors' trading strategies, sentiments, and stock of choice.","uni":"rc3372","language":"Backend: AWS lambda(Python), AWS Kinesis Firehose, AWS S3, AWS Quicksight, AWS Athena Model: AWS sagemaker(Tensorflow) Frontend:HTML CSS, Javascript","pid":"202112-6","m4uni":"","analytics":"Our system is built on AWS. Datastream in a specific time frame is fetched from reddit api and twitter api via AWS Lambda, and it is then pass on to AWS S3 through AWS Kinesis Firehose. Next, the raw datastream stored in AWS S3 is passed on to Amazon Comprehend to uncover information in unstructured text data. The same raw data is also sent to Lambda Function to for filtering and analysis using our two step-stage deep learning framework. The processed data is stored in Amazon S3 and is later used for Visualization through Amazon QuickSight.

In this project, we introduce a two-stage framework where in the first stage we deploy well-trained deep learning models to filter out noise data. Then in the second stage, we deploy several machine learning-based methods or data visualization techniques to provide a general view of the Twitter and Reddit data, which can help us understand people's attitudes towards our interested companies on social media and assist us in making investment decisions.

For visualization, our intention is to adopt Amazon QuickSight as part of our system to accomplish the visualization work of our analysis result, in a more user-friendly and interactive way. Also, we built a website frontend capable of embedding AWS Quicksight dashboard.","m4lname":"","industry":"Media","m3lname":"Wang","dataset":"Data used in this projects are live streaming tweets from Twitter and posts in a predefined timeframe in Subreddit r/wallstreetbets. These data are first filtered using a list of traded stock name and company's name as keywords and stored in S3 buckets. Data are updated every hour to provide the most recent insights on the retail community. A sample data of Tesla related tweets contains 64000 characters in one single update.","m2uni":"yh3273","m2fname":"Yangjinan","m3uni":"sw3275"},{"projectname":"Recommendation Algorithms Implementation and Evaluation System","timestring":"Sat Dec 18 03:37:49 2021","m1uni":"sg3971","m2lname":"Liu","m1fname":"Siwei","m4fname":"","m1lname":"Guo","m3fname":"Xinxin","description":"As we all know, Recommendation systems are widely used nowadays around the world.
So questions come that, what is a good recommendation system? How to evaluate recommendation systems’ efficiency?
Recommendation algorithms have multidimensional properties, and different kinds of applications could have heterogeneous aspects of requirements, and hence different appropriate algorithms and different evaluation metrics. How to fit well between different algorithms and different applications.
To answer these questions, we designed a system that will be of great use in the real world.
In our project, we design a recommendation algorithms implementation and evaluation platform.

In our platform, we support a variety of classical recommendation algorithms, especially deep learning algorithms, and many different kinds of evaluation metrics.
A user of our system, which could be an engineer or a researcher, can choose different algorithms and parameters using our front-end website. Our system will give back recommendation results and evaluations.

","uni":"sg3971","language":"Python, GCP","pid":"202112-23","m4uni":"","analytics":"Our project will be in Flask, and we will do XML configuration firstly in the system. When implementing deep learning algorithms, we will make use of GCP, BigQuery, and also PySpark, including RDD, which is the big data part of our project.

As for the method, we mainly use three advanced deep learning algorithms in our project, including Standard-VAE algorithm, Multi-VAE algorithm, and Alternating Least Square model.
","m4lname":"","industry":"Information","m3lname":"Wu","dataset":"We will take movie recommendations as an example in our project. The dataset we use is a movie rating dataset from MovieLens. Our example dataset contains over 20 million ratings (created by users between January 09, 1995 and March 31, 2015). Users in this dataset were selected randomly for inclusion, and all selected users had rated at least 20 movies.

It is a public dataset from Kaggle \"MovieLens 20M Dataset\".

Our system is able to implement lots of datasets, as long as it has three columns user, item, and ratings. ","m2uni":"zl3031","m2fname":"Zijun","m3uni":"xw2789"},{"projectname":"Outcome Prediction and Consciousness Detection in Patients With Acute Traumatic Brain Injury","timestring":"Fri Dec 15 23:12:46 2023","m1uni":"jak2302","m2lname":"Patrick Tracy","m1fname":"Jesse","m4fname":"","m1lname":"Kotsch","m3fname":"Calvin ","description":"Combine analysis of advanced diagnostic techniques such as electroencephalogram (EEG), Functional Magnetic Resonance Imaging (fMRI) analysis and other contextual factors for consciousness detection and possible patient outcome prediction - see Glasgow Outcome Scale Extended (GOSE).
Patient prognosis is an important consideration when it comes to treatment.
","uni":"jak2302","language":"Matlab, Python, Google Colab","pid":"202312-1","m4uni":"","analytics":"The EEG data was analyzed using a support vector machine model. The features for the SVM model are extracted from the power spectral density and then Principal Component Analysis (PCA) is performed to reduce the dimensionality of the features for processing. The model will run through several PCA trials, preserving varied amounts of the variance. It will determine the error rate of the training data for each dimensional reduction and choose the best one.

The MRI data is preprocessed by creating synthetic data to make the dataset more robust. The image processing model is a modification of ResNet50 architecture to support 3D input data. It is implemented using Keras/Tensorflow.
","m4lname":"","industry":"Information","m3lname":"Yu","dataset":"The dataset consists of 117 patients who have suffered from traumatic brain injury. The data was collected from a medical group in Taipei and consists of EEG and MRI data for each patient. ","m2uni":"","m2fname":"Greg ","m3uni":""},{"projectname":"Chess - A Visual History","timestring":"Tue Dec 22 17:24:56 2020","m1uni":"cc4399","m2lname":"","m1fname":"Christopher","m4fname":"","m1lname":"Charron","m3fname":"","description":"Chess was invented in the 6th century, with recorded games following chess’ modern rules existing from the 1700s. The game has evolved significantly since then, with human prowess said to have made significant leaps first during the Cold War, and then in the early 2000s with the advent of widely available computer assistance. The aim of this project is to apply data visualization techniques to identify trends and insights over the long history of the game, with a focus on opening moves and the evolution of Elo ratings.","uni":"cc4399","language":"Python, Spark, Javascript, BigQuery, CSS, amCharts","pid":"202012-2","m4uni":"","analytics":"Analytics in BigQuery, Python, Spark, with visualization in amCharts","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset contains 3.5M games played between 1783 - 2007, available for free from the Chess Research Project. The visualization would be able to support any expansion of the data set or addition of more recent data.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Visual Exploration in Immersive Environment","timestring":"Thu May 12 02:53:01 2022","m1uni":"zz2721","m2lname":"","m1fname":"Zixiang","m4fname":"","m1lname":"Zhao","m3fname":"","description":"Objective: To use consciousness to achieve control in AR device or personal computer to slide and examine pictures.
Innovation: Combine the EEG headset with AR device or personal computer to achieve consciousness control without hand movement.
Capabilities: use mind to interact with AR or personal computer's software without hands movement.

Why important: AR device is know for immersive experience, if we use handles that will reduce the immersive feelings and also cause fatigue. AR in biomedical file also has huge important. if we want to free the doctor's hand when they doing surgery with using AR device. Then mind control is important. ","uni":"zz2721","language":"Python, Matlab, Kaggle, JuypterLab","pid":"202205-10","m4uni":"","analytics":"Analytics & Algorithms: CSP, LDA, GAN, Conventional NN, binary cross entropy loss function.
system modules:pyautogui","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Dataset1: Glioma MRI images are from Masoud Nickparvar's Brian tumor MRI dataset. and the source can be found from the link (https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset).
This dataset is a combination of the following three datasets :
figshare
SARTAJ dataset
Br35H
This dataset contains 7022 images of human brain MRI images which are classified into 4 classes: glioma - meningioma - no tumor and pituitary.no tumor class images were taken from the Br35H dataset. SARTAJ dataset has a problem that the glioma class images are not categorized correctly, I realized this from the results of other people's work as well as the different models I trained, which is why I deleted the images in this folder and used the images on the figshare site instead. but in this dataset we only used glioma.

Dataset2: The EEG data is collected from subject's own EEG waves by using Muse 2 headset.

Support: our dataset can be support by juypter lab, Kaggle and Matlab ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Graph Neural Network for Predict Stock Market","timestring":"Fri Dec 20 00:10:59 2024","m1uni":"td2803","m2lname":"Si","m1fname":"TING","m4fname":"","m1lname":"DONG","m3fname":"","description":"This project focuses on predicting future stock trading volumes as a critical step in forecasting stock market trends and dynamics. By utilizing advanced distributed computing and deep learning techniques, the project aims to address the following objectives:

Improve Stock Market Predictability:

Predict future stock volumes for multiple companies using historical trading data.
Leverage predicted volumes to gain insights into market movements and trends.
Integrate Panel Data Analysis:

Utilize panel data (multiple time series) to model interdependencies between stocks.
Enable richer feature extraction by incorporating relationships across companies, industries, and time.
Handle Large-Scale Financial Data:

Use the yfinance library to retrieve monthly stock trading volumes for a wide range of companies, including technology (e.g., MSFT, AAPL), automotive (e.g., TSLA, NIO), and financial (e.g., JPM, MA).
Process and aggregate this data using PySpark to ensure scalability and efficiency for real-world applications.
Develop Scalable AI Solutions:

Implement a Graph Attention Network (GATv2) to capture graph-structured relationships among stocks.
Use distributed computing to manage and train models on large-scale data efficiently.
Provide Real-Time Insights:

Design a system that delivers real-time predictions and visualizations of model performance.
Equip stakeholders with tools to monitor, analyze, and act on market predictions dynamically.
By addressing these goals, the project demonstrates the feasibility and effectiveness of combining financial data analysis, graph-based deep learning models, and distributed computing for robust and scalable stock market prediction systems.","uni":"td2803, ms6817","language":"This project uses Python for data processing, backend development (Flask), and deep learning (PyTorch, PyG). HTML/CSS/JavaScript are used for the frontend, while PySpark handles distributed data aggregation. Matplotlib is used for visualization, with execution on a local GPU and PySpark’s scalable environment.","pid":"202412-13","m4uni":"","analytics":"This project implemented several advanced analytics, algorithms, system modules, and visualizations to process and model financial data efficiently:

Analytics and Data Processing:

Data Aggregation:
Used PySpark for distributed data aggregation and preprocessing.
Applied the join_df() function to merge multi-class stock trading volumes based on time dimensions (YEAR, MONTH).
Ensured no missing values or NaN entries, maintaining data consistency.
Data Normalization:
Standardized financial data to ensure uniform scaling for training deep learning models.
Algorithms:

Graph Attention Network (GATv2):
Built using PyTorch Geometric to model relationships between stocks as a graph structure.
Multi-head attention captured complex dependencies among stock trading volumes.
Optimizer:
Stochastic Gradient Descent (SGD) with Momentum (0.9) to accelerate convergence and improve training stability.
Applied gradient clipping to prevent gradient explosion.
Loss Function:
Mean Squared Error (MSE) to evaluate the prediction accuracy for stock volumes.
System Modules:

Backend APIs:
Developed using Flask to serve data and real-time visualizations dynamically.
APIs like /draw_loss and /draw_plot supported live monitoring of model performance.
Frontend Integration:
Enabled feature selection, real-time data updates, and dynamic visualizations using HTML, CSS, and JavaScript.
Visualizations:

Loss Curves:
Visualized training and testing loss trends to monitor model convergence.
Correlation Matrix:
Heatmap to display inter-stock dependencies and relationships.
Class Graph:
Represented stock relationships in a graph structure, with edge weights indicating correlation strength.
Dynamic Updates:
Generated high-resolution SVG images with Matplotlib and updated visualizations every 500ms for real-time insights.
This combination of advanced analytics, efficient algorithms, and robust visualizations enabled the system to handle large-scale data effectively and predict future stock volumes with high accuracy.

您说：
","m4lname":"","industry":"Information","m3lname":"","dataset":"The core dataset used in this project consists of stock trading volumes retrieved from the yfinance library. It includes monthly trading volumes of various companies across different industries. Below are the detailed descriptions:

Data Source:

Stock trading data collected using the yfinance library, covering major companies from different sectors such as:
Technology: MSFT (Microsoft), AAPL (Apple), GOOG (Google), NVDA (NVIDIA).
Automotive: TSLA (Tesla), NIO (NIO), F (Ford), GM (General Motors).
Finance: JPM (JPMorgan Chase), MA (Mastercard), GS (Goldman Sachs).
Data Structure:

CLASS_ID: Identifies the stock category (e.g., MSFT, AAPL).
CHANNEL_ID: Represents the data source (fixed at 90.0, indicating the stock exchange).
YEAR & MONTH: Indicate the time dimension when trades occurred.
UNITS: Monthly trading volume (numeric value).
Time Range:

Data spans from the earliest recorded date for each stock to December 2024.
Data Frequency:

Monthly trading data aggregated to ensure consistency and completeness.
Data Preprocessing:

Removed all missing values and invalid records to ensure data quality and integrity.
Applied PySpark’s distributed computing framework for data cleaning, aggregation, and transformation, creating an efficient and scalable dataset format.
Dataset Objective:

Evaluate the framework’s ability to process large-scale financial data, predict future stock trading volumes, and validate the model’s performance in real-world scenarios.
Applicability:

Beyond stock trading data, the framework supports various panel data types such as sales data, insurance records, and other large-scale time-series datasets.
By using this dataset, the project effectively demonstrates the power of distributed computing and deep learning in handling and analyzing large-scale financial time-series data.","m2uni":"ms6817","m2fname":"Mingyu","m3uni":""},{"projectname":"Anime Recommendation System","timestring":"Sat Dec 18 03:26:03 2021","m1uni":"ht2568","m2lname":"Wei","m1fname":"Hanwei","m4fname":"","m1lname":"Tang","m3fname":"Yifan","description":"Anime usually refers to animation produced in Japan. The anime industry has been developed quickly since last century, watching anime has been a popular choice for people of all ages in the whole world. Although some titles are loved by nearly everyone, most anime can only attract limited amount of audience. Everyone has different taste on anime, finding the suitable anime to watch is important and sometimes difficult for anime fans. To solve this problem, we developed an anime recommendation system. We picked up our data from myanimelist.

Our main goal is to make anime recommendation for users, and we provide two different choices.The first is to make content-based prediction for the general score of each anime.The other choice is to make personalized recommendation.","uni":"ht2568","language":"Python, PySpark, Jupyter, GCP and some modules (Numpy, Pandas, Tensorflow, Keras and etc.)","pid":"202112-13","m4uni":"","analytics":"We used KNN model for content-based algorithmns recommendation and compared linear regression model, random forest model decision tree model and gradient-boosted tree model in predicting general score for animes. The Linear Regression model results in best accuracy with 83% according to our test. The other choice is to make personalized recommendation based on Collaborative Filtering, by looking for users who share the same rating patterns with the target user and make prediction based on what these similar users like. In this part, we used neural network of deep learning method to train the model and predict users’ personal ratings for each anime. And the best neural network model we got has 48.9% accuracy within 10 epochs.","m4lname":"","industry":"Media","m3lname":"Wang","dataset":"Users data are usually encrypted and cannot be read. However, the existed users’ preference data on
myanimelist. We can get it by using Jikan API. And encoded the usernames by ourselves.

For anime info, we scrapped down Anime id, Name, Score, Genres, Type, Episodes, Premiered, Studios,
Source, Rating, Members. For users' info, we got encoded user ids, score for each animes and watching status.","m2uni":"zw2781","m2fname":"Zijun","m3uni":"yw3744"},{"projectname":"Analysis of US-China Trade War","timestring":"Sat Dec 22 07:47:33 2018","m1uni":"bs3118","m2lname":"Abrams","m1fname":"Bhavya","m4fname":"","m1lname":"Shahi","m3fname":"Hariharan","description":"Governments frequently impose taxes or duties to be paid on particular classes of imports or exports to/from its country. These governments usually tend to charge higher tariffs for imports from other countries compared to the tariffs on exports, thus increasing the price of those goods. This ensures the country’s economic growth and also makes sure that the citizens buy products of the land.
On July 6th, 2018. the US Government levied tariffs on $34 billion worth of Chinese goods being imported to the country. The Chinese government retaliated by imposing a similar 25% tariff on $34 billion US goods being imported to China.
This situation has been escalating for a while now with both countries planning to impose tariffs on goods worth $200 billion.
Impacts of this have already been witnessed in developing countries with increasing prices of goods and oil in particular.
Our main objective for this project is to better understand the worldwide perception of this Trade War and analyze the public sentiment behind this. This will enable us to predict the current Socio-Political Landscape and help understand the reasons behind this so called “misunderstanding” and predict its outcome on Stock Price.
Goldstein Scale looks at capturing the theoretical potential impact that a certain event will have on the stability of a country. On December 2nd, 2018, US and China agreed upon a 90-day truce. This had a clear increase in the Goldstein scale. On the very next day, President Donald Trump threatened to restart the trade war if China talks fail. This had a very negative impact on the Goldstein Scale, further impacting the economy. On the 8th, the US- China Trade war deal to suspend new trade tariffs was agreed upon by the two countries increasing the Goldstein scale.
This is a small example of what effect comments, tweets and other public statements have on the economy and the socio- political landscape as a whole.","uni":"bs3118","language":"Python, Spark, Google Cloud Platform, MLLib","pid":"201812-14","m4uni":"","analytics":"A linear regression model is used to predict the stock value on a particular day. The features given are the tones provided by GDELT, the sentiment extracted through Vader, and the vectorized TF-IDF themes. The Stock Price is given as the label vector. The RMSE, and R2 values are calculated to test the accuracy of the model.
The GDELT dataset is first queried using BigQuery and SQL like commands. Each row in the dataset consists of the article URL along with its date. This date is used to map the Dow Jones value to the articles. The URL is used to extract the text of the article and Vader is used for sentiment analysis. All of this is done via a Spark pipeline.
The TF-IDF features are calculated for the Themes per article, then the Vector Assembler combines all the features into a feature vector and finally the linear regressor is trained and tested. This is also done through a Spark pipeline
","m4lname":"","industry":"Finance","m3lname":"Raju","dataset":"The GDELT master dataset in our analysis is a BigQuery table with over 250 million data points in 2018 alone. The dataset is updated every day based on current events. Each event has attributes such as event ID, date attributes, and event action attributes, including eventbasecode, Goldsteinscale, and avgtone which looks at the average “tone” of all documents containing one or more mentions of a particular event. Each event also includes geographic references as well. GDELT event records are stored in an expanded version of the dyadic CAMEO format, capturing two actors and the action performed by Actor1 upon Actor2. A wide array of variables break out the raw CAMEO actor codes into their respective fields to make it easier to interact with the data.
he GDELT dataset consists of three main tables- The Event table, Event Mentions, and the Knowledge Graph. We query the Knowledge graph to get information about articles that have been written about the trade war. Each row of the dataset consists of various attributes such as Actors, Locations, Organisations etc. which tell us what people, locations, and organisations are mentioned in the article. The articles also have Themes attributed to it which tell us what the article is about. For example, an article talking about the effect of the trade war on global currencies will have Themes such as ECON_WORLDCURRENCIES, ECON_TRADE_DISPUTE etc. The dataset from the GDELT database consists of articles written since the beginning of this year (January 1st, 2018) about the trade war.
","m2uni":"cea2151","m2fname":"Charlotte","m3uni":"hr2458"},{"projectname":"Romantic Partner Recommender based on Speed Dating Experiment","timestring":"Sat Dec 22 22:59:03 2018","m1uni":"sj2909","m2lname":"Xu","m1fname":"Tallis Shih-Ying","m4fname":"","m1lname":"Jeng","m3fname":"Min","description":"Most dating recommender is based on online dating sites that make use of virtual user profiles portrayed by the users themselves. While speed dating collects real-life face-to-face dating interactions among people, the data is rarely applied in dating recommender in any form. This project aims to explore the research gap. The project presents a novel dating recommender which combines speed dating study with dating recommender system. Speed dating data suggests that user profile portrayed by themselves may not accurately reflect their likeability as a potential romantic partner. We introduce an approach of extracting objective evaluation based on the objective ratings given by the dating partners in the speed dating events to construct an objective profile library. Due to the low match rate in speed dating, there is a high class imbalance on the match label in the speed dating dataset. The project applies SMOTE oversampling to mitigate the class imbalance issue. A random forest regression-based reciprocal recommender is presented. Experiment results confirm the effectiveness of the proposed approach. The combination of the objective profile library, SMOTE oversampling, and a random forest regression-based reciprocal recommender achieves a match prediction accuracy of 92.19%, outperforming existing benchmark algorithms. The result not only validates the effectiveness of an objective profile library extracted from face-to-face speed dating events in recommender systems but also sheds light on the great potential of the study of speed dating being applied in dating recommender. The work can be used to improve the efficiency of speed dating in terms of match rate and may also be incorporated into online dating sites, which is the core the largest and fastest-growing segment of the dating service industry.","uni":"sj2909","language":"Python3, javascript, HTML, Spark, scikit learn, Keras, Tensorflow, d3.js ","pid":"201812-10","m4uni":"","analytics":"The three core components of the dating recommender include the extraction of objective evaluation for the construction of an objective profile library, SMOTE (Synthetic Minority Oversampling TEchnique) oversampling technique for imbalance mitigation, and a random forest regression-baed reciprocal recommender. Other regression models including logistic regression, neural network, and XGboost are also implemented. The visualization of the match network is realized with HTML web pages and d3.js. ","m4lname":"","industry":"Information","m3lname":"Fu","dataset":"The dataset used in the project is a questionnaire collection from speed dating experiments during the period of 2002 to 2004, collected by Columbia Business School. The dataset contains 8378 rows and 195 columns, including participants’ demographics, interests, attribute scores, etc. The dataset is publicly available at https://www.kaggle.com/annavictoria/speed-dating-experiment","m2uni":"yx2489","m2fname":"Yijia","m3uni":"mf3200"},{"projectname":"Improving Rating Score and Conducting Feature Extraction on Amazon Reviews","timestring":"Sun Dec 23 05:52:30 2018","m1uni":"sb4027","m2lname":"Ramesh","m1fname":"Soundarya","m4fname":"","m1lname":"Balasubramani","m3fname":"Arpita","description":"Objective & Capabilities: While reviews on e-commerce sites have been a major source of information for customers to choose products, they also tend to be too much data to digest. Topical classification and sentiment classification are proposed to be used in information classification. But classical algorithms lacks the ability to comprehend what a reviewer liked or disliked. Our approach focuses on the part of each review that describes features of the product using objectivity classifiers and POS tagging, and picks out the pros and cons using semantic orientation of descriptive words. We also improve the rating given for particular product by looking at the overall sentiment of its reviews.

Innovation: Much of our analysis plan is inspired by the ‘Feature-based Customer Review Mining’ of data extracted from buy.com by Wang and Ren (May 2007) as part of their NLP course at Stanford University in Spring 2007. Even though their work is over a decade old now, it uses novel techniques in feature extraction that turned out to be quite useful for our analysis as well. However, the novelty of our approach lies in extracting nouns from the processed data, creating n-grams from the text and finally calculating the Semantic Orientation of the nearest adjectives.

Why is this important: Amazon alone is responsible for almost 50% of the e-commerce sales as of July 2018. Its biggest competitor so far is eBay, which garnered only 6.6% of the online sales during the same time period. One of those drivers of this growth are the reviews left by consumers on Amazon. However, for a vast majority of the products, either the number of reviews is too huge or the length of the reviews is too long or both. This poses an apparent difficulty in synthesizing all of the information to make an informed decision. Our goal is to analyze all the reviews for each product, extract valuable information and present it in a succinct manner.","uni":"sb4027","language":"Language (Python, Spark, CSS, HTML, MySQL), Platforms (Hadoop, Jupyter Notebook) ","pid":"201812-2","m4uni":"","analytics":"Algorithms:
1. Subjectivity and Objectivity classifier
2. Parts of Speech tagging
3. Pointwise Mutual Information - Information Retrieval
4. Semantic Orientation Analyzer
5. N-gram Creation

For Google Search Engine API (for PMI-IR): Created a custom search engine and obtained the results by using a HTTP GET request to the URI.

For Website: Flask micro-web framework, CSS and HTML along with SQL to display the final results as an end product.

Analytics: Length of Review vs Helpfulness, Length of Review vs Rating, Average Rating across products, Distribution of pros and cons across reviews

Visualization: Length of Review vs Helpfulness, Length of Review vs Rating, Average Rating across products","m4lname":"","industry":"Retail","m3lname":"Shah","dataset":"We used the Amazon Product Data posted by Julian McAuley from UCSD. This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. It includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

We had emailed Julian to get access to the complete dataset.","m2uni":"sr3440","m2fname":"Srivatsan ","m3uni":"as5451"},{"projectname":"TripWise: A Multi-Agent AI Travel Planning Assistant","timestring":"Tue May 5 20:24:27 2026","m1uni":"zw3099","m2lname":"Chen","m1fname":"Zhimei","m4fname":"","m1lname":"Wang","m3fname":"Kayla","description":"TripWise is a multi-agent AI travel planning system that transforms a single free-form natural language request into a complete, actionable trip plan — including a day-by-day itinerary, real venue recommendations, round-trip flight options, and an all-in budget estimate. The core objective is to replace the fragmented, multi-app travel planning experience with one coherent agentic pipeline that reasons, searches, plans, and self-corrects end-to-end.

The key innovations are threefold. First, we decompose the planning task into eleven specialized agents with strict data contracts, each scoped to one job, rather than relying on a single monolithic prompt. Second, we introduce a hybrid model architecture: a large cloud reasoning model (Cerebras gpt-oss-120b) handles all world-knowledge reasoning and tool use, while a custom fine-tuned LoRA model (tripwise-mlx-bf16, published on HuggingFace) handles the one structured output task it was trained for — schema-conformant itinerary JSON. Third, we build in automated quality control through a Critic agent and a smart Revision Router that classifies user edits into text, structural, or budget-tier changes, triggering the appropriate depth of replanning rather than a blanket rewrite.

This work matters because travel planning is a microcosm of a broader unsolved problem: tasks that require chaining retrieval, reasoning, structured generation, and user interaction are exactly where today's single-turn AI tools fall short. TripWise demonstrates a concrete, deployable architecture for this class of problem — one where open-weight models, targeted fine-tuning, and careful agent decomposition together produce results competitive with expensive proprietary APIs, at a fraction of the cost. The system is live, the fine-tuned model is open on HuggingFace, and the pipeline design is directly transferable to any domain that requires turning vague user intent into a structured, multi-step plan.","uni":"zw3099","language":"Languages: Python (backend, training, evaluation), TypeScript (frontend) Backend: Python 3.12, FastAPI, asyncio — multi-agent pipeline with SSE streaming Frontend: Next.js 16 (App Router), Tailwind CSS 3, TypeScript ML / Training: PyTorch 2.10 (CUDA 12.8), Hugging Face transformers, peft, trl, accelerate — LoRA fine-tuning with DDP via torchrun Model Serving: Cerebras API (gpt-oss-120b) for orchestration; oMLX on Apple Silicon for the fine-tuned itinerary model (tripwise-mlx-bf16) Infrastructure: Ubuntu VPS, nginx, systemd, Cloudflare Access + cloudflared tunnel, Let's Encrypt TLS, Vast.ai (GPU rental for training) External APIs: Tavily (web search), HuggingFace Hub (model publishing)","pid":"202605-14","m4uni":"","analytics":"TripWise implements several distinct algorithmic and analytical components across the pipeline. At the orchestration layer, all non-itinerary agents use a tool-calling agentic loop — an iterative LLM inference loop that decides when to invoke external tools (Tavily web search, Python subprocess for arithmetic), processes tool results, and continues reasoning until a final structured JSON output is produced. The multi-agent decomposition itself is an architectural algorithm: eleven agents with explicit dependency ordering, parallel execution where independent (research legs run concurrently via asyncio.gather), and a critic-driven replan loop that iterates route planning, itinerary generation, and quality scoring up to two retries based on a 0–10 scoring rubric with hard pass/fail threshold.

For the itinerary model, the core algorithm is LoRA (Low-Rank Adaptation) fine-tuning — a parameter-efficient method that injects trainable low-rank matrices (r=16, alpha=32) into the query, key, value, and output projection layers of a transformer, updating roughly 0.1% of total parameters while keeping the base model frozen. Training used SFT (Supervised Fine-Tuning) via trl.SFTTrainer with cosine learning rate scheduling, gradient checkpointing, and DDP (Distributed Data Parallel) across two GPUs. Evaluation was conducted on 10 held-out examples using five hard deterministic metrics: JSON validity, schema completeness, day-count correctness, per-day field completeness, and place-grounding accuracy — all run under greedy decoding for reproducibility. The revision router implements a three-way text classification to dispatch user change requests to the appropriate replanning depth, and the budget agent uses structured arithmetic computed by a Python subprocess tool to aggregate per-category cost buckets into daily and total trip estimates.","m4lname":"","industry":"Information","m3lname":"Jiang","dataset":"The primary dataset used in this project is travel_finetune_examples_1_200.jsonl, a custom-constructed supervised fine-tuning dataset of 199 chat-format examples created specifically for this project and committed to the repository. It is not a public dataset — each example was hand-crafted to ensure internal consistency across four components: a structured preferences object (destination, trip length, budget level, pace, interests, constraints), a set of handpicked candidate places and restaurants, a day-by-day route group assignment, and a target JSON itinerary in our exact output schema. The dataset was split into 189 training examples and 10 held-out evaluation examples. Destinations span cities across Japan, Europe, Southeast Asia, and the United States, with trip lengths ranging from 2 to 14 days, covering solo and group travelers across all four budget tiers (low, medium, high, luxury) and a wide range of interest profiles including food, culture, nature, adventure, and urban exploration.

Beyond the fine-tuning dataset, TripWise ingests live data at runtime through the Tavily web search API, which grounds the research, arrival, budget, and candidate-enrichment agents in real-world, up-to-date information — real hotel names and prices, actual flight route options, and current attraction details. This means the system is not limited to any fixed set of destinations or venues; it can support any city or region that Tavily can retrieve results for. In principle, retraining the itinerary model on examples covering additional destination types, languages, or specialized trip formats (e.g. business travel, accessibility-focused itineraries, multi-month backpacking) would extend the model's coverage without requiring any changes to the pipeline architecture, since the itinerary agent sits behind a strict and stable input/output contract.

","m2uni":"zc2856","m2fname":"Zheli","m3uni":"nj2560"},{"projectname":"Arbitrary Aspect Identification, Extraction, and Ranking","timestring":"Sat Dec 14 00:01:20 2019","m1uni":"alb2307","m2lname":"Jouan","m1fname":"Austin ","m4fname":"","m1lname":"Bell","m3fname":"","description":"Goal: Identify, extract, and rank the importance of product aspects (i.e., a feature or attribute of a product) for each category on Amazon to better understand reasons for consumer purchases

For each product review, we extract all relevant aspects. Once extracted, aspects are clustered such that semantically similar aspects are in the same group, so that we can assign a single representative name to each aspect group. We are then able to measure the reviewer's sentiment toward each aspect and then compute a weight for how much each aspect contributes to a consumer's overall opinion. These weights can then be utilized to rank and compare aspects.

With this research, we answer inferential questions such as:
- Which aspects contribute most to positive opinions towards specific products?
- Which aspects correlate most with likelihood of purchase within a particular product category?

Through leveraging open text product reviews, we seek to:
- Reduce potential bias of respondents found in survey analysis
- Reduce cost and time compared to controlled experiments

While simultaneously developing a framework that is highly scalable and replicable

","uni":"alb2307","language":"Python, Pyspark, FastAI","pid":"201912-36","m4uni":"","analytics":"All completed leveraging pyspark to scale to >100m observations

Aspect extraction:
- dependency parsing trees and Part-of-speech tagging
- linguistic rules to extract aspects

Assign names to aspects:
- clustering of glove embeddings
- tf-idf to select words

Sentiment analysis
- CNN-BiLSTM model sentiment analysis
- glove embeddings

Aspect ranking
- Linear regression
- variant of expectation maximization algorithm (probabilistic ranking)

Visualized charts in python ","m4lname":"","industry":"Retail","m3lname":"","dataset":"Our team leveraged two datasets:
1) Amazon Product Reviews
- Product reviews from May 1996 - July 2014
- Includes: review text, rating, and product ID
- 142.8 million observations
- 24 broad product categories with many more subcategories

2) Metadata
- Includes: product ID, category, and price
- 9.4 million products
- Links to review data via product ID

Datasets provided by Julian McCauley and his team at UCSD. For access, follow the directions on this webpage: http://jmcauley.ucsd.edu/data/amazon/

While our dataset leveraged amazon product reviews, our analysis was developed independent of the data source and can be applied to any unstructured product reviews","m2uni":"cj2567","m2fname":"Cedric","m3uni":""},{"projectname":"Deep Audio Understanding","timestring":"Thu May 2 19:21:23 2024","m1uni":"tw2903","m2lname":"","m1fname":"Tianyan","m4fname":"","m1lname":"Wu","m3fname":"","description":"Speech-to-text conversion","uni":"tw2903","language":"python and html","pid":"202405-2","m4uni":"","analytics":"ICA,RNN,MLP","m4lname":"","industry":"Information","m3lname":"","dataset":"https://www.openslr.org/12
LibriSpeech
It is one of the most popular open-source speech-to-text datasets. It consists of 1,000 hours of English speech suitable for training and evaluating speech recognition systems.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Application of Machine Learning in Baseball","timestring":"Sat Dec 22 02:16:13 2018","m1uni":"yz3284","m2lname":"Cui","m1fname":"Yuhan","m4fname":"","m1lname":"Zha","m3fname":"Leo","description":"Machine learning is widely used in baseball prediction. This study aims to construct a classification model for the prediction of award-winning players in order to reveal some potential hidden future baseball stars from a large pool of players. In addition, this study creates a career peak prediction model for the team managers to apply during the player selection process in order to predict whether the players have passed their career peak. Furthermore, the study proposes a salary prediction model for the players to evaluate their current contracts on whether they are being underpaid. Lastly, the study performs unsupervised machine learning techniques in categorizing different pitchers. ","uni":"yz3284","language":"Python, HTML & CSS, and Google Cloud Platform","pid":"201812-34","m4uni":"","analytics":"The study makes predictions of baseball salary, career peak, and award-winning using machine learning models. Classification models include Logistic Regression, Random Forest, and XGBoost. Regression models include Random Forest and XGBoost. In addition, the study also performs unsupervised machine learning techniques such as k-means clustering and PCA in categorizing different pitchers.
A web application is constructed to delivering complex statistical analytics and machine learning algorithms. ","m4lname":"","industry":"Information","m3lname":"Lam","dataset":"Lahmen’s Baseball Database is the primary data source of this study. This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2017 with 28 datasets. It was obtained from SeanLahman.com.
In addition, Statcast is also used in this study. It was obtained from pybaseball module.
Another dataset used is MLB salaries, which obtained from USA TODAY.","m2uni":"wc2619","m2fname":"Wanting","m3uni":"lkl2129"},{"projectname":"Understanding the Correlation Between Multiple Modality Physiological Data and Task Performance Under Multi-User Virtual Reality Distributed Control Task ","timestring":"Thu Dec 15 17:36:20 2022","m1uni":"yq2284","m2lname":"","m1fname":"Yinuo","m4fname":"","m1lname":"Qin","m3fname":"","description":"Assessing and tracking the physiological and cognitive states of multiple individuals interacting in virtual environments is of increasing interest to the virtual reality (VR) community. In this paper, I describe a team-based VR task, which I term the Apollo Distributed Control Task (ADCT), where individuals, via the single independent degree-of-freedom control and limited environmental views, must work together to guide a virtual spacecraft back to Earth. Novel to the experiment is that 1) I simultaneously collect multiple physiological measures including electroencephalography (EEG), pupillometry, speech signals, and individual actions, 2) I regulate the type of communication between the teammates, and 3) I modulate the difficulty of the task. Focusing on the analysis of pupil dynamics, which have been linked to a number of cognitive and physiological processes such as arousal, cognitive control, and individual, I find that pupil diameter and EEG changes are predictive of intermediate team successes throughout the task. The effect cannot be explained by potential confounds due to luminance changes or pupil accommodation during the task. I find that pupil dynamics and EEG under team-based VR tasks offer the potential to infer cognitive and physiological states related to task performance and teammate interaction, which further proves that Yerkes-Dodson Law remains in team-based VR tasks on multiple data modalities. ","uni":"yq2284","language":"Python, Matlab","pid":"202212-42","m4uni":"","analytics":"I have tested the Pearson Product-Moment Correlation, Spearman's Rank-Order Correlation, and Granger Causality on different data modalities because they are common statistical models. I also tested Multilayer Perceptron (MLP) and Long Short-Term Memory (LSTM) models as high-performance machine learning models. Some basic statistical analyses are implemented such as Analysis of Variance (ANOVA) and p-values to test if the findings are statistically significant. Most of the visualizations are completed by using Matplotlib in Python. EEGs are visualized using EEGLab in Matlab and MNE in Python. VR environment was created by the Unreal engine. The visualizations include the virtual environment, correlations of different data modalities, and other statistical analyses. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset is a multi-modality dataset collected by myself. During a group of participants participating in the VR experiment, I simultaneously collect all subjects' EEG, pupillometry, speech, and VR controller inputs. The dataset is not public and it is identical. I am planning to make this dataset a publicly available dataset in the future after deidentifying all the EEG and speech recordings. My software can process any dataset containing multiple physiological data modalities. Any dataset collected using LabStreamingLayer, a software that most researchers use to collect biological data, can be easily preprocessed using my software. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"OneML","timestring":"Thu Dec 23 06:32:41 2021","m1uni":"ah3816","m2lname":"Tekur","m1fname":"Anusha","m4fname":"","m1lname":"Holla","m3fname":"Devica","description":"The aim of our work is to build a spark-based data platform solution for ml tasks – OneML. Our modules support many options to customize an ML lifecycle with different types of sampling, preprocessing, machine learning algorithms etc. OneML serves as a data-agnostic platform for varied supervised machine learning requirements. Our platform abstracts the engineering complexity for the ML lifecycle and facilitates rapid provisioning of ML models. Our platform has custom transformations that are not currently offered by Spark MLlib like Stratified Sampling, Stemming, Lemmatization etc. The system design of our project is open-sourced to make it accessible to all.","uni":"ah3816","language":"Spark, Python, Flask, HTML, CSS, GCP","pid":"202112-34","m4uni":"","analytics":"ROC curves were done on the ML models.
Systems used were spark based data readers and submission to dataproc","m4lname":"","industry":"Information","m3lname":"Verma","dataset":"Tested on: https://www.kaggle.com/c/titanic
The project can run on any dataset","m2uni":"at3584","m2fname":"Aswin","m3uni":"dv2465"},{"projectname":"Concert Tickets Prices Analysis","timestring":"Sat Dec 20 05:20:19 2025","m1uni":"yl5646","m2lname":"Liu","m1fname":"Yutong","m4fname":"","m1lname":"Liu","m3fname":"Yuesheng","description":"Build a regression model to estimate ticket prices for upcoming live events and provide insights into the influence of key features. Innovations include combining Ticketmaster event data with Spotify artist popularity, an end-to-end Spark ML pipeline for large-scale event data, and interactive web-based visualization. This toolkit helps event organizers and users better understand pricing patterns and make informed decisions.","uni":"yl5646","language":"Python, Spark on Google Cloud Dataproc","pid":"202512-28","m4uni":"","analytics":"We implemented six regression models: Linear Regression, Lasso, Elastic Net, Decision Tree, Random Forest, and Gradient Boosting. The system includes data preprocessing, feature engineering, model training and evaluation, feature importance extraction, and interactive visualization of predicted ticket prices.","m4lname":"","industry":"Media","m3lname":"Hao","dataset":"We used a Ticketmaster dataset containing historical ticket prices and event metadata. Initially, we attempted to extract data from the Ticketmaster API, but only a limited number of events had usable price information, so we used an existing dataset instead. The dataset comprises both numeric features (such as ticket price, artist popularity, and event date) and categorical features (including venue, genre, and city). The pipeline can also support other event datasets with similar attributes. ","m2uni":"ml5312","m2fname":"Minghao","m3uni":"yh3777"},{"projectname":"Twitter Streaming Analysis: Bitcoin Price Prediction based on Twitter Sentiment and Time Series","timestring":"Fri Dec 17 23:12:41 2021","m1uni":"rp3016","m2lname":"Bi","m1fname":"Ran","m4fname":"","m1lname":"Pan","m3fname":"Jiaxi","description":"The cryptocurrency market is receiving increasing attention, and the prediction of its prices is a vital financial and technological problem. Additionally, in recent years, there has been increasing interest in using sentiment information from social media to analyze and solve real-world problems. Twitter, as one of the most popular social media platforms, offers APIs for developers to grab real-time Twitter data easily. In this project, we focus on the possible relationship between Twitter sentiment, time-series data, and cryptocurrency. We build regression and deep learning models to predict bitcoin price.
Our goal is to design and develop a prediction model to predict the trend of bitcoin price, using the public sentiments on social media. With certain crypto keywords and user sentiments on social media, we develop predictive models to detect the price fluctuations of cryptocurrency.
This goal has practical significance. Sentiment values calculated from real-time Twitter streams can reflect true public sentiments. A good predictive model can help people get the possible price fluctuations based on real-time Twitter streams.","uni":"rp3016","language":"Python, SQL, Spark, GCP ","pid":"202112-27","m4uni":"","analytics":"Deep Learning Model: LSTM ","m4lname":"","industry":"Finance","m3lname":"Zhou","dataset":" For this project, we mainly work with Twitter’s streaming data that are related to bitcoin and bitcoin’s price data. The Twitter datasets and bitcoin price are all near-real-time streaming data For Twitter data, we extract them through Twitter’s developer API, and for bitcoin data, we download it from Bitstamp’s API. We would want to understand how Twitter and market sentiments influence the price of bitcoins so both datasets were streaming and downloading to GCP at the same time. Both datasets share the timeslot from 2021/11/17 - 2021/12/12.
","m2uni":"yb2500","m2fname":"Ying","m3uni":"jz3280"},{"projectname":"Financial Sentiment and Market Correlation Analysis","timestring":"Fri Dec 19 18:57:21 2025","m1uni":"jl6850","m2lname":"Meng","m1fname":"Jinsong","m4fname":"","m1lname":"Liu","m3fname":"Ziyang","description":"The core objective of this project is to tackle the challenge of \"unimodal information scarcity\" in financial time series forecasting by constructing a high-precision multimodal financial forecasting system based on FinBERT and LSTM. Addressing the pain point where traditional technical analysis relies solely on numerical historical data, leading to lagged responses to sudden public opinion shifts, we are committed to exploring a Deep Dual-Stream Architecture. This system aims to leverage FinBERT, the state-of-the-art pre-trained model in the NLP field, to precisely quantify unstructured market sentiment and perform orthogonal fusion with LSTM temporal features, thereby significantly enhancing the model's robustness and prediction accuracy in complex market environments. First, we propose a heterogeneous feature alignment mechanism that successfully maps high-dimensional sparse Semantic Features and low-dimensional continuous Market Features into the same temporal space. Second, the system integrates the FinBERT financial large language model, possessing fine-grained sentiment perception capabilities for complex financial contexts, enabling the capture of latent long and short signals from seemingly neutral news. Finally, by utilizing the LSTM gating mechanism for temporal modeling of the dual-stream data, the system possesses the end-to-end predictive capability to \"remember\" long-term market trends and \"keenly perceive\" short-term public opinion shocks. On this basis, we only briefly constructed the necessary historical dataset as the underlying support for model input to ensure experimental reproducibility. The significance of this study lies in validating the practical value of large language models in the field of quantitative finance. We demonstrated that by introducing high-quality sentiment factors generated by FinBERT, the \"information cocoon\" of traditional time-series models can be effectively broken, significantly reducing prediction error (RMSE). This provides a general, scalable multimodal modeling paradigm for next-generation intelligent quantitative trading systems, showcasing how AI can more accurately understand financial markets by comprehending human language.","uni":"jl6850","language":"Python","pid":"202512-26","m4uni":"","analytics":"The project implements a comprehensive analytics and modeling pipeline for sentiment-enhanced stock price forecasting. From an analytics perspective, the system performs correlation analysis between news sentiment and stock returns, including same-day and next-day Pearson correlation coefficients, statistical significance testing, and extreme sentiment categorization to assess predictive validity.

The core algorithms include a FinBERT-based sentiment analysis model for transforming unstructured financial news into quantitative sentiment scores, and Long Short-Term Memory (LSTM) networks for time-series prediction. Two LSTM variants are trained and compared: a baseline model using only OHLCV features and a sentiment-enhanced model that incorporates aggregated daily sentiment. Model training employs supervised learning with mean squared error loss, Adam optimization, normalization, and early stopping to ensure fair and robust evaluation.

System modules include a custom GNews data ingestion and historical backfilling module using a time-slicing strategy, a semantic denoising pipeline based on configurable blacklist filtering, a sentiment aggregation module aligning news data with trading days, and a model training and evaluation module. An interactive web-based dashboard serves as the visualization layer, featuring time-series price vs. prediction plots, performance metric panels (MAE, RMSE, R², MAPE), sentiment–return scatter plots, correlation summary tables, and model architecture comparisons for exploratory analysis and result communication.","m4lname":"","industry":"Finance","m3lname":"Lin","dataset":"This system is powered by a high-precision, Dual-Stream Historical Dataset specifically curated for multimodal financial forecasting.
1. Market Data Stream: Includes fully aligned Daily OHLCV (Open, High, Low, Close, Volume) records sourced from Yahoo Finance, ensuring a rigorous quantitative baseline.
2.Sentiment Data Stream: Features a proprietary historical news corpus retrieved via a custom Time-Sliced GNews Scraper. This text data has undergone a Semantic De-noising Pipeline to filter out non-financial noise and is quantified into daily sentiment scores using the FinBERT large language model.
The underlying data engine is designed to support historical retrieval and sentiment quantification for other major liquid assets, including MSFT, AMZN, GOOGL,META and other.","m2uni":"jm5876","m2fname":"Jiawei","m3uni":"zl3477"},{"projectname":"StockMood: A Sentimental Analysis Tool for Stock Performance Prediction","timestring":"Sat Dec 18 03:44:41 2021","m1uni":"yl4860","m2lname":"Zhou ","m1fname":"Yunhang","m4fname":"","m1lname":"Lin","m3fname":"Xiaoran","description":"Motivation and objectives:
Twitter’s emotional content and sentiment expression can reflect the stock market’s performance.
Also, the stock price in the following several days is strongly correlated with twitter’s sentiment performance based on some researches.
So the goal of our project is to build a tool to help investor makes wiser decisions based on our sentiment analysis of tweets

Innovations:
Explored 4 different open source NLP models (NLTK, Textblob, Flair, BERT) together with Affective Norms for English Words (ANEW) model for tweets analysis
Provided different kinds of data visualizations to convince our users and help them make wise decisions, like Emotional Dot Graph, Heatmap, Word Cloud Graph, Emotion Count Bar Chart, etc.
Potential business value due to the lack of relevant applications, We will be one of the first few applications that provide important market sentiment insight to traders and ordinary investors.

Capabilities:
The ability to process large amounts of data, use APIs to analyze emotions, train neural networks, visualize, and build websites.

Why important?
On one hand, There are few studies that use Twitter sentiment analysis as one of the factors to predict the rise and fall of stocks. Most of the predictions are based solely on historical stock prices.
On the Other hand, there is not currently a user-friendly website to provide this kind of sentiment information. Also, the current financial data sentiment analysis maybe not be available to the common people. There may be some APIs or web tools for funds or institutions to use, but ordinary people can only obtain limited information.

","uni":"yl4860","language":"Python, Tensorflow, Keras, Flask, SQLite3, JavaScript, HTML ,CSS, D3.js.","pid":"202112-29","m4uni":"","analytics":"First, we compare several sentiment analysis tools (NLTK, Textblob, Flair, BERT) to get the sentiment score of Twitter. By analyzing their polarization and correlation, we treat the weighted sum of them as the input of prediction model.
Second, we compare several time series prediction model (Linear Regression, SVR, GBR, SGD, ElasticNet, LSTM) and output their F1 score as result. Finally choose LSTM as the candidate model. We analyze the influence of different types and scales for the performance of prediction. The more specific and larger input we used, the greater result we could derive.
Third, the system can be divided as the front-end and back-end.
In front-end, we implemented several visualization like Emotional Dot Graph, Heatmap, Word Cloud Graph, Timeline Bar Chart, Tweet Display and Stock Table.
","m4lname":"","industry":"Finance","m3lname":"Yuan","dataset":"Our dataset can generally divide as two parts: Twitter sentiment text and stock price records.
Twitter data is gathered by using scrape method python snscrape package. The date range is set from 10/01/2020 to 09/30/2021. Only the top 100 tweets will be collected. Total tweet amount is 529,881.
Stock price data is gathered by direct download from Yahoo Finance. Total 18 stocks' name are used as searching keywords.
Our dataset is not public. Totally collected by ourselves.
Any other stock data and Twitter sentiment text can be added to our model and do prediction.

","m2uni":"hz2700","m2fname":"Hongyi","m3uni":"xy2508"},{"projectname":"Fashion AI: Attributes Recognition of Apparel","timestring":"Sat Dec 22 18:20:26 2018","m1uni":"jb4076","m2lname":"Wang","m1fname":"Jingyuan","m4fname":"","m1lname":"Bian","m3fname":"Xiuqi","description":"Detecting detailed apparel attributes is a topic receiving increasing attentions, which also has wide applications. Recent year, the demands of online shopping for fashion items grow a lot, which raises problems such as the sellers provide information not consistent with the real stuff, different sellers have inconsistent understandings of apparel styles. An automatic fashion attributes detection system can help overcome these problems by providing precise and consistent taggings or descriptions of apparel from their pictures. This technique can be applied to various areas such as apparel image searching, navigating tagging, and mix-and-match recommendation, etc.","uni":"jb4076","language":"Python, Keras, Ubuntu 16.04","pid":"201812-03","m4uni":"","analytics":"We use the same structure for all 8 networks. After input layer, we have preprocess layer to process data. And then, we use InceptionV3 or Resnet50 to construct our CNN model. The last layer is a fully-connected layer with softmax as the activation function. It takes the output from our base model as input where a 50% dropout is added to prevent overfitting. The final output is a number corresponding to a specific category in that attribute dimension.","m4lname":"","industry":"Media","m3lname":"Shao","dataset":"We could also support any other image data to do fashion details classification.","m2uni":"ww2468","m2fname":"Wenshan","m3uni":"xs2327"},{"projectname":"Automatic Storytelling","timestring":"Sat May 16 04:20:25 2020","m1uni":"jh4162","m2lname":"","m1fname":"Jiaqi","m4fname":"","m1lname":"He","m3fname":"","description":"Automated story generation is the problem of automatically selecting a sequence of events, actions, or words that can be told as a story. Our goal is to monitor overwhelming real-time information on social media and automatically generate a story by selecting related information. First I use generative method，training an event2sentence RNN model which can translates events back into human language to write a whole new story. Then we use extractive method，like vector ranking，sorting，clustering，topic classification and keywords scoring to construct a story. ","uni":"jh4162","language":"Python","pid":"202005-07","m4uni":"","analytics":"First is abstractive way. We first to transform We train a event2sentence RNN model, which can translates events back into natural language.
Then We use TFIDF to embedding the original sentences and making clustering.
Train a LSTM model to classify topic of every paragraph. Then construct a story.
","m4lname":"","industry":"Media","m3lname":"","dataset":"Dataset to train generating sentences is from wikipedia movie plot. The train data/validation data is approximately 9:1. For less training time，here I use 10000 training data，1000 validation data to train the model.
The data we use to construct a new story is from twitter. Using twitter API to grab instant tweets under specific accounts，those accounts are: new York times，breaking news，Cnn-brk，WSJ-breaking-news，ABS-news-Live，sky newsbreak，TWC-breaking. In this way，we ensure the quality of all the tweets so that there won’t appear meaningless sentences. From every account，we collect 30 newest tweets.
The data used to train the topic classification is from BBC news. There are 2,225 news articles in the data, they belong to 5 topics，I split them into training set and validation set, according to the parameter we set earlier, 80% for training, 20% for validation. The number of training data is 1780，while validation data is 445.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"AI Trader (CN/HK)","timestring":"Fri May 15 22:02:12 2020","m1uni":"jj3078","m2lname":"Pan","m1fname":"Junyang","m4fname":"","m1lname":"Jiang","m3fname":"","description":"Artificial intelligence shows its power to analyze time-series data more efficiently than human beings and to perform stock trading tasks automatically without people’s intervention. In this project, we designed algorithms mainly based on reinforcement learning DQN. The RL based system can simulate human expert traders’ analyzing and trading. The agent’s actions are mapped from states by a deep q-learning network. In addition, we did external prediction to use predicted price as new features, adjusted exploration rate and designed stop-loss strategy to improve the trading performance. An optional and referable rank list of most predictable stocks is also provided, which is a marginal yield of external prediction. The final AI trader has the ability to learn experience from past years of data of 3 selected stocks and trade automatically through given new data with considerable earning ability. ","uni":"jj3078","language":"Python","pid":"202005-15","m4uni":"","analytics":"We researched and designed Deep Q-Network (DQN), a state of art in reinforcement learning scope, to solve the auto trading problems. DQN agents in RL can learn from the experience of changing states, which includes different days of price changing and how it changes according to different actions that the agent took, such as buy, sell, or hold certain indexes of stocks. ","m4lname":"","industry":"Finance","m3lname":"","dataset":"We acquired data from Tushare API. There are 3789 stocks from two China stock markets. The data contains historical daily price (open, high, low, close), volume and amount, etc, starting from 1997. We preprocessed the data by truncating time periods, merging different selected stocks by dates matching. The main features we used are stock code, stock close price, volume and date.","m2uni":"yp2524","m2fname":"Yiping","m3uni":""},{"projectname":"Towards Interpretable Precision HEOR","timestring":"Thu May 14 14:41:11 2020","m1uni":"alb2307","m2lname":"","m1fname":"Austin","m4fname":"","m1lname":"Bell","m3fname":"","description":"I developed an interactive application for exploring subpopulations for patients diagnosed with Chronic Liver Disease. I utilize Patient Similarity Networks to facilitate quick exploration of patient clusters. All models are contained within an application such that the user can then immediately evaluate each patient cluster on downstream outcomes. ","uni":"alb2307","language":"Python, R, HTML, CSS","pid":"202005-19","m4uni":"","analytics":"Patient Similarity Networks (Graph modelling approach), LightGBM model for 30-day mortality prediction, Cox Regression and Survival Analysis. Visualized using the R-Shiny platform. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"I utilized the MIMIC-III database for all analyses. This is a public research database that can be accessed after a short review from this website: https://mimic.physionet.org/gettingstarted/access/

This dataset includes ICU and EMR related data for 40,000 patients between 2001 and 2012. It includes approximately 3,000 patients diagnosed with some form of chronic liver disease. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"2020 Democrat Nominee Popularity Analysis and Visualization","timestring":"Fri Dec 13 03:31:44 2019","m1uni":"jg4179","m2lname":"Ye","m1fname":"Junwei","m4fname":"","m1lname":"Gong","m3fname":"","description":"The question we try to see here is which Democrat candidate is most likely to be selected for the party nominee to compete with President Trump in the 2020 presidential election. This is a trending project as more and more people will try to get their hands on in 2020, and we are just one among many to try to take on the challenge.","uni":"jg4179","language":"Python, HTML, CSS and Java script, ","pid":"201912-45","m4uni":"","analytics":"We used Naive Bayes Classifier to train a sentiment analytics model on 1.6 M tweets. Then we used this model built for the twitter sentiment analysis to see how people feel about each candidate. For visualization, we used a map visualization and an interactive pie and bar charts.","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"To train our Naive Bayes Model, we used tweet dataset acquired from kaggle, which contains more than 1.6 M data. Other than that, dataset we used for this visualization project all comes from the live streaming tweets.","m2uni":"hy2610","m2fname":"Hongzhe","m3uni":""},{"projectname":"Autonomous Learning of Physical Environment through Neural Tree Search","timestring":"Fri May 5 04:50:40 2023","m1uni":"bf2504","m2lname":"","m1fname":"Bowen","m4fname":"","m1lname":"Fang","m3fname":"","description":"A scalable learning approach that efficiently visits as much of the unknown environment to build the map and localize itself.

Belongs to active SLAM, which allows a robot to operate in an initially unknown environment
Learning provides flexibility to the choice of input modalities
Ability to transfer the knowledge
Model-based search improves the efficiency

The goal of the project is to present a solution that enables the agents to efficiently visit as much of the unknown environment as possible while building a map and localizing themselves.

As for the algorithm, I would like to choose MuZero for this problem, which uses Monte Carlo Tree Search(MCTS) as its search algorithm and directly learns a model for environment dynamic from interaction with the unknown environment during simulation. Further, it is a scalable model which was first used on board games and reached the same performance level as AlphaZero. At the same time, it is also capable to play Atari games, which is considered to be hard for AlphaZero and model-based algorithms, at the state-of-the-art level.","uni":"bf2504","language":"Python, GCP VM Linux, Habitat, Jax, Muax","pid":"202305-19","m4uni":"","analytics":"Implemented MuZero algorithm, proposed muax library to provide easy-to-use MuZero algorithm implementation.

Implemented ResNet backbone, EfficientZero neural network.

Implemented Neural Tree Search model.

Generated gif from interacting with the simulated environments.","m4lname":"","industry":"Information","m3lname":"","dataset":"Gibson dataset and Habitat point goal navigation tasks. After signing up for permission, directly download data from official website. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Dream Simulation","timestring":"Fri Apr 23 06:10:57 2021","m1uni":"wxh2000","m2lname":"","m1fname":"William","m4fname":"","m1lname":"Huang","m3fname":"","description":"As one of the vital brain functions, sleep plays an essential role in maintaining good physical and mental health. A good night’s sleep not only relaxes the body but also creates an interesting byproduct — dreams. Although it is impossible to fully simulate all brain functions, this paper tries to explore methods to simulate dreams in video formats. The goal of this paper is to set a cornerstone to inspire other developers to develop algorithms in the area of automatic video generation and video content transformation. ","uni":"wxh2000","language":"Python","pid":"202105-7","m4uni":"","analytics":"An automatic video generation algorithm was implemented in Python language. The algorithm utilizes existing CycleGAN model to perform image to image translation, and the algorithm produces an output in a video format for visualization. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Due to the nature of the project, no suitable public video datasets were available.
The project uses custom movie stock footages collected from the internet as the dataset. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Quantifying Correlation Between Weather and Twitter Public Sentiment","timestring":"Fri Dec 17 23:02:43 2021","m1uni":"fca2118","m2lname":"Zhou","m1fname":"Frederico","m4fname":"","m1lname":"Araujo","m3fname":"Uday","description":"Objectives:
• Create regression models to predict ratio of positive tweets over all tweets in real time given a current weather condition
• Compare models by their error

Innovations:
• Used abstract weather parameter rather than more concrete values (e.g. percentage cloud cover, humidity)
• Collected data from one city rather than a bigger geographical location
• Tweet classification done based on a vote system
• Used streaming data over static data

Capabilities:
• Predict aggregated Twitter sentiment based on weather condition
• Show prediction in real time

Spurred by growth of platforms like Twitter, a lot of companies and media organizations are now using tech platforms as service providers, so there is potential for companies to analyse data available from these platforms in order to create better market strategies. This type of sentiment analysis will help brands define goals, identify and find their audience more efficiently.","uni":"fca2118","language":"Python, JavaScript + CSS + HTML","pid":"202112-49","m4uni":"","analytics":"• Visualizations: RMSE plots of machine learning models, Pie chart of positivy ratio
• Modules: scikit-learn, pandas, numpy, datetime, nltk, textblob, vader, etc.
• Analytics: Airflow Scheduler, Google Cloud Platform.
• Algorithms: regression algorithms - Linear Regression, Ridge Regression, Gradient Boosting, SVR, Random Forest, Ada Boost","m4lname":"","industry":"Social Science-Government","m3lname":"Mukhija","dataset":"The dataset consists of twitter data and weather data, obtained using the Twitter API and OpenWeatherMap API, respectively.
It's not a historic data. Rather, we collect it in real-time.","m2uni":"yz4175","m2fname":"Yewen","m3uni":"um2158"},{"projectname":"Fake News Detection","timestring":"Sat Dec 22 05:08:44 2018","m1uni":"yd2466","m2lname":"Ji","m1fname":"Yuanchu","m4fname":"","m1lname":"Dang","m3fname":"Wei","description":"For this project, we choose to narrow down to a specific formulation, that is, given a news headline and a body of paragraph, our objective is to determine whether the body discusses, agrees of disagrees with, or is totally unrelated to the headline text. In terms of modeling, we first implement and configure vanilla feed-forward neural networks and achieve solid benchmark accuracy. On top of that, we further experiment with recurrent neural networks, especially long short-term memory units that are known to perform well with sequential data such as texts. Last but certainly not least, we apply BERT - a latest pre-trained fine-tuning language model developed by Google AI - to this classification problem and achieve decent training outcomes. Separately, using Twitter's streaming API and our trained classifiers, we build an interface that validates the truthfulness of tweets in real time based on the predefined ground truth body texts. ","uni":"yd2466","language":"Python, JavaScript, TensorFlow, Keras, Elephas, Google Cloud, Twitter Streaming API","pid":"201812-43","m4uni":"","analytics":"Feed-forward neural networks, Recurrent Neural Networks (RNN), Long Short-Term Memory Units (LSTM), Bidirectional Encoder Representations from Transformers (BERT)","m4lname":"","industry":"Information","m3lname":"Luo","dataset":"The dataset we use comes from Stage 1 of the Fake News Challenge. In total, the dataset is comprised of 1683 article bodies, 49972 headlines, and ground truth labels corresponding to each headline-body pair. The size of the joined dataframe is over 100MB. We split the entire dataset into training and validation according to a 85% to 15% ratio. ","m2uni":"yj2466","m2fname":"Yanmin","m3uni":"wl2671"},{"projectname":"Yelp Help!","timestring":"Sat Dec 18 05:00:13 2021","m1uni":"bdd2115","m2lname":"Yerram","m1fname":"Bhavin","m4fname":"","m1lname":"Dhedhi","m3fname":"Vasanth","description":"When talking about Big Data Analytics, often the data generation part is overlooked and it is assumed that there exists an API or a readily available dataset that we can you to do the analysis. But, we wanted to start from scratch and build a tool that can scrape data from Yelp.com and then use the scraped data to generate Pseudo-menus and display the results in an interactive website where any user can input city name and zipcode to get restaurants and their corresponding Pseudo-menus.

Pseudo-menu: This is not the typical menu that we see in restaurants where they might have their own obscure names for common food items. For example \"Eggs can’t be cheese\" this is a burger. So our Pseudo-menu will just have one menu item as a \"burger\".

The motivation for doing this was three folds.
- First, Most of the previous work done using the Yelp data (Albeit publically available dataset) relates to recommendation systems. So, we wanted to do something different.
- Second, When a user talks about what they ate at a restaurant, they do not typically use the obscure names from the menus of a restaurant, they talk about the actual thing they ate.
- From a business point of view, if you are a supplier you might want to know what food items people are consuming at which location to gain potential customers. Pseudo-menus can help in doing just that.","uni":"bdd2115","language":"Python, Scrapy, spaCy, NLTK, GCP, Postgresql, HTML, CSS, JavaScript, Flask","pid":"202112-51","m4uni":"","analytics":"We trained a NER (Named Entity Recognition) model using menu items from menus uploaded by restaurants on Yelp.com. We created a corpus of ~53000 menu items to annotate the reviews for training the NER model.

The menu items were preprocessed before including them in the corpus. The preprocessing steps included the following:
• Convert everything to lowercase
• Remove “Quantitative measure terms” like 20 oz., 2 tb spoons, 3 pieces of etc.,
• Normalize the string (convert any HTML specific characters to plain English),
• Strip excessive whitespace
• Remove very common menu items (For this we created another corpus that included very frequent items like water, eggs etc. and there were also many garbage values present in the menus uploaded to Yelp like \"ab\", \"a p\" etc.)
• Removed wine names
• Removed names with characters more than 52.

Once we had a trained model, we used it to predict menu items in each review and aggregated the count of each item to display to the users.

We used Flask to create APIs and HTML, CSS and vanilla JavaScript to build the frontend to display the results. The user selects City and Zipcode to get a list of restaurants and can click on any restaurant to show the menu associated with that restaurant.","m4lname":"","industry":"Retail","m3lname":"Margabandhu","dataset":"The dataset we used was scraped using the scrapers we implemented. The dataset includes information about businesses, reviews and menus.

Business scraper fetches the following information - Business name, Rating, Business URL, Number of reviews, Location, Categories, Address.

Reviews scraper fetches the following information - Review ID, Review, Date, Rating, Business name, Business ID, Business Alias, Business Location, Menu URLs.

Menus scraper scrapes the following information - menu.

We scraped data for Phoenix, AZ and Chicago, IL
• Phoenix, AZ ~ 400,000 Reviews, 1,420 Restaurants
• Chicago, IL ~ 450,000 Reviews, 1,495 Restaurants

**The scrapers should only be used to scrape data for academic purposes only**","m2uni":"ky2482","m2fname":"Karthik Datta","m3uni":"vm2656"},{"projectname":"Building Cancer Knowledge Graph For Diagnostic Medicine","timestring":"Fri May 15 20:17:13 2020","m1uni":"rh2962","m2lname":"Zhang","m1fname":"Runyu","m4fname":"","m1lname":"Hao","m3fname":"","description":"Knowledge Graph(KG) is a new way of information presentation that has some advantages over traditional information presentation like structured databases. In this project, we try different methods to build a Cancer Knowledge Graph that contains entities and relationships which can be used in application like information query and Q&A system. We also build a Q&A system that can answer simple questions related to cancers.

Due to the high volume of data, professionals may have difficulty to retrieve all related information about a certain cancer. If there is a product which can help doctors solve these problems, it will be perfect. This is the reason that we want to build a Cancer Knowledge Graph.","uni":"rh2962","language":"We used Python as the programming language and Neo4j as the graph database.","pid":"202005-22","m4uni":"","analytics":"As for the text mining part, we utilized NER and RE via BERT model. As for the QA system, we used AC tree to do pattern matching.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"We scraped raw data from a medical website named Malacards. There are total 3287 records of different cancers.","m2uni":"lz2684","m2fname":"Lixin","m3uni":""},{"projectname":"Manhattan subway analysis and visualization","timestring":"Fri Dec 15 20:09:02 2023","m1uni":"yz4309","m2lname":"Yan","m1fname":"Yufan","m4fname":"","m1lname":"Zhang","m3fname":"Shutong","description":"Finding insight into people’s behavior is key to the success of recommendation systems. By analyzing New York City’s subway data, we could find out general patterns in the load of the subways and the flow of people. This pattern could potentially be used to improve public transportation in New York City. For example, we realized that there are a lot more people taking the subway on workdays compared to weekends. Therefore, there could be more frequent trains and buses on weekdays. To be more precise, there are more people on Tuesday, Wednesday, and Thursday. So, there could be more processed adjustments to the train schedule. What’s more, throughput for different times of the day could be used to have dynamic schedule adjustments for the train.
Last but not least, the visualization of enter and exit throughput of each station could directly present stations where people get on the train and get off the train at each time. Those stations that a lot of people enter during morning and exit during evening present there are more people living there vise versa. With this flow analysis, we could customize the recommendation and advertisement distribution in these areas. In the residential area, people would do groceries, shopping, and entertainment. In the business area, there would be more business-related purchases. The motivation for solving this problem is boundless and there are a lot of financially driven reasons behind it.","uni":"yz4309","language":"Python JavaScript HTML","pid":"202312-16","m4uni":"","analytics":"analytics and algorithms：
Operator Reordering
Operator Reordering is to rearrange the order of operations to minimize the amount of data processed in subsequent steps. In our project:
1) Filtered the data to select only the records where the borough is Manhattan and remove any irrelevant data.
2) Used the filtered dataset to perform additional data processing steps
Redundancy Elimination
Redundancy elimination is to minimize the number of redundant calculations performed during data processing. In our project:
1) Calculated the number of people entering and exiting each station for each date and time.
2) Used this intermediate dataset for multiple purposes, including identifying the top 10 busiest stations and calculating the total number of people at all stations for a given date or time.

system modules：
Like traditional two-tier web architecture, our application is divided into data, back-end, and front-end. In the
data layer, we use two kinds of datasets. The turnstile
dataset provides the passenger flow information of different stations at different times. And the station coordinates
dataset provides the latitude and longitude of each station, we will draw the map based on that. In the backend layer, Flask is our framework and we also use spark to do data preprocessing works. We join the two datasets, do some filtering, and aggregations, and order the data by different
features. Then we use the Restful API based on HTTP protocol to communicate with the front-end. In the front-end layer, the website shows a fancy and dynamic map to the users. Also to interact with users, the users could choose a particular timestamp and the map will change automatically.

visualization：
Our project website consists of three parts.
The upper left corner contains a pie chart: Daily Throughput for Each Day of the Week and a line chart: Throughput at Different Times.
The lower left corner shows the ten stations with the most traffic during the current period, which changes with the change of time period.
On the right is a map where we visualize the number of people entering and exiting each site during the current time period. Each site is composed of two circles. We use blue to represent entry and red to represent exit. The map can be zoomed in or out using the mouse wheel to see each site more clearly.
The bottom side contains the current time period and a ”Start Streaming” button. After pressing it, the time will automatically start to increase, and the map will change as the time increases.","m4lname":"","industry":"Information","m3lname":"Zhang","dataset":"The data set for one week in April was tested. The data set was downloaded from the MTA official website.","m2uni":"ly2593","m2fname":"Lihui","m3uni":"sz3101"},{"projectname":"Expedia Hotel Recommendation","timestring":"Fri Dec 13 16:45:29 2019","m1uni":"jh3870","m2lname":"Dong","m1fname":"Junyu","m4fname":"","m1lname":"He","m3fname":"Zhuangyu","description":"Choosing a hotel is always a disturbing task. We try to predict/recommend the hotel that users are going to choose using a variety of features, e.g. destinations, lengths of stay, etc.","uni":"jh3870","language":"Python, Google Compute Platform, etc.","pid":"201912-37","m4uni":"","analytics":"Random Forest, Spark, Scikit-learn, Pandas, Matplotlib, etc.","m4lname":"","industry":"Retail","m3lname":"Ren","dataset":"The dataset contains user and hotel information from Expedia. It comes from Kaggle and has a total size of 4.2GB:

train.csv (3.79GB): the training set, which consists of log of customer behavior. Columns include time of search, location of customer, location of hotel, number of guests, number of nights, checkin/checkout dates, id of the hotel, whether the hotel was booked, etc.

destinations.csv (131MB): attributes of hotel clusters, which contains features extracted from hotel reviews text, represented in float numbers.

test.csv (263MB): the test set
","m2uni":"cd3032","m2fname":"Can","m3uni":"zr2209"},{"projectname":"Metavis: Connect Datasets for Better Analysis","timestring":"Fri Dec 17 20:57:34 2021","m1uni":"zc2549","m2lname":"Liu","m1fname":"Zhengyi","m4fname":"","m1lname":"Chen","m3fname":"Guangyu","description":"The original idea of this project comes from the paper called Support the Data Enthusiast: Challenges for Next-Generation Data-Analysis Systems. It is published by engineers in the visualization software industry. The idea is that next-generation data analysis should not only support interactivity but also suggest relevant datasets on the fly to enrich the process.

With the proliferation of open data platforms and free visualization tools, analysts have gone from a state of data scarcity to information overload. In face of the growing dataverses, the traditional paradigm of searching falls short in two aspects. If you do not know a dataset exists or have not realized its relevance, you cannot search for it. Secondly, search results show datasets as standalone objects, missing out the opportunity of linking datasets together to solve complex problems. Only by connecting the datasets together, the business value can be maximized.

Seven years later since the paper was published, today we have many great open data platforms like Harvard dataverse or NYC open data, which has thousands of open datasets. But it only supports searching and filtering in the database.

So our objective is that we want to build a recommendation system that is smart and helps people do more explorations. So we looked at the recommendation and EDA system literature, semantic linkage of big data literature, which help us to design our recommendation system. Meanwhile, we are inspired by the connection network in class, so we want to develop an interactive network interface so that we can visually explore datasets.
","uni":"zc2549","language":"Python (Dash specifically for realization of backend and frontend); Google Cloud Platform for data storage and server setup.","pid":"202112-15","m4uni":"","analytics":"We build a recommendation model on four dimensions of metadata from NYC Open Data. First, we use word embedding on column names, and feature extraction on metadata, then calculate cosine similarity between every pair of datasets; And we use one-hot encoding and calculate the number of common tags as a measure of similarity among datasets; Finally, we do LDA topic modeling on dataset description to produce clustering on datasets. The recommendation system works in the following way: by default when the user chooses a dataset as input, most similar datasets with the highest rankings will be selected and this may include datasets under different topics; Moreover, if the user prefers to dive into the specific topic, the model will filter out dataset in the same topic cluster.

And the system supports dynamic weight adjustment, where we will personalize recommendations based on the user’s selection. Therefore, this system can customize the recommendation for different use purposes and support more exploration possibilities.

Besides recommendation results, one visualization we have presented is the network of different datasets, which demonstrates how they are correlated with each other.","m4lname":"","industry":"Information","m3lname":"Wu","dataset":"Data we use in the project are metadata of all data published on NYC Open Data.

Data sourcing and management in the project are divided into the following parts. First, metadata of all the 3,405 datasets on NYC Open Data was fetched through the Python Client of Socrata API. During the process, we can store useful attributes of datasets, including update frequency, attachment information, and dataset type, which will be utilized in subsequent data sourcing workflow. With update frequency, we schedule data jobs on Airflow to periodically pull datasets based on their update frequency, which is the most efficient way of ensuring the datasets are up-to-date for our system. Then we programmatically download detailed dataset attachments, most of which are in Excel format, and transform them into the standardized data dictionary. Finally, using dataset type information, we preprocess and store tabular files and geospatial layers separately with high-performance libraries such as Dask and Sedona.

So far we only include metadata from NYC Open Data but definitely, data from other organizations with similar metadata structures can be incorporated into our system so that we can truly realize the connection of multi-sourcing datasets.","m2uni":"sl4835","m2fname":"Shiyue","m3uni":"gw2415"},{"projectname":"A Insider Threat Detection System","timestring":"Fri May 12 23:13:23 2023","m1uni":"hy2762","m2lname":"","m1fname":"Hongzhe","m4fname":"","m1lname":"You","m3fname":"","description":"Implement a real-time intrusion detection system with user-friendly interface","uni":"hy2762","language":"Python, JavaScript","pid":"202305-20","m4uni":"","analytics":"LSTM, Flaskm Kafka","m4lname":"","industry":"Information","m3lname":"","dataset":"CSE-CIC-IDS2018","m2uni":"","m2fname":"","m3uni":""},{"projectname":"What’s Cooking: Prediction of Cuisine Type","timestring":"Sat Dec 22 04:12:40 2018","m1uni":"bn2300","m2lname":"Zhang","m1fname":"Bohan","m4fname":"","m1lname":"Niu","m3fname":"Xinwei","description":"Today, more and more people choose to eat outside. They think cooking at home is inconvenient and they usually have no idea of what’s to cook. Eating outside somehow means more cuisines and more kinds of dishes. So, we want to make cooking easier by recommending people with recipes of various cuisine type in an easy-to-perform webpage. In this project, we analyzed twelve thousand recipes with 20 cuisine types and thousands of ingredients to predict a best-matched cuisine type based on the input ingredients. Finally, we can recommend people what’s to cook based on what they have.","uni":"bn2300","language":"Python/Spark on Jupyter notebook for training; Python Flask Framework and Jinja2 for web development","pid":"201812-39","m4uni":"","analytics":"
Feature Extraction and Transformation:
Word2Vec
TF-IDF

Classification:
Linear SVM with One-vs-Rest Classifier
Multinomial Logistic Regression
Decision Tree
Random Forest

Visualization:
Word Cloud
Network Graph","m4lname":"","industry":"Life Science","m3lname":"Zhang","dataset":"The original data is from Kaggle with 39774 observations. In order to improve our model and meet the requirement of the size of the dataset. We imported two more datasets from Yummly, and our final data has 104981 observations.

The cooking datasets are JSON formatted. Parsing of data needs to be done before further operations being performed, after which the data is stored in the Database. The following are the column fields in our database,

1. id: unique recipes ID for each recipe
2. cuisine: different 20 cuisine types which are Brazilian, British, cajun_creole, Chinese, Filipino, French, Greek, Indian, Irish, Italian, Jamaican, Japanese, Korean, Mexican, Moroccan, Russian, Southern_us, Spanish, Thai, and Vietnamese
3. ingredients: each recipe has its unique ingredients","m2uni":"gz2263","m2fname":"Guyu","m3uni":"xz2732"},{"projectname":"Power Flow Optimization using Big Data Techniques","timestring":"Fri May 6 16:21:04 2022","m1uni":"rr3417","m2lname":"","m1fname":"Rohan","m4fname":"","m1lname":"Raghuraman","m3fname":"","description":"The goal of this project was to optimize the power flow in a region under various conditions and to determine if the optimal power flow problem can be made faster using machine learning and AI. This is an important task because increased energy demand, addition of renewable energy and electric vehicles had led to increased challenges in terms of maintaining system stability and economic viability. Sub-optimal power flow conditions could acerbate climate change. To tackle this task, firstly, load data of New England was scraped, followed by generating load data using different power flow simulations using Siemens PSS/E. Next the New England load data was fed into FBProphet to generate a prediction of future energy demand. A prediction of power flow was done by testing different supervised learning algorithms on the simulation data. Consequently, using domain expertise a determination was made if machine learning is a suitable alternative to perform power flow optimization. A dashboard was built to visualize the results. The end goal is that these techniques will contribute to lowering the time and computing energy needed to perform power flow optimization so that the development of the energy landscape will not be hindered.","uni":"rr3417","language":"Python, HTML, CSS, Jupyter Notebook, Flask","pid":"202205-7","m4uni":"","analytics":"Data generation through simulation: Siemens PSS/E power flow
Initial load data collection (2016 to 2022): PyISO
Energy demand forecasting: FBProphet time-series forecasting accounting for US holidays
Power flow prediction: Supervised learning algorithms (decision tree, kNN, linear regression, etc.)
Web dashboard visualization: Backend - Flask, Frontend - HTML/CSS and Plotly for demand forecast visualization","m4lname":"","industry":"Information","m3lname":"","dataset":"1. Hourly electricity load (MWh) data of all zones in New England from 2011 to 2022 - manually scraped from the ISO-NE website (https://www.iso-ne.com/isoexpress/web/reports/load-and-demand)
2. Power flow data simulated using Siemens PSS/E of the IEEE New England 39 bus system for various scenarios.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"FinReasoing AI","timestring":"Tue May 5 20:31:13 2026","m1uni":"pl2965","m2lname":"He","m1fname":"Pengyu","m4fname":"","m1lname":"Luo","m3fname":"Zhe","description":"FinReasoningAI is a financial reasoning system designed to improve large language models’ ability to
answer calculation-heavy financial questions. The project focuses on three major objectives: combining
multiple financial reasoning tasks into one unified pipeline, improving numerical correctness through
fine-tuning and reasoning strategies, and building a practical self-hostable financial QA toolkit.

The main innovation of the project is the integration of supervised fine-tuning, Chain-of-Thought
reasoning, tool-augmented calculation, self-consistency aggregation, and evaluation controls into one
end-to-end financial reasoning workflow. The system is built around Qwen2.5-14B-Instruct with a QLoRA-
first training path, allowing efficient adaptation under limited GPU memory while still supporting
robust inference and evaluation.

This research and toolkit are important because financial reasoning requires more than fluent text
generation. In real financial QA tasks, small numerical mistakes can change the meaning of an answer.
A model must correctly parse financial context, identify relevant numbers, perform arithmetic or
percentage calculations, and output a concise answer. FinReasoningAI explores how fine-tuning, Chain-
of-Thought prompting, and external tools can reduce hallucinated numeric outputs and make financial AI
systems more reliable and practical for future financial analysis workflows.","uni":"pl2965","language":"Python","pid":"202605-13","m4uni":"","analytics":"For analytics and evaluation, the system implemented numeric exact match with tolerance, task-aware
F1, answer parsability, grounding-rate checks, and strict correctness evaluation. The final evaluation
compared Baseline, Fine Tuning, Fine CoT, and Agentic settings on 300 FinQA samples.

The main algorithms and reasoning strategies included QLoRA supervised fine-tuning, Chain-of-Thought
prompting, direct-answer prompting, tool-augmented reasoning, self-consistency sampling, numeric
answer extraction, grounding validation. The self-consistency module aggregates multiple model outputs
by median-based selection for numerical answers and majority voting for textual answers.","m4lname":"","industry":"Finance","m3lname":"Huang","dataset":"The project used two main public financial reasoning datasets: FinCoT and FinQA. FinCoT was used
mainly for training or supervised fine-tuning, while FinQA was used as an external benchmark for
evaluation

FinCoT is a financial Chain-of-Thought dataset designed to improve reasoning ability in financial
question answering tasks. It provides financial QA examples with step-by-step reasoning traces.

FinQA is a public benchmark dataset for financial question answering that requires numerical
reasoning over financial reports, including both textual information and tables.","m2uni":"qh2308","m2fname":"Qijun","m3uni":"zh2685"},{"projectname":"Advanced KYC — Customer Behavior Prediction","timestring":"Wed May 11 20:34:04 2022","m1uni":"jh4312","m2lname":"Liu","m1fname":"Jingchao","m4fname":"","m1lname":"Hu","m3fname":"","description":"Customer analysis can do more than drive sales of existing products. By uncovering customer needs, the right analysis can help you develop new products and services; ones your customers may not even know they need. The new product lines you develop in this manner could drive sales and profits even more, helping you build an even better business.","uni":"jh4312","language":"Python, Google Colab, HTML, Flask","pid":"202205-13","m4uni":"","analytics":"We used recurrent neural networks model. Because it is good at capturing non-linear sequences and have an internal state capable of retaining information from previously learned data structures making them particularly good at data learning structures with a temporal element such as speech recognition, time series prediction, and robot control.","m4lname":"","industry":"Information","m3lname":"","dataset":"We used the Ta Feng Grocery Dataset, which can be found here https://www.kaggle.com/datasets/chiranjivdas09/ta-feng-grocery-dataset. Our model can also support other types of transaction data as long as it includes the information of customer ID, transaction date, transaction amount, transaction sales prices and product class. ","m2uni":"pl2804","m2fname":"Peihan","m3uni":""},{"projectname":"Reverse-Complement Aware Dilated CNN-BiGRU with Gated Attention for Regulatory DNA Prediction","timestring":"Tue May 5 00:45:40 2026","m1uni":"ind2109","m2lname":"","m1fname":"Ioannis","m4fname":"","m1lname":"Daras","m3fname":"","description":"This project builds an end-to-end deep learning framework for predicting regulatory genomic activity directly from nucleotide sequences. The primary goal is to learn robust sequence representations that generalize across chromosomes and across regulatory tasks. The project’s key innovations are: (1) a controlled experimental pipeline with chromosome-based splits to prevent genomic leakage; (2) improved negative sampling for peak-centered training that reduces shortcut learning via GC-content or low-quality regions; and (3) a reverse-complement consistent neural architecture that combines dilated residual CNN blocks for multi-scale context, bidirectional GRU sequence modeling for dependencies, and gated attention for adaptive feature refinement. These capabilities are important because accurate regulatory prediction from sequence supports gene regulation studies and provides a scalable computational tool for biological discovery and downstream biomedical applications.","uni":"ind2109","language":"Python with PyTorch for deep learning and training. The project is designed to run on Google Colab (GPU optional but recommended) and standard Linux environments. Data processing uses common Python scientific packages (NumPy/Pandas). Results are exported as CSV/JSON and figures as PNG for reporting.","pid":"202605-1","m4uni":"","analytics":"Implemented modules include:
Data pipeline: genome loading, BED parsing, peak-centered window extraction, one-hot encoding.
Improved negative sampling: distance-from-peak constraints, GC-matching, filtering N-heavy windows, multiple negatives per positive.
Models: logistic regression baseline, MLP baseline, CNN baseline; SOTA-inspired architectures (DeepSEA-style CNN, DanQ-style CNN+RNN, Basenji-style dilated CNN); proposed RC-dilated-CNN–BiGRU–gated-attention model with RC fusion (log-sum-exp).
Training: weighted BCEWithLogits, AdamW, OneCycleLR, mixed precision, EMA weights, early stopping.
Evaluation: accuracy, F1, AUROC, AUPRC; threshold selection on validation; benchmark tables and plots.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"We evaluate on two regulatory genomics datasets using the hg19 reference genome:
ATAC-seq chromatin accessibility peak regions (binary classification of accessible vs background sequence windows).
CTCF transcription factor binding peak regions (binary classification of binding vs background sequence windows).
Genomic coordinates are provided as BED files; DNA windows are extracted from the hg19 reference sequence and one-hot encoded. We use chromosome-based train/validation/test splits to evaluate generalization to unseen chromosomes. The framework can also support additional BED-based peak datasets (e.g., other transcription factors or cell-type-specific accessibility tracks) by swapping the input BED file and split configuration.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Daily Financial Risk Aid","timestring":"Fri Dec 16 19:01:54 2022","m1uni":"fr2510","m2lname":"","m1fname":"Fernando","m4fname":"","m1lname":"Rodriguez-Guzmn Jr","m3fname":"","description":"This investment risk aid seeks to explore the feasibility of combining trending world topics, historical stock information, and future stock value predictions to provide a user with the right information for investment decisions leading to investment portfolio growth. Technological automation tools and virtualization techniques are utilized to leverage available datasets through daily scheduled data acquisitions. The collected datasets are processed and visualized to aid an investor in their daily investment decision making process. Overall, this project aims to understand the value of utilizing daily real-time datasets of financial information for investment decisions leading to investment portfolio growth.
","uni":"fr2510","language":"Python, HTML, CSS, JavaScript","pid":"202212-25","m4uni":"","analytics":"A daily historical dataset was successfully acquired, via the yfinance python API, for each investment vehicle dataset. This was followed by the processing of each dataset, training of linear regression models for each investment vehicle. The resulting predictions were visualized on an HTML/CSS webpage via a D3 interactive graph allowing for the observation of daily predicted closing costs vs actual historical closing costs.
","m4lname":"","industry":"Finance","m3lname":"","dataset":"A total of 10 unique datasets are being collected daily for the processing and visualization of relevant financial and global event data.

The first 9 datasets include the stock market historical performance of 3 subsections of investment vehicles:

1 - Three Historically Stable Mutual Funds
2 - Three Historically Profitable & Stable Stocks
3 - Three Fast Growing Stocks

Lastly, daily Twitter data is collected to visualize daily trending world events through the lens of Twitter Hashtags.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Medical Visual Q&A Agents Systems","timestring":"Sat Dec 20 05:05:39 2025","m1uni":"ch4019","m2lname":"Gao","m1fname":"Chengbo","m4fname":"","m1lname":"Huang","m3fname":"Yigang","description":"Our goal is to build agents that can look at a medical image, understand a clinical question, and return an answer that is not only correct, but also medically reliable. Furthermore, we aim at building a system that consists of such agents to analyze medical images at scale. It's important as ,edical imaging is central to modern diagnosis and treatment planning. As imaging volume and complexity continue to grow, there is increasing interest in AI systems that can assist clinicians by answering free-form questions about images and explaining their reasoning in natural language. However, in high-stakes domains such as oncology or radiology, errors and hallucinations can be harmful, so such systems must be not only accurate but also reliable, transparent, and easy to control. ","uni":"ch4019","language":"We use Google's Colab and LangStudio.","pid":"202512-20","m4uni":"","analytics":"We implemented a full pipeline spanning algorithms, analytics, system modules, and visualization. On the modeling side, we explored several multimodal architectures, including separate-encoding baselines (ResNet/CLIP + TinyLLaMA with an MLP projector) and a jointly aligned BLIP backbone with multi-token visual prefix injection. For language alignment, we trained a TinyLLaMA variant with Direct Preference Optimization (DPO) on UltraMedical-Preference to improve answer faithfulness and structure. Analytically, we evaluated models on PathVQA using Exact Match, F1, yes/no accuracy, and qualitative error analysis. System-wise, we wrapped the vision–language backbone into a Hugging Face Gradio service and integrated it as a tool endpoint inside an ACE-based multi-agent graph (preprocess node, agent, medical VQA tool, trace collection, and reflection agent). For visualization, we produced demo screenshots of the Gradio UI, ACE execution traces, and metric curves/tables that summarize performance across model variants and alignment settings.","m4lname":"","industry":"Life Science","m3lname":"Meng","dataset":"Our main experiments use the PathVQA dataset, a pathology visual question answering benchmark consisting of histopathology images paired with free-form medical questions and short textual answers. We obtain it directly from Hugging Face and follow the official train/validation/test split. For language alignment, we additionally use the UltraMedical-Preference dataset, which provides medically focused question–answer pairs with preference labels between “chosen” and “rejected” responses.","m2uni":"yg2999","m2fname":"Yufeng","m3uni":"ym3068"},{"projectname":"Risk Analysis and Default Prediction for Taiwan Companies","timestring":"Fri May 15 21:23:56 2020","m1uni":"pt2534","m2lname":"Lai","m1fname":"Pei-Ling","m4fname":"","m1lname":"Tsai","m3fname":"","description":"In this project, we aim at obtaining information from many facets. It is important because the more useful information we get, the more precise the model can achieve. The first question here is: Which data should we get? And which key words should we use for searching?
In financial statements, they contain many numbers and we can find many ratios which use these number to represent different facets. Since we need to do default prediction, we choose ratios that can show a company’s cash flow and ability to pay money back. For news and social media, we target at four categories: company names, products, industry and CEO names. With these categories, we can know how people think about them. For example, we can read from news that SSD is more popular than disks these years. Thus, we can expect companies that produce SSD would be more lucrative than companies produce disks. However, this method only works given the presumption that this company is famous enough and we can find it mentioned in many discussions. For those that are not well-known, we need to find another way to evaluate them. Thus, we decided to use the relationship among companies to complement the shortage of public posts. For example, for a laptop firm, if the company that sales memory to it performs well this year, then we can reasonably assume that the laptop firm performs well this year since it means that there is a large demand of laptops.
After having these data, there is another problem: how to transfer the data into concrete numbers to compare? With this in mind, we adopt the sentiment analysis, which gives us scores regarding how positive or how negative a post is.
","uni":"pt2534","language":"Python, GCP","pid":"202005-11","m4uni":"","analytics":"
LSTM model
XGBoost
Crawlers
Data preprocessing module
Data collecting module
Label generating module ","m4lname":"","industry":"Finance","m3lname":"","dataset":"As mentioned in previous section, our data is composed of accounting data, news and PTT posts, and relationship between companies. ","m2uni":"yl4305","m2fname":"Yuan-Hsi","m3uni":""},{"projectname":"Sentiment Analysis of Trending Topic on Social Media","timestring":"Thu Dec 18 05:52:42 2025","m1uni":"yl5717","m2lname":"Liu","m1fname":"Yisu","m4fname":"","m1lname":"Li","m3fname":"","description":"Our goal is to build a scalable, multi-platform social media sentiment analysis system that can continuously collect large-scale, time-aligned text data from multiple sources (YouTube and GitHub), run sentiment analysis with a custom-trained neural sentiment model, and provide an interactive analytics dashboard for exploring long-term sentiment trends. The key innovations/capabilities include: (1) a time-window-based YouTube scraping workflow that ensures even temporal coverage across months; (2) a scalable GitHub issue ingestion pipeline that preserves chronological distribution and supports thousands of texts per run; and (3) an interactive Streamlit-based web interface that supports configurable aggregation windows and real-time visualization. This toolkit is important because it enables research-grade longitudinal sentiment tracking across platforms where public opinions and discussions evolve over time, and it is designed to be extensible for future LLM-based analysis.","uni":"yl5717","language":"Python; Streamlit web app; HuggingFace Transformers/DistilBERT; YouTube Data API + GitHub Search Issues API; pandas/numpy/nltk; GPU-based training (NVIDIA). ","pid":"202512-22","m4uni":"","analytics":"We implemented a modular pipeline with separate modules for data ingestion, preprocessing, sentiment analysis, aggregation, and visualization. For ingestion, the YouTube adapter performs month-by-month searches using publishedAfter/publishedBefore to ensure temporal coverage, then retrieves comments through the commentThreads endpoint and converts timestamps to UTC. The GitHub adapter queries the Search Issues API using a “keyword + created:>=start_date” pattern with pagination to collect chronologically distributed issues.

For sentiment modeling, we fine-tuned a DistilBERT-based 3-class classifier (negative/neutral/positive) with standard Transformer tokenization, truncation/padding to a fixed maximum length, AdamW optimization, and macro-F1/accuracy evaluation. For analytics, we aggregate sentiment scores over configurable time windows to produce long-term trend analysis across months. Visualizations include interactive sentiment trend plots over time (per platform/keyword) and model diagnostics such as a confusion matrix, enabling users to compare sentiment dynamics between YouTube and GitHub.","m4lname":"","industry":"Media","m3lname":"","dataset":"We evaluated our system on two categories of datasets: live data collected from platforms and public training data used for model development.

(A) Live/Tested data (YouTube + GitHub):

YouTube: We collect comments from videos whose titles/descriptions match a user-provided keyword query. Videos are searched month-by-month within a configurable look-back horizon (e.g., 6 or 12 months). For each video, we retrieve top-level comments via the YouTube Data API (commentThreads endpoint), keeping timestamps in UTC for time-series analysis.

GitHub: We collect issue texts using the GitHub Search Issues API, filtered by keyword and restricted to issues created within the last N months. Each record includes issue title + body as text.

Data scale (configurable): YouTube up to 500 videos × 500 comments (up to ~250k comments/run, API limits permitting); GitHub up to 10 pages × 100 issues/page, typically capped around 5,000 texts/run.

Unified schema: Each record is normalized to {platform, source_id, text, published_at (UTC), sentiment fields (neg/neu/pos/compound + label)}.

(B) Public training data (for sentiment model):
We used Sentiment140 (public dataset) from Kaggle (Go et al., 2009), containing 1.6M tweets with distant-supervision polarity labels (800k negative, 800k positive). We also incorporated additional text from YouTube news comments and Telegram news channel posts to broaden style/temporal diversity.

Overall, our software supports ingestion and time-aligned sentiment analytics for YouTube comments and GitHub issues, with extensible adapters for more platforms in future work.","m2uni":"ml5141","m2fname":"Mingxuan","m3uni":""},{"projectname":"Customized Movie Recommendation System","timestring":"Sat Dec 21 02:33:43 2019","m1uni":"kl3157","m2lname":"Xu","m1fname":"Kaiwen","m4fname":"","m1lname":"Liu","m3fname":"","description":"Use Python and Spark to build K-means clustering and Alternative Least Squares Collaborative Filtering Based Recommendation System and Flask, GCP based web application","uni":"kl3157","language":"Python, Spark, Flask, GCP, HTML, CSS, JavaScript","pid":"201912-24","m4uni":"","analytics":"K-means clustering and Alternative Least Squares Collaborative Filtering Based Recommendation System
Python and Spark based data processing and analysis
Flask, HTML, CSS, JavaScript based web application with GCP database for visualization and user interaction","m4lname":"","industry":"Information","m3lname":"","dataset":"MovieLens 20M movie ratings. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019
https://grouplens.org/datasets/movielens/25m/","m2uni":"dx2193","m2fname":"Donghai","m3uni":""},{"projectname":"Relationship among Crimes, Public Commuters and Sentiment in Manhattan","timestring":"Sun Dec 23 02:18:41 2018","m1uni":"zd2212","m2lname":"Lu","m1fname":"Zhicheng","m4fname":"","m1lname":"Ding","m3fname":"Lin","description":"• Crime and public commute seem to have a similar pattern. So in this project, we aim to find out the pattern, meanwhile, we expected sentiment also follow a similar trend. Therefore, we try to explore the relationship among crime, public commuters, and sentiment in Manhattan.
• We found and download the associated datasets (i.e. the criminal dataset in the Manhattan at the year 2014, the subway and taxis dataset of New York City at the year 2014, and the twitter dataset at the year 2014).
• To better evaluate the relationship among these data, we designed three different charts using Vue.js and D3.js to visualize the data hour by hour in an average week:
(1) a heat map chart of crime and public commutes in Manhattan. This plot provides spatial and temporal information about crime and commutes;
(2) a bubble chart of crime, public commutes aggregated by precinct1 in Manhattan. This plot shows a specific linear correlation of the data;
(3) some statistic charts to show the trend of the crime, commutes, sentiment, and their correlation in each day. This plot provided a more clear result of how this relationship changed on a given day.
• In the web UI, we also provide a story mode which gives a chance for audiences to explore the relationship which some timestamp highlighted.

Innovation & Capabilities:
• We conducted the relationship among the number of crime, the number of commuters, and the sentiment of Manhattan people hour by hour in an average week.
• We design three different charts (heat map chart, bubble chart, and statistic chart) to demonstrate the relationship behind the data.
• We extensively create a story mode which highlights the critical date time in this project which provides instruction for users to explore the data.
","uni":"zd2212","language":"Python, JavaScript - Platforms: Google Cloud Platform, Pandas, Numpy, Shapely, TextBlob, Vue.js, D3.js ","pid":"201812-9","m4uni":"","analytics":"
• Data Preprocessing: we pre-process the data, including extract useful data and clean up data.
(1) Feature Extraction: we need to decide which data in the collected dataset are useful in this project. Since our goal is to explore the relationship among these datasets in Manhattan, we extract geography location information, timestamp, and corresponding value.
(2) Data Cleanup: after selecting the useful data from original datasets, we clean up the data which help us to get around some errors, such as remove duplicate rows, remove rows with missing data, fixed formatting, and round data.

• Data Processing: to eventually generate the final dataset for visualization, we generate spatial-temporal indices (i.e. location and time), and the corresponding value for data aggregation. Each dataset requires different processing procedure.
(1) Crime dataset.
- Spatial: group by precinct number and coordinate, respectively.
- Temporal: we convert timestamp to weekday and hour, then group by weekday and hour, respectively.
(2) Public commuters dataset:
- Spatial: calculate precinct number using coordinate and precinct geojson, then group by precinct number and coordinate, respectively.
- Temporal: we convert timestamp to weekday and hour, then group by weekday and hour, respectively.
(3) Twitter dataset:
- Spatial: filter Manhattan tweets, then calculate sentiment polarity using NLP-based algorithm.
- Temporal: we convert timestamp to weekday and hour, then group by weekday and hour, respectively.

• Data Visualization: we design three interactive pages to explore the relationship (i.e. heatmap chart, bubble chart, and statistic chart).
(1) Heatmap Chart: heatmap chart shows the spatial and temporal information crimes and commuters distributed in Manhattan.
(2) Bubble Chart: bubble chart demonstrates the positive linear correlation using the Pearson correlation.
(3) Statistic Chart: statistic chart summarizes the overall data trend in one day which helps conduct the relationship among these crimes, public commutes, and sentiment.
","m4lname":"","industry":"Transportation","m3lname":"Bai","dataset":"Crime Dataset, MTA Subway Dataset, NYC Yellow Taxis Dataset, Twitter 2014 Dataset

• Crime Dataset. The data of crimes is from the Kaggle: 2014-2015 Crimes reported in all 5 boroughs of New York City. This dataset reports 2014-2015 crimes in all 5 boroughs of New York City, which contains 23 fields. The attributes that we are going to use are id, timestamp, coordinate, precinct number. The total size of this dataset is 53MB. This dataset will further process as two datasets: (1) processed data1 contains weekday, hour, precinct number, and the number of crimes; (2) processed data2 contains weekday, hour, coordi- nate, and the number of crimes.

• MTA Subway Dataset. MTA subway dataset is published at MTA’s official website. Since the crime dataset is from 2014-2015, we downloaded the MTA subway dataset from 2014-2015. The total size of this dataset is about (900MB), including 11 fields. The attributes that we care about are station id, timestamp, number of entries, number of exits, station coordinate.

• NYC Yellow Taxis Dataset. Yellow taxis dataset covered Manhattan and this dataset contains detailed timestamp and number of passenger. We retrieve this dataset from NYC open data. This dataset contains 19 fields. The total size of this dataset is about 25GB. The attributes that we care about are the timestamp, the number of passengers, pickup coordinate, drop off coordinate.

• Twitter 2014 Dataset. The twitter dataset that we had used in this project is extremely large. The total size of the data is 313.7GB. This dataset contains useful information like timestamp, context, user, location. To process this dataset, we read the data chunk by chunk because of its large size. Then we clean up the data and use the NLP-based algorithm to calculate sentiment using the context provided. Eventually, we aggregated the data by weekday and hour which eventually conduct the processed data, including weekday, hour, number of positive, number of neutral, number of negative information. It worth mentioning that we didn’t find the twitter dataset cover the whole year. We found the data covered Feb, Mar, Apr, May, Oct, Nov, and Dec in 2014.
","m2uni":"yl4021","m2fname":"Yunan","m3uni":"lb3161"},{"projectname":"Generative AI: Realistic Human Image Creation","timestring":"Thu Dec 19 20:07:02 2024","m1uni":"ds4229","m2lname":"Yang","m1fname":"Delong","m4fname":"","m1lname":"Su","m3fname":"Jialiang","description":"Our goal is to train an AI-powered system that generates realistic digital human images using a Deep learning model base on characters and scenes from the popular TV series Breaking Bad. This theme adds a fun and engaging twist for users who love the show.
We are focusing on building a model that can understand prompts related to Breaking Bad and generate accurate, entertaining images. The model will be trained to interpret text descriptions and transform them into relevant visuals.
If it is possible, we plan to build a simple, user-friendly interface in the future so that users can customize their needs easily.
","uni":"ds4229","language":"python, windows","pid":"202412-11","m4uni":"","analytics":"We are working with a repetitive system.
First, data collection and cleaning, then, set parameters and training dataset into kohya to train base model, SD. after this, we will have our customized lora model. Then, we use this lora model, base model, prompt, and parameters to generate images. Based on the model performance, we may collect more and different training data, adjust parameters to train the lora again. By repeating this algorithm, we would finally gain our customized image generation lora model which is focusing on three characters from Breaking Bad.
","m4lname":"","industry":"Information","m3lname":"Yan","dataset":"Volume: The dataset contains hundreds images for 3 characters, with approximately dozens of image per character from 5 seasons.
Velocity: Approximately several tens of images per hour are captured manually, depending on the availability of the episodes.
Variety: The dataset includes images with varying poses, facial expressions, lighting conditions, and background settings for each character.

Given there is no existing dataset, we are creating our own by capturing screenshots from the Breaking Bad series.
This process is time-consuming, taking weeks to collect enough images that meet the requirements for training the model.
","m2uni":"sy3185","m2fname":"Silu","m3uni":"jy3331"},{"projectname":"Analysis on Children Learning Performance by Educational Game APP","timestring":"Fri Dec 13 15:14:27 2019","m1uni":"yl4227","m2lname":"Zhang","m1fname":"Yanjun","m4fname":"","m1lname":"Liu","m3fname":"Ruixin","description":"Game and children education are taking larger and larger market in today’s world. Especially the combination of these two, thus we focus on data from an educational game app called the PBS KIDS Measure Up! In this project, we want to uncover new insights in early childhood education and how media can support learning outcomes, thus we want to build a model to predict 3600 children's performances in 5 assessments based on their game record on this app, discovering the important features which contributes children's higher profermane in game.","uni":"yl4227","language":"language: python, javascript, html, css; platform:flask","pid":"201912-17","m4uni":"","analytics":"Analytics and algorithm: Basic data preprocssing and EDA methods using Pandas; Oversampling and downsampling, bagging algorithm. Train/test split and KFold cross validation methods using sklearn, tree-based machine learning model in LightGBM.

visualization: In front-end, we use HTML, CSS and JavaScript to do visualization. What we show in the front-end is the performance, indicated by ROC curve, Feature Importance and Loss Curve, of different models in different datasets, which user can choose freely from the four datasets we generated randomly.

System procedure: In the respect of system design, we use Flask as our framework to connect the fron-end and back-end. In back-end, we used Python as our primary language, and run the model on datasets, generate ROC cure, then send it to the web in front-end through Flask in Ajax manner.
","m4lname":"","industry":"Information","m3lname":"Liu","dataset":"The data used in this project is anonymous, tabular data of interactions with an app called PBS KIDS Measure Up! These datasets are collected from Kaggle, including train set.csv，train_label.csv and test.csv. The app is a game-based learning tool for kids, and the dataset includes users' assessment scores and their records through the game.

The train dataset we used includes 11341,042 rows of different event data in Json format from 303,319 different game sessions of 17000 installation id.","m2uni":"zz2668","m2fname":"Zhichao","m3uni":"rl3063"},{"projectname":"Soccer Game State Reconstruction","timestring":"Sat Dec 21 02:49:02 2024","m1uni":"nl2873","m2lname":"Wang","m1fname":"Nelson","m4fname":"","m1lname":"Lin","m3fname":"Yiu Chung","description":"The key goal of this project is to define and generate a game state of a soccer match using computer vision tools. A significant challenge in sports analytics is the lack of continuous tracking data, which hinders in-depth analysis due to incomplete information for every timestep. Thus, we wish to devise a pipeline to reconstruct real match footage onto a 2D pitch, in order to create a form of live tracking data that can replace the more commonly used (and available) event-based data. To achieve this, we first utilized CV models to scan for key objects to obtain their coordinates in relation to the video, as well as pitch keypoints to compute a homograhpy transformation. Using both sets of detections, we can finally reconstruct the game state in 2-dimensions.
","uni":"nl2873","language":"Python, YOLO, Google Cloud, Airflow ","pid":"202412-2","m4uni":"","analytics":"Object detection: YOLOv5-detect
Object tracking: Roboflow Supervision
Ball Position Estimation: Linear interpolation
Team Assignment: K-means clustering
Pitch Localization: YOLOv11-pose
Camera Calibration: Homography transformation
Visualizations: OpenCV Image

GCP VM specs: c2-standard-4 machine, 16GB RAM, 100 GB hard-drive","m4lname":"","industry":"Media","m3lname":"Yau","dataset":"Our main dataset was obtained through SoccerNet's open-source competition (https://www.soccer-net.org/tasks/game-state-reconstruction). The dataset provides image sequences of 30 second clips of various football matches. The sample visualizations used are also of the same format, provided through Roboflow's Dataset API.
","m2uni":"eaw2233","m2fname":"Emily","m3uni":"yy3223"},{"projectname":"Video Caption Generation","timestring":"Fri May 6 05:09:37 2022","m1uni":"sc4921","m2lname":"","m1fname":"SUNJANA RAMANA","m4fname":"","m1lname":"CHINTALA","m3fname":"","description":"The goal of this project is to design a system that automatically generates a caption for the events in a video.","uni":"sc4921 ","language":"Python, Jupyter Notebook","pid":"202205-1","m4uni":"","analytics":"CNN, LSTM, Greedy Search, Beam Search, Encoder-Decoder, Transfer Learning","m4lname":"","industry":"Media","m3lname":"","dataset":"TRECVID-VTT","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Humanized Conversation with Personality","timestring":"Thu May 9 15:56:41 2024","m1uni":"ch3801","m2lname":"","m1fname":"Chuqiao","m4fname":"","m1lname":"Huang","m3fname":"","description":"Objectives:
Develop a virtual agent capable of conducting human-like conversations, infused with distinct personality traits to make interactions more personalized and engaging.

Innovations:
Integrating complex personality models into the AI to allow for more nuanced and varied interaction styles, moving beyond the typical one-size-fits-all approach of current virtual assistants.

Importance of Research:
1. By humanizing AI interactions, the project aims to reduce the perceived distance between human and machine communication, making technology more accessible and less intimidating to users.
2. In applications ranging from customer service to personal companionship, the development of these advanced AI capabilities can significantly enhance user satisfaction and engagement.
3. The AI's ability to adapt to individual personality traits and emotional states opens up new possibilities in personalized education and therapy, where the AI can act as a tutor or therapeutic aide tailored to the user’s specific needs.","uni":"ch3801","language":"Python, TensorFlow and PyTorch, BERT, Google Cloud Platform, React","pid":"202405-3","m4uni":"","analytics":"System Modules

Persona and Memory Integration Module:
Persona Management: Handles the storage, retrieval, and application of different personality profiles.
Memory System: Utilizes past interactions to inform current responses, ensuring continuity and contextuality in conversations.

Response Generation Module:
Combines input from the persona and memory modules to generate responses that are not only relevant but also align with the AI's personality settings.

User Interaction Module:
Dialogue Management: Manages the flow of conversation, ensuring that interactions are smooth and natural.
User Feedback System: Gathers user feedback to continuously refine AI responses and behaviors.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"There are two datasets meticulously prepared for modeling our AI’s personalities:

Big-5 Dataset: This dataset is based on established personality assessment models, particularly the Big Five Inventory (BFI) and HEXACO-60.

PERSONA-CHAT Dataset: Crowd-sourced from Amazon Mechanical Turk. This approach ensures a diverse range of inputs in terms of demographics, language use, and cultural references, which enrich the dataset with realistic and varied dialogue scenarios.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NYC Taxi Fare Prediction","timestring":"Sun Dec 23 03:44:32 2018","m1uni":"dw2834","m2lname":"Qiao","m1fname":"Di","m4fname":"","m1lname":"Wu","m3fname":"Yunfei","description":"Taxis have been making our lives easier than ever before, and it has been quite familiar for all of us to take a taxi or enjoy the service from car-hailing platforms such as Uber, Lyft and Didi, especially in big cities like New York. Yet unlike the platforms that present us with an estimated car fee before we taking the ride, we can only know the exact amount of taxi when we arrive at the destination. This awkwardness makes us curious about the prediction of it, and a precise estimation will be of great help for us to control the daily budget.

Motivated by the facts above, we mainly focus on three points in this project. The first point is to build a prediction system to provide relatively accurate predicted taxi fare for given pick-up and drop-off locations in New York City. The second point is to investigate how each predictor affects the prediction result. The third point is to develop a Python application for users to predict the fare.","uni":"dw2834","language":"Python, HTML; Jupyter Notebook, PySpark, LightGBM, SHAP ","pid":"201812-31","m4uni":"","analytics":"In this course final project, we use New York City taxi fare data to make prediction and analysis. We first do some overview throughout the data, and do some data cleaning work. We try different combinations of predictors and different algorithms in PySpark to build models and predict the fare, while all models give unsatisfactory results. We then add more predictors such as the airport information as well as try other algorithms such as LightGBM framework where we get a decent prediction result, and it is selected as our final model. We use SHAP tool to analyze and interpret the final model, where we figure out important predictors to the model as well as how predictors affect the prediction results. And we also build a web page using JavaScript showing the day-hour heatmap of taking taxis numbers, where we derive a lot of interesting facts concerning people’s daily life. Finally, we make a Python application to help users predict the New York City taxi fares.","m4lname":"","industry":"Transportation","m3lname":"Wang","dataset":"Both our train set and test set are from Kaggle Playground Prediction Competition New York City Taxi Fare Prediction. The size of the train set is more than 55 million and the size of the test set is 9914.","m2uni":"sq2205","m2fname":"Shuhao","m3uni":"yw3157"},{"projectname":"League of Legends Real Time Winning Prediction","timestring":"Sat Dec 23 04:38:30 2023","m1uni":"jy3252","m2lname":"","m1fname":"Jiming","m4fname":"","m1lname":"Yu","m3fname":"","description":"Abstract—In the dynamic and rapidly evolving world of esports, leveraging machine learning for real-time game predictions poses both unique challenges and opportunities. This paper details the development of the RealTime Victory Predictor (RVP), an advanced machine-learning system engineered to forecast match outcomes in League of Legends. The RVP system stands out by its capability to process and analyze data in real-time, adapting to the ever-changing conditions of live matches. It employs a variety of machine learning models, including Decision Trees, Logistic Regression, Random Forests, and Gradient Boosting Machines, each fine-tuned to interpret complex game dynamics effectively. The system architecture integrates real-time data streaming, dynamic model updating, and an interactive user interface, ensuring timely and accurate predictions. This innovation not only enhances the esports viewing experience by providing insightful analytics but also assists teams and players in strategic decision-making during matches. Additionally, the project explores the challenges of working with real-time esports data and the effectiveness of various predictive models in this high-velocity environment. The RVP system thus serves as a pioneering tool in esports analytics, pushing the boundaries of real-time predictive modeling in competitive gaming.

Keywords—Esports, Data Analytics, Machine Learning, Decision Trees, Logistic Regression, Random Forest, Gradient Boosting Machine, Real-Time Prediction, League of Legends ","uni":"jy3252","language":"python","pid":"202312-19","m4uni":"","analytics":"Technical Stack and Tools Frontend: Dash Plotly was
chosen for its dynamic data visualization capabilities and
its compatibility with Python.
Backend: Python serves as the backbone, handling data
processing, model training, and deployment.
APIs: Riot Games API for initial data collection; Riot’s
local client API for real-time data.
Machine Learning: Scikit-learn and XGBoost for de-
veloping the predictive models.
Data Storage: Google BigQuery is used for storing and
querying processed data.
Version Control: Git ensures efficient version control
and collaboration.
Hosting: The application is hosted on a platform that
supports Python and Dash Plotly.","m4lname":"","industry":"Information","m3lname":"","dataset":"Our data acquisition process commenced with the utilization of the MATCH-V5 endpoint from Riot's API, targeting top-tier players (Challengers, Grandmasters, Masters) in the North American region. By extracting unique player IDs (puuids), we compiled a comprehensive dataset comprising Solo and Flex matches. This raw data underwent meticulous preprocessing to extract pivotal features such as first dragon, tower, gold differentials, and kill counts, essential for nuanced model training. Post-processing efforts, including duplicate removal and exclusion of incomplete matches.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Stance Detection of Russia-Ukraine War Across Nations on Mainstream News","timestring":"Fri Dec 20 23:01:56 2024","m1uni":"zn2209","m2lname":"Wu","m1fname":"Zhanghao","m4fname":"","m1lname":"Ni","m3fname":"Zishun","description":"This project aimed to analyze mainstream media coverage, specifically focusing on the current international conflict involving the Russia-Ukraine war. While countries have expressed their attitudes through official statements on the Russia-Ukraine conflict, their actions have not always aligned with these declarations. Economic deals, strategic movements, and national political decisions often reveal more complex or even conflicting attitudes.
For instance, France initially expressed its strong support for Ukraine, including condemning Russia's actions in official statements. However, France continued significant purchases of Russian natural gas suggested a more complex position towards Russia-Ukraine war. This conflict between diplomatic statements and actions highlights the importance of examining not only official rhetoric but also concrete actions to truly understand a country's stance.
Our project focused on monitoring and quantifying stances over time by analyzing mainstream news media coverage of diplomacy, specifically regarding Russia. By uncovering trends within media coverage, this project provided audiences with a data-driven perspective on nations' and media outlets' true stances over time.","uni":"zn2209","language":"Python, javascript, sql, html, Google cloud platform, Google Colab, Airflow, Hugging Face","pid":"202412-19","m4uni":"","analytics":"Analytics: Oversampling, Synonym Replacement, Back-Translation
Algorithms: Bert, RoBerta
System modules: GPU(A100), CPU, DAGs, Spark
Visualization: D3, Apache web server

","m4lname":"","industry":"Information","m3lname":"Shen","dataset":"We scraped from news websites like NYT, the Guardian, BBC, CNN, RT, etc. And we labelled the training dataset by ourselves.","m2uni":"aw3088","m2fname":"Anqi","m3uni":"zs2695"},{"projectname":"IRIS RECOGNITION BY DEEP LEARNING","timestring":"Sat Dec 18 03:18:04 2021","m1uni":"rw2902","m2lname":"Zhao","m1fname":"Ruisi","m4fname":"","m1lname":"Wang","m3fname":" Zhongsheng","description":"The objectives of our project is to explore the potential of deep learning on iris recognition. Our research focuses on the accuracy of deep learning on different quality of human eye images, the necessity of different pre-processing steps before training and the ways to combine traditional methods with deep learning.
Innovations, Capabilities, and Importances: For security reasons, many applications or situations require personal identification to restrict access to certain resources. For example, to log onto an email account, a password is needed, and it is usually known only by the owner. However, one person can have multiple accounts for various platforms. Each platform can have different rules to set passwords. Except for passwords, we also have fingerprint recognition and facial recognition. However, under cold weather or when one’s hand is wet, fingerprint recognition can fail. For facial recognition, since covid-19, everyone wears a mask outside, so when they want to use a masked face to unlock their mobile devices, the current recognition algorithm cannot identify successfully. Compared with fingerprints and facial recognition, human iris recognition has a high potential to be a more reliable method. ","uni":"rw2902","language":"Python, Tensorflow, CoLab","pid":" 202112-54","m4uni":"","analytics":"Deep learning models: ResNet50 and VGG","m4lname":"","industry":"Life Science","m3lname":"Chen","dataset":"IITD_Delhi, CASIA_V4_Interval, and MMU2
Downloaded from websites","m2uni":"xz2987","m2fname":" Xiaoshu","m3uni":"zc2583"},{"projectname":"MTA Network Flow Prediction and Anomaly Detection","timestring":"Thu Jan 1 02:38:12 2026","m1uni":"aw3575","m2lname":"","m1fname":"Albert","m4fname":"","m1lname":"Wen","m3fname":"","description":"Objective: model New York City subway inflow as a graph, and predict per-station hourly ridership for anomaly detection. The innovation here is to use an autoregressive model with exogenous predictors to create a strong model with > 0.9 R2 on 98% of stations.

Importance: This has implications for city planning, such as resource allocation around busy stations or busy times, or such as real-time event detection and response. ","uni":"aw3575","language":"Python, ipynb, Gcloud VM, HTML","pid":"202512-23","m4uni":"","analytics":"Data Processing: Polars, Spark, Airflow, Pandas, Numpy

Autoregressive Modeling: sklearn, OLS, pytorch, GNN

Visualization: CartoDB, Geopandas, Matplotlib, Seaborn

all done in a Gcloud VM environment","m4lname":"","industry":"Information","m3lname":"","dataset":"We primarily use this dataset from NYC Open Data Program

https://data.ny.gov/Transportation/MTA-Daily-Ridership-Data-2020-2025/vxuj-8kew","m2uni":"","m2fname":"","m3uni":""},{"projectname":"OncoLink: Predicting Breast Cancer Treatment Response from Gene Expression","timestring":"Tue May 5 19:34:51 2026","m1uni":"apb2192","m2lname":"Taimur","m1fname":"Anjali","m4fname":"","m1lname":"Bhimanadham","m3fname":"Mahsa","description":"The goal of OncoLink is to support precision medicine by helping oncologists to predict what a patient’s response to a treatment plan will be and make more informed clinical decisions. Oncolink is able to predict whether or not a patient will respond to treatment with probability/confidence, show the top k similar historical patients from the METABRIC dataset, allows physicians to enter the outcomes of entered patients so we can improve the model, and lets them see insights about the model making the predictions.

The project has some key innovations that improve model performance as well as explainability and interpretability. We use patient similarity search by finding similar patients by comparing PCA-reduced gene expression data utilizing FAISS. This application also integrates Agentic AI by using Groq and Llama 3. It retrieves data such as similar historical patients and the model’s performance before generating the clinical explanation which makes it grounded and reduces hallucinations. Oncolink also has an incremental learning loop and the model is continuously updated every time 10 new real-world patient outcomes are entered by using partial_fit() so that we don’t have to retrain the entire model. SHAP explainability is used to back the model’s predictions. It tells the oncologist how each gene or piece of clinical data affected the prediction.

This research/project is important because it promotes precision medicine which pushes to provide patients personalized care and we are able to assist with that by utilizing ML models and clinical data. Since we are using SHAP for explainability we are also making the system explainable which provides transparency and gains trust. We are also constantly improving the model by incorporating the real data we are getting from the physicians without retraining the whole model. ","uni":"apb2192","language":"Python, Streamlit, Groq API","pid":" 202605-3","m4uni":"","analytics":"For the data pipeline we used StandardScaler to normalize both gene expression and clinical features separately before any modeling. We then applied Principal Component Analysis to reduce the gene expression features down to 20 and 50 components as well as a variance-threshold version that captures 95% of explained variance, and we also selected the top 1,000 most variable genes by variance as a separate feature set. These gave us four distinct feature sets to test against clinical only, top variable genes, PCA-20, and all features combined. For the classification models we trained XGBoost with 200 estimators, a learning rate of 0.05, max depth of 4, and subsampling, alongside a Random Forest with 100 estimators and Logistic Regression with L2 regularization, comparing all three across all four feature sets using accuracy, F1 score, and ROC AUC on a stratified 80/20 train-test split to select the best performing combination. For explainability we used SHAP via TreeExplainer to compute per-prediction feature attributions on the best model, with a subprocess isolation approach so SHAP crashes wouldn't kill the training run and a feature importance fallback if SHAP was unavailable. For patient similarity search we built a FAISS IndexFlatL2 vector index over PCA-20 embeddings of all 1,900 patients, using an exponential decay function anchored to the cohort's 95th percentile nearest-neighbor distance to convert Euclidean distances into meaningful similarity percentages. For the incremental learning component, we trained a baseline SGDClassifier using log loss that supports partial_fit, which gets updated without full retraining every time a physician submits 5 or more confirmed patient outcomes through the app. On the AI side we used Groq with Llama-3.1-8b instant through an OpenAI compatible tool calling loop where the model can call get similar patients and get model performance to retrieve real data before generating clinical explanations, with retry logic and exponential backoff for rate limiting.","m4lname":"","industry":"Life Science","m3lname":"Mohajeri","dataset":"The primary dataset used in this project is the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset. This dataset contains gene expression profiles and clinical metadata such as tumor stage, receptor status (ER, HR2, etc), and treatment information. The dataset is publicly available and was accessed through Kaggle.   In addition to the METABRIC data, the GSE25066 dataset from the Gene Expression Omnibus (GEO) was used as an external validation and testing source. This dataset contains 508 patient gene expression samples that were used to simulate real world patient uploads to test the project’s ability to process new patient data and generate predictions. The dataset is publicly available and was accessed through Kaggle.

The project is designed to be flexible and support additional datasets beyond the current ones used. This can include any gene expression dataset as long as they are formatted in a tabular (samples x genes) structure, individual patient gene expression files, and other cancer datasets. The preprocessing pipeline in the project standardizes and scales inputs and the model operates on numerical feature matrices so the system can be generalized to any dataset that follows a similar structure of high dimensional genomic features and structured clinical variables.
","m2uni":"drt2145","m2fname":"Daniyah","m3uni":"mm6859"},{"projectname":"Columbia AI-Agent Campus","timestring":"Fri Dec 19 17:43:15 2025","m1uni":"jc6397","m2lname":"Chen","m1fname":"Jiajun","m4fname":"","m1lname":"Chen","m3fname":"Haiyu","description":"ColumbiaValley is an English-localized generative-agent campus simulation toolkit that adds a Columbia-themed world, new agents, and a reworked replay UI (top controls fixed, bottom persona bar, adaptive zoom). It also integrates MAPPO to optimize agent behavior with RL while preserving LLM-based semantic reasoning and natural-language interaction.it’s a practical research sandbox for studying LLM agents in long-horizon social environments, and for testing how RL policy optimization can improve behavioral consistency, interactions, and schedules, with built-in replay and evaluation workflows.","uni":"jc6397","language":"Python/Ollama/Flask/Phaser-based","pid":"202512-14","m4uni":"","analytics":"Core algorithms (RL):

MAPPO training loop with GAE advantages and PPO clipped objective updates (policy + value networks).

Discrete action space (6 actions) including INITIATE_CHAT, CHANGE_LOCATION, REVISE_SCHEDULE, etc.

Multi-component reward function (persona alignment, interaction quality, relationship growth, diversity, schedule completion).

OnlineDataCollector + State feature extraction integrated into the decision loop (persona/spatial/action/memory/social/schedule features).

Visualization & analytics outputs:

Training metrics saved to rl_metrics.json, with auto-generated plots in rl_visualizations/ (losses, KL divergence, reward breakdown/trends, action distribution, learning curves, etc.).

Replay system with compressed artifacts and UI controls (zoom/pan/speed, timeline).","m4lname":"","industry":"Information","m3lname":"Wei","dataset":"This project does not depend on a fixed external labeled dataset. The “dataset” used for testing/training is generated by running the simulation and logging artifacts:

Checkpoint + per-agent memory data saved under results/checkpoints//

Replay artifacts generated by compress.py:

results/compressed//movement.json (frame-by-frame movement + actions)

results/compressed//simulation.md (timeline of agent states/conversations)

For RL, trajectories are collected online: the OnlineDataCollector stores transitions (s_t, a_t, r_t, s_{t+1}) into rollout buffers.
What other data can it support? Any new campus/world + agents can be supported by swapping map assets, agent personas/schedules, and prompts/config, because the pipeline is driven by simulation traces + configurable LLM backends.
","m2uni":"jc6175","m2fname":"Jianfeng","m3uni":"hw3036"},{"projectname":"Adversarial sample for neural network","timestring":"Sat May 18 02:49:55 2019","m1uni":"cs3731","m2lname":"CHEN","m1fname":"CHAOFAN","m4fname":"","m1lname":"SUN","m3fname":"","description":"Neural networks provide state-of-the-art results for most machine learning tasks.

However, neural networks are vulnerable to adversarial samples: given an input x and any target classification t, it is possible to find a new input x prime that is similar to x but classified as t. So we decide to explore in this area. We join an adversarial AI competition held by Alibaba. The competition is about generating adversarial samples.

While participating the competition, we first learned some basic methods of generating adversarial samples. Then we do experiments of these methods. After that, we do some method fusion and get rank first on the leader board. Finally, we build a web page to show our result.","uni":"cs3731","language":"Python, Java Script, CSS, Pytroch, Linux","pid":"201905-5","m4uni":"","analytics":"Algorithm used to generate adversarial samples: FGS, PGD, Adam, Momentum, Mask and Spatial Transformation.

Visualization: Boostrap.","m4lname":"","industry":"Information","m3lname":"","dataset":"Our data is provided by Alibaba. The size of the data set is about 10 GB. It consists of 110 categories, each with about 1000 pictures from the e-commerce platform.","m2uni":"tc2932","m2fname":"TIANSU","m3uni":""},{"projectname":"AdTracking Fraud Detection","timestring":"Fri Dec 13 15:15:16 2019","m1uni":"jy3012","m2lname":"Nene","m1fname":"Jeswanth","m4fname":"","m1lname":"Yadagani","m3fname":"","description":"Fraud risk is everywhere in the finance industry, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can increase revenue by simply clicking on the ad on a large scale. Apart from Ad channels, the rival companies can generate random clicks and waste company investment by not letting it reach the targeted customer market. TalkingData, China’s largest independent big data service platform, handles 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they’ve built an IP blacklist and device blacklist. While successful, they want to always be one step ahead of fraudsters by taking action in real time. Hence the objective is to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. The proposed system will not only predict the probability of users downloading the data but also combine the results with the traditional method of blacklisting the IPs and block the potentially fraudulent IPs in real time. The system has an offline and an online mode where offline mode handles training and improving the existing machine learning model and online mode will work on the data and predict the results for the incoming clicks. The results are visualized with advanced Business analytics tools so that the user will get intricate patterns in the data. The companies are suffering from huge revenue loss in online advertising. With the loss estimated to be $23 million dollars by 2019 the state-of-the-art system that can prevent sophisticated ad frauds in essential to the companies.","uni":"jy3012","language":"Pyspark, Pyspark, BigQuery, Google Data Studio, Dataproc, Google cloud storage","pid":"201912-6","m4uni":"","analytics":"Analytics:-
Upsampling, Downsampling and Smote for unbalanced data.
Groupby features to extract better features in Feature Engineering

Models tested:-
Logistic Regression, Decision Tree Classifier, Random Forest, Gradient Boosted Tree, SVM Classifier.

ML Pipeline has been used to fit models and make predictions.
BigQuery is used to store data and Google cloud Storage is used to store logs of previous data.

Visualizations have been implemented using Google Data Studio which is linked with BigQuery.","m4lname":"","industry":"Finance","m3lname":"","dataset":"We have used data from kaggle competition named \"TalkingData AdTracking Fraud Detection Challenge\".

It contains 184 Million rows of train data and 18 Million rows of test data.

Each row of the training data contains a click record, with the following features.

ip: ip address of click.
app: app id for marketing.
device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
os: os version id of user mobile phone
channel: channel id of mobile ad publisher
click_time: timestamp of click (UTC)
attributed_time: if user download the app for after clicking an ad, this is the time of the app download
is_attributed: the target that is to be predicted, indicating the app was downloaded
Note that ip, app, device, os, and channel are encoded.

The test data is similar, with the following differences:

click_id: reference for making predictions
is_attributed: not included

Our system can acceptany data with the above schema.","m2uni":"rn2494","m2fname":"Ruturaj","m3uni":""},{"projectname":"Classification ","timestring":"Fri May 3 20:24:36 2024","m1uni":"xh2628","m2lname":"","m1fname":"Xiaoyi","m4fname":"","m1lname":"He","m3fname":"","description":"Objectives:
The primary objective of this project is to develop a deep learning model capable of classifying brain images based on the movies being watched by individuals. This involves processing functional magnetic resonance imaging (fMRI) data to identify unique patterns associated with different cinematic experiences.

Innovations:
The project introduces a novel approach to cognitive neuroscience where deep learning is applied to understand how different visual stimuli (in this case, movies) are processed in various regions of the brain. By leveraging advanced neural network architectures, the project aims to achieve high accuracy in classifying brain states, which could open new avenues in personalized media content analysis and neuro-marketing.

Capabilities:
1. Analyzing complex brain imaging data.
2. Distinguishing between different types of visual content processed by the brain.
3. Providing insights into the neural correlates of visual perception and cognitive processing.

Importance of Research/Toolkits:
This research is significant because it bridges the gap between deep learning and neuroimaging, providing a deeper understanding of brain function and its reaction to visual stimuli. The use of toolkits will expedite experimental setups and streamline the workflow, allowing for robust model training and evaluation. This could significantly impact fields such as psychology, neuroscience, and even media production, offering a better understanding of how content affects viewer engagement and brain activity.","uni":"xh2628","language":"python, html","pid":"202405-12","m4uni":"","analytics":"The project employs advanced machine learning techniques, particularly convolutional neural networks (CNNs), to analyze functional magnetic resonance imaging (fMRI) data. This allows the identification of specific neural patterns associated with various types of visual stimuli, such as different movie genres.
","m4lname":"","industry":"Life Science","m3lname":"","dataset":"This is a high-resolution functional magnetic resonance (fMRI) dataset — 20 participants recorded at high field strength (7 Tesla) during prolonged stimulation with an auditory feature film (\"Forrest Gump''). In addition, a comprehensive set of auxiliary data (T1w, T2w, DTI, susceptibility-weighted image, angiography) as well as measurements to assess technical and physiological noise components have been acquired. An initial analysis confirms that these data can be used to study common and idiosyncratic brain response pattern to complex auditory stimulation. Among the potential uses of this dataset is the study of auditory attention and cognition, language and music perception as well as social perception. The auxiliary measurements enable a large variety of additional analysis strategies that relate functional response patterns to structural properties of the brain. Alongside the acquired data, we provide source code and detailed information on all employed procedures — from stimulus creation to data analysis. The total size of dataset is more than 350 GB. Therefore files for individual modalities are made available below. README.dataset_content provides an overview of the dataset and a description of the content for all available downloads. Note, access to individual files is possible via openfmri.org's XNAT server. This server typically hosts datasets that are used for open research purposes, allowing researchers worldwide to access and utilize the data in their own studies, ensuring that the dataset adheres to public domain standards which facilitate transparency and reproducibility in research. This platform provides detailed documentation and access instructions, ensuring that researchers can effectively use the dataset for a wide range of scientific inquiries related to auditory attention, cognition, and other neuroscientific studies.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NYC Restaurant Critical Hygiene Violation Prediction","timestring":"Fri Dec 15 20:54:13 2023","m1uni":"wl2916","m2lname":"Wang","m1fname":"Karl ","m4fname":"","m1lname":"Liu","m3fname":"Shujie","description":"Objectives: This study aims to provide a proactive tool for consumers to assess food safety in NYC restaurants before dining. It applies machine learning algorithms to predict potential sanitation violations using historical inspection data.

Innovations: This study innovatively conducts feature engineering to effectively utilize historical inspection data by meticulously transforming and extracting relevant features from the data. This approach significantly enhances the predictive capability of the machine learning models, showcasing a novel application of data analytics in public health.

Capabilities: This study utilizes historical data from NYC OpenData, processed and analyzed on the Google Cloud Platform using Apache Airflow. The user interface is based on Flask and Web3, allowing consumers to check restaurants' hygiene status before visiting.

Importance: This study is significant because it addresses a public health concern by providing real-time insights into restaurant sanitation, thus empowering consumers with information to make informed dining choices. It also sets a precedent for applying similar methodologies in other cities or sectors, highlighting the transformative potential of technology in public health and safety.","uni":"wl2916","language":"Airflow, GCP, Flask, HTML, Javascript, CSS, Python, Pandas, Jupyter Notebook, Scikit-learn","pid":"202312-8","m4uni":"","analytics":"Analytics: Data preprocessing including SMOTE, label encoding, handling of missing values and outliers; feature engineering including splitting, feature selection, and creation of new features from existing features
Algorithms: Utilization of Logistic Regression, Decision Trees, Random Forests, Gradient Boosting, and XGBoost, with a focus on XGBoost for its superior accuracy.
System Modules: The workflow, including data processing and model training, is conducted on a Google Cloud Platform virtual machine, with Apache Airflow for data management and Python for development.
Visualization: The web interface, also hosted on a GCP VM, utilizes Flask for back-end to provide interactivity with model and data, and HTML/JS/CSS for front-end, along with Bootstrap JS for better aesthetics, allowing users to query restaurant hygiene data and view results in real-time.","m4lname":"","industry":"Information","m3lname":"Hu","dataset":"The study used NYC OpenData's \"DOHMH New York City Restaurant Inspection Results\" dataset. This dataset comprises NYC restaurant inspection records and is public. It covers NYC restaurant hygiene with approximately 210,000 rows updated daily since 2015. Inspection scores, infraction kinds, restaurant food types, and geography are included.

This study's software is intended for this dataset, but the methodology and machine learning approaches might be applied to datasets from other cities or domains with similar data structures and objectives.","m2uni":"yw3886","m2fname":"Isa","m3uni":"sh4355"},{"projectname":"Robot Arm Reinforcement & Imitation Learning","timestring":"Fri May 5 19:36:41 2023","m1uni":"ar4451","m2lname":"","m1fname":"Ali","m4fname":"","m1lname":"Rahman","m3fname":"","description":"With an increasing number of fields utilizing robot arms to conduct various tasks, it has become more important to have sufficiently trained autonomous robot arms in terms of both safety and efficiency.
Goals:
The primary goal of this project was to implement an imitation learning model to conduct tasks in an environment using a mixture of data. In order to accomplish this primary goal, smaller goals were set to get working simulator environments and then implement reinforcement learning agents to collect some data and have a baseline comparison.
Innovation:
At the time that this project was implemented there weren't any notable papers that used the same dataset, so the innovation was to show how an imitation learning model using the data could perform while trying to improve on baseline models.","uni":"ar4451","language":"Python","pid":"202305-8","m4uni":"","analytics":"Tensorflow: PPO, SAC, Imitation Learning CNN
Stable-Baselines: PPO, SAC
Matplotlib: Reward Charts
Robosuite: Simulator Env Renders
OpenAI Gym: Simulator Env Renders","m4lname":"","industry":"Information","m3lname":"","dataset":"An aggregated dataset was used for this project. A bit of generated data was collected from a trained reinforcement learning agent. The RoboNet dataset is also referenced since it was important for the initial stages of the project, and the RoboTurk simulation dataset was the most important since it matched the simulation environment. The datasets that were used were all primary using physics descriptors but had other data such as images that can be found on their corresponding websites.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"SENSECOVID: The Real-Time Twitter Storyteller for COVID-19","timestring":"Fri Apr 23 16:23:30 2021","m1uni":"bf2477","m2lname":"Tang","m1fname":"Bo","m4fname":"","m1lname":"Feng","m3fname":"","description":"We focus on building an intelligent system that can automatically aggregate and summarize valuable information from live streaming tweets around COVID-19 related topics.

Utilizing the fetched tweet data and AI models, the system will generate a story in multimedia form based on the customized web page UI to show users or even researchers all important updates and references on twitter regarding the global pandemic.

The COVID-19 has been a hot topic closely connected with the entire human society during the past year. Since there is a large number of information on twitter which has not been managed, we can provide a clear and vivid picture from the dazzling data. Also, information retrieved from related twitter topics may include beneficial information that could be used for various decision making by the general public as well as governmental organizations. For the recent pandemic of COVID-19, extraction and visualization of the changing situation might help them make some better countermeasures in the future.","uni":"bf2477","language":"Python, Linux Shell, HTML, CSS, JavaScript / Google Cloud Platform, Spark","pid":"202105-4","m4uni":"","analytics":"1.Data Collection: PySpark, twarc
2.Sentiment Analysis: TextBlob, Stanza, Wordcloud
3.Credibility Evaluation: Support Vector Machine
4.Text Summarization: T5-base, BART-large-cnn
5.Web Application: Flask, ECharts
","m4lname":"","industry":"Media","m3lname":"","dataset":"1.IEEE Dataport CORONAVIRUS (COVID-19) TWEETS DATASET
2.Real-time Tweets from Official Twitter Streaming API
3.COVID-19 Tweet Fake News Detection Dataset","m2uni":"ct2990","m2fname":"Chenxin","m3uni":""},{"projectname":"Fashion AI: Attributes Recognition of Apparel","timestring":"Sat Dec 22 18:11:35 2018","m1uni":"jb4076","m2lname":"Wang","m1fname":"Jingyuan","m4fname":"","m1lname":"Bian","m3fname":"Xiuqi","description":"Detecting detailed apparel attributes is a topic receiving increasing attentions, which also has wide applications. Recent year, the demands of online shopping for fashion items grow a lot, which raises problems such as the sellers provide information not consistent with the real stuff, different sellers have inconsistent understandings of apparel styles. An automatic fashion attributes detection system can help overcome these problems by providing precise and consistent tagging or descriptions of apparel from their pictures. This technique can be applied to various areas such as apparel image searching, navigating tagging, and mix-and-match recommendation, etc.","uni":"jb4076","language":"Python, Keras, Ubuntu 16.04","pid":" 201812-3","m4uni":"","analytics":"We use the same structure for all 8 networks. After input layer, we could use preprocess layer to process our datasets. And then, we use ResNet50 and InceptionV3 to construct our CNN models. The last layer is a fully-connected layer with softmax as the activation function. It takes the output from our base model as input where a 50% dropout is added to prevent overfitting. The final output is a number corresponding to a specific category in that attribute dimension.","m4lname":"","industry":"Media","m3lname":"Shao","dataset":"We also support any other image data to do fashion details classification.","m2uni":"ww2468","m2fname":"Wenshan","m3uni":"xs2327"},{"projectname":"Multi Modal Analysis of Music","timestring":"Sun Dec 23 03:49:57 2018","m1uni":"ss5569","m2lname":"Saxena","m1fname":"Saurabh","m4fname":"","m1lname":"Sharma","m3fname":"Tvisha","description":"Detect and analyze the bias present in the music industry in terms of gender, using lyrics of songs and their music videos.
Detect which genre a particular song belongs to using it’s audio.
Used a generative approach to show presence of bias in lyrics.
Generate song lyrics for songs belonging to different genres. ","uni":"ss5569","language":"Python, Mac, Google Colab, Google Cloud Platform","pid":"201812-12","m4uni":"","analytics":"Neural networks, Decision trees, Gradient Boosting were used as the training algorithms.
NLTK, Keras, Tensorflow, TextGenRnn, Scikit Learn were some of the Python modules we used in addition to PySpark. Visualizations were majorly done with WordCloud and Matplotlib","m4lname":"","industry":"Media","m3lname":"Gangwani","dataset":"Bias Analysis in Lyrics 380,000 lyrics from Metrolyrics (Kaggle) Dataset (98 MB)
Bias Analysis in Music Videos- YouTube7(15GB)
FMA Music dataset - 7GB ","m2uni":"ms5736","m2fname":"Mayank","m3uni":"trg2128"},{"projectname":"An Realization of Facial Feature Detection Using Convolutional Neural Network","timestring":"Wed May 22 04:25:07 2019","m1uni":"xl2788","m2lname":"Han","m1fname":"Xiaotong","m4fname":"","m1lname":"Li","m3fname":"","description":"The purpose of this paper is to build facial features detection algorithms on prelabeled data. To do this, we first explore Convolutional Neural Networks (CNN) classification methods on images and visualize the architectures. We first built convolve layers to visualize each layer. Then using Keras built on Tensorflow, we implemented various architectures for training a binary label classification models including VVG, ResNet, Inception. Due to residual networks’ relative high accuracy and speed in comparison to other networks featured in ImageNet competition, we proceeded with more experiments on ResNet18, ResNet34, and ResNet50 using Fastai. We decided it would be more interesting for the end-user to select individual features to test. Thus, we trained multiple single label models on select features and achieved accuracy levels between 80\% to 99\% based on the selected feature. (Male attractiveness had ~80\% accuracy while gender achieved near 99\% accuracy.) Finally, using Flask, we took our single label models and built a functional web page ready for deployment.","uni":"xl2788","language":"Python, Keras, TensorFlow, Pytorch, Fast.ai, Flask, Cloud Compute","pid":"201905-7","m4uni":"","analytics":"CNN：ResNet, AlexNet, LeNet, VGG, Inception
Visualization: Flask, html, CSS, jupyter outputs
System Modules: Cloud Compute, Unix","m4lname":"","industry":"Information","m3lname":"","dataset":"In this paper, the primary database we will be looking at is the CelebFaces Attributes dataset (CelebA). This dataset is published by The Multimedia Laboratory by the Chinese University of Hong Kong and it is a pre-labeled data with images from openly published celebrity pictures. In this section, we will perform exploratory data analysis to have a general outlook on the dataset.
","m2uni":"lh2910","m2fname":"Linsu","m3uni":""},{"projectname":"Let's Catch Pokemon","timestring":"Fri Dec 21 23:34:26 2018","m1uni":"tl2861","m2lname":"Qin","m1fname":"Tingyu","m4fname":"","m1lname":"Li","m3fname":"Ge","description":"PokemonGo is a mobile AR game, where players can catch virtual creatures Pokemon use GPS and camera on their phone. From the camera, it is just like Pokemon are in real-world. The game first releases in July 2016 with around 150 species of Pokemon. The game has more than 500 million times download worldwide by the end of this year. And it is claimed to reach 800 million downloads in May 2018. In addition to the popularity of the game, it also has some community and cultural impact. The game motivates more people to step out of their home, find and catch Pokemon in different places in the real world. Players can report crimes in progress, which is beneficial to public safety. This game also has some impact on business. PokéStops attract more people to come to that place and therefore bring more customers to the nearby store.

Players try hard to find the Pokemon they like in the game, but some Pokemon can be really rear and are not easy to find. In order to improve the user experience when playing this game, we want to design an application to help people find the pokemon they like. We explore the dataset and make some prediction based on it. Then we turn it into a web application. People can easily find the information they want by interacting with our website.

The model we trained not only help improve the game experience but also can be used in many other areas like crime prediction and traffic forecast.","uni":"tl2861","language":"python, html, css, javascript, jupyter notebook","pid":"201812-24","m4uni":"","analytics":"1. Analytics
(1) Train a model to predict where pokemon may appear
(2) Train a model to predict what kind of pokemon may appear in a specific area
(3) Find the similarity among different pokemons
2. Algorithms and system modules
(1) Gradient boosting tree
(2) K nearest neighbors classification
(3) One hot encoding
3. Visualization
(1) Basemap, the longitude and latitude distribution of Pokemon in the world
(2) Seaborn bar plot, the ID distribution of Pokemon in different cities and different time of the day
(3) d3.js, dc.js. The Pokemon city distribution sortable bar chart; the time, city and ID distribution relation.
","m4lname":"","industry":"Information","m3lname":"Qu","dataset":"This is a dataset from Kaggle, named Predict’em All. It contains roughly 2 hundred and ninety-three thousand pokemon sightings. And we can see from this picture, it contains id, latitude, longitude, appeared time of day, appear time, terrain type and many different features.
","m2uni":"yq2247","m2fname":"Yi","m3uni":"gq2138"},{"projectname":"Music Recommendation System on KKBOX","timestring":"Sun Dec 23 03:13:23 2018","m1uni":"wt2247","m2lname":"Mo","m1fname":"Weiqi","m4fname":"","m1lname":"Tong","m3fname":"Fengqi","description":"Our goal is to explore how different recommendation approaches can be used in the music recommendation task and how to combine them, in order to create better personalization for users.","uni":"wt2247","language":"Python","pid":"201812-19","m4uni":"","analytics":"Random Forest, XGBoost, Song2Vec, Bayes, SVD, ALS","m4lname":"","industry":"Media","m3lname":"Xu","dataset":"Dataset given by KKBOX.","m2uni":"zm2302","m2fname":"Zhaobin","m3uni":"fx2136"},{"projectname":"Persistent Watching & Listening 1","timestring":"Mon May 12 18:52:24 2025","m1uni":"tz2617","m2lname":"Ma","m1fname":"Tianlei","m4fname":"","m1lname":"Zhu","m3fname":"Kangyu","description":"The \"Persistent Watching & Listening\" project aims to create an intelligent AI monitoring system specifically designed for elderly care, addressing significant limitations in traditional monitoring approaches. By integrating advanced visual and auditory perception technologies, it ensures continuous, non-intrusive monitoring, swift emergency detection and response, and predictive capabilities for potential health risks. Utilizing models like MediaPipe Pose, YOLOv5, 3D CNN, DeepLabv3, and Wav2Vec 2.0, the project provides cost-effective, user-centric care solutions. This research is vital due to the growing elderly population, reducing caregiver burdens, enhancing care efficiency, and ensuring robust ethical and privacy standards.","uni":"tz2617","language":"The system is primarily developed using Python, leveraging key AI and deep learning frameworks such as PyTorch and TensorFlow for model training and inference. It also utilizes MediaPipe for pose estimation, PyTorchVideo for action recognition, and Hugging Face Transformers for advanced audio processing. The platform runs on Linux-based systems, with experiments conducted on NVIDIA GPU-enabled machines, and logging/evaluation integrated through Weights & Biases.","pid":"202505-10","m4uni":"","analytics":"The system implements a range of analytics and algorithms including MediaPipe for fall detection via pose estimation, YOLOv5 for real-time object detection, and a 3D CNN (Slow R50 from PyTorchVideo) for action recognition. For auditory analysis, it uses Google Speech Recognition for keyword spotting, with plans to integrate Wav2Vec 2.0 and transformer-based NLU for deeper semantic understanding. Evaluation is based on accuracy, precision, recall, and F1-score. Visualization includes annotated video frames, object detection overlays, and confusion matrices, with simple interfaces for debugging and the potential for user-facing caregiver dashboards.","m4lname":"","industry":"Information","m3lname":"Zhao","dataset":"Data Source and Availability: The simulated dataset is proprietary and constructed in controlled environments for research purposes. The Kinetics-400 dataset is publicly available and can be accessed via https://deepmind.com/research/open-source/kinetics. Future updates may include submission of a cleaned version of the ElderCare VisionAudio Dataset if permitted.

Other Supported Data: The system is designed to support additional datasets containing RGB videos, pose estimation annotations, ambient audio, and labeled transcripts. It can be adapted to process publicly available datasets like NTU RGB+D for action recognition, AudioSet for environmental sound classification, and OpenSLR corpora for speech and keyword detection.","m2uni":"ym3052","m2fname":"Yuxin","m3uni":"kz2537"},{"projectname":"Company Analysis Tool: Creative Website Summarization","timestring":"Fri May 3 23:07:51 2024","m1uni":"rk3164","m2lname":"","m1fname":"Richard","m4fname":"","m1lname":"Kim","m3fname":"","description":"The main goal is to create an informative and creative company descriptions using a website’s raw HTML homepage content. The novelty is that the project leverages Big Data and creates a platform that conducts data mining, uses LLMs to summarize key information, provide visualization, and automatically create **creative** website descriptions.

The motivation behind this is that businesses and venture capitals are becoming more and more data-driven. Their time is limited as they are flooded with an abundance of textual information, and this underscores the need for a good and efficient text summarization. By having a good summarization of what a company is about, they can make more informed and efficient business decisions, as well as equipping the businesses to make more informed decisions based on data, which is different from the traditional sourcing process.","uni":"rk3164","language":"Python, Huggingface, fast.ai, BLURR, NLTK, RAKE-Keyword, Flask","pid":"202405-9","m4uni":"","analytics":"The project mainly leveraged the power of the LLMs (BART). Facebook's BART model (facebook/bart-large-cnn) was finetuned on our custom dataset using the BLURR and Transformer library. In addition, we provided some visualizations like the word cloud, word frequency, and extracted key word phrases.","m4lname":"","industry":"Information","m3lname":"","dataset":"The data was manually crafted by scraping the main page websites randomly taken from the Cisco's Umbrella Popularity Ranking List. Tools like ExtractNet, html2text, and bs4 were used to get the raw HTML content as well as the various fields that are used as short company descriptions.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"End to End Movie Recommendation System","timestring":"Tue Dec 22 22:02:57 2020","m1uni":"qw2303","m2lname":"Yang","m1fname":"Qi","m4fname":"","m1lname":"Wang","m3fname":"Ziyu","description":"In our final project, we aim to complete a movie recommender system based on a movie dataset to fulfill the movie recommendation need of different users. We have built an end-to-end production-ready movie recommendation system. After new users give their ratings to different kinds of movies, the webpage will be updated and shows new recommendation movies for users.
Compared with the previous recommender system, we improve the efficiency of movie recommendations by providing different recommendation algorithms. We choose collaborative filtering as the recommendation method in our system and use different algorithms, like Alternative Least Square(ALS), Graph Convolution Network(GCN), and Pearson Correlation. . At the same time, we also integrate the web-based application with the recommendation system to show the results in the webpage.","uni":"yy2608","language":"Python, HTML, CSS, Google Cloud","pid":"202012-6","m4uni":"","analytics":"We implemented analytics, algorithms, system modules, visualization such as EDZ ，Tmdb API, BigQuerry, Pearson Correlation Matrix-based collaborative filtering, Django web application，Python and Spark-based data processing and analysis.","m4lname":"","industry":"Information","m3lname":"Liu","dataset":"The data containing 25 million ratings from more than 160 thousand selected users between January 09, 1995, and November 21, 2019. This dataset describes a 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. At the web-based application, we use the movie posters from TMDB when it matches new movie recommendations for users.
The processing may contain missing value processing. We have observed the sample dataset and then we decide to drop the data because the impact of dropping the missing value line is little. ","m2uni":"yy2608","m2fname":"Yuechen","m3uni":"zl2949"},{"projectname":"Credit Risk Analysis","timestring":"Fri Dec 20 22:50:33 2024","m1uni":"pt2649","m2lname":"Wu","m1fname":"Pengyu","m4fname":"","m1lname":"Tao","m3fname":"","description":"Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. We are going to use machine learning methods to help the organization decided if they should approve the loan to the
customer.","uni":"pt2649","language":"Python","pid":"202412-3","m4uni":"","analytics":"Our project uses advanced data wrangling and feature engineering techniques, including memory-efficient data type optimization, temporal feature extraction, and statistical aggregation. To increase model performance, we chose to use LightGBM and CatBoost in our project.","m4lname":"","industry":"Finance","m3lname":"","dataset":"The dataset used in this project comes from the Home Credit - Credit Risk Model Stability Kaggle competition. The data aims to predict loan default by analyzing internal and external financial information about clients. It is designed to assess repayment capabilities, particularly for individuals with little or no credit history, and emphasizes not only predictive accuracy but also model stability.","m2uni":"qw2434","m2fname":"Qianran","m3uni":""},{"projectname":"Motion Prediction for Autonomous Vehicle","timestring":"Sun May 8 19:02:39 2022","m1uni":"fs2756","m2lname":"Guo","m1fname":"Fuchen","m4fname":"","m1lname":"Shen","m3fname":"","description":"The motivation of Autonomous Driving
Desire to reclaim travel and commuting time, general convenience, and a way to reduce the stress and fatigue of driving.
Focus on “Motion Prediction” Part
Given birds-eye-view image (No natural images)
Predict possible trajectories with confidence.

Predict nearby agents' motions of the autonomous vehicle over the next 5 seconds given their previous 1 seconds positions.
Use Convolutional Neural Networks to build the model and trained it with the provided dataset.
Visualize the prediction result on the semantic map.
","uni":"fs2756","language":"Python, Pytorch, tensorflow","pid":"202205-23","m4uni":"","analytics":"Resnet34, Efficientnet.Bootstrap, Beanstalk, l5kit.","m4lname":"","industry":"Transportation","m3lname":"","dataset":"Lyft Level 5 Open Data for Motion Prediction
Currently Largest Collection of Travel Agent Motion
162,000 scenes with semantic map of lane segments and traffic agents
Each scene encodes the state of the vehicle’s surroundings at a given point in time
25 seconds
1000+ hours
Agents including mostly cars, followed by pedestrians and cyclists
","m2uni":"zg2417","m2fname":"Zidong","m3uni":""},{"projectname":"Reddit Feels: Entity Sentiment Analysis Tool with Reddit News","timestring":"Fri Dec 13 17:01:30 2019","m1uni":"tak2151","m2lname":"Jain","m1fname":"Timotius ","m4fname":"","m1lname":"Kartawijaya","m3fname":"Fernando ","description":"The objective of this project was to build an analytics dashboard that enable users to understand the general sentiment towards an entity from Reddit. For our project, we focused on companies (e.g. Facebook, Wayfair) and the r/news subreddit, and developed the tool for two business applications:
1. Enable investors to get alternative data from Reddit to understand the general public's opinion about a company, which is an indicator of its health and support.
2. Enable companies to understand what the general public feel about their own brand or products, since the input can be of any kind of entity.

Our final product includes an interactive web app in which users can input an entity/company to analyze and receive analytics regarding the entity's popularity (i.e. number of posts that mention, etc) and overall sentiment (sentiment score, most positive comment).

Furthermore, the web app implements a sentiment analysis model which includes an entity recognition step to filter out comments that are not actually connected to the entity (e.g. comment talks about Facebook posts but not the company) and an NLTK parser that grabs the closest sentence to the entity (best context for said entity).

Deployed app can be found on:
https://reddit-feels.herokuapp.com/","uni":"tak2151","language":"Python (Flask, pandas, NLTK, Vader), Spark (SparkNLP), SQL, HTML, CSS (Bootstrap), Javascript (JQuery), Google Cloud (Big Query, NLP API), Heroku (Deployment)","pid":"201912-12","m4uni":"","analytics":"We used BigQuery SQL and Pandas to query the dataset and create high-level metrics (e.g. number of post mentions).

To extract relevant comments, Named Entity Recognition was performed using spaCy. We trained our own model using our own custom dataset and the resulting model filtered out comments that mentioned the name of the entity but where the focus of the comment was not the entity itself. Sentiment analysis of the comments was done using VADER, SparkNLP, TextBlob, Google’s Sentiment Analysis through Google Cloud NLP API. ) SparkNLP, TextBlob and Google's API are all based on neural networks, while Vader is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains. We found that VADER was be the best performing model among the four models we experimented with, and was deployed into the final application.

All visualizations on the application webpage (line chart, formula score, tables) was implemented using d3.js, chart.js, and CSS.","m4lname":"","industry":"Information","m3lname":"Troeman","dataset":"The dataset used is a public dataset that contains all Reddit posts and comments from 2005 to July 2019 (as of time of writing). We obtained it through a Google search that led us to this link:
https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2019_07

We ingested a subset (all posts and comments from 2019 in the r/news subreddit) of the data into our own GCP BigQuery storage. Our software then queries data directly through BigQuery from this subset, powering our models and program.

Our software is able to support additional data (bigger range of dates) if they were provided in a similar format as the Reddit dataset, i.e. rows of post/comments with timestamps, scores, text content, etc. This platform would also be generalizable to other subreddits. To perform analyses on this larger dataset, a higher amount of usage from BigQuery (and thus cost) would be needed. ","m2uni":"hj2533","m2fname":"Hritik","m3uni":"ft2515"},{"projectname":"Personalized Job Posting Recommendation System","timestring":"Thu Dec 19 03:03:56 2024","m1uni":"by2361","m2lname":"Liu","m1fname":"Binrui","m4fname":"","m1lname":"Yang","m3fname":"Tianyi","description":"
Core Objectives
Our job recommendation system aims to solve several critical problems in the current job search landscape:

1. Accuracy in Job Matching
- Bridge the gap between different terminologies used in resumes and job descriptions
- Match candidates with positions that truly align with their skills and experience

2. Efficiency Enhancement
- Eliminate the need to search across multiple job platforms
- Reduce time spent filtering irrelevant job postings

3. User Experience Improvement
- Provide a clean, unified interface for job search
- Offer real-time processing of resumes and job matches
- Present relevant matches in an easily digestible format

Technical Innovations
Our system introduces several key innovations in job matching technology:

1. Hybrid Embedding Architecture
- Combines semantic understanding (BERT) with precise skill matching (TF-IDF)
- Balances context comprehension with specific requirement matching
- Adapts to both technical and non-technical job descriptions

2. Automated Data Pipeline
- Continuous job data collection and cleaning
- Multi-source deduplication
- Real-time processing and updating of job listings

3. Intelligent Matching Algorithm
- Dual-Embedding Architecture with BERT and TF-IDF
- Weighted consideration of different job aspects

Capabilities
1. Automated data pipeline that scrap data from multiple job boards (LinkedIn, Indeed, Glassdoor, Ziprecruiter, etc); process and refresh embeddings daily.
2. Clean user interface with customizable search parameters

Why These Tools Matter

The significance of this research and toolkit development lies in several key areas:

1. Market Need
- Current job boards are inefficient and time-consuming
- Existing matching systems often miss relevant opportunities
- Job seekers struggle with information overload

2. Technical Improvement
- Demonstrates practical applications of modern NLP techniques
- Shows effectiveness of hybrid approach to document matching
- Provides framework for similar matching problems","uni":"by2361","language":"Python, GCP","pid":"202412-5","m4uni":"","analytics":"Analytics:
- Data Fetching: Selenium, JobSpy API
- Data Cleaning: regular expression, NLTK, Pandas, Numpy
- Data Visualization: Matplotlib
- Resume Parsing: pyPDF2, regular expression, OpenAI API

Algorithms:
- BERT (all-mpnet-base-v2)
- TFIDF (scikit-learn)
-Cosine-Similarity (scikit-learn)

System Modules:
- Virtual Machine (GCP)
- Data Pipeline (Apache Airflow)
- Front-end Webpage: (Streamlit)","m4lname":"","industry":"Information","m3lname":"Chen","dataset":"Data Collection:
- Used selenium and the jobspy library to scrape job postings
- Fetched data from multiple job boards (LinkedIn, Indeed, Glassdoor, Ziprecruiter, etc)","m2uni":"jl6578","m2fname":"Jiaqi","m3uni":"tc3308"},{"projectname":"Community Detection in Anti-Money Laundering (AML)","timestring":"Sat Dec 22 00:14:54 2018","m1uni":"ax2127","m2lname":"Hua","m1fname":"Anke","m4fname":"","m1lname":"Xu","m3fname":"Siyu","description":"The Challenging Money Laundering Issue
- Groups of collaborating individuals
- Numerous transactions
- Offshore accounts and complex investment vehicles with well-connected transaction behaviors

The Effectiveness of Community Detection tech in AML
- Overcome the problems of focusing on personal information
- Consider collective behavior for each entity and transfer amount information at the same time

The Objectives of this Project
- Detect suspicious and well-connected entities by applying and improving CESNA algorithm
- Visualize the graphical network of financial transfers and results of community detection","uni":"ax2127","language":"Python, HTML, JavaScript","pid":"201812-1","m4uni":"","analytics":"The major idea of designing our community detection algorithm is derived from Community Detection in Networks with Node Attributes (CESNA) Algorithm. However, the origin CESNA algorithm doesn’t consider the degree of weights. But with our improvement, the algorithm becomes more robust and did achieve differentiating communities in various cases.

We also used HTML and JavaScript to build a visualization tool for understanding our results and algorithm performance better.
","m4lname":"","industry":"Finance","m3lname":"Liu","dataset":"Based on the fraud detection data on Kaggle competition (https://www.kaggle.com/netzone/eda-and-fraud-detection/data), we performed bootstrap, modification, and simulation according to the basic logic we defined for money laundering activities to create more connections.

Since the financial transaction dataset, especially those datasets with user attributes, are often confidential and sensitive, so we simulated and generated our only dataset in two ways through Python. And we use one dataset to visualize and examine algorithm performance.","m2uni":"th2706","m2fname":"Tiaoyao","m3uni":"sl4262"},{"projectname":"Background_Aware_Expression_Agent","timestring":"Wed May 6 05:05:16 2026","m1uni":"hg2783","m2lname":"Zhang","m1fname":"Han","m4fname":"","m1lname":"Gao","m3fname":"Shuzhi ","description":"The objective of this project is to build a Background-Aware Expression AI Agent that can explain the same technical project differently to different audiences. In technology teams, communication is often difficult because engineers, product managers, business leaders, and clients care about different parts of the same system. Engineers may focus on architecture, data flow, and implementation details, while product managers care more about workflow, risks, and user impact. Business stakeholders usually care about value, efficiency, and decision-making.

The main innovation of this project is the Expression Layer. Instead of only retrieving information and generating one generic answer, the agent first understands the user’s role and background, then plans how the explanation should be expressed. It controls the level of detail, terminology, tone, structure, and emphasis of the response.

This makes the system different from a standard RAG chatbot. A standard RAG system mainly answers based on retrieved documents. Our system goes one step further: it uses the same grounded project knowledge but rewrites the explanation for different users. For example, the same project can be explained technically to an engineer, operationally to a product manager, and strategically to a business stakeholder.

The key capabilities include:

1. Background-aware personalization
The agent uses user background and project context to avoid generic explanations.
2. Role-aware explanation
The same project can be explained differently for engineers, PMs, business users, and general users.
3. Expression planning
The system decides the appropriate depth, tone, terminology, and structure before generating the final answer.
4. Grounded generation
The explanation is still based on retrieved project documents, reducing hallucination.

This toolkit is important because real technology teams are cross-functional. A project may be technically strong, but if it cannot be clearly explained to different stakeholders, it is difficult to gain trust, adoption, and business value.","uni":"hg2783","language":"Python","pid":"202605-10","m4uni":"","analytics":"Analytics Implemented

This project implemented a role-aware evaluation framework to test whether the agent could explain the same project differently for different audiences. The evaluation focused on whether the generated answer was factually grounded, role-appropriate, and useful for communication.

The main analytics included:

* Concept coverage: whether the answer covered the key project concepts.
* Role fit: whether the explanation matched the target audience, such as engineer, product manager, business user, or general user.
* Source isolation: whether the answer only used relevant retrieved context.
* Citation support: whether the response was supported by retrieved documents.
* Hallucination safety: whether the answer avoided unsupported claims.
* Overall score: a weighted score combining the above metrics.

The project implemented a multi-stage algorithmic pipeline for background-aware and role-aware explanation generation. The system first performs query understanding to identify the user’s intent, target project, target role, and desired explanation depth. It then retrieves information from two memory sources: project knowledge memory and user background memory. Project memory stores GitHub documents, project reports, implementation notes, and evaluation summaries, while background memory stores user role, experience level, and communication preferences.

The retrieval process is metadata-aware. Each document chunk is stored with metadata such as document type, project name, topic category, technical level, and role relevance. At query time, the system combines semantic retrieval with metadata filtering to retrieve evidence that is both factually relevant and audience-relevant. Retrieved chunks are then filtered through an evidence-selection step before being passed into generation, which improves grounding and reduces irrelevant context.

After retrieval, the system first generates a role-neutral base explanation. This base explanation serves as the factual anchor of the response. Then, the Expression Planning algorithm decides how the answer should be communicated, including explanation depth, tone, terminology, examples, structure, and whether to emphasize implementation details, product workflow, or business impact. Finally, the role-aware rewriting algorithm transforms the base explanation into a final answer for the target audience while preserving the same underlying facts.

The system also includes hallucination-control and evaluation logic. The generated responses are scored using concept coverage, role fit, source isolation, citation support, and hallucination safety. The overall score is computed as a weighted combination of these metrics, with concept coverage and role fit weighted most heavily because the project’s core goal is adaptive technical communication. This design makes the agent more than a standard RAG chatbot: it is a controlled communication system that retrieves grounded knowledge and expresses it differently for engineers, product managers, business stakeholders, and general users.

System Modules Implemented

The project was implemented as a modular agent system. The main modules included:

1. Document Ingestion Module

This module loads project documents, GitHub materials, resumes, background profiles, and other text-based files into the system.

2. Chunking and Metadata Module

This module splits documents into smaller chunks and attaches metadata
3. Vector Store / Retrieval Module

This module stores embedded document chunks and retrieves the most relevant context for a user query.

4. Structured Background Memory Module

This module stores structured user-background information, such as role, skill level, prior project experience, and communication preference.

5. Query Understanding Module

This module interprets the user’s question and identifies what type of answer is needed.

6. Orchestrator Module

This module coordinates the full pipeline: query understanding, background retrieval, project retrieval, base answer generation, expression planning, and final response generation.

7. Base Explanation Module

This module generates a factual project explanation based on retrieved project context.

8. Expression Layer

This is the core module of the project. It adapts the base explanation into a role-specific explanation for different audiences.

9. Evaluation Module

This module scores the generated responses using concept coverage, role fit, source isolation, citation support, hallucination safety, and overall score.

10. Front-End / Demo Interface

The front-end interface allows users to enter a query, select or simulate a target role, and view the generated role-aware explanation.

⸻

Visualization Implemented

The project included basic visualization and interface outputs to make the system easier to understand and evaluate.

The main visualizations included:

* System architecture diagram showing the pipeline from document ingestion to final Expression AI response.
* Module workflow diagram showing how retrieval, background memory, orchestrator, and expression layer interact.
* Role comparison output showing how the same project explanation changes for different roles.
* Evaluation score table comparing baseline RAG and role-aware Expression AI.
* Metric comparison chart showing concept coverage, role fit, citation support, source isolation, hallucination safety, and overall score.
* Demo interface showing the user query, retrieved context, expression plan, and final personalized answer.","m4lname":"","industry":"Information","m3lname":"Yang","dataset":"For this project, the tested dataset was a document-based dataset used to evaluate whether the Expression AI Agent could understand project knowledge and explain it differently to different users.

The dataset mainly included four types of documents:

1. Public GitHub and paper
We collected several open-source GitHub repositories and their related project documentation, such as README files, architecture descriptions, implementation notes, and system workflow explanations. These documents were used as project knowledge sources for the retrieval system.
2. Public online resumes / profiles
We also collected publicly available resume-style or profile-style documents from the internet. These were used to simulate different user backgrounds, such as technical users, product-oriented users, and business-oriented users.
3. Our own previous project documents
We used documentation from our previous data science / AI projects, including project reports, system design notes, implementation descriptions, and evaluation summaries. These documents helped test whether the agent could explain real technical projects with enough detail.
4. Our own resumes / background profiles
We used our own resumes and manually written background profiles to test the background-aware personalization component. These profiles helped the agent understand the user’s technical level, project experience, and preferred explanation style.
","m2uni":"cz2925","m2fname":"Chi ","m3uni":"sy3321"},{"projectname":"PUBG Finish Placement Prediction","timestring":"Sat Dec 22 04:54:20 2018","m1uni":"yw3152","m2lname":"Zhang","m1fname":"Yingling","m4fname":"","m1lname":"Wang","m3fname":"Anran","description":"PlayerUnknown's BattleGrounds (PUBG) has enjoyed massive popularity. With over 50 million copies sold, it's the fifth best selling game of all time, and has millions of active monthly player. Some exploration about PUBG will be interesting. In this report, we firstly do Exploratory Data Analysis (EDA) and Visualization of the dataset to get the intuitive distribution of the dataset. Then we remove some cheater and outlier to ensure the universality. Finally we use feature engineering to shuffle the most important and correlated features to construct the model and use a function called GridSearchCV to search for the best parameters for our lightGBM decision tree of user’s finish placement.","uni":"yw3152","language":"Python， Google Cloud Platform","pid":"201812-41","m4uni":"","analytics":"LightGBM is a gradient boosting framework that uses tree based learning algorithms[1]. It is designed to be distributed and efficient with the following advantages:
1.sFaster training speed and higher efficiency.
2.sLower memory usage.
3.sBetter accuracy.
4.sSupport of parallel and GPU learning.
5.sCapable of handling large-scale data.
","m4lname":"","industry":"Information","m3lname":"Li","dataset":"We got 65,000 anonymized player data which includes 28 kind of user’s performance in game and 1 result placement of PUBG from kaggle, a public platform.","m2uni":"xz2730","m2fname":"Xinrui ","m3uni":"al3804"},{"projectname":"Flight Data Analytics","timestring":"Thu Dec 21 22:31:28 2023","m1uni":"sdp2158","m2lname":"Buche","m1fname":"Simran","m4fname":"","m1lname":"Padam","m3fname":"Harsh","description":"Objectives: Provide a real-time risk assessment and tracker for flights using auxiliary weather data across latitude and longitude coordinates of a given flight path.

Innovations: Most open-source flight trackers do not include weather data information for their flights along the flight route. Those that do are often paid options. Furthermore, there are applications that examine delays at an airport level but do not look into which flights are likely to be causing those delays. Our flight tracker combines both flight route data and weather parameter data such as visibility, precipitation, etc. to create a holistic flight tracker.

Capabilities: Interactive user interface with real time and scheduled flight tracking with risk assessment using plotly and dash. Data handled with pyspark and pandas. Designed with CSS.

Importance: As per FAA, approximately 75% of the system-impact delays of more than 15 minutes were caused by weather from June 2017 to May 2022. If weather conditions are known in advance, airlines and customers can avoid inconvenience due to delays. Hence, a real-time tracker is created to facilitate the smooth functioning of airplane systems. The tracker demonstrates weather conditions for each latitude and longitude coordinate in the flight route. ","uni":"sdp2158","language":"Python - Pyspark, Dash, Plotly, Pandas, CSS","pid":"202312-02","m4uni":"","analytics":"Geographic Scatter Plot generated using Plotly with different map projections
Custom weather risk algorithm derived for this specific algorithm using distributional percentiles and distribution of distributions
Custom user-defined functions for spark to calculate flight routes and format data using haversine distance","m4lname":"","industry":"Information","m3lname":"Benahalkar","dataset":"We used Datfiles for airport listings and route data, AviationStack for bulk flight schedule data, FlightRadar for arrivals and individual flight data, and Open Meteo API for weather data.","m2uni":"ssb2215","m2fname":"Shriniket","m3uni":"hb2776"},{"projectname":"AI Academic Advisor Chatbot for Columbia University","timestring":"Thu Jan 1 04:08:35 2026","m1uni":"jz3850","m2lname":"Zhang","m1fname":"Junfeng","m4fname":"","m1lname":"Zou","m3fname":"","description":"This project presents an intelligent academic advising system for Columbia University that combines Retrieval-Augmented Genera tion (RAG), semantic search, and hybrid intent detection to provide personalized course recommendations. The system addresses the challenge of navigating Columbia’s extensive course catalog of 8,120+ courses by implementing a conversational AI interface that understands natural language queries and generates context-aware recommendations using local language models. We developed a novel hybrid intent detection approach that combines regex-based pattern matching with intelligent parameter extraction, achieving 100% accuracy across our comprehensive test suite. The system demonstrates significant practical improvements with response times under 5 seconds and zero hallucination rate through strict prompt engineering. Our evaluation shows that the hybrid ap proach provides reliable, deterministic intent classification while maintaining the flexibility needed for natural language understand ing. The system successfully handles instructor queries, specific course lookups, level-filtered searches, and topic-based exploration with student profile personalization.","uni":"jz3850","language":"Programmed using Python and deployed on Google Cloud Platform","pid":" 202512-04","m4uni":"","analytics":"Our project implements a full-stack AI-driven academic advising system that integrates multiple analytics techniques, algorithms, system modules, and visual components.
From an analytics and algorithmic perspective, we implemented semantic text embedding for course descriptions, vector similarity search using FAISS, and a hybrid retrieval strategy that combines semantic similarity with structured filtering over course metadata stored in MongoDB. A ranking and de-duplication algorithm was applied to prioritize relevant courses and remove redundant entries. We further adopted a retrieval-augmented generation (RAG) framework to ground large language model (LLM) responses in retrieved data and reduce hallucination.
From a system design perspective, the system consists of a data preprocessing and storage module, a semantic retrieval and hybrid search module, an LLM reasoning module based on LLaMA 3.2 (1B), and a backend orchestration layer that manages query processing, retrieval, and response generation. A web-based user interface module enables interactive, multi-turn academic advising.
For visualization, we implemented a chat-based web interface that displays structured course results, instructor information, schedules, and advisor-style explanations generated by the LLM. The UI supports real-time interaction, result comparison, and interpretable presentation of recommendations.
","m4lname":"","industry":"Information","m3lname":"","dataset":"Our system integrates three primary data sources from Columbia University’s course catalog. The course data contains 8,120 courses with attributes including call number (unique identifier), course code (department and level such as \"EECS E6895\"), title, instructor name, department, credit points, full course description, prereq uisites, and academic term. The enrollment data contains 8,120 enrollment records with historical enrollment information for fu ture enhancements. The instructor data includes 14,699 instructor records with faculty names and department affiliations.
So far our platform only support json format file.
https://github.com/soid/columbia-catalog-data
","m2uni":"yz4843","m2fname":"Yangyang","m3uni":""},{"projectname":"Amazon Product Recommendation Assistant","timestring":"Fri May 3 05:29:13 2024","m1uni":"sl5394","m2lname":"Qiu","m1fname":"Siyu","m4fname":"","m1lname":"Li","m3fname":"","description":"Objectives:
1. Enhance Customer Interaction: Create a user-friendly webpage where customers can ask questions about Amazon products, which will be automatically answered by the agent.
2. Improve Decision-Making for Suppliers: By analyzing customer inquiries and interactions, the system will help suppliers understand market trends and customer needs more effectively.
3. Boost Customer Service Efficiency: Equip the customer service department with new tools and technologies such as AI, data analysis, and blockchain to improve work quality and efficiency.
4. Increase Customer Satisfaction and Loyalty: By providing timely and relevant product recommendations and answers, aim to enhance customer satisfaction and stickiness.

Innovations:
1. Real-Time Data Handling: Address challenges related to the high volume, velocity, and variety of data by integrating real-time data processing and analysis capabilities.
2. Multidisciplinary Approach: Utilize a blend of techniques from statistics and other fields to better understand and cater to customer needs and preferences.
3. Advanced AI and ML Techniques: Employ cutting-edge AI methodologies like NLP, machine learning, deep learning, and reinforcement learning to refine the system's ability to process queries and generate insights.
4. Dynamic Content Generation: Use sentiment analysis and keyword extraction (TF-IDF) to dynamically create personalized responses and recommendations based on user queries.
Capabilities:
1. Data Collection and Analysis: Implement AI-driven tools like web crawlers and API interfaces to gather, clean, and analyze vast amounts of product and financial data.
2. Personalized User Experience: Develop a system that can not only respond to user queries with high accuracy but also predict and understand user preferences through sentiment analysis and keyword trends.
3. Scalable and Adaptable System: Design the backend to handle a large and constantly changing dataset, ensuring the system is adaptable to varying customer needs and market conditions.
4. Interactive and Engaging Interface: Construct a frontend that facilitates easy interaction with the agent, allowing users to ask questions and receive answers and recommendations efficiently.
The research and toolkits in the project are critical for several reasons, each contributing significantly to the system's overall effectiveness and efficiency in addressing the needs of both customers and businesses. Here's a breakdown of why these elements are important:

1. Advanced AI and Machine Learning Techniques
- Personalization: AI-driven tools, especially those utilizing machine learning and deep learning, can analyze user data and interactions to offer personalized recommendations and responses. This not only improves the customer experience by making it feel more tailored and relevant but also increases the likelihood of customer retention and satisfaction.
- Predictive Analytics: These technologies enable the system to predict future behaviors and preferences based on historical data, which can help businesses anticipate market trends and customer needs more accurately.

2. Natural Language Processing (NLP)
- Improved Communication: NLP allows the system to understand and generate human-like responses to customer queries. This capability is fundamental in automating customer service, reducing response times, and freeing human agents for more complex issues.
- Sentiment Analysis: By gauging the sentiment behind customer inquiries or feedback, the system can offer more nuanced responses and alert human operators to potential customer dissatisfaction or delight, which can be crucial for customer relationship management.

3. Data Analysis and Visualization
- Insightful Decision-Making: The ability to quickly process and visualize data helps businesses understand complex scenarios and make informed decisions. This is particularly important in dynamic environments like online retail, where consumer preferences and market conditions can change rapidly.
- Operational Efficiency: Efficient data handling and visualization reduce the time and resources required to derive actionable insights from large datasets, improving overall business efficiency.
4. Integration of Diverse Data Types
- Comprehensive Analysis: Handling various data types (structured, semi-structured, and unstructured) allows for a more holistic analysis of customer interactions, market conditions, and product performance. This diversity in data integration ensures that the insights generated are comprehensive and encompass all relevant facets.
- Complex Problem-Solving: The ability to merge and analyze different data types facilitates complex problem-solving, enabling the system to address multifaceted issues that may involve various aspects of the business and customer experience.

5. Multidisciplinary Approaches
- Broader Perspectives: Incorporating knowledge from disciplines such as statistics helps in understanding the underlying human behaviors and patterns that influence customer decisions. This deeper understanding can lead to better product designs, marketing strategies, and customer service approaches.
- Enhanced System Design: A multidisciplinary approach contributes to creating a system that is not only technically proficient but also empathetic and user-friendly, aligning with the preferences of users.

These research areas and toolkits are pivotal in crafting a system that not only meets the current demands of e-commerce but is also adaptable and forward-thinking, capable of evolving with technological advancements and changing market dynamics.
","uni":"sl5394","language":"Language: Python, HTML; Platforms: VsCode, Google Chrome(Frontend)","pid":"202405-07","m4uni":"","analytics":"Analytics:
1. Average Score Calculation: calculates the average of the sum of two columns, ratingScore and sentiment_score, from CSV files. This involves basic statistical analysis to derive a mean value, which is used to assess overall performance or sentiment related to items.
2. Data Merging and Enrichment: There's a merging process where average scores data is combined with item details data based on an identifier (asin). This enriches the item data with calculated average scores, which could be crucial for further analysis or decision-making.
3. Combined Average Calculation: calculates a combined average score using the item's rating and the computed average score. This could be used to provide a more holistic view of the item's performance or appeal to customers.
4. Sentiment Analysis: uses NLTK's Sentiment Intensity Analyzer to calculate the sentiment scores of product reviews. This involves computing the 'compound' score which is a normalized score of sentiment polarity.
5. Color Detection: includes functionality to detect color names mentioned in product titles. It checks if words in a string are valid color names using the colour library.
6. Price Data Extraction: extracts numerical values from a string representation of a list of dictionaries stored in the 'prices' column, which appears to be related to product pricing information.
7. Data Collection and Aggregation: uses the Rainforest API to fetch product search results from Amazon based on a specified search term, and then aggregates the product links for further analysis.
8. Review Collection: further collects detailed reviews for products starting from the fourth item in the search results, using an Apify Actor, which is designed to scrape review data from given product URLs.
9. Natural Language Processing (NLP): uses spaCy's NLP capabilities to perform Named Entity Recognition (NER) and noun extraction from product titles. This is used to derive subcategories from the text data, which enriches the dataset with more granular information.
10. Text Classification: A Multinomial Naive Bayes classifier is trained using a TF-IDF vector representation of product titles to predict the category of a product. This forms the basis for automated category classification based on textual data.
11. Item Recommendation System: implements an item recommendation system. It uses cosine similarity to measure the similarity between the TF-IDF vectors of product titles, facilitating the recommendation of similar items based on text content.
12. Web Scraping: extracts data from an Amazon product page. It collects information such as product name, author (or equivalent attribute), ratings, number of customer ratings, and price.
13. Data Aggregation: aggregates the data across pages (although no_pages is set to 2, implying it collects data from two pages), compiles all the data into a single list, and then converts this list into a DataFrame.
14. Price Data Cleaning and Conversion: adjusts the actual_price column in a DataFrame by removing currency symbols and commas, converting it into a float format suitable for numerical operations.
15. Named Entity Recognition (NER): Utilizes a pre-trained RoBERTa model from the Hugging Face's transformers library, specifically fine-tuned for detecting proper nouns in text, which can be essential for extracting specific entities or keywords.
16. Noun Extraction: Using NLTK's tokenization and POS-tagging to extract nouns from the text. This is useful for various applications such as content categorization, keyword extraction, and information retrieval.
Algorithms:
1. Basic arithmetic operations (sum and mean calculations) and conditional logic.
2. Sentiment Intensity Analysis: involves determining the emotional tone behind a series of words, using predefined models in NLTK.
3. Color Validation: The check_color function uses the colour library to validate whether a string is a recognized color name, handling exceptions if the color name is invalid.
4. Data Transformation and Extraction: Transforming string data into usable formats (like converting string representations of lists into actual lists) and extracting specific values from complex data structures using Python’s ast.literal_eval.
5. TF-IDF Vectorization: This technique transforms text data into a format suitable for machine learning models, emphasizing words that are unique to a document in a collection of documents (corpus).
6. Naive Bayes Classification: Utilized here for its effectiveness in text classification tasks, especially with high-dimensional data.
7. Cosine Similarity: Used to calculate a numeric value that denotes the similarity between two documents. In this case, it is used to find products whose titles are semantically similar to a query product title.
8. Named Entity Recognition (NER): Employed to identify entities like products or organizations in text, which can help in extracting useful features from product titles.
9. Sentiment Analysis using VADER: A lexicon and rule-based sentiment analysis tool that is part of the NLTK suite.
10. Color Validation: Uses the colour Python package to validate string inputs as legitimate colors.
11. Token Classification: Utilizes a transformer-based model for token classification tasks, providing detailed insights into the text's structure by identifying proper nouns.
12. POS Tagging: Applies NLTK’s part-of-speech tagging to identify nouns in a text, which are then filtered based on predefined conditions.
System Modules:
1. File Handling and I/O Operations: reads from and writes to CSV files, which involves file system operations using Python’s standard os and pandas library functionalities.
2. Error Handling: basic error handling during the file reading process to manage exceptions that might occur, such as file not found or data parsing errors.
3. Directory Management: Using os.listdir to navigate directories.
4. Data Manipulation: Heavy use of pandas for reading from and writing to CSV files, transforming data frames, and applying functions to data columns.
5. HTTP Requests: Utilizes the requests module to make HTTP requests to the Rainforest API to retrieve Amazon product search results.
6. Data Handling with Pandas: Uses the pandas library to organize the data fetched from the API into a DataFrame, which is then saved into a CSV file for each product based on its ASIN (Amazon Standard Identification Number).
7. Apify Integration: Integrates with Apify using the apify_client to automate the process of fetching and handling web-scraped data, including managing complex configurations like proxies.
8. SpaCy: An industrial-strength natural language processing library used for text processing and entity recognition.
9. Scikit-learn: Utilized for creating the machine learning pipeline, including TF-IDF vectorization, the Naive Bayes classifier, and implementing train/test splits for model validation.
10. Pandas: Extensively used for data manipulation and reading/writing CSV files. It allows the aggregation and transformation of dataset features needed for further processing.
11. OS Module: Used for directory and file manipulation, helping in managing dataset files.
12. Requests: Utilized for making HTTP requests to fetch web pages. It handles the network interaction needed to retrieve the HTML content.
13. BeautifulSoup (from bs4): Used for parsing HTML content and extracting data. It navigates through the HTML tree and retrieves the required information based on specified tags and attributes.
14. Pandas: Used for creating a DataFrame from the scraped data, which allows for easier manipulation, analysis, and storage of structured data.
15. Numpy: Although imported, it's not directly used in the script shown. Typically, it would be used for numerical operations.
16. Matplotlib and Seaborn: These are visualization libraries, but no visualization code is executed in the provided script. They could be used to plot and examine trends in the data, such as price distributions or ratings.
17. Regular Expressions (re): Imported but not used in the provided snippet. It's commonly used for searching patterns in text, which can be helpful in data cleaning or extraction tasks.

Visualization:
The retrieved products will be presented in the form of pictures.","m4lname":"","industry":"Information","m3lname":"","dataset":"Dataset: 1. One existing Amazon product dataset(from Kaggle);
2. Real time data from Amazon(use nested API consists of Rainforest and Apify to crawl).

Other data: Not available","m2uni":"xq2234","m2fname":"Xinyu","m3uni":""},{"projectname":"Wildfire Prediction & Sentiment Analysis","timestring":"Fri Dec 16 23:05:55 2022","m1uni":"fm2750","m2lname":"Zhang","m1fname":"Fang","m4fname":"","m1lname":"Ma","m3fname":"Asura","description":"Wildfire has always been a major natural disaster especially in recent years as global warming becoming a key driver in more frequent wildfires. The goal of this project is to provide insights on the behavior of the wildfires by exploratory data analysis and predicting the cause and severity of wildfires based on environmental/geographical information, and thus to help us preemptively reduce future wildfire risks. The novelty of this project lies in that it utilizes twitter API to provide sentiment analysis based on the disaster tweets in specific geolocations, which helps to predict people’s reactions towards the wildfire.","uni":"zs2525","language":"Platform: GCP; Languages: Python，CSS","pid":"202212-11","m4uni":"","analytics":"Machine Learning Model:
1. Random Forest
3. Logistic Regression
4. Support Vector Machine
5. Gradient Boosting Tree
Oversampling through Synthetic Minority Oversampling Technique (SMOTE) due to highly unbalanced data distribution

System Modules:
1.Frontend: Plotly for data visualization and user interaction
2.Twitter Workflow: Twitter API, BigQuery, Data Studio
3.Machine learning workflow: Compute Engine

Visualization:
1.wildfire occurrence and severity on map by states
2.wildfire causes breakdown on scatter plot
3.twitter trend wildfire analysis with retweet counts
4. twitter trend wildfire data visualization with latitude longitude coordinates
","m4lname":"","industry":"Information","m3lname":"Shen","dataset":"Wildfire occurrence data from the USDA Forest Service, which contains more than 2 million records of wildfires dating back to 1992 with feature fire_size, latitude, longitude, dicovery_date, state, county, fire_cause. Twitter data collected with wildfire related hashtags, ranked according to number of retweet.
","m2uni":"yz4130","m2fname":"Yi","m3uni":"zs2525"},{"projectname":"Analysis and Inﬂuence of Opioid Crisis","timestring":"Sat Dec 18 04:51:32 2021","m1uni":"sl4921","m2lname":"Li","m1fname":"Shuo","m4fname":"","m1lname":"Liu","m3fname":"Yujing","description":"Opioids are a class of drugs that include the illegal drug heroin, and pain relievers available legally by prescription, such as oxycodone, codeine, morphine. A national crisis regarding prescription and recreational opioids is going on recently. Opioid-involved deaths rose significantly in recent years. If this trend continues and spreads, there would be devastation on businesses and civil health.

Out of concern for people’s health and the importance of medical and business value, we built mathematical models to predict the possible outbreaks in terms of different kinds of opioids and try to see the influence of Opioids.
","uni":"sl4921","language":"Python and GCP","pid":"202112-38","m4uni":"","analytics":"We first conducted some data cleaning. We dropped the lines with missing data and join tables according to the Geo code or name of county. We selected the the five most popular opioids and top 3 counties for analysis. We used several kinds of regression model with data from 2010 to 2015 to make the prediction of report cases in 2016 and 2017. The ground truth data are in parentheses. We worked out whether the drug report cases increase stably as time passes. Green means the prediction is good and red means bad predictions. We found Oxycodone and Buprenorphine are basically in line with linear growth. However, the results of fentanyl are far from the ground truth. We found an explosion in the number of reports of fentanyl in recent years in these counties. Combine this fact that the base of heroin report cases is large, we predicted that opioid crisis would most likely occur in Heroin and Fentanyl in given states in the near future.

To improve our regression, we need to consider the features that would influence the increase of drug reports. But we more than hundreds of socio-demographic features that are available in our dataset. Thus a selection of features is needed. First, we did a rough elimination by setting the correlation value to be 0.3 and we got about 20 features that would influence the increase of drug reports. Then, we perform another round of correlation and got the picture on the left. These are the features that are relatively independent with each other. Another thing we need to consider before tuning our regression model is importance of features. To have a list of importance, we calculated the correlation coefficient between increase of drug reports and different feature. Then we regularize them to have the importance list on the right.

We used graph neural networks (GNN) to aggregate feature vectors and make new predictions based on our assumption that opioid increases are caused not just by population composition but also by civil migration. We will take the migration rate into consideration and use GNN to train the regression models, where the rate between two counties is calculated by proportionally weighted adding the neighbors’ feature vectors. We set the the aggregation rate to be .1, that’s keep most of original features and take the neighbors influence into account in the meanwhile and get the outbreak prediction.

To find out the potential impact of different opioids on heart disease like coronary and stroke, we did a correlation between heart disease prevalence and the increase of opioids reports. We use combined prevalence to calculate the correlation coefficient and use the it difference with majority voting of each county to estimate the error. The correlation between different opioids with Coronary Heart Disease and stroke are shown in the figure with error bar. Heroin is positively related to heart disease’s prevalence in all 5 states shown. While, other opioids do not have clear relationship across these states, showing that Heroine abuse may lead to these heart disease. This is also matching with our common sense.

","m4lname":"","industry":"Social Science-Government","m3lname":"Chen","dataset":"NFLIS database records the increase of total drug report across different counties in 7 years from 2010-2017. This slide shows the schema of the dataset. We used FIPS code of counties as identifier, opioid names and TotalDrugReportCounty are used to make the prediction. In addition, NFILS database also provides datasets of population of different socio-demographic features in each county, like this schema. We used this to enhance our prediction model and got clues about which groups are more likely to abuse opioids.

USCensus database records the resident population for counties annually. From this dataset, we can obtain the population of each county. Civil migration dataset in USCensus database, which shows the in-, out-, and grow migration between different counties across states. We used this dataset and resident population dataset together to calculate the graph neural networks feature aggregation weights.

Rates and Trends in Coronary Heart Disease and Stroke Mortality dataset documents rates and trends in local coronary heart disease (CHD) and stroke death rates. Specifically, this report presents county (or county equivalent) estimates of stroke and CHD death rates in 1999-2018 and trends during three intervals (1999-2005, 2005-2011, 2011-2018) by age group (ages 35–64 and 65 and older). The rates and trends were estimated using a Bayesian spatiotempotal model and a smoothed over space, time, and age group. Rates are age-standardized. Data source: National Vital Statistics System.

","m2uni":"yl4736","m2fname":"Yu","m3uni":"yc3851"},{"projectname":"Global Climate Analytics","timestring":"Thu Dec 18 01:51:01 2025","m1uni":"zw3162","m2lname":"Zhu","m1fname":"Zhiliang","m4fname":"","m1lname":"Wang","m3fname":"Hong","description":"The primary objective of this project is to engineer a high-performance analytical framework capable of distilling 70 years of raw NOAA climate records into actionable environmental intelligence. By establishing a cloud-native environment on Google Cloud Platform and leveraging the distributed computing power of PySpark, we aim to transform heterogeneous, unstructured data into a refined data lake. This infrastructure is specifically designed to handle the longitudinal complexity of the GHCN-Daily dataset, enabling the precise identification of decadal warming trends and the quantification of \"climate shocks\" that are often obscured in smaller-scale models.
The innovation of this toolkit lies in its transition from static mean-value reporting to dynamic anomaly detection. Unlike conventional climate tools, our system implements specialized logic—such as day-over-day temperature delta tracking and synchronized dual-city plotting—to isolate abrupt thermal spikes and intense precipitation bursts. This is achieved through a decoupled architecture that separates heavy-duty backend processing from an interactive Plotly.js-based frontend. This design ensures that petabyte-scale historical data can be explored with near-zero latency in a standard web browser. Ultimately, these capabilities are vital because they bridge the gap between massive, silent data repositories and the intuitive visual narratives required for modern climate advocacy and urban infrastructure resilience.","uni":"zw3162","language":"Python, HTML","pid":"202512-5","m4uni":"","analytics":"The system architecture is comprised of several integrated modules designed to handle the full lifecycle of big data, from ingestion to interactive visualization. At the core of the backend is a distributed ETL (Extract, Transform, Load) module implemented via PySpark on Google Cloud Dataproc, which executes algorithms for data sanitization, unit normalization, and schema pivoting. This module transforms narrow, raw NOAA records into a wide Parquet-based data lake, enabling efficient multi-variable queries. To extract scientific value from this structured data, we implemented an analytical engine focused on \"Climate Shock\" detection. This sub-module utilizes Spark Window Functions and lag-based algorithms to calculate daily temperature deltas—identifying abrupt $10^\circ\text{C}$ shifts—and applies threshold-based filtering to quantify the frequency of extreme heatwaves ($>40^\circ\text{C}$), cold spells, and heavy precipitation events ($>25\text{mm}$).The visualization layer is a decoupled web-based dashboard powered by a Flask-driven API and Plotly.js for the frontend interface. The dashboard features a synchronized dual-city analysis module, allowing for the real-time comparison of chronological trend lines between disparate urban centers like Boston and Miami. To ensure high performance, we implemented a pre-aggregation visualization strategy where the system fetches optimized JSON payloads from Google Cloud Storage, bypassing the latency of raw data execution. The interface includes interactive capabilities such as multi-scale zooming, dynamic tooltips for specific data points (e.g., precise 2020 temperature readings), and built-in utility functions for exporting PNG screenshots and raw data subsets. This modular design ensures that the heavy-duty algorithmic processing remains on the cloud backend while the frontend provides a lightweight, responsive diagnostic environment for the user.","m4lname":"","industry":"Information","m3lname":"Chen","dataset":"The primary dataset utilized for testing and validation in this study is the Global Historical Climatology Network (GHCN)-Daily, an authoritative meteorological archive managed by the National Oceanic and Atmospheric Administration (NOAA). This dataset was acquired by ingesting raw, unstructured weather logs from NOAA’s public repositories into a Google Cloud Storage (GCS) environment, covering a 70-year longitudinal span from 1950 to 2020 with over 50 million records. Our software specifically tested core variables including daily maximum and minimum temperatures (TMAX/TMIN), liquid-equivalent precipitation (PRCP), and snowfall totals (SNOW). Beyond these primary metrics, the system’s modular Parquet-based architecture is engineered to support a wide array of additional environmental streams, such as wind speed (AWND), relative humidity, and evaporation rates. Because the backend utilize a decoupled schema design, the toolkit can also be extended to ingest non-meteorological data, including Air Quality Index (AQI) or urban infrastructure metrics, by simply appending new variables to the existing station-day keys without requiring a fundamental redesign of the PySpark ETL pipeline or the interactive dashboard.","m2uni":"hz3087","m2fname":"Haojia","m3uni":"hc3605"},{"projectname":"Feelings & Emotions in LLMs ","timestring":"Wed May 14 03:36:52 2025","m1uni":"ems2359","m2lname":"Twan","m1fname":"Emma","m4fname":"","m1lname":"Sombers","m3fname":"Nikos","description":"Objectives:
The objective of this project is to evaluate and improve the emotional alignment of large language models (LLMs) when responding to emotionally charged real-world scenarios. Specifically, our goals for this project were:
Quantifying emotional alignment using both classification accuracy and PANAS affective ratings
Identifying and addressing ambiguity in complex emotional responses
Prototyping adaptive agents like an emotionally aware chatbot that can dynamically respond to emotional uncertainty
Exploring reinforcement learning to fine-tune emotional alignment based on reward-driven empathy signals

Innovations:
Dual-layer emotional evaluation: We use both external (DistilBERT emotion classifier) and internal (PANAS ratings) frameworks to assess alignment, offering a multi-dimensional view of affective behavior
Confidence-aware adaptation & emotionall aware chatbot prototype: In our emotionally aware chatbot, we introduce a novel use of classifier softmax gaps to measure model uncertainty and improve chatbot interactions to be more emotionally nuanced
Reinforcement learning for empathy: By fine-tuning a model with Proximal Policy Optimization (PPO), we demonstrate that reward shaping can push LLMs toward more emotionally aligned responses

Capabilities
Ability to simulate and evaluate emotionally appropriate responses across a wide spectrum of situations, even without explicit emotion prompts
Integration of reinforcement learning frameworks to actively train for improved empathy
Capability to detect and respond to emotional ambiguity, prompting for clarification before committing to a potentially incorrect response

Importance of Research/Toolkits
This toolkit is important because it provides a framework for evaluating how large language models simulate human emotional understanding. It enables fine-grained, interpretable analysis of LLM behavior across emotion classification, affective rating, and response adaptation, including multi-step reasoning through reinforcement learning. By integrating uncertainty-aware response strategies and testing in psychologically grounded scenarios, the toolkit lays groundwork for developing emotionally intelligent LLMs and chatbots capable of demonstrating empathy, which is critical for applications such as mental health, education, and human-centered AI.

","uni":"ems2359","language":"Language: Python: PyTorch, Hugging Face Transformers, OpenAI API, Scikit-learn, NumPy, pandas, matplotlib, CSV Platforms: Google Colab, VSCode Models: Mistral-7B, LLaMA-2-7B, GPT-4o, tiny-gpt2 ","pid":"202505-4","m4uni":"","analytics":"Our system implements three core emotional evaluation pipelines for large language models (LLMs):
1. Emotion Classification Evaluation:
We prompt models (LLaMA-2-7B and Mistral-7B) to respond to emotionally annotated scenarios and evaluate the emotional content of their responses using a pretrained emotion classifier (DistilRoBERTa). The classifier outputs softmax distributions over emotion labels. We aggregate results over multiple generations per scenario and compute top-1 and top-3 alignment accuracy against the dataset’s ground-truth emotion labels.

2. PANAS-Based Affective Scoring:
In this experiment, LLMs (GPT-4o-mini, Mistral-7B, and LLaMA-3) rate their affective state using the 20-item Positive and Negative Affect Schedule (PANAS). We parse model outputs to extract a 20-dimensional emotion vector per scenario, allowing us to compute cosine similarity and L_2 distances between models’ average affective profiles. We also conduct per-item statistical tests to analyze significant emotional differences between models.

3. Reinforcement Learning with PPO:
We train a small causal language model (sshleifer/tiny-gpt2) using Proximal Policy Optimization (PPO) to generate emotion scores across 15 emotions in response to scenarios. The reward signal is computed as the normalized intensity of the correct emotion, and training proceeds using REINFORCE-style gradient updates. We log reward trends across epochs and visualize model performance over time using a reward plot.

Each module outputs structured CSV files and, in the case of reinforcement learning, reward trajectory plots (reward_plot.png) to support analysis and reproducibility.","m4lname":"","industry":"Information","m3lname":"Goutzoulias","dataset":"We used the EmotionBench dataset, introduced in Huang et al. (2024). The dataset contains 428 real-world scenarios annotated with one of eight negative emotions: anger, fear, guilt, embarrassment, frustration, depression, anxiety, and jealousy. Each scenario is grouped under thematic subcategories (e.g., “Blaming,” “Bullying,” “Failure of Goals”) and is written in naturalistic language to simulate emotionally salient experiences.

We obtained the dataset from the authors’ official repository: https://github.com/CUHK-ARISE/EmotionBench. It is publicly available and maintained under the EmotionBench project.

Our software directly supports this dataset in its flattened form (CSV), where each row contains a scenario, its associated emotion label, and the contextual factor. Beyond EmotionBench, our emotion alignment pipeline and reinforcement learning setup can support any dataset containing textual prompts/scenarios and corresponding categorical emotion labels, ideally aligned with psychological theory (such as basic emotions, appraisal theory, or PANAS dimensions).
","m2uni":"mt3565","m2fname":"Michelle ","m3uni":"ng2985"},{"projectname":"House Price Prediction","timestring":"Wed Dec 14 19:25:34 2022","m1uni":"yg2537","m2lname":"","m1fname":"Yue","m4fname":"","m1lname":"Gu","m3fname":"","description":"In recent decades, house marketing has
always been a hot topic in economics.
People are trading real estate not only for
living but also for business. However,
during the covid period, the entire market
has been full of risks. People are
considering a more efficient prediction
model in housing prices for them to reduce
the risk as much as possible. Meanwhile,
despite the improving housing market, the
country's overall housing supply continues
to be constrained. Many people who bought
homes during the past few years are still
staying put, which has kept the prices from
falling further. Therefore, for these reasons,
I think house price prediction is a really
good topic and a project for me to practice
the skills I have learned in EECS 6893 Big
Data analytics.
","uni":"yg2537","language":"Python,D3,html,spark","pid":"2022-28","m4uni":"","analytics":"The project can be divided into below steps and workflow:
Data modeling and streaming pipeline
Apply data cleanse
Apply the Machine Learning model to the data
Data virtualization
Summary of the report and understanding of the data.
","m4lname":"","industry":"Retail","m3lname":"","dataset":"Data Set: 'House price prediction’
Data Size: 4601 Rows and 18 Columns.
source: kaggle.com
https://www.kaggle.com/datasets/shree1992/housedata
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Medical AI Helper","timestring":"Fri Apr 23 23:42:55 2021","m1uni":"cw3282","m2lname":"Geng","m1fname":"Chaoran","m4fname":"","m1lname":"Wei","m3fname":"","description":"COVID-19 is the most serious public health challenge in the world today. However, lung image analysis requires a lot of labor and spends a large amount of time. So automated analysis of lung images by AI technology will greatly improve the efficiency of doctors in diagnosing patients with COVID-19. In addition, we discovered that our idea can be extended to more medical image analysis scenarios.After some literature survey we knew that Brain tumor MRI analysis and Endoscopic artifacts detection were also very common physical examination items with huge workloads, so we developed analysis modules in our system for them too.","uni":"cw3282","language":"Python, Flask, Tensorflow, Keras, HTML, CSS","pid":"202105-17","m4uni":"","analytics":"Classification: VGG-16, ResNet-50 and Inception-V3
Segmentation: U-Net and Res-UNet
Bone suppression: U-Net and PatchGAN
Endoscopic artifact detection: Faster-RCNN","m4lname":"","industry":"Life Science","m3lname":"","dataset":"COVID-19 Radiography Database: https://www.kaggle.com/tawsifurrahman/covid19-radiography-database
COVID-CTset : A Large COVID-19 CT Scans dataset: https://www.kaggle.com/mohammadrahimzadeh/covidctset-a-large-covid19-ct-scans-dataset
Brain MRI segmentation: https://www.kaggle.com/mateuszbuda/lgg-mri-segmentation
Endoscopic artifact detection: https://ead2019.grand-challenge.org/Data/
https://ead2020.grand-challenge.org/Data/
X-ray Bone Shadow Supression: https://www.kaggle.com/hmchuong/xray-bone-shadow-supression
COVID-19 Lung CT segmentation: https://drive.google.com/file/d/1bbKAqUuk7Y1q3xsDSwP07oOXN_GL3SQM/view","m2uni":"xg2358","m2fname":"Xiaotian","m3uni":""},{"projectname":"Better Name More Money: NYC Airbnb Analysis","timestring":"Fri Dec 13 08:58:32 2019","m1uni":"zw2624","m2lname":"Yun","m1fname":"Zihe","m4fname":"","m1lname":"Wang","m3fname":"Qianhui","description":"On Airbnb, ‘Cozy PRIVATE Studio UWS and Jazz Tour’ got 10 reviews per month while ‘Upper West Side Apartment’ just received 1. Both sources are 1B1B around $100 and in the same neighborhood. It seems that a good NAME of the source could bring more customers and therefore more money.

We decided to further investigate on this interesting phenomenon and our goals are descriped as followings:
-sData analysis to identify possible relationship between the name of the housing source and to its popularity
-sBuild tools to help house owners predict the popularity of his/her house with different names
-sBuild tools to help house hunters analysis price and get the information of the neighborhood

We think our project website could be an useful tool for both ends: better experience for guests, more money for hosts.
","uni":"zw2624","language":"Python Django / Google cloud platform","pid":"201912-16","m4uni":"","analytics":"web-spider using scrapy
EDA: statistical analysis, word cloud
Word Embedding
Random Forest and XGBoost using Sklearn and Spark MLlib
MVC website using Python Django and Google Cloud Platform
Interactive data visualization using Bokeh
Interactive map using leaflet

","m4lname":"","industry":"Information","m3lname":"Yu","dataset":"We built a web-scraper and scraped data from Airbnb as our dataset. We built it using python scrapy on Airbnb api. ","m2uni":"dy2400","m2fname":"Duanyue","m3uni":"qy2226"},{"projectname":"Analysis of House Price Prediction and Stock Market ","timestring":"Sat Dec 22 16:38:46 2018","m1uni":"xc2452","m2lname":"Hu","m1fname":"Ximing","m4fname":"","m1lname":"Chen","m3fname":"Xiang ","description":"Real estate and stock are the most common as well as the most important investment in people's entire life. So how to accurately predict the movement of the stock price and how to let people know how much their ideal house would cost become very meaningful. Also, we would like to explore our prediction accuracy and what factor would affect our prediction accuracy. Limited by the datasets we choose, our prediction about the accuracy and house price might not be quite consistent since we adopted two different dataset. The ideal situation is that we should use the same dataset to predict the house price and evaluate the accuracy.

Also, we created a suggestion and prediction model on our website which allowed our user to type in their requirement for their house, and our system would give predictions according to their requirement. We would give back the house price expectancy and the error of our prediction for the reference of our user. ","uni":"xc2452","language":"Python, JavaScript, R, CSS","pid":"201812-37","m4uni":"","analytics":"Training Framework: XGBoost, Spark MLlib, Light GBM, Sklearn
1)XGBoost
XGBoost has great performance on the situation of regression and classification which also has a great support for parallel computing, so it has great performance on dealing with large size of data. The most important factor of XGBoost is the scalability in all scenarios which runs more than ten times faster existing popular machine learning framework. These innovations include: a novel tree learning algorithm is for handling sparse data; a theoretically weighted quantile sketch procedure enables handling instance weights in approximate tree learning. Parallel and distributed computing makes learning faster which enables quicker model exploration.
2) Spark:
Spark is used for comparing, when compared to XGBoost, Spark is simpler but not as adjustable as XGBoost, also the performance of Spark is not as strong as XGBoost.
3) Light GBM:
LightGBM is a highly efficient gradient boosting decision tree[1]. It uses two novel techniques: Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) to improve efficiency of calculating split point in Gradient Boosting Decision Tree (GBDT) Algorithm, which is the most time consuming part in GBDT, since conventional implementations of GBDT need to, for every feature, scan all the data instances to estimate the information gain of all the possible split points. LightGBM use GOSS algorithm to reduce the number of data instances, and use EFB algorithm to reduce the number of features.
4) Sklearn
Sklearn is the very basic models of machine learning which supports many classic machine learning algorithms, but it does not support large stream of data. So this is not quite suitable for our big data analysis project.

Data Visualization: Seaborn, Google Chart, MatPlotLib
Matplotlib is used for some simple data visualization but not quite beautiful.

Seaborn could generate some more complex and colorful graphs.

Google chart: this could be used on our website and is highly interactive, but the style of google chart is not quite suitable for large amount of data.

Data Analytics: Pandas, Numpy

Compute Platform: I adopted google compute engine to hold our backend server and the training model. We also used some basic CSS and JavaScript to design our frontend pages. ","m4lname":"","industry":"Finance","m3lname":"Li","dataset":"We have used 3 dataset. We get all of them in Kaggle.
The First one is Kaggle Zillow Price Prediction dataset:
https://www.kaggle.com/c/zillow-prize-1
This dataset contains 50 properties of real estates records and 2 Millions of rows. This dataset is used to evaluate the prediction accuracy. The properties column is about some of the features of real estate such as total basement square feet, numbers of rooms and so on, and the result is the prediction accuracy calculated in log error.

The second one is the House Prices: Advanced Regression Techniques
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
This dataset has more than 80 properties. And about 10000 rows of data.
This dataset provide many useful and also not useful data, so we analyze the data and picked out 14 properties which could suit our needs.

The third one is two sigma news data
https://www.kaggle.com/c/two-sigma-financial-news
This data has two datasets which contains stock price fluctuation and the news information, also the news data provide some properties of news, such as head lines, timestamp, attitude and so on.

All the data is in the format of csv, so they could be used in pandas and pandas dataframe could be used in many training models.
","m2uni":"xh2374","m2fname":"Xuexin Hu","m3uni":"xl2811"},{"projectname":"Building Respiratory Disease Knowledge Graph Based on Disease Symptoms","timestring":"Fri Apr 23 17:16:30 2021","m1uni":"bl2835","m2lname":"Liu","m1fname":"Bohua","m4fname":"","m1lname":"Liu","m3fname":"","description":"Respiratory diseases have become a very common disease due to the serious contamination and smoking. According to the Centers for Disease Control and Prevention (CDC), COVID-19 is also a respiratory illness that belongs to a large family of viruses called coronaviruses. As COVID-19 spreads throughout the world, we want to provide some guidelines to search and detect respiratory diseases existence based on symptoms conveniently by means of both Knowledge Graph, which is an efficient and simple approach of information interactive visualization tool based on Big Data, and a machine learning model called AdaBoost SVM. The project has been visualized by a frontend system, which can be redirected to neo4j database to show the pre-built Knowledge Graph and Shiny-based framework.","uni":"bl2835","language":"For the knowledge graph part, we used Python as the programming language and Neo4j as the graph database. And we use Flask to generate the front-end for user to search in our database. For the Machine Learning part, we used R language to build the algorithm and Shiny to integrate with our frontend system. For the frontend, we used HTML, Javascript and CSS.","pid":"202105-18","m4uni":"","analytics":"Natural Language Processing, Support Vector Machine, AdaBoost SVM, visualized using HTML and R-Shiny framework. Web crawler, Neo4j, visualized by Flask connect python to Javascript to build HTML webpages for user searching.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"For the knowledge graph part, we used Respiratory Diseases Category from a human disease database called MalaCards, which includes 1879 diseases in total.
For the Machine Learning part, we used a dataset “Symptoms and COVID Presence” from Kaggle containing 5434 records.
","m2uni":"rl3147","m2fname":"Ruoke","m3uni":""},{"projectname":"Customer Interaction — Finance Product Sales & Marketing Strategy","timestring":"Sat May 16 03:15:03 2020","m1uni":"zx2276","m2lname":"","m1fname":"Zijie","m4fname":"","m1lname":"Xia","m3fname":"","description":"Customer interaction is sequential, so marketing decisions are made over time and based on
customers’ previous responses. In this case, many models that is not able to learn the sequential
property cannot be used here. For another thing, companies/organizations expect maximal
cumulative revenues, not maximal current revenues. For example, if salesman recommends
products to his clients too frequently, the clients might be willing to buy the products in short
term. But in long term, they might be annoyed and never buy the company’s product again.
Then the long-term and cumulative revenues is given away even though the company obtains
short-term revenues from clients. Also, it’s unreasonable and almost impossible to promote
products to each client too frequently, especially when there can be some costs to promote
products.
So, the objective of our project is to maximize the cumulative revenues for each customer using
reinforcement learning, given their history of interactions with the company. This setting
differs from traditional reinforcement learning paradigms, due to the sequential nature of the
customer interactions. ","uni":"zx2276","language":"Python","pid":"202005-16","m4uni":"","analytics":"1. Input: transaction data, selected method and hyperparameters
2. Preprocessing: restructure the data, separate the dataset into states and actions, generate
proper features and select features
3. Agent training: run selected reinforcement learning algorithms and train the agent on
preprocessed data
4. Automatic Strategy Execution: execute the strategy and generate result report by creating
a webpage powered by Dash
","m4lname":"","industry":"Finance","m3lname":"","dataset":"KDD Cup 1998 Data","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Customer Retention Analysis for Music Streaming Services","timestring":"Sat Dec 17 02:54:47 2022","m1uni":"ac5166","m2lname":"Khandekar","m1fname":"Ajinkeya","m4fname":"","m1lname":"Chitrey","m3fname":"Aishwarya","description":"With the entry of multiple OTT platforms, the market is heading towards saturation. There are already signs of saturation in signing new customers. Customers have a finite amount of time and budget to spend on streaming, particularly as they spend less time at home in this post-pandemic environment. High churn has had OTT platforms scrambling to retain customers by offering higher personalization and incentives. Our project aims to study customer churn trends for music streaming services. Leverage big data technologies like PySpark and ML libraries like Spark MLlib to model the factors affecting customer churn decisions and provide companies with the capability to strategize based on factors driving customer churn most significantly.
","uni":"ac5166","language":"PySpark, GCP, Tableau, Streamlit","pid":"202212-29","m4uni":"","analytics":"The data was stored in a GCS bucket. A Tableau dashboard was created to discover underlying trends and relationships in the data. Post this, the data was pulled into a GCP dataproc cluster where it was first preprocessed extensively by techniques like null value imputation, data binning, data scale normalization, and extensive feature engineering to create user activity log variables and bivariate target variable \"Churn\". This dataset was used to train a variety of models including Logistic Regression, SVM, Gradient Boosted Trees, and Random Forest. Random Forest provided the best results and after further hyperparameter tuning, we obtained a training set F1 score of 94% and a validation set F1 score of 89%. Finally, this model was deployed on a Streamlit-based frontend to enable live inferencing by users.","m4lname":"","industry":"Media","m3lname":"Sen","dataset":"A large dataset from Udacity was collected - Sparkify dataset. It has 18 columns which contains the information logs of music streaming events which include Customer event data (gender, country, etc.) and API events (login, playing next song, subscription level, etc.). The data set is 12 GB and was collected by the creator over a period of 2 months of streaming from multiple sources. The dataset was coalesced from multiple session level APIs of streaming services to form a comprehensive feature space. Finally the dataset was engineered in a way to model user activity logs (average songs played each day, number of friends added, number of actions initiated, etc) and a Churn target variable (1 - if the customer churned and 0 - if the customer stayed with the service).
","m2uni":"mk4679","m2fname":"Manasi ","m3uni":"as6718"},{"projectname":"Video Object Segmentation Based on Pixel-level Annotated Dataset","timestring":"Fri Dec 13 07:25:49 2019","m1uni":"yl4238","m2lname":"Hu","m1fname":"Yanlin","m4fname":"","m1lname":"Liu","m3fname":"","description":"The objectives of Video Object Segmentation is to extract foreground objects from video clips. It has many applications such as: video summarization/editing, object tracking, video action detection, autonomous driving, etc.

Object segmentation and object tracking are fundamental research area in the computer vision community.

Public benchmarks and challenges have been an important driving force in the computer vision field, with examples
such as Imagenet for scene classification and object detection, PASCAL for semantic and object instance segmentation, or MS-COCO for image captioning and object instance segmentation. From the perspective of the availability of annotated data, all these initiatives were a boon for machine learning researchers, enabling the development of new algorithms that had not been possible before. Their challenge and competition side motivated us to participate and push towards the new different goals, by setting up a fair environment where test data are not
publicly available.","uni":"yl4238","language":"Language: python, javascript Platforms: django, Tensorflow, OpenCV ","pid":"201912-13","m4uni":"","analytics":"Analytics:
One-Shot Deep Learning-Let us assume that one would like to segment an object in a video, for which the only available piece of information is its foreground/background segmentation in one frame. Intuitively, one could analyze the entity, create a model, and search for it in the rest of the frames. For humans, this very limited amount of information is more than enough, and changes in appearance, shape, occlusions, etc. do not pose a significant challenge, because we leverage strong priors: first “It is an object,” and then “It is this particular object.”

Algorithms:
Our method is inspired by this gradual refinement. We train a Fully Convolutional Neural Network (FCN) for the binary classification task of separating the foreground object from the background. (1) We start with a pre-trained base CNN for image labeling on ImageNet; its results in terms of segmentation, although conform with some image features, are not useful. (2) We then train a parent network on the training set of DAVIS; the segmentation results improve but are not focused on a specific object yet. (3) By fine-tuning on a segmentation example for the specific target object in a single frame, the network rapidly focuses on that target.

System model:
Using django to build a web interface. User can upload the original video on the web interface. When the video is uploaded, django sends request post to flask api. Then the flask api gets the video file, do data preprocess, use pre-trained CNN model to predict the segmentation label in the video(track object), rendering the video and return it back to the django. Finally, user can watch the rendered video on web interface.

Visualization:
We use javascript especially d3.js to visualization the object postion data, and implement a 3D scatter plot to describe the overall offset of the position coordinates of the object we are tracking, corresponding to the motion trajectory of the object in the video. x-axis and y-axis represent the horizontal and vertical coordinates of the object.z-axis represents time, in unit of each frame. All the visualization is available on the web interface.
","m4lname":"","industry":"Information","m3lname":"","dataset":"DAVIS (Densely Annotated VIdeo Segmentation), consists of fifty high quality, full HD video sequences, spanning multiple occurrences of common video object segmentation challenges such as occlusions, motion-blur and appearance changes. Each video is accompanied by densely annotated, pixel-accurate and per-frame ground truth segmentation.

The dataset that we tested is DAVIS 2016, from https://davischallenge.org/

My software/web interface support any video with first frame data, such as YouTube dataset; DAVIS-2017.
","m2uni":"ch3467","m2fname":"Chong","m3uni":""},{"projectname":"Solar Energy Forecasting for PV Performance and Integration","timestring":"Sat Dec 21 05:11:52 2024","m1uni":"aa5479","m2lname":"Opel","m1fname":"Aniruddh","m4fname":"","m1lname":"Aiyengar","m3fname":"Ava","description":"Competitive energy markets motivate decreased production costs and improve system reliability. It positions those with the best insight into the market to profit. The intermittency of renewable energy resources like solar PV and wind farms prevents clean energy technology from fully participating in competitive market structures as they exist today. In this work, we designed a model to predict solar PV performance indicators using time-series weather data as part of an active area of research to make clean energy easier to integrate into the grid, and more reliable. We use established machine learning and large-scale data analytics techniques along with a NeuralProphet engine to build a forecast model.","uni":"aa5479","language":"Python, Javascript, PySpark, Apache Airflow, GCP, GCS, D3JS","pid":"202412-26","m4uni":"","analytics":"K-Means Clustering
Silhouette Score Analysis
Lasso Regression
NeuralProphet
Supervised Learning Framework
Distributed System Architecture","m4lname":"","industry":"Information","m3lname":"Sealander","dataset":"National Solar Radiation Database (NSRDB): https://developer.nrel.gov/docs/solar/nsrdb/psm3-2-2-download/","m2uni":"to2359","m2fname":"Tuba","m3uni":"as7037"},{"projectname":"A-share Stock AI Trader","timestring":"Tue Apr 27 03:30:20 2021","m1uni":"yf2560","m2lname":"Han","m1fname":"Yiwen","m4fname":"","m1lname":"Fang","m3fname":"","description":"The financial market is one of the first to adopt machine learning. Since the 1980s, people have been using machine learning to discover the laws of the financial market. Although machine learning has achieved great success in forecasting in other fields, the stock market is a market where everyone can invest in profit, but machine learning forecasts have not achieved significant results. As stock traders, we want to simulate stock prices or its trend correctly so that we can reasonably decide when to buy stocks and when to sell stocks in order to achieve maximum profitability. In this project, we will deploy LSTM, SVM, ARIMA methods to train models, then predict stock values or its trends. According to the result, we will design an algorithm to buy it at a low price, sell it at a high price, to achieve profits as much as possible. Finally, we will build a web application visualizing the result, which makes it more intuitive and easier to use and to accept for users.","uni":"yf2560","language":"Python, Django, HTML, CSS, JavaScript, Windows/Linux, AWS EC2","pid":"202105-6","m4uni":"","analytics":"1. Analytics/algorithms: LSTM (long short-term memory), SVM (support vector machine), ARIMA (autoregressive integrated moving average);

2. Web Application: Django;

3. ML modules: TensorFlow, scikit-learn, statsmodels;

4. Visualization: Echarts","m4lname":"","industry":"Finance","m3lname":"","dataset":"The dataset is from the stock data of Yahoo Finance collected by yahoo_fin API.","m2uni":"gh2567","m2fname":"Guoshiwen","m3uni":""},{"projectname":"Twitter Based Sentiment Analysis of Two Famous Actresses","timestring":"Sun Dec 23 01:30:26 2018","m1uni":"yf2466","m2lname":"Yang","m1fname":"Yue","m4fname":"","m1lname":"Feng","m3fname":"Yizhi","description":"Objectives:

We have heard about the controversies between Anne Hathaway and Jennifer Lawrence all the time. Now taking advantage of Big Data, we could collect and filter tweets from people of American, train a sentiment analysis model to do prediction and get a general idea that how do people think of Hathaway and Lawrence.

Innovations and capabilities:

A lot of work of sentiment analysis have been done by classification algorithm like LSTM. There are even models can predict several sentiments like surprised, angry and so on, while other uses Maximum Entropy Classifier and Decision Tree. But their models are trained based on different kind of dataset. Here, we decide to train our own model using related labeled data.

Why are these research / toolkits important?:

In the related models, several use Naïve Bayes, Random Forests, Support Vector Machine, etc. Here we tested Naïve Bayes and Random Forest and XGBoost, finding XGBoost performs as the best. For testing data, we also use a method that can get location and time information of tweets, which helps in geo/time-visualization.
","uni":"yf2466","language":"Language: python, javascript Platforms: Google Cloud Platform, D3.js ","pid":"201812-11","m4uni":"","analytics":"First, we use several methods to preprocess the raw tweets. Then we train a XGBoost model to classify the sentiment of tweets and visualized the results using d3.js in time/geo/stat-visualization. We also build a website to show this project.
","m4lname":"","industry":"Media","m3lname":"Zhang","dataset":"Twitter provides Search API to let developers get access to Tweets from as early as 2006. We use this API to construct a dataset of tweets for Anne Hathaway and tweets for Jennifer Lawrence. We feed this dataset as our test data to trained model to construct the predictions for sentiment.

Our software can support other data like raw tweets.","m2uni":"my2577","m2fname":"Manqi","m3uni":"yz3376"},{"projectname":"B1: Market Intelligence — Utilizing Financial Knowledge Graphs","timestring":"Fri May 5 22:37:05 2023","m1uni":"cc4884","m2lname":"Hsieh","m1fname":"Charlene","m4fname":"","m1lname":"Chiang","m3fname":"","description":"Given the current massive layoffs in the tech industry, we would like to develop a knowledge graph that focuses on human resources. As job seekers, we would like to know what the recruiting market is like and what is the condition in each company. However, there is no website that shows the current job market - people will have to go through every news website to search for information manually. This is also not great for people who want to analyze the job market from the big picture. As a result, job seekers need a tool to collect information faster and make correct decisions based on the data.

Therefore, we would like to create a website that displays information on human resources in the tech industry. The website would be able to process data swiftly and accurately and would be able to provide insight to job seekers. Given the economic climate now, the project can provide information that job seekers need. The website we designed would also allow users to have a comprehensive view of the job market, as well as specific queries that people might want to know. Recruiting and layoff news will be collected via web scraping. Then, a trained classification model would extract relevant features from unseen text and store them in the database. The web interface would then allow the users to interact with the database.
","uni":"cc4884","language":"SpaCy, Neo4j Arua, NeoDash, AWS ECR/ECS, Python, Selenium, BeautifulSoup","pid":"202305-13","m4uni":"","analytics":"We used Neo4j and NeoDash to visualize the relationship between each company. We define one company as a node, and the relationship between each company is limited to parent and subsidiary relationships. If we look closer to each company node, we can see more nodes connecting to them, which represent the hiring and layoff number. The nodes would be connected by relationships, which is the action that the company executed. By doing so, user can easily recognize the relationship and the action each company executed.

In order to extract useful information, we need to find a way to extract the news text into the data formation we wanted to store. Therefore, we trained a model to acheive this goal. We first extract the titles to see if there is all the information that we need. To do that, we use Part of Speech (PoS) analysis. It can help us tag particular phrases or words into types such as verbs, adjectives, adverbs, nouns, etc. After that, we define some rules that can help us recognize the patterns that might have the information that we want. Finally, we extract the data if we found the corresponding pattern.

A variation of Depth First Search was implemented when crawling the company’s information.
","m4lname":"","industry":"Finance","m3lname":"","dataset":"We get our data mainly from web scraping. We crawl through news websites to get news that is related to recruiting and layoff. Also, we got the information of each company by crawling Wikipedia. By doing so, we can also double check if the company that we tagged is correct. In terms of other data, our software can support any data that has relationship between each other and can be searched in the news. For example, if people want to know about the most recent trends of each company’s stock and their relationship with other entities, people can use our software to find out. It can give the user the insight on the stock market by knowing the effect between different companies. The only part that requires altering is the script used for extracting information from the news.
","m2uni":"th2990","m2fname":"Tsai-Chen","m3uni":""},{"projectname":"NamastAI: Your AI assistant for Yoga","timestring":"Sun May 12 21:54:08 2024","m1uni":"","m2lname":"","m1fname":"Balachander","m4fname":"","m1lname":"Sathianarayanan","m3fname":"","description":"Objectives:
The project aims to develop an innovative yoga pose estimation system that accurately analyzes and grades yoga poses in real-time. Key objectives include developing a robust computer vision system for pose recognition and analysis, implementing a feedback mechanism for real-time grading based on alignment and form, providing personalized yoga recommendations based on individual abilities, and enhancing user engagement through interactive features.

Innovations:
Traditionally yoga correction app rely on pose coordinates, and have a lower FPS, now this can be achieved in real time by using light weight models and optimizing the workflow. 67-68 FPS results were achieved locally.

Capabilites:
It is a realtime, anyone can access the webapp from anywhere in the world, and this project will be further moved to AWS to address the scalability issue.

Research:
Existing models detect the position more like classifying a particular pose rather than perfect the pose.By combining the corrections with the pose’s characteristics, we can further combine a chatbot which will further give suggestion to user about his performance.","uni":"bs3507","language":"Python","pid":"202405-1","m4uni":"","analytics":"Streamlit, WebRCTC, Pose Estimation, Movenet, Z scoring. ","m4lname":"","industry":"Information","m3lname":"","dataset":"This dataset contains the yoga pose for surya namaskar, it contains 3 folders with images, \"Youtube_Images\" is the image sampled at a particular yoga pose,
\"Preprocessed_images\", contains images after gamma correction and necessary image correction techniques to improve the confidence score of the model,
\"Output_images\" containes the images after annotation from pose estimation.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Gene Pathway Extraction","timestring":"Sun Dec 23 01:56:03 2018","m1uni":"ll3165","m2lname":"Chen","m1fname":"Licheng","m4fname":"","m1lname":"Liu","m3fname":"Ziyi","description":"We present a new approach to automatically extract the pathway relationships between genes by classifying evidence sentences from published research papers based on their semantic meanings. Our frame of work includes two approaches, unsupervised learning and supervised learning, where K-means clustering and Text Convolutional Neural Network (TextCNN) are employed accordingly. Through the visualization of our classification results and detailed evaluations, we demonstrate that our method reaches a classification accuracy of 80% outperforming several approaches currently used in Natural Language Processing (NLP) tasks. More importantly we reveal that 277 genes are potentially related to lung cancer while only 86 cancer genes are presented in the existing archived databases.

Our project is important because we are searching for the potential gene variation pathway which may cause Non-small cell lung cancer. Lung cancer has raised more and more concerns nowadays. Since researchers have found that it takes a series of genetic mutations to cause the formation of lung cancer cells, a great number of research papers studying genetic information related to lung cancer have been published. According to the studies in those papers, researchers have also built databases that contain different genetic information. However, as more and more new genes have been found related to lung cancer, we find that there is a delay in updating the gene pathways in the current existing databases, where many newly discovered genes remain unrecorded, thus considered as unrelated to other genes as well. Therefore, we are curious about developing a method to automatically reveal the relationships between a large quantity of lung cancer genes. The development of such a pathway recognition system can also help update the current knowledge graph of lung cancer genes more accurately and efficiently.
","uni":"ll3165","language":"Python","pid":"201812-28","m4uni":"","analytics":"The algorithms we are using are word embedding, mean sentence embedding, K-means clustering and TextCNN
Visualization: t-SNE for embeddings, ggplot for K-means, connected graph for gene ontology relationships.
","m4lname":"","industry":"Life Science","m3lname":"Liu","dataset":"We downloaded our papers from https://www.sciencedirect.com/ every day each of us is able to download 100 papers. We have dowloaded 2286 papers in totoal. With the reader.py file I uploaded on github, users are able to read in all kinds of pdf files contain gene relations and implement the sentences embedding and visualize the gene relations.","m2uni":"zc2393","m2fname":"Ziyu","m3uni":"zl2690"},{"projectname":"How to lease my place in a reasonable price?: Airbnb price analysis around the world","timestring":"Fri Dec 13 07:45:56 2019","m1uni":"ms5904","m2lname":"GAO","m1fname":"MENGYUAN","m4fname":"","m1lname":"SU","m3fname":"XINYAN","description":"You know Airbnb? It's a very popular app which provides information about renting a bed used by millions of people. Which means even a small prediction app matters a lot.
Our aim is to develop a visualization website for predicting prices, the price from a single air bed to a big house, even a castle! It all depends on what you are going to lease.","uni":"ms5904","language":"language: Python platform: jupyter notebook, PyCharm, ","pid":"201912-15","m4uni":"","analytics":"models: Linear Regression, Lasso and Ridge, XGBoost, OMP, ElasticNet
visualization: Django, Leaflet, FusionChart
","m4lname":"","industry":"Information","m3lname":"ZHANG","dataset":"Dataset for five cities: New York, London, Paris, Beijing and Tokyo are used.
The datasets are all public on http://insideairbnb.com, there are also datasets of other cities available.","m2uni":"mg4115","m2fname":"MINXUAN","m3uni":"xz2878"},{"projectname":"Real Estate Price Prediction and Analysis","timestring":"Sat Dec 17 01:49:50 2022","m1uni":"kl3352","m2lname":"Han","m1fname":"Ke","m4fname":"","m1lname":"Li","m3fname":"Chu","description":"Real estate has always been one of the most popular financial topics. Predicting house prices is critical to improve real estate efficiency. It helps investors formulate appropriate investment strategies and guide the healthy and long-term development of the real estate market. It is very important to estimate the market value of real estate according to the market situation.

Objectives: Predict the prices of houses precisely using machine learning techniques. Develop an interactive web application to present the results.

Innovations: Come up with a model that can predict price in a grand scale of land in the US. Treat data in a less continuous perspective to make more accurate predictions according to multiple factors. Create a user-friendly UI for people to use and explore our predictive model with different parameters.
","uni":"kl3352","language":"Python, HTML, CSS, JavaScript, Jupyter Notebook, Flask, React, GCP, Tableau","pid":"202212-8","m4uni":"","analytics":"Analytics: We analyzed the distribution of each features and their relationship to each other. We discovered that the dataset contains a lot of duplicates and some states have little and biased data records. After removing duplicates, missing values, and outliers, we obtained a cleaned dataset ready for model training.

Algorithms: Linear Regression, Polynomial Regression, K-NN Regression, Decision Tree Regressor, Random Forest and XGBoost

System Modules: scikit-learn, pandas, numpy, matplotlib, seaborn

Exploratory Data Analysis & Visualizations: We used matplotlib and seaborn to conduct EDA with python. We used Tableau to create an interactive dashboard hosted on our webpage. Some visualizations we created include: the distribution of price, scatter plot of price vs. house size, average house price in each state and city, a map of average house price in zip code, etc.
","m4lname":"","industry":"Finance","m3lname":"Qin","dataset":"Our data was scraped from - https://www.realtor.com/ - A real estate listings website operated by the News Corp subsidiary Move, Inc. and based in Santa Clara, California. It is the second most visited real estate listings website in the United States as of 2021, with over 100 million monthly active users. The csv file contains 900k+ data records broken by State and Zip Code. We directly downloaded the data from Kaggle.com. The dataset was updated weekly. It has 12 columns, which are status, price, bed, bath, acre_lot, full_address, street, city, state, zip_code, house_size, and sold_date.","m2uni":"lh3096","m2fname":"Linyang","m3uni":"cq2238"},{"projectname":"Analysis and Optimization for NYC Public Transportation Alternatives","timestring":"Fri Dec 13 02:21:05 2019","m1uni":"hs3194","m2lname":"","m1fname":"Hongzhi","m4fname":"","m1lname":"Shi","m3fname":"","description":"To analyze the trend and usage pattern of the 3 major public transportation alternatives for NYC commuters : Taxi, For hire vehicle and Citi bike and also develop a web based app to provide some insight of which one is a better choice when one wants to go from one neighborhood to another at a particular time of the day.

This is important because there're millions of people relying on these tools to commute. Understand the dynamics and making the right choice not only would save time and money for individual commuters, but also to some extend lift the burden for the traffic and make the city streets more efficient. ","uni":"hs3194","language":"Python/ Google cloud platform","pid":"201912-42","m4uni":"","analytics":"various filtering/joining/aggregation in Pyspark
SQL with GIS feature on BigQuery
Django to host the web app
D3 and Vega for visualization","m4lname":"","industry":"Transportation","m3lname":"","dataset":"1. aggregate stats for yellow/green taxi as well as for hire vehicles in the past 3 years from TLC
2. citi bike trip data from citi bike website
3. taxi zone geometry data from TLC
4. NYC borough map data from NYC Open Data.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Exploratory Analysis of Stack Overflow Samples","timestring":"Sat Dec 22 17:14:09 2018","m1uni":"yl4020","m2lname":"Shen","m1fname":"Yuhan","m4fname":"","m1lname":"Liu","m3fname":"Yuting","description":"Objectives&Capabilities:
Visualize the dataset in some way.
Analyze the determinant of the answers' score with respect to provided information.
Implement an auto-tagging system for the questions.

Innovations:
We present the correlation analysis of the answers' score with heatmap, and plot a tag connection graph to show the inner connections of the tags.

As the amount of information dramatically increases, we can learn much from existing data and this could enable us to better process new data.As in this project, automatically tagging the questions for the Q&A website can greatly improve the efficiency and save time for the users. At the same time, correlations between some metric of users’ interest with collected information could be inspiring.

","uni":"yl4020","language":"Python, Jupyter Notebbok","pid":"201812-35","m4uni":"","analytics":"Correlation analysis is conducted over the answers' score and it is visualized with heatmap.
A tag connection graph is also constructed and presented for the users to have a sight of the inner connections of the tags.
Auto-tagging of top tags for questions is implemented for the convenience for users of Q&A websites.","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"StackSample: 10% of Stack Overflow Q&A
The training set and testing set are split randomly from this dataset with respect to a ratio of 80% and 20%.","m2uni":"zs2390","m2fname":"Zhansen","m3uni":"yw3167"},{"projectname":"Brandforge AI: Creative AI Agent for Early-Stage Startups Visuals","timestring":"Tue May 5 22:44:06 2026","m1uni":"tf2639","m2lname":"Chen","m1fname":"Tianrui","m4fname":"","m1lname":"Fang","m3fname":"Tianyu","description":"Our project goal is to build an AI-powered design platform for early-stage startups, helping users quickly create a professional and cohesive visual identity without requiring significant time, budget, or design expertise.

The key innovation lies in combining multiple design capabilities into a single, coordinated system of AI agents that understand brand context and produce cohesive outputs across different formats. Instead of using fragmented tools or relying on manual design processes, users can go from idea to complete brand identity in minutes.

These toolkits are important because they turn design from a bottleneck into an accessible, on-demand capability. This allows startups to launch faster, present themselves more professionally, and compete more effectively, regardless of their budget or design experience.","uni":"tf2639","language":"TypeScript, Python, CSS","pid":"202605-26","m4uni":"","analytics":"System modules include: a Brand Agent (GPT-4o extracts brand profile from text), a Logo Agent (SVG generation + DALL-E 3 image), a Marketing Agent (tool-using LLM with RAG), and a UI/Web Agent (two-step plan-then-generate for Tailwind landing pages), all orchestrated by an AI Gateway that fans out tasks concurrently via asyncio.gather and persists results to MongoDB.

Algorithms/analytics center on the Marketing Agent: an LLM router classifies user queries into execution plans, a platform chooser tool filters ad benchmark CSVs (CPC/CPM/CPA by industry and region) with fuzzy matching, and a RAG pipeline uses FAISS cosine-similarity search over sentence-transformer embeddings (all-MiniLM-L6-v2) to retrieve ad policy documents for cited answers.

Visualizations are delivered through a React dashboard with five pages: Brand Hub (color swatches, logo preview, website iframe), Web Architect (device-mockup preview), Logo Lab (SVG/PNG download), and Marketing Chat (rendered markdown strategy with charts via Recharts).","m4lname":"","industry":"Information","m3lname":"Zhan","dataset":"This project utilizes two primary datasets: advertising performance benchmarks and platform-specific policy standards.

The performance data includes key metrics, such as CPC, CPM, CPA, CTR, and conversion rates, across platforms including Meta, TikTok, LinkedIn, YouTube, and Google Ads. These data were obtained from publicly available data sources such as WordStream and Digitopia. Date ranges from 2024 to 2025.

The policy corpus consists of official advertising guidelines for Meta, Google, TikTok, and LinkedIn, covering Prohibited Content, Restricted Categories, Quality Standards, etc. These data were scraped from each platform’s latest ad guideline websites.
","m2uni":"hc3625","m2fname":"Hao","m3uni":"tz2704"},{"projectname":"StackOverflow Live Data Visualization & Analysis","timestring":"Sat Dec 24 01:26:17 2022","m1uni":"af3252","m2lname":"Saghir","m1fname":"Aatir","m4fname":"","m1lname":"Fayyaz","m3fname":"Kunlun","description":"The purpose of this project is to fetch real time StackOverflow tag data in order to visualize, analyze and understand the rapidly evolving programming trends amongst users around the world. The analysis would also involve a machine learning model prediction to show if the quality of the question being posed is high or low and whether it requires an admin to intervene in order to make the question quality better. This project utilizes the Stack Exchange API which allows users to retrieve and view information used by the Stack Overflow.

","uni":"af3252","language":"Google Cloud, Python, Javascript, CSS, HTML, PySpark (structured streaming), DataProc, Big Query, Google Cloud Storage, d3.js","pid":"202212-23","m4uni":"","analytics":"This project utilized Cloud Dataproc on google cloud to run the cluster which fetched and stored Stack Overflow site data using PySpark streaming, Google Cloud Storage and Google BigQuery. As mentioned previously, Stack Exchange API was used to fetch the data. PySpark was utilized for data filtration and MapReduce operations. Pandas-gbq was utilized for storage to Google BigQuery. The MapReduce operation was used to calculate the count of each tag per streaming window which was set at 10 seconds each (since that was the duration for our data request as well). All data received was then stored in BigQuery so that it would be readily available for use by the deep learning model for prediction. Two different types of data tables were stored, one for the tag count application and another for the ML-based question quality analysis application.

Once the data was procured from Stack API and available in BigQuery, data was then imported using Google OAuth2.0 authentication with Google BigQuery. For the purposes of this project, a simple python script was developed to pull the BigQuery data and save it as a .csv file. The pandas and pandas-gbq package was utilized to make this process seamless. Visualizations were completed in javascript using D3.js.

Current visualizations include an interactive circle packing / bubble chart and a bar-chart race to visualize the incoming tags in real-time.
","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"The Stack Exchange API enables users to retrieve answers, comments, badges, events, questions, revisions, suggested edits, user information, and tags from a Stack Exchange based website. It uses RESTful API architecture to make requests and receive responses in a JSON format. Streaming data was pulled from StackAPI every 10 seconds. PySpark streaming was utilized to receive data from Stack API which was sent over the local socket. The data received from the socket was captured in a datastream and for each Resilient Distributed Dataset (RDD), data preprocessing was done to split each tag and remove any unwanted characters. The data that was sent over the socket was also processed such that from the data packet received in a fetch call - only the tags, the title, and the body of each question were transmitted.

In particular, the fetch call on StackAPI builds the API query for Stack Exchange which in turn processes the request. The return type is a python dictionary containing all the data for the requested call including call and client specifics. The relevant data for the query sits under the items list.

Traversing the items list leads to the required data for each of the project’s requirements. This includes tags, title, and the body of each of the incoming questions. Since the request for the call is configured to fetch ten (10) seconds worth of data every 10 seconds, each request may contain multiple items lists. It was configured in the code to iterate through these and scrape all relevant data which was then sent to the socket as an encoded, and delimited string.
","m2uni":"ms6339","m2fname":"Muhammad","m3uni":"kw2964"},{"projectname":"Semantic Image Annotation Based on Bayesian Networks","timestring":"Sat May 16 04:16:53 2020","m1uni":"zz2637","m2lname":"","m1fname":"ZIYU","m4fname":"","m1lname":"ZHOU","m3fname":"","description":"","uni":"zz2637","language":"","pid":"202005-26","m4uni":"","analytics":"","m4lname":"","industry":"Media","m3lname":"","dataset":"","m2uni":"","m2fname":"","m3uni":""},{"projectname":"MBTI Personality Analysis and Prediction","timestring":"Fri Dec 16 23:35:55 2022","m1uni":"yz4359","m2lname":"Shi","m1fname":"Yutao","m4fname":"","m1lname":"Zhou","m3fname":"Qingcheng","description":"The Myers-Briggs Type Indicator (MBTI) is an introspective self-report questionnaire that indicates different psychological preferences in how people perceive the world and make decisions. Knowing MBTI can help users understand themselves and adjust their life and work according to their own personality tendencies.
This project provides a simple system to make predictions of Twitter user’s MBTI personalities. The input of the system is the user’s Twitter name, and the system should return the MBTI personality type predicted based on the user's tweets.
Compared to the traditional way of filling out questionnaires, we use social platform data to train classification models and make predictions. We also build a user-friendly web application that let users get their personality type results faster and easier.
","uni":"yz4359","language":"Python, GCP, HTML，CSS, JavaScript","pid":"202212-14","m4uni":"","analytics":"Analytics: Two parts of this project's data are preprocessed in different ways
For Twitter API streaming data:
1. Filte punctuation, stopwords, and emoji from the original data
2. Implement lemmatization to convert the format of text to the normal state
3. Reconstructure the text into equal size chunks (500 words)
For the MBTI dataset:
1. Oversampling through Synthetic Minority Oversampling Technique (SMOTE) due to imbalanced data distribution

Machine learning algorithms:
1. Naive Bayes-BernoulliNB
2. Naive Bayes-ComplementNB
3. Logistic Regression
4. K-Nearest Neighbors algorithm (KNN)
5. Random Forest

System Modules: The system is designed in several parts. The dataset comes from Kaggle and is a trained NLP model in GCP Dataproc. The Web application is set up on a GCP virtual machine and it requests the streaming data from API provided by Twitter. The visualization figures and predict results are displayed to the User on the web. The User should enter their Twitter User Id to get these results.

Visualization:
1. The prediction result of the User's personality
2. Some famous people have the same personality
3. Some Job recommendations for this type of personality
4. Word Cloud visualization for most commonly used words
5. Model visualization figures","m4lname":"","industry":"Information","m3lname":"Yu","dataset":"This is a dataset that contains 106K records of preprocessed posts online from the PersonalityCafe forum and Reddit. Where each post is 500 words long and has been annotated with personality type. The dataset was covered by Dylan Storey and Mitchell Jolly. The dataset has a volume of 346MB. The velocity of the dataset is 0 since it stopped updating. There are a total of 16 personalities included. The original dataset has an unbalanced distribution of personality. Therefore, we have used an oversample to balance the dataset and made each personality have 25K entries.

What's more, after the user enters their username our software would use Twitter API to collect tweets from that user. We would catch the nearest 100 tweets and combine them and user filter and stopwords to pre-process them. We will duplicate or crop the tweets until we have exactly 500 words.","m2uni":"js6132","m2fname":"Jiarong","m3uni":"qy2281"},{"projectname":"Virtual Doctor","timestring":"Sat May 7 05:59:19 2022","m1uni":"ak4745","m2lname":"","m1fname":"Aashi","m4fname":"","m1lname":"Kapoor","m3fname":"","description":"Objective: To create a chatbot that shall provide people with knowledge about genetic disorders, genetic testing and how to utilize the testing kits.
Based on symptoms and family history determine which disease is probable and which ones could be transferred to the next generation.
Hence, assisting the general public in understanding the relationships between diseases and genes in order to improve general public’s medical knowledge and understanding
Innovations: It's providing a very refined version of answer available from the net as it has been trained on Pegasus and BERT model after which the data is being displayed to the user when asking questions on the chatbot. It consists of valuable and summarized information.
This is important as it shall raise awareness amongst the people about genetic disorder because around 10% of the people around the world have genetic disorders and they get to know in their adulthood.The chatbot shall help people recognize if they have slightest of the symptoms for the genetic disorder cause many people fear early doctor visits so after relating to data on bot they can refer the doctor for proper medication and prevent from the disorder to reach an alarming stage in case it's dangerous to health.","uni":"ak4745","language":"Python, Pytorch, Transformers, BERT, Google Pegasus XSUM model,MS Virtual Agent Chatbot","pid":"202205-21","m4uni":"","analytics":"Analysis: For the analysis of the model, scorer has been used for plotting BERT model (scoring the different tokens assigned to the various terms) then for testing the chatbot it has been run several times to find the accuracy.
Algorithms used: K means clustering for topic modelling, HuggingFace transformer library for the BERT model and Pegasus X-Sum model for abstractive summarization
Visualization: Tabular presentation of the result of K-means Clustering, visual plotter for different scores of the tokens generated in BERT model and the chatbot is live presentation of working of model.
System modules: Pytorch, Transformers, BERT, Google Pegasus XSUM model,MS Virtual Agent Chatbot, Azure cloud services","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Dataset: Articles from Pubmed were tested for the following 5 diseases: ehlers danlos syndrome, charcot marie tooth, von willebrand's disease, tourette syndrome, ankylosing spondylitis
and genetic testing kits. BBC dataset was used to train the BERT model and SQUAD dataset to train QnA set.
Got the dataset from articles available at https://pubmed.ncbi.nlm.nih.gov/ , since medical database isn’t available due to privacy issues and hence articles were utilized. Other data that the software can support shall be AWS, Google Bot.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Rx.AI: Multimodal Patient Check-In System","timestring":"Tue May 5 21:57:44 2026","m1uni":"ga2726","m2lname":"Kumar","m1fname":"Gautam","m4fname":"","m1lname":"Agarwal","m3fname":"Aryaman","description":"Rx.AI replaces static clinical check-in forms with a fully personalized, multimodal intake experience that adapts to each patient's visit context.

Core objective: Generate clinically relevant questions per patient, capture answers through text, voice, or camera, and deliver structured pre-visit data that reduces documentation burden on providers.

Innovation 1 — Personalized question generation: A three-agent CrewAI pipeline deduplicates the patient record, summarizes clinical problems, and generates targeted questions per visit based on conditions, medications, allergies, and detected issues. A single-pass Gemini baseline is included for quality and latency comparison, enabling benchmarking of multi-agent orchestration versus simpler approaches.

Innovation 2 — Voice-driven interaction: Questions are read aloud via text-to-speech, and patients respond by speaking; a speech-to-text engine returns word-level confidence scores, flags low-confidence segments for re-recording, and preserves a keyboard fallback for accessibility. This is especially important for patients with limited mobility or low digital literacy.

Innovation 3 — Contextual visual capture: Questions about wounds, skin conditions, medication bottles, or insurance cards automatically trigger a camera prompt; a Gemini vision model converts the captured image into a concise clinical description that is appended to the patient's answer, providing richer intake data without manual charting.

Why it matters: Clinical intake is often error-prone and disconnected from the nuanced information needed for an encounter. By combining LLM reasoning, speech recognition, text-to-speech, and image understanding at the point of care, Rx.AI demonstrates a practical, evaluable template for robust multimodal healthcare workflows, assessed using DeepEval metrics for accuracy, hallucination rate, naturalness, and latency, with logs stored in BigQuery.","uni":"ga2726","language":"Python 3.11 (FastAPI, CrewAI, Pydantic v2, DeepEval, jiwer); React 18 + Vite + Tailwind CSS with browser MediaRecorder and getUserMedia APIs; Google Cloud Vertex AI (Gemini 2.5 Flash for LLM and vision), Cloud Speech-to-Text v2 (Chirp), Cloud Text-to-Speech, and BigQuery; Conda environment with LangChain Google VertexAI and LiteLLM integrations.","pid":"202605-6","m4uni":"","analytics":"Question generation: A three-agent CrewAI workflow runs a medical data deduplication agent, a healthcare summarization agent, and a questionnaire generator agent to produce a structured JSON payload of targeted questions with image-requirement metadata. A single-pass Gemini baseline is exposed for quality and latency comparison.

Camera triggering: A heuristic module scans question text for keywords such as wound, skin, medication, or insurance card and infers requires_image and image_prompt fields when the LLM does not set them, ensuring visual questions always prompt the camera.

Speech-to-text: A FastAPI endpoint accepts WebM/Opus audio, forwards it to Google Cloud Speech-to-Text v2 with word-level confidence scores, and returns a transcript and confidence value. A configurable threshold (e.g., 0.7) flags low-confidence transcripts for re-recording before submission.
Text-to-speech: A server-side endpoint calls Google Cloud TTS with a Gemini-powered voice and streams MP3 audio to the browser for each question, supporting repeat and coordinated recording start.

Vision/image analysis: An endpoint passes the captured image and triggering question to Gemini 2.5 Flash multimodal and returns a short clinical description constrained to observable findings, appended to the patient's answer.

Evaluation: DeepEval powers all metrics — word error rate and medical-term accuracy for STT; LLM-judge naturalness and intelligibility for TTS; AnswerRelevancy and HallucinationMetric for question generation; GEval clinical description accuracy for vision. Workflow scripts group BigQuery logs by workflow ID to compare text-only, speech-only, camera-only, and fully multimodal sessions on answer completeness and quality, producing JSON summary reports.","m4lname":"","industry":"Life Science","m3lname":"Agrawal","dataset":"Patient cohort: The primary dataset is a synthetically generated structured JSON cohort containing patient demographics, multi-visit histories, conditions, medications, allergies, detected issues, and de-identified provider notes, designed to represent realistic chronic disease scenarios such as type 2 diabetes, hypertension, and neuropathy without using any real patient data for privacy. It drives questionnaire generation and provides a stable reference for reproducible evaluation across diverse clinical profiles.

Speech evaluation — MedDialog-EN (public): For evaluating speech-to-text transcription quality, the project uses 100 sample reference utterances from MedDialog-EN, a large-scale public dataset of 257,000+ English patient–doctor consultations spanning 96 medical specialties, published by UCSD AI4Health. Utterances covering symptom descriptions, medication adherence, and pain scale responses are selected and paired with category labels to compute word error rate and medical terminology accuracy for Google Cloud Speech-to-Text v2.

Vision evaluation — Wound Image Dataset / Kaggle (public): For evaluating image analysis quality, the project uses 100 samples from the publicly available wound image dataset on Kaggle, which contains labeled wound images across seven categories, including burns, abrasions, lacerations, cuts, bruises, stab wounds, and ingrown injuries. These images, along with 200 medication bottle and insurance card samples, are paired with human-authored clinical descriptions and severity labels to support a GEval-style metric that scores how accurately Gemini Vision describes observable findings versus human ground truth.

Question generation evaluation: Patient profile fixtures encoding expected question themes are used to score the CrewAI pipeline and the Gemini baseline on relevance, appropriateness, and hallucination rate using DeepEval AnswerRelevancy and Hallucination metrics.

Runtime evaluation corpus: Every system interaction is logged as JSONL and optionally streamed to BigQuery, capturing inputs, outputs, latency, model identifiers, and workflow correlation IDs. These logs support per-feature and cross-workflow DeepEval runs comparing text-only, speech-only, camera-only, and fully multimodal intake paths.

The software accepts any structured patient JSON following the same schema as the synthetic cohort, and the voice and image modules operate at the browser level, making them independent of any underlying EHR system.
","m2uni":"sk5659","m2fname":"Shivangi","m3uni":"aa5775"},{"projectname":"A Functionalist System for Deterministic LLM Critical Thinking: CoCoMo Pipeline Design, Implementation, and Evaluation","timestring":"Tue May 13 08:41:20 2025","m1uni":"he2305","m2lname":"Lai","m1fname":"Hiroki","m4fname":"","m1lname":"Endo","m3fname":"Liangke","description":"The primary objective of CoCoMo is to create a reproducible, transparent “System-2” reasoning engine that leverages large language models to perform structured critical thinking. Our innovations include a deterministic LLM client with JSON-mode prompting, an LRU-cached Receptor to minimize redundant calls, a two-level MFQ scheduler to model bursts and fading of attention, and CRIT subroutines for claim extraction, reason generation, validation, and weighted evidence aggregation. Collectively, these capabilities enable audit-friendly analysis—tracking every API call, flagging low-credibility inferences, and summarizing net support via a single γ-score. Such toolkits are important because they bring scientific rigor and interpretability to AI reasoning, facilitating ethical, accountable deployment in high-stakes domains like policy evaluation, scientific review, and compliance monitoring.","uni":"he2305","language":"python / ipynb on Google Collab","pid":"202505-15","m4uni":"","analytics":"CoCoMo’s analytics stack comprises modular algorithms and system components: the llm_call Receptor wraps the OpenAI API with deterministic sampling and an LRU cache; the MFQScheduler implements a priority-demotion queue to schedule CRIT jobs; extract_claim and extract_reasons perform structured extraction of assertions and arguments via JSON-mode prompts; validate_reason assigns numeric support and credibility scores and flags low-confidence cases; and compute_gamma aggregates evidence into a net γ-score per chunk. The crit_validate orchestrator ties these modules into an end-to-end workflow, printing detailed logs and summary metrics. Though the current prototype outputs text-based summaries, it can be extended with visualization modules using standard Python plotting or data-frame libraries.","m4lname":"","industry":"Information","m3lname":"Wu","dataset":"We validated CoCoMo on multiple made-up curated case studies. These texts were assembled by AI from publicly available reports and white papers, then embedded directly into the doc variable for end-to-end testing. While our experiments used these bespoke examples, the pipeline is data-agnostic and readily supports any plain-text corpus—including public datasets like FEVERFact for claim extraction, Wikipedia articles for policy analysis, or Markdown-annotated research abstracts—by simply loading input strings or JSON objects into the crit_validate function.","m2uni":"jl6932","m2fname":"Jingyi","m3uni":"lw3161"},{"projectname":"Speed Dating Data Analysis","timestring":"Sat Dec 14 04:46:36 2019","m1uni":"yl4272","m2lname":"Anant","m1fname":"Elliot","m4fname":"","m1lname":"Liu","m3fname":"Raksha","description":"We want to find out what makes two people click when they meet for the first time. With the emergence of new online dating platforms, recommendation engines play an ever-more-important role in bringing people together for the first time. Using comprehensive data from speed dating experiments done by academics, we want to see if we can design an effective recommendation engine for a dating platform. Our algorithm takes in a person's attributes and preferences, and uses cutting-edge data science techniques to find most probable matches. ","uni":"yl4272","language":"Python, JavaScript, HTML, BigQuery","pid":"201912-29","m4uni":"","analytics":"Exploratory analysis were done with matplotlib and seaborn visualization packages. For the recommender, a number of models, including linear, distance-based, tree-based, and ensemble models, were tested with optimal parameters obtained from grid search with cross-validation. D3 was used for final visualization.","m4lname":"","industry":"Social Science-Government","m3lname":"Ramesh","dataset":"The data was collected from a set of speed dating experiments conducted by a professor at Columbia Business School between 2002 and 2004. The dataset has over 8000 rows and 195 columns, and each row is the interaction record between two participants. The dataset is available at https://www.kaggle.com/annavictoria/speed-dating-experiment","m2uni":"ka2477","m2fname":"Kavita","m3uni":"rn2486"},{"projectname":"Google Analytics Customer Revenue Prediction","timestring":"Fri Dec 21 19:19:38 2018","m1uni":"cm3700","m2lname":"Hu","m1fname":"Chi","m4fname":"","m1lname":"Ma","m3fname":"Yuchong","description":"Project Objectives: Predict Google Merchandise store revenue per customer.

Innovations: We used some newly developed machine learning algorithms such as LightGBM to predict the target and we created some independent variables which are not originally given. We also used Ensemble which includes some of the new GBDT methods and also gives a good result but takes a longer time to train.

Capabilities: The best result reaches an RMSE of 1.4.

The research is important because a great revenue prediction model could help marketing teams to identify significant factors that influence revenues and therefore they can use marketing budget more efficiently. In the retail business, promotional strategy is very important in boosting sales, so we hope companies who choose to use data analysis on the top of Google Analytics data could have their marketing budget well-spent base on our outcome.","uni":"cm3700","language":"Python, Jupyter Notebook and Google Cloud Platform","pid":"201812-29","m4uni":"","analytics":"Analytics: statistical analysis, Exploratory data analysis

Algorithms: Linear regression with Elastic Net, Regression Tree, Gradient Boosing, XGboost, CatBoost,LightGBM.

System Modules: python 3.7, Tableau 2018.3.1, sublime text 3

Visualization: Python library: Plotly; Tableau; HTML demo","m4lname":"","industry":"Retail","m3lname":"Wang","dataset":"","m2uni":"zh2290","m2fname":"Zhejing","m3uni":"yw3081"},{"projectname":"Accident probability analysis based on external factors in NYC","timestring":"Mon Dec 19 17:41:56 2022","m1uni":"yd2611","m2lname":"Han","m1fname":"Youming","m4fname":"","m1lname":"Ding","m3fname":"Zeheng","description":"The goal of our project is to analyze the possible factors that will affect the probability of collision in NYC, including time, weather, location, and so on. There are many people died due to traffic accidents. Based on the analysis, we want to find the region with high collision probability in the different time zone to help people to avoid potential accidents.

","uni":"yd2611","language":"languages : python, pyspark, JavaScript, HTML, CSS. Platforms: GCP","pid":"202212-26","m4uni":"","analytics":"We did correlation measurement by Pearson correlation coefficient. We also did clustering by the K-means algorithm. We visualized some data by D3 and the clustering result with JavaScript, HTML, and CSS.
","m4lname":"","industry":"Transportation","m3lname":"Yang","dataset":"We used Details of Motor Vehicle Collisions in New York City provided by the Police Department and New York City weather data. Both are open-source datasets, so we got the data directly from the website. Our software can support other traffic data like taxi collision data as well.","m2uni":"sh4332","m2fname":"Shaochen","m3uni":"zy2532"},{"projectname":"Airbnb Rent Price Prediction","timestring":"Sat Dec 18 04:31:14 2021","m1uni":"ld2938","m2lname":"Kong","m1fname":"Lisen","m4fname":"","m1lname":"Dai","m3fname":"Liqin","description":"Our goal is to predict the Airbnb housing price trend with consideration of the pendemic of COVID-19 and twitter emotion analysis. We assume that pandemic should be one factor people may consider before the trip and this could significantly influence the local housing price.

Under this assumption, we are building a web app that tells users about housing price, pandemic situation and people’s sentiments. First, we conducted the relations among housing price, COVID-19 situation, and the sentiment of people around the world to cities, chronologically. Then we model those relations and make comprehensive predictions for housing price and COVID-19 situations. Finally, we make an overall view of all the analysis and predictions to users, and also based on those information we provide suggestions for people's trips.

Our research is important and meaningful. The results could help travellers make more reasonable business decisions for travelling. Meanwhile, it could also help house owners make appropriate listing recommendations. And for just common poeple, they can get more data insigts from this and may be helpful to them in the future.","uni":"ld2938","language":"HTML, CSS, JS for frontend. Fusioncharts to draw the charts and Mapbox to provode the map canvas. Mainly python as backend language. Django as server, Ariflow as scheduler. GCP Bigquery as database, and pandas as the communication tool between backend and database.","pid":"202112-21","m4uni":"","analytics":"Analytics:
1. Data Fusion and Data Integration: We have combined a set of methods that analysis and integrate data from multiple sources and solutions. The housing price data, pandemic trends, and sentences implying sentiments are used together in our app. The insights are more efficient and potentially more accurate than if developed through a single source of data.
2. Data Mining: We use data mining to extract patterns from housing price, pandemic situation and sentiments. We combined methods from statistics and machine learning, within database management. This would help our users to easily get the hidden information which they would not directly get.
3. Machine Learning: We use machine learning to get our predictions about housing price and pandemic situation. And those predictions would help our users to know about what they should do next. This is also a very important part in our analysis.
4. Natural language processing (NLP): We use NLP to analyze sentiments about a city, which will be useful to our users.
5. Statistics. This analytics method works to collect, organize, and interpret data, within surveys and experiments.
6. Others: spatial analysis, predictive modeling, association rule learning, network analysis and many, many more.

Algorithms:
1. Deep Learning: We use deep learning to predict housing price with input features.
2. LSTM: Long short-term memory (LSTM) is helpful to predict covid trends with timeline.
3. Sentiment Analysis: This works to define how negative or positive a sentence is. The scores are used for us to determine whether and how this city is popular around the world.

System Modules:
1. Django
2. Keras
3. Pandas
4. BigQuery
5. Airflow
6. TensorFlow
7. Sk-Learn

Visualization:
1. Mapbox
2. React
3. Fusioncharts

","m4lname":"","industry":"Life Science","m3lname":"Zhang","dataset":"1. CSSEGISandData: a covid-19 data repository manipulated by JHU. It provides the daily confirmed COVID-19 cases in most important cities around the world.

2. InsideAirbnb: a listing of housing and pricing posted by users from Airbnb, which is not publicly available from Airbnb. It contains all useful information regarding the homes/apartments, private or shared rooms.

3. Tewitter stream: a list of twitter stream collected directly using Twitter API, normalized to get ready for sentiment analysis.","m2uni":"xk2144","m2fname":"Xiangcong","m3uni":"lz2809"},{"projectname":"US Flights data analysis, visualization and delay prediction","timestring":"Fri Dec 13 02:43:38 2019","m1uni":"yl4003","m2lname":"CAO","m1fname":"YUE","m4fname":"","m1lname":"LUO","m3fname":"LINGSONG","description":"Over 100 thousand flights are in the air every day. Huge amount of people and cargo are transported from place to place unceasingly, facilitating the business communication of the world. With a great number of flying records, we are easy to make use of analytic tool to discover and present the values hidden in the big data. In this project, we analyze the distribution of data, make visualization web application of the flights routes, and finally train and deploy a machine learning model for delay prediction. The analysis and application give useful tips for people to select flights, and for companies to set up their business.","uni":"yl4003","language":"python, Google Colab, Google Cloud Platform, bigquery, django, D3.js, sklearn, plotly, pandas","pid":"201912-41","m4uni":"","analytics":"Basic statistical analysis, K-means clustering, Linear Regression, Random Forest","m4lname":"","industry":"Transportation","m3lname":"GAO","dataset":"2015 Flight Delays and Cancellations: https://www.kaggle.com/usdot/flight-delays

The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.","m2uni":"yc3518","m2fname":"YUHAO","m3uni":"lg3018"},{"projectname":"Movie Recommendation System with Chatbot","timestring":"Fri Dec 23 21:56:09 2022","m1uni":"sl4980","m2lname":"Yu","m1fname":"Siying","m4fname":"","m1lname":"Liu","m3fname":"Zuguang","description":"This project presents Machine Learning-based algorithms for a collaborative filtering movie recommendation system with front-end virtual chatbot implementation. Collaborative filtering and K-nearest neighbor algorithms are implemented to design a full-stack system for movie lovers by streamlining their movie-searching experience across the AWS platform. With our methodology and proper data processing techniques, this system could be ideally implemented in any streaming service with a valid amount of user viewing history.","uni":"sl4980","language":"Python, Java, AWS S3, AWS Personalize service, Amazon Lex","pid":"202212-16","m4uni":"","analytics":"We have explored three cross-validating, simple algorithms for collaborative filtering-based movie recommendation systems in this project. Collaborative filtering defines user-based and item-based recommendations based on the rating given by the user. Correlation analysis identifies the relationship between the rating ranking of viewed movies with others from the dataset. KNN gets the most similar one to the view movie list while matrix factorization differentiates the recommended movie from all others from the list to give a valid and personalized recommendation when users are presented with options.
In terms of the system, we conducted research on the existing movie recommendation chatbot, and finally chose to build a system that is more complete and intelligent than other systems from database to frontend. We chose AWS s3, an object storage service offering industry-leading scalability, data availability, security, and performance to host our webpage, API Gateway to manage the webpage API, Amazon Lex to deploy the dialogue robot and then pass in SQS to interact with the database. The intermediate mindware is connected using a lambda function.

","m4lname":"","industry":"Media","m3lname":"Li","dataset":"Two datasets which come from Movielens and Netflix respectively are used. We initially intended to complete our project by solely using the Netflix dataset, but it was then discarded due to the lack of diversity in the dataset which might induce more workload of processing for modeling. These files from Movie Lens contain 1,000,209 anonymous ratings of approximately 3,900 movies
made by 6,040 MovieLens users who joined MovieLens in 2000.","m2uni":"by2345","m2fname":"Bo","m3uni":"zl3236"},{"projectname":"NYC Vehicle Crash Analysis","timestring":"Sat Dec 18 03:05:48 2021","m1uni":"so2639","m2lname":"Arora","m1fname":"Shivam","m4fname":"","m1lname":"Ojha","m3fname":"Animesh","description":"In this project, the aim is to analyze the NYC Motor Vehicle Crash Analysis, understand correlations, patterns and analyze trends with the current weather conditions in the area of the crash. Further extending the scope, aim is to draw a relation with varying weather conditions, if the severity of the accident and how does number of people killed/injured vary with it.
To develop this system end to end, first an ETL pipeline is setup to continuously ingest data from the NYC vehicle crash dataset with a daily update. The dataset contains about 2 million records. The next step involves cleaning the dataset and backfilling missing values of zipcode, latitude, longitude etc. so that we have comprehensive data about each incident. Another major step is to scrape weather data from wunderground for each weather station in New York and store it in another table. This data is then processed and mapped with the crash dataset.
The workflow pipeline is orchestrated using an Airflow DAG, that runs on a daily schedule. After the dataset is completely ready, it is used for visualization and prediction analysis. Machine learning models like logistic regression, support vector machines and Gradient Boosting Classifier are evaluated to predict the severity of a vehicle crash. Also, a recurrent neural network is trained to predict the weather conditions as well. The analysis and interactive heat maps and visualizations created on metabase and amcharts are used to understand the crash factors, high risk areas and a warning system that can warn users based on the historic data.","uni":"so2639","language":"Python, Javascript, HTML, CSS","pid":"202112-26","m4uni":"","analytics":"• Cloud setup and orchestration: Airflow Scheduler, Google Cloud Platform
• Analytics and visualizations: Metabase, Amcharts
• Frameworks and packages: scikit-learn, Pandas, Tensorflow, Numpy, bs4, googlemaps, psycopg2, etc
• Algorithms: Logistic Regression, SVM, Gradient Boosting Classifier, RNN
","m4lname":"","industry":"Transportation","m3lname":"Bhasin","dataset":"This project has utilized two datasets. Weather data has been web scraped from wunderground.com and the vehicle crash data has been fetched from the NYC Open Data website. Both datasets contain data from 2012/07/01 and the datasets are refreshed on a daily schedule.","m2uni":"aa4822","m2fname":"Abhishek","m3uni":"ab5051"},{"projectname":"Investment Strategy -- AI Trader (Foreign Exchange)","timestring":"Fri Apr 23 15:55:09 2021","m1uni":"hs3224","m2lname":"Jiang","m1fname":"Hongbo","m4fname":"","m1lname":"Shen","m3fname":"","description":"This course project is desired to train a Reinforcement Learning model which is able to automatically make optimized investment strategy on stocks and realize maximum profit. Gross target is to predict future trend of stock price movement and evaluate profitable buying and selling point. Firstly obtain, analyze and process previous stock data from yahoo finance API (raw data). Construct a reinforcement learning framework and use cleaned data to train model. The excepted system will learn pervious datasets of some stocks, predict future stock trends and make investment strategy that can maximize the profit. Build a back-test system and test with practical trade scenarios with pre-trained model to do feedback assessment and back-train the model. Finally construct an UI and enable users to make trade simulations according to their preference.
So far, we have reached a rather good model that has both high converge speed and high reward. The model converges at about step 31 and the reward value is 400000. While the max_draw_down is 8.94%, sharpe_ratio is 2.673 and annual_return is 34.333%.
","uni":"hs3224","language":"We use python and tensorflow-keras as our training package.","pid":"202105-12","m4uni":"","analytics":"We built up a reinforcement learning model, and applied Deep Q-Learning algorithms. Implemented html as our front end of the final UI, and used matplotlib package to visualize the data.","m4lname":"","industry":"Finance","m3lname":"","dataset":"We trained the model with BTCUSD and ETHUSD datasets. We found our datasets on the yahoo finance API, which is available for many financial information and news. The time span of datasets includes 1 minute, 1 hour and 1 day, all in-time datasets are available on yahoo finance API.","m2uni":"jj3141","m2fname":"Junzhe","m3uni":""},{"projectname":"","timestring":"Sat Dec 22 00:54:14 2018","m1uni":"xz2737","m2lname":"Wang","m1fname":"XInfu","m4fname":"","m1lname":"Zhang","m3fname":"Jia","description":"Steam is the largest digital game distribution platform in the world with a massive collection that includes everything from AAA blockbusters to small indie titles, so great discovery tools can be super valuable for Steam. What’s more, Steam has an open database which includes various game information like the number of players, the price of the games and the rating of the games. We would like to analyze the information in database to show the relationship between the games and the players. In addition, we want to classify users and make recommendations to them according their preference. ","uni":"xz2737","language":"python","pid":"201812-06","m4uni":"","analytics":"classification and recommendation
K-means, Gaussian Mixture,Collaborative filtering
classification is visualized and recommendation makes a recommend system.","m4lname":"","industry":"Information","m3lname":"He","dataset":"“steamgame.csv” 745MB, more than 10000000 rows and 4 columns.
The dataset contains information on what games some Steam users bought and how many hours they spent playing them.
","m2uni":"pw2480","m2fname":"Pengchong","m3uni":"jh4001"},{"projectname":"Ensemble Reinforcement Learning for AI Trading in Equity Markets","timestring":"Fri May 3 19:28:00 2024","m1uni":"ws2686","m2lname":"Nguyen","m1fname":"Weihao","m4fname":"","m1lname":"Song","m3fname":"","description":"With the firehose of new information about stock markets and myriad financial products offered by asset managers, it is difficult for individual users to synthesize so much information and manage their investment portfolios. Furthermore, manual trading on a daily basis is cumbersome and paying high frees for premade inflexible financial products is also not ideal.

We aim to solve some of these challenges faced by individual investors by training and serving Reinforcement Learning algorithms that offer extensive customizability for users. Users can define their own risk preference when trading their portfolios. Also our algorithms can execute trades systematically and automatically on behalf of users or recommend the best long and short trades for users.

We also design intuitive and accessible system to let users interact with our algorithms and help users manage their portfolios and automate trades with ease.","uni":"ws2686","language":"Python, React, AWS","pid":"202405-11","m4uni":"","analytics":"Data ingestion with yahoo finance downloader
Feature engineering module to build custom features
Training RL agents with ensembling and backtesting
Serving module to recommend or execute trades on a daily basis
Cloud database to hold and manage user data
React for our app to let users interact with our algorithms with ease and visualize their portfolio balances and backtest performances
Robust security offered by cloud infrastructure","m4lname":"","industry":"Finance","m3lname":"","dataset":"We use Yahoo Finance Data Downloader to download pricing data for stocks. Specifically, we download the stock ticker, open price, close price, high price, low price, and volume information to create our features.","m2uni":"jn2814","m2fname":"John","m3uni":""},{"projectname":"Collaborative Spotify Playlist System","timestring":"Sat Dec 17 01:00:26 2022","m1uni":"ad4017","m2lname":"Munyuza","m1fname":"Alban","m4fname":"","m1lname":"Dietrich","m3fname":"Andrew","description":"Our goal is to build a custom playlist generator using multiple user inputs in the form of existing playlists. We allow users to interact with our model using an online user interface where they can link their profiles and submit generation requests.

Spotify recently created an option to ‘blend’ playlists from different users, but it doesn't contain any new songs; it's just a mix of the users’ playlists. We are breaking new ground by creating a playlist with new songs. In other words, and more specifically, our toolkits (model and user interface) allow us to provide the following innovations and values:
- Social value of having Spotify over other streaming services increases
- Helps fill the missing social aspect for Spotify thus increasing the situation to use Spotify
- Increases reliance on Spotify in the streaming services for users
- Exposes users to new music and new genres

We have a web interface using Flask, Python, HTML, CSS, and Javascript. Here are the capabilities and the different steps to get the final result on our website:
1. log in to your Spotify account
2. Select the playlist and songs you want
3. Once selected, run the model
4. The output playlist will appear and you will have the option to save it to your Spotify account","uni":"ad4017","language":"Python, HTML, CSS, Javascript, Flask, Spark, GCP","pid":"202212-5","m4uni":"","analytics":"We collected the data using the Spotify API and the Spotipy Python library, uploaded it as a CSV file to GCP, and ran it on the Dataproc cluster.

For preprocessing, we used Spark.

We visualized the data using a histogram, a correlation graph, a joint Seaborn graph, a radar graph, and regular graphs between different parameters (e.g., danceability VS tempo).

We tried different models. First, a basic model. We calculate the variance of the parameters of each song in the user playlist. We assign a higher weight to parameters with lower variance. Take the average of the parameters in the input user playlist. Select the output songs with minimum Euclidean distance.
Then we also tried KNN, DBSCAN and OPTICS. But finally we implemented a Random Forest Classifier which gave the best results. ","m4lname":"","industry":"Information","m3lname":"Xavier","dataset":"First, we tried some small general datasets from Spotify playlists (e.g., Top 2022) by taking 1000 songs using the Spotify API (this limitation on the full database is set by Spotify).
Then, we decided to use a Kaggle dataset of 26,173,514 songs (3.5 GB) to get more data.

Our software can support datasets using the same format as the one from Kaggle. So if the dataset is updated, these new songs can be added to our model.","m2uni":"km3829","m2fname":"Kenneth","m3uni":"ahx2001"},{"projectname":"From LLM Labels to Real-Time Scoring: Multilayer Emotion Modeling of Bilibili Comments","timestring":"Fri Dec 19 22:50:26 2025","m1uni":"wl3011","m2lname":"Zhang","m1fname":"Weiqi","m4fname":"","m1lname":"Liang","m3fname":"Xun","description":"Objectives

The primary objective of this research is to understand how public emotions evolve on large-scale social media platforms beyond simple positive–negative sentiment dichotomies. Specifically, the study aims to characterize the temporal dynamics of multidimensional emotions expressed in user comments, considering both absolute time (calendar time of comment generation) and relative time (delay between video publication and comment posting). By focusing on a youth-dominated platform, the research further seeks to capture emotion expression patterns that are representative of younger online communities.

Innovations

This work introduces several methodological and practical innovations:

1. Three-layer emotion framework with LLM-based scoring
Instead of traditional discrete or binary sentiment labels, we propose a three-layer emotion framework and leverage large language models (LLMs) to generate continuous emotion scores, enabling more nuanced measurement of emotional intensity and composition.

2. Integration of LLM outputs with statistical modeling
We combine LLM-derived emotion scores with ANOVA and mixed-effects models to rigorously analyze emotion dynamics across time, video categories, and major social contexts (e.g., pandemic periods). This bridges modern NLP techniques with interpretable statistical inference.

3. Domain-adaptive fine-tuning of open-source LLMs
Using high-quality LLM emotion scores, we fine-tune an open-source model (DeepSeek) to better align with the language habits and expressive styles of young users, improving performance in informal, platform-specific contexts.

4. Human-in-the-loop evaluation toolkit
We develop a web-based manual labeling interface to validate LLM predictions and assess model accuracy, ensuring transparency and empirical grounding of automated emotion analysis.

5. Interactive emotion analysis platform
An additional interactive web page allows users to input comments and receive precise, multidimensional emotion scores, demonstrating real-world applicability beyond academic analysis.

Capabilities

The proposed research and toolkits provide the following key capabilities:

1. Scalable extraction of fine-grained, continuous emotion signals from large volumes of social media text

2. Statistical identification of temporal emotion patterns, including shifts across societal events and content categories

3. Adaptation of emotion models to youth-oriented linguistic norms, improving ecological validity

4. Transparent evaluation through human annotation and model comparison
Practical deployment via interactive web tools for analysis, education, or public engagement

Why This Research and Toolkits Are Important

This work addresses critical limitations of existing sentiment analysis approaches that rely on coarse, static labels and lack temporal or contextual depth. By integrating LLM-based emotion scoring with robust statistical modeling and human validation, the research provides a methodologically rigorous and socially relevant framework for studying public emotions at scale. The resulting toolkits enable researchers, policymakers, and platform analysts to better understand how emotions emerge, evolve, and differ across communities and time—particularly within younger populations that are often underrepresented in traditional survey-based emotion research.","uni":"wl3011","language":"Python; PyTorch; Gradio","pid":"202512-8","m4uni":"","analytics":"The project implements an end-to-end emotion analytics pipeline that integrates large-scale data processing, advanced modeling, statistical inference, and interactive visualization. Starting from a Python-based crawler that collects video metadata and time-stamped comments from Bilibili, the system applies large language models to generate continuous, multidimensional emotion scores under a unified three-layer framework covering core affect, basic emotions, and social-media–specific signals. A human-in-the-loop best–worst evaluation algorithm is used to benchmark different LLMs and select the most human-aligned teacher model, whose outputs then support downstream statistical analysis and model training. On the analytics side, the project combines descriptive distributional analysis, correlation analysis, period-based temporal comparisons, and hierarchical linear mixed-effects models to quantify how emotions evolve over absolute time, relative response time, engagement intensity, and hierarchical content structure (comments nested within videos and categories). To enable scalable and low-latency deployment, an open-source LLM is fine-tuned via QLoRA for continuous Valence–Arousal–Dominance regression, producing a lightweight student model that preserves alignment with human perception. These components are encapsulated in system modules for data ingestion, emotion annotation, human validation, model training, and deployment, and are complemented by rich visualizations including emotion distributions, correlation heatmaps, temporal trend plots, mixed-effects trajectory and variance decomposition plots.","m4lname":"","industry":"Media","m3lname":"Sun","dataset":"A Python-based crawler was developed to access Bilibili’s public web APIs, thereby retrieving video information and their comments. Specifically, we firstly collected videos from the Weekly Must-Watch using the https://api.bilibili.com/x/web-interface/popular/series endpoint. For each video, metadata including title, BV identifier, category, uploader information (UID, name), and video statistics such as views, likes, favorites, coins, shares, and danmaku count were obtained. Additional details such as video duration and description were collected through supplementary API calls.","m2uni":"sz3319","m2fname":"Shiyu","m3uni":"xs2569"},{"projectname":"Automatic Voice License Plate Reader","timestring":"Thu Dec 22 19:07:28 2022","m1uni":"lh3057","m2lname":" Wang","m1fname":"Liang","m4fname":"","m1lname":"Hu","m3fname":" Kejun","description":"Our project is to generate a spoken language engine that extracts driver license information from spoken language and converts them to the correct text format. After using Google’s Speech-to-Text API to convert the audio files into text data, we compare the similarities between the generated text and the reference text by using the Levenshtein distance score and NLTK-bleu score. Combined with the extra experiments with humans performing the same audio recognition tasks, we try to understand factors that potentially influence our model performance. ","uni":"lh3057","language":"Python3","pid":"202212-15","m4uni":"","analytics":"There are the two main algorithms that we used in this project.

Nltk-bleu score:
We found that when we process the audio directly, the result is not that ideal. We want to improve the accuracy by finding the similarity of words. Nltk-bleu score is a score of comparing a candidate translation of text to one or more reference translations. We calculate the score for up to 4-grams using uniform weights.

Levenshtein Distance:
Another way to evaluate the similarities between translated text and reference text is calculating the Levenshtein distance. It is a string metric for measuring the difference between two sequences. The distance is the number of deletions, insertions, or substitutions required to transform a reference word into a translated word.
","m4lname":"","industry":"Information","m3lname":" Liu","dataset":"The most important data we need for this project are audio files that contain human voice recording with license plate information. And we obtain the desired data set from Lab volunteers. Our software supports both Wav files and MP4 files.","m2uni":" ww2584","m2fname":" Weiran ","m3uni":"kl3434"},{"projectname":"Creating Digital Human: Text-to-Speech Synthesis and Its Visual Extension","timestring":"Fri May 6 14:00:14 2022","m1uni":"ec3576","m2lname":"","m1fname":"Enze","m4fname":"","m1lname":"Chen","m3fname":"","description":" Our task goal is divided into 3 parts. Firstly, we are going to achieve basic audio simulation using deep learning and analyze its results.The second part will be taking time cost and accuracy into consideration. Finally, we are going to try to do our own visualization, including creation for lip animation, or maybe even the character animation that involves body movements, emotions, etc).

This task might be really challenging, not only does it involve building connections between text and sounds, but also the combination of sound and animation, which might require us to use audio processing as well as image processing. Also, previous research is really important, since specific deep learning network is needed to be designed for this project.","uni":"ec3576","language":"Python. Better do its work on Linux/Unix, but might also work on Windows","pid":"202205-3","m4uni":"","analytics":"WAVENET: It is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

Glow-TTS: Glow-TTS is a flow-based generative model for parallel TTS that does not require any external aligner. By combining the properties of flows and dynamic programming, the proposed model searches for the most probable monotonic alignment between text and the latent representation of speech on its own. We demonstrate that enforcing hard monotonic alignments enables robust TTS, which generalizes to long utterances, and employing generative flows enables fast, diverse, and controllable speech synthesis.

HiFi-GAN: It is a GAN-based model capable of generating high fidelity speech efficiently.","m4lname":"","industry":"Information","m3lname":"","dataset":"We chose LJSpeech as our dataset. This is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books. A transcription is provided for each clip. Clips vary in length from 1 to 10 seconds and have a total length of approximately 24 hours.The texts were published between 1884 and 1964, and are in the public domain. The audio was recorded in 2016-17 by the LibriVox project and is also in the public domain.

Other single speaker datasets are also supported.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Cultivating Public Interest In Ancient DNA Using Lactose Intolerance And Data Visualization","timestring":"Sat Dec 21 02:17:00 2024","m1uni":"pj2422","m2lname":"Subramanya","m1fname":"Pranav","m4fname":"","m1lname":"Jain","m3fname":"Adam","description":"There are three challenges with respect to the relationship between the general public and the field of ancient DNA. The first challenge is that many people are unaware of the field. The second challenge is that people who are aware of the field do not see its relevance to today's world. The third challenge is that people who lack a biology background (but have an interest in the field) find it daunting to enter the field themselves. To help solve these challenges, we created an educational website on lactose intolerance for the general public.","uni":"pj2422","language":"Node.JS, Express.JS, Leaflet, D3, ConvertF, Bootstrap","pid":"202412-17","m4uni":"","analytics":"The website contains an interactive map that allows exploration of single nucleotide polymorphisms (SNPs) of ancient and modern humans that impact the presence or absence of lactose intolerance. The website also contains all the background information needed for understanding the map and lactose intolerance. The purpose of the website is to show viewers how helpful ancient DNA is in understanding why certain people have lactose intolerance and certain people do not. The intended audience of the website are the general public and high school students.","m4lname":"","industry":"Life Science","m3lname":"Vogt","dataset":"We are using the Allen Ancient DNA Resource (AADR) from Dr. Reich's laboratory at Harvard as the source for our genomic and contextual information. In the AADR, the genomes and context of ancient individuals published in hundreds of articles are compiled into a standard format, curated, and updated as new publications become available. The version of the AADR we used contains the genomes of 13,571 ancient people and 4,058 modern people \cite{AADR_Dataset}. Contextual information for individuals includes longitude, latitude, sex, age at death, skeletal element, group identifier, locality, political entity, and other elements. The DNA of each individual at up to 1,233,013 SNPs is represented \cite{AADR_Dataset}. In total, 19,103,623,021 SNP records are contained in 20.2GB of data. Furthermore, we took a list of 23 lactase persistence SNPs from Dr. Anguita-Ruiz's research \cite{lp_map}. In preprocessing, we decompressed the data. Then we converted it from an older format, ancestrymap, to a newer format, eigengeno. Next, we imported it into Python and indexed each individual. Subsequently, we filtered duplicate entries for the same individual. Importantly, we searched for the 23 SNPs associated with lactase persistence in the AADR and extracted them. Finally, we merged genetic data with contextual information to create an informative list of lactase persistence SNPs from ancient and modern people.","m2uni":"ks4298","m2fname":"Keshavadithya","m3uni":"av3047"},{"projectname":"Yelp Rating Interpretation with Text-based and Graph-based Features","timestring":"Sat Dec 22 23:13:14 2018","m1uni":"zl2621","m2lname":"Chen","m1fname":"Zhuoran","m4fname":"","m1lname":"Liu","m3fname":"","description":"Interpreting and understanding how much impact different factors can have on users’ final ratings for restaurants is of key interest to restaurant owners who may later improve their businesses accordingly. Previously, researches on Yelp rating prediction focused heavily on pushing forward task-related metrics such as accuracy and RMSE. In this work, we propose to interpret the prediction
process per se by building a random forest classifier for rating polarity prediction which rely only on transparent features from review texts and relation graphs of users and restaurants. Our classifier achieved reasonable accuracy, and provided meaningful insight into how users are evaluating restaurants.","uni":"zl2621","language":"Python, Google Cloud VM","pid":"201812-46","m4uni":"","analytics":"Algorithms:
- n-gram
- random forest
- importance calculation by mean impurity decrease
- Louvain method of community detection
- PageRank
- Betweenness
- Closeness
System Modules:
- Data Cleansing
- Text feature extraction
- Graph feature extraction
- Random forest classifier
- Feature importance computation
- Feature importance visualization
Visualization
- Feature importance: Bar chart w/ standard deviation","m4lname":"","industry":"Information","m3lname":"","dataset":"The Yelp dataset is a subset of our businesses, reviews, and user data for use in personal, educational, and academic purposes. Available as JSON files, use it to teach students about databases, to learn NLP, or for sample production data while you learn how to make mobile apps.","m2uni":"mc4414","m2fname":"Mingye","m3uni":""},{"projectname":"Climate Change Data Analysis","timestring":"Fri Dec 22 14:45:48 2023","m1uni":"amp2365","m2lname":"Goel","m1fname":"Apurva","m4fname":"","m1lname":"Patel","m3fname":"Milin ","description":"1. Agenda & Motivation: This project aims to predict weather for a historically challenging region in the NW region of France specifically for latitude range: 46°17’59’’ - 52°6’ and longitude range: -6°47’59’’ - 2°6’ which is a region of 550 x 550 km along the Brittany coast in EU. The area is home to the English channel, Euro tunnel and maritime routes which are economically important for both France and UK and extensively used for both civilian and defence purpose.

2. Usability: The current APIs which give future predictions are not specifically catered to address the temporal conditions of the area. Also being close to the frigid zone creates drastic variations sometimes which happen to be anomalies.

3. Novelty: The novelty of the project lies in the visualization of the predictions made using a suitable colormap which helps to draw conclusions even for someone who is not experienced with cartography and weather modelling.","uni":"amp2365","language":"Python, HTML, JavaScript, CSS, Spark","pid":"202312-11","m4uni":"","analytics":"We used spark,pandas and numpy to clean our existing historical data. Then to create a model we used tensorflow to implement MLP, LSTM and other models for time-series modelling. We predicted 3 parameters (temperature, precipitation, wind speed) using 11 dependant variables. We made a json file to store the predictions. Also each time we make predictions for a latitude and longitude range with step size of 0.25 (20km) and then we display the predictions for the region and also numerical values for the desired longitude and latitude. We fetch current openweather data using API and then we can compare our predictions.
The entire pipeline is handled using Flask where we create a webpage and whenever a request is made from the UI we call the model(s) in the backend and make predictions for the user input. The UI was programmed using HTML & JS and the entire backend is coded in python which includes multiple libraries but not limited to cartopy and matplotlib.

Models implemented: MLP, LSTM, Logistic Regression
Analytics & Visualization: Flask webpage & pipeline with Cartopy visualizations","m4lname":"","industry":"Information","m3lname":"Saini","dataset":"The dataset(static-historical) contains full time series of the following data type:

1. Ground observations: over 500 ground stations measuring pressure, temperature, humidity, wind direction and speed, dew point and precipitation, recorded every 6 min. (i.e. 7 parameters recorded for 3 years every 6 minutes = 262,800 x 7 ~ 1,839,600 (1.84M) data entries)

2. Precipitation radar: radar reflectivity and total rainfall measured every 5 min. ( 315360 x 2 ~ 630,720 entries)

3. Satellite data: Cloud Type (CT) every 15 min (105,120 entries), Channels (visible, infrared) every 1 hour (26,280 x 2 channels ~ 52,560 entries).

4. Weather models: forecasts from 2 weather models with 2D parameters, generated once a day.

5. Land-sea and relief masks","m2uni":"rg3546","m2fname":"Ritwik","m3uni":"mks2249"},{"projectname":"Auto Face Doodling","timestring":"Fri Dec 13 14:59:25 2019","m1uni":"xs2291","m2lname":"Sun","m1fname":"Xiaowo","m4fname":"","m1lname":"Sun","m3fname":"Xiaoyun","description":"The project analyzes the Quick Draw datasets, focusing on the face and facial feature categories, to explore that if machine can identify the category and players’ country/region from the doodlings and automatically generate doodlings in a manner similar to human.

We built several classification models (Random Forest, CNN, RNN) to identify word categories of doodlings players’ countries, achieving an accuracy of 99% for word categories. We further explored Auto face doodling using GAN.

In the end, we created a user interface, which allows users to doodle a face or facial features in a real-time way and presents the recognition result of the category and the region from which the face or facial features could come.
","uni":"xs2291","language":"Python, JavaScript, HTML, Google Cloud Platform","pid":"201912-19","m4uni":"","analytics":"EDA, t-SNE, Random Forest, CNN, RNN, GAN","m4lname":"","industry":"Information","m3lname":"Qin","dataset":"Quick Draw Dataset","m2uni":"ys3127","m2fname":"Yiqi","m3uni":"xq2189"},{"projectname":"Humans in the Loop: The Design of Interactive digital human","timestring":"Wed May 4 23:15:06 2022","m1uni":"hy2711","m2lname":"Pande","m1fname":"Haocheng","m4fname":"","m1lname":"Yun","m3fname":"","description":"We are focused on unlocking human potential by evolving the relationship between machines and humans from transactional to interactional. We want to created a system that allows for hyper-real face-to-face communication and interaction, making the machine feel alive and personal. The human brain is naturally able to process multiple inputs (light, sound, touch, etc.), prioritize its attention, learn through experience, create and store memories, and coordinate actions and behaviors based on rewards and intricate emotional systems.
","uni":"hy2711","language":"python, unreal engine","pid":"202205-15","m4uni":"","analytics":"ASR algorithm: Deepspeech2 from Baidu , network communication, lip-sync model from facegood","m4lname":"","industry":"Retail","m3lname":"","dataset":"UE developer guide
ASR developer guide
FACEGOOD developer guide","m2uni":"tp2673","m2fname":"Tanvi","m3uni":""},{"projectname":"Sentiment Pulse: An Intraday Sentiment Analysis Dashboard","timestring":"Fri Dec 20 01:32:09 2024","m1uni":"tf2503","m2lname":"Lin","m1fname":"Tomas","m4fname":"","m1lname":"Fiure","m3fname":"","description":"The goal of the project is to create a tool that provides a sense of market feeling on a particular stock with intraday precision. Currently, there are some tools on the market, though they have their issues, that are able to provide this information on a day-to-day basis. However, we are aiming to build a tool that can do this on an hour-to-hour basis. This product would be a useful tool for market participants when making decisions on market transactions. A single day can provide significant volatility and our tool would help navigate this volatility.","uni":"tf2503","language":"Google Cloud Products: Composer, CloudSQL, Cloud Run, VM instance in Compute Engine, Colab. Languages: Python, Javascript","pid":"202412-28","m4uni":"","analytics":"For data sourcing, we have a library that gives a list of recent articles, an API that gets us the article html and a custom built scraping script that gets us the article text. For scoring, we call a model that we fin-tuned. The fine-tuning was done by taking a Llama 3.1-8B and fine-tuning using memory control with LoRA, DeepSpeed, and Gradient Accumulation. This process was done in one go for each article and the resulting entry, with the source info and score, was stored in a database. Then, we built a light API to interact with the database. The dashboard is then built by doing API calls to the database through the API. As of now, there is a simple visualization comprised of a bar graph showing the average score of the last 5 hours for each ticker. We are planning on adding more visualizations and interactions on to the UI.","m4lname":"","industry":"Finance","m3lname":"","dataset":"There are a few datasets involved in the project. The first is the set of sources used for scoring that gets shown on the dashboard. This dataset is of our own making, it is not pre-compiled or publicly available. The second dataset used was fingpt-sentiment-llama3-instruct. This was used for training the model. Then, we used datasets named 'Twitter-financial-news-sentiment', 'Financial_phrasebank' and 'FiQA-2018' for fine-tuning the model.
","m2uni":"ll3713","m2fname":"Likun","m3uni":""},{"projectname":"What's so special?","timestring":"Fri Dec 13 15:05:46 2019","m1uni":"ab4685","m2lname":"Kocherlakota","m1fname":"Ankita","m4fname":"","m1lname":"Bhardwaj","m3fname":"","description":"Objective: Our main objective is to reduce the time that a customer has to spend to go through all the reviews to find out the specialties of a business (e.g. a special dish of a restaurant) and to also help the business owners to find out what people like the most about their business which can be used by them for better customer targeting for advertisements or also for menu selection, in case of restaurants.

Innovations: To attain our objective, we used NLP to go through all the reviews and find out what’s so special about a particular business. This feature can save a lot of time for the customers and provide them with much-needed information which will motivate them to use yelp more often thereby, benefiting Yelp. At present, this feature is missing in Yelp and can be incorporated into the existing yelp website.

Capabilities: Capabilities of this project are to find the specialities of a business and making customers and business owners more aware and happy by using NLP in pyspark.

These toolkits are important because it saves a lot of time which is valued a lot in today’s world where we want to be more aware of everything but we are not willing to devote time to them due to the lack of time. Also, it will help Yelp increase its customer’s satisfaction and retention rate and help the business owners to plan their future in a much better way.","uni":"ab4685","language":"Pyspark, Google Cloud Storage, NLP, HTML, CSS, Javascript","pid":"201912-44","m4uni":"","analytics":"We have used nltk in python to find out the specialities of a place. When a person enters the name of a business, it gets mapped to the corresponding business id and from the reviews dataset, relevant 4 and 5-star reviews are extracted. To filter the extracted reviews, first,the stop words are removed and the words are stemmed. Then, based on the occurrence of the words, specialities are found out. Similar algorithm is used for finding the location wise specialities as well. For visualization, we have created a word cloud for each business and location. ","m4lname":"","industry":"Information","m3lname":"","dataset":"We have used the Yelp dataset which contains 6.6M reviews for over 190k businesses. It contains Business Data, User Data, Reviews data, Checkin and Tips data, where the data was initially in the json format. We found the dataset online from Yelp’s open data challenge. The software is mainly devised for websites dealing with a large number of reviews. It can be used for various websites that post reviews about the universities to find out What’s so special in a university! Our software will be able to support many such applications.","m2uni":"nk2801","m2fname":"Nithya ","m3uni":""},{"projectname":"Yelper: Hybrid Recommendation System","timestring":"Wed Dec 23 03:11:48 2020","m1uni":"tks2132","m2lname":"Narravula","m1fname":"Tanmay","m4fname":"","m1lname":"Shah","m3fname":"Deepak","description":"Our project goal was to build a hybrid recommendation system to recommend restaurants to users based on their Yelp reviews. In addition to finding restaurants that users may like, we also aimed to identify restaurants that users may dislike.

We used a combination of collaborative filtering and content-based filtering to create recommendations. Specifically, our content-based filtering approaches used categorical data about businesses and textual data from reviews to help recommend restaurants based on what users actually like about a business.

After aggregating the results from our individual algorithms, we were able to create a hybrid recommendation system and display our results using a Django-based web application. Our project is unique from past projects since it focuses on understanding what a user dislikes as well as what they like. Additionally, past projects were also focused mainly on collaborative filtering, while we enhance our recommendations using restaurant categories and textual data.","uni":"tks2132","language":"Python, PySpark, Spark MLLib, NLTK, Spark SQL, BigQuery, Django, HTML/CSS/JavaScript, Leaflet.js","pid":"202012-8","m4uni":"","analytics":"We implemented collaborative filtering and two variations of content-based filtering and aggregated our results to create a hybrid recommendation model. For collaborative filtering, we utilized the Alternative Least Squares (ALS) algorithm to predict expected user ratings given existing review and rating information.
For our first content-based method, we preprocessed our business data using PCA to include the most relevant attributes and categories. We grouped the businesses by user rating (>3 stars and <3 stars) and found the similarity score of user likes and dislikes by business. We then combined these scores and recommended businesses where the resulting score was positive.
For the second content-based method, we filtered textual review data by selecting nouns, lemmatizing (using WordNet), and then aggregating reviews by user and by business. We then used TF-IDF and segregated into a positive review set and a negative review set to create representations for businesses and users based off of their textual review data. Finally, we recommended businesses which shared attributes that users cared about.
To display our results and provide a query-based visualization, we implemented a Django-based web application that read our recommendation results from a BigQuery table and displayed results on a map. Users are able to enter their user ID, as well as other filters (e.g. category, location) and view our corresponding restaurant recommendations.","m4lname":"","industry":"Retail","m3lname":"Dwarakanath","dataset":"We used a subset of the Yelp dataset which is publicly available on their website after providing an email and reading the terms and conditions of use: https://www.yelp.com/dataset. For our project, we limited our test set to Yelp businesses which were restaurants and located around the New York state area. Additionally, we limited our user data to those who had more than ten reviews.","m2uni":"rrn2119","m2fname":"Riddhima","m3uni":"dd2676"},{"projectname":"Stock Price Prediction Based on Sentiment Analysis","timestring":"Fri Dec 15 22:24:34 2023","m1uni":"vc2652","m2lname":"P Rao","m1fname":"Vethavikashini","m4fname":"","m1lname":"Chithrra Raghuram","m3fname":"Balachandar","description":"Objectives:
Handling significantly large data and predicting the nature of changing stocks
To effectively compare different ML models by using ensemble learning
To see the nature of deep learning models used in short-term and long-term cases
Comparatively analyzing their accuracies to glean an understanding of their performance

Innovation:
With the ever-changing nature of stock prices, it is interesting to note the factors behind the fluctuations. There are many reasons behind the stock price fluctuation such as profit in product, news announcements, and many more. In this report, we aim to focus on creating
a relationship between the emotions of individuals and stock prices. Through this, we intend to build and introduce a term called the “behavioral counterpart”. This is used as one of the factors behind predicting the market price for the stock. In this project, we have employed the sentiment analysis framework to satisfy the behavioral counterpart. This is implemented with the help of Twitter where we get millions of opinions from people daily. Hence, Twitter data serves as the best platform to build the framework for predicting the stock price using public opinions.

Capabilities:
The built model can predict the closing price for both long and short-term cases which offers a cohesive report to the user. This informed result comes from exploring the result of PCA before model training. A combination of machine learning and deep learning modules has enhanced the project's capabilities.

Importance behind this problem:
Combining the public knowledge with the stock data will provide us with real-world insight into closing price prediction. The toolkits used also provide certain novelty by considering both long and short-terms by employing effective PCA techniques with both ML and DL algorithms.","uni":"vc2652","language":"Python, Google Engine, Vader, TensorFlow, HTML, CSS, JavaScript, Sklearn and Technical Analysis Package","pid":"202312-15","m4uni":"","analytics":"Analytics: Technical Analysis Package, VADER
Algorithms: PCA for dimensionality reduction
System Modules: Adaboost, Random Forest, LSTM, and Linear Regression
Visualization: Word Cloud, Loss Plots, and Front-end using HTML, CSS, and Javascript
","m4lname":"","industry":"Finance","m3lname":"Sathianarayanan","dataset":"Stock Data: For the stock data, the Yahoofinance API was used.

Twitter: ”twscrape” was used for extracting the Twitter data.
Since the market is not open every day, we extracted
the dates on which the market is open and collected
the tweets for each company. Moreover, since 2022,
only 900 API calls can be made every 15 minutes (900
seconds), hence we used Airflow to schedule the data
collection. For every 5 minutes, 50 tweets for each of the
three companies will be extracted. The following features
were extracted in the dataset via the
API:
--> Date: Date in which the tweet is made
--?Tweet: The tweet from a user about the company

Link for the API: https://github.com/vladkens/twscrape","m2uni":"spr2139","m2fname":"Swasthi","m3uni":"bs3507"},{"projectname":"Autonomous Driving","timestring":"Fri May 6 20:50:31 2022","m1uni":"rg3332","m2lname":"","m1fname":"Riya","m4fname":"","m1lname":"Gupta","m3fname":"","description":"Objectives : The objective of this task is to be able to present certain smaller subtasks that are needed and required for the Advanced Driver Assistant Systems.
Innovations : These architectures are pretty simple and does not use much of the computational power. They are trained usiong simple available CNN architectures and work for roads with the straightforward lane lines, proper traffic system, which becomes convoluted when the same architectures are trained with the complex road systems like Indian Road Systems (IDD Dataset). The goal was to be able to present all these tasks with a help of a common portal.
Capabilities : The trained model can be quantized and converted to a tflite version, which can be used on the mobile phones. I have shown the currently available model (without finetuning) it for my classes but with similar parameters on the NYC road live video.","uni":"rg3332","language":"Python","pid":"202205-8","m4uni":"","analytics":"Python was majorly used, modified sequential networks, modified version of LENET-5, finetuned version of Yolov3 etc were tested during this project and the required modules were kept. For lane line detection, OpenCV and image processing paid more important role along with CNNs.
Flask was used to create a live demo website, which explains the project, dataset, architecture and aim in much more detail (this will be submitted with the video)","m4lname":"","industry":"Transportation","m3lname":"","dataset":"The following datasets were tested:
KITTI+GTI dataset
IDD Dataset
MSCOCO dataset
BOSCH traffic dataset
NYC real-time captured video","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Converting Videos to TFRecords with Apache Beam","timestring":"Fri Dec 13 07:08:48 2019","m1uni":"klm2190","m2lname":"","m1fname":"Kimberly","m4fname":"","m1lname":"Milam","m3fname":"","description":"This is the first open-source pipeline for parallelizing a video to TFRecords pipeline (as far as I know of). Additionally, it is the first open-source Apache Beam pipeline that uses OpenCV and TensorFlow Hub.

These advances are important because processing video files is computationally expensive and difficult. By converting the videos to TFRecords and applying transfer learning beforehand, the input pipelines for a successive machine learning model will be greatly simplified. Commonly, these pipelines are done sequentially, which is very, very slow. By parallelizing this pipeline with Apache Beam, the work can be vastly sped up. ","uni":"klm2190","language":"Computer Languages: Python with Apache Beam, TensorFlow 2.0, Platforms: Google Cloud Platform","pid":"201912-11","m4uni":"","analytics":"- OpenCV to extract frames from a video
- Transfer learning using the Inception v3 model in TensorFlow Hub.
- Windowing frames together to crop videos","m4lname":"","industry":"Information","m3lname":"","dataset":"YouTube User Generated Content (UGC) Dataset from https://media.withyoutube.com/

The pipeline supports any video file types that are supported by OpenCV. This includes mp4, mkv, and avi video file types. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Persuasive Chatbot","timestring":"Fri May 15 20:21:20 2020","m1uni":"ml4407","m2lname":"Wang","m1fname":"Mubai","m4fname":"","m1lname":"Liu","m3fname":"","description":"Persuading someone to do something is a task that most of our lifetime will experience. Using human resources to do such a task is time consumable and inefficient. However, building and implementing a persuasive chatbot can negate such a downside of using human resources. Our project is based on this intuition and focuses on persuade the user to make a donation to children, and we believe by adding a personality aspect of the target will also increase the chance of success. Hence, by using human samples of persuading different personality people datasets, we developed a python scripted chatbot that will interact with the user and do the persuading job of donation.","uni":"ml4407","language":"Python; PyTorch; GUI; Pandas; Numpy","pid":"202005-28","m4uni":"","analytics":"We've used Seq2Seq model and tokenization method to our system and build our chatbot using python GUI.","m4lname":"","industry":"Information","m3lname":"","dataset":"Our dataset is mainly from some work done by a related paper written by Zhou Yu. This dataset is public and everyone can use it. Other datasets are also supported by our application as long as it is in a conversational format. ","m2uni":"zw2605","m2fname":"Ziyin","m3uni":""},{"projectname":"Job Recommendation System","timestring":"Wed Jan 5 21:30:09 2022","m1uni":"sr3849","m2lname":"Zhao","m1fname":"Shuai","m4fname":"","m1lname":"Ren","m3fname":"Yutong","description":"The position recommendation system made by our group is dedicated to helping users find the most suitable position and corresponding salary. Users can put their personal information or resume directly into our recommendation system, and our model can give the most suitable job classification for users. After that, the user can re-enter the job classification and their related information into the system, and our system can give the corresponding salary forecast to help the user know the most suitable job search information.

We use Tensor Flow to train the RNN model, use Spark to train the salary prediction model, and finally deploy the model to the web page to help users achieve better functions","uni":"sr3849","language":"Python, JavaScript, HTML5, CSS","pid":"202112-7","m4uni":"","analytics":"RNN model with two LSTM layers
Linear Regression model
JavaScript for website design","m4lname":"","industry":"Information","m3lname":"Chen","dataset":"Download job and resume data from NYC Open Data, Data World, Kaggle. ","m2uni":"yz4131","m2fname":"Yuqin","m3uni":"yc3993"},{"projectname":"NY State Inpatient Healthcare Analysis","timestring":"Fri Dec 13 06:37:55 2019","m1uni":"yl4042","m2lname":"Cho","m1fname":"Yuan-Fang","m4fname":"","m1lname":"Lin","m3fname":"Jing","description":"Descriptive Goal
- Data exploration to understand distribution and availability of data
- Data analysis to identify systematic patterns such as bias and test for statistical significance

Predictive Goal
- Build a tool to help users estimate healthcare costs given user specific inputs

Our project summarizes the SPARCS De-Identified: 2017 dataset in a meaningful manner using data visualization techniques in D3.js and also layer on big data analytics tools to help extract patterns and insights using dimensionality reduction, clustering, and machine learning algorithms for predicting healthcare costs. Finally, we share our results from statistical tests designed to help identify systematic biases across populations. ","uni":"yl4042","language":"Python, PySpark, Javascript including D3, HTML, R","pid":"201912-28","m4uni":"","analytics":"Bias testing with Pearson Correlation Matrix, Anova and Tukey. Vectorization, Decision Tree, Regressor are used to train the prediction model, e.g. XBGRegressor. Heatmap is used to visualize the data in each county. Bar chart and pie chart are used to get the overview of the data.","m4lname":"","industry":"Life Science","m3lname":"Qian","dataset":"We analyze the New York State Hospital Inpatient Discharges (SPARCS De-Identified): 2017 dataset. The raw data from the official New York State website is roughly 850 MB with over 2.3 milliion records of patient discharge data with each record consisting of 34 different features or fields.","m2uni":"yc3522","m2fname":"Justin","m3uni":"jq2282"},{"projectname":"Ethic AGI: Integrating Fine-Tuned Ethics Models and PPO Policies","timestring":"Wed May 14 00:37:45 2025","m1uni":"mz3056","m2lname":"Quan","m1fname":"Muyao","m4fname":"","m1lname":"Zi","m3fname":"","description":"Our Ethic AGI framework aims to unify large-language models (LLMs), a fine-tuned ethics classifier, and reinforcement-learning agents into a single interactive system that can propose, vet, and execute morally compliant actions in real time. Key innovations and capabilities include:

Human-in-the-Loop Dataset Expansion: Leveraging GPT-3.5-turbo to generate edge-case ethical/unethical action proposals, followed by three-way human majority-vote annotation, to continually refine our BERT classifier.

Ethics-Wrapped Environment: A modular Gym wrapper (\texttt{CustomRewardEnv}) that seamlessly injects fine-tuned BERT judgments as a dense reward penalty, enabling direct comparison between performance objectives and ethical compliance.

Recurrent PPO Policy: Integrating SB3’s \texttt{RecurrentPPO} to learn long-horizon strategies that respect both task success and moral constraints, with on-policy gradient clipping for stability.

Interactive Loop Interface: An end-to-end prompt–classification–feedback UI (CLI/GUI) where users can pose dilemmas, label “uncertain” proposals, select actions, observe combined reward/ethics feedback, and compare against the agent’s own policy recommendation.

This toolkit is important because it provides a concrete, extensible testbed for exploring how AI systems can incorporate ethical judgment at both single‐step decision points and over multi‐step tasks, with transparent human oversight throughout.","uni":"mz3056","language":"Python, Jupyter Notebook, Google Colab","pid":"202505-11","m4uni":"","analytics":"1. Ethics Classifier (BERT):
Fine-tuned BertForSequenceClassification on Hendrycks/Ethics + human-annotated edge cases.
Produces binary ethical verdicts and softmax confidence scores.
2. Environment & Reward Wrapper:
EthicalEnv: base Gym env with three discrete actions, continuous 3-D observations, deterministic dynamics.
CustomRewardEnv: wraps the base env, applies a confidence penalty for unethical actions, returns enriched dicts.
3. Reinforcement Learning (Recurrent PPO):
SB3’s RecurrentPPO with “MlpLstmPolicy,” on-policy gradient clipping.
Trained via DummyVecEnv + VecMonitor for 50 000 timesteps, logging to WandB.
4. Interactive Loop Module:
CLI parser / lightweight GUI built in Python: handles user text input, displays numbered LLM proposals, highlights ethics/confidence labels, prompts for relabeling or choice.
Orchestrates calls to OpenAI’s GPT-3.5-turbo, our HuggingFace BERT reward model, Gym environment, and PPO agent recommendation API.
","m4lname":"","industry":"Information","m3lname":"","dataset":"Hendrycks/Ethics “Commonsense” Split (HuggingFace “hendrycks/ethics”): 13 910 train / 3 885 validation / 3 964 test examples of short action descriptions labeled permissible vs. impermissible.

Downloaded directly via the HuggingFace Datasets library.

LLM-Generated Edge Cases: We ran sampling cycles in our synthetic \texttt{EthicalEnv}, prompting GPT-3.5-turbo with 4-shot primers to generate 100 + novel imperative actions.","m2uni":"yq2311","m2fname":"Bonny","m3uni":""},{"projectname":"Automatic Storytelling","timestring":"Sat May 16 04:26:32 2020","m1uni":"jh4162","m2lname":"","m1fname":"Jiaqi","m4fname":"","m1lname":"He","m3fname":"","description":"Automated story generation is the problem of automatically selecting a sequence of events, actions, or words that can be told as a story. Our goal is to monitor overwhelming real-time information on social media and automatically generate a story by selecting related information. First I use generative method，training an event2sentence RNN model which can translates events back into human language to write a whole new story. Then we use extractive method，like vector ranking，sorting，clustering，topic classification and keywords scoring to construct a story. ","uni":"jh4162","language":"Python","pid":"202005-6","m4uni":"","analytics":"RNN LSTM, TFIDF, k-means clustering, text rank, LSTM text classification.","m4lname":"","industry":"Media","m3lname":"","dataset":"Dataset to train generating sentences is from wikipedia movie plot. The train data/validation data is approximately 9:1. For less training time，here I use 10000 training data，1000 validation data to train the model.
The data we use to construct a new story is from twitter. Using twitter API to grab instant tweets under specific accounts，those accounts are: new York times，breaking news，Cnn-brk，WSJ-breaking-news，ABS-news-Live，sky newsbreak，TWC-breaking. In this way，we ensure the quality of all the tweets so that there won’t appear meaningless sentences. From every account，we collect 30 newest tweets.
The data used to train the topic classification is from BBC news. There are 2,225 news articles in the data, they belong to 5 topics，I split them into training set and validation set, according to the parameter we set earlier, 80% for training, 20% for validation. The number of training data is 1780，while validation data is 445.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"1","timestring":"Tue Jan 3 21:34:58 2023","m1uni":"","m2lname":"","m1fname":"1","m4fname":"","m1lname":"","m3fname":"","description":"1","uni":"1","language":"1","pid":"1","m4uni":"","analytics":"1","m4lname":"","industry":"Information","m3lname":"","dataset":"1","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Twitter Topic Analysis of Different Languages","timestring":"Fri Dec 13 22:35:35 2019","m1uni":"hy2506","m2lname":"Young","m1fname":"Hyun BIn","m4fname":"","m1lname":"Yoo","m3fname":"Yuelin","description":"Objectives: To provide insight into what people who speak different languages are interested in. This interests marketers who want to know what is important to people of specific demographics who understand a certain language

Innovations: By looking at multiple languges, we are able to analyze across many different cultures and provide insights into what motivartes them. Prior Twitter analysis projects focus primarliy on English twitter data and it lacks the cross cultural aspects.

Capabilities: We are able to provide a visual analysis tool that is useful for marketers who are involved in global business and outreach. ","uni":"hy2506","language":"Python, Pyspark, Dataproc, Compute Engine, Big Query, AWS S3, NLTK, D3, Javascript, HTML, Pandas, Tweepy, Json, REST API","pid":"201912-10","m4uni":"","analytics":"Analytics: Tweepy, NLP, Pyspark, Json, Topic modeling, TF-IDF Matrix, Yandex Translate API

Algotithms: LDA

System modules: Data Processing, Topic Modeling, Visualization

Visualization: Zoomable Treemap","m4lname":"","industry":"Information","m3lname":"Long","dataset":"Live Twitter stream with all languages (5GB+). Streamed through twitter developer API using Python. We can support any sort of text data that has language tags in JSON format. ","m2uni":"ly2451","m2fname":"Liane","m3uni":"yl3181"},{"projectname":"Alpha Generation from Alternative Sources ","timestring":"Fri May 15 15:35:56 2020","m1uni":"zc2243","m2lname":"","m1fname":"ZIYUN","m4fname":"","m1lname":"CHEN","m3fname":"","description":"Quantitative investing strategies have been under-performing broader equities market and many other strategy types in the recent years. Investors have been actively looking to generate alpha signals with innovative methodologies and algorithms. Empirical researches suggest evidence of predictive power on financial market from alternative data sources such as web search data and twitter data. This project creates a framework to analyze and generate alpha signals from these alternative data sources to predict bitcoin prices.","uni":"zc2243","language":"Python","pid":"202005-10","m4uni":"","analytics":"Live Trading Bot","m4lname":"","industry":"Finance","m3lname":"","dataset":"","m2uni":"","m2fname":"","m3uni":""},{"projectname":"AI Stock Trader based on News Sentiments","timestring":"Fri Dec 20 02:30:45 2024","m1uni":"xw3037","m2lname":"Chou","m1fname":"Xiaojie","m4fname":"","m1lname":"Wu","m3fname":"Yunhao","description":"The stock market is a dynamic environment that is influenced by numerous observable and unobservable factors. The stock trend itself is a Brown Movement which is volatile and unpredictable. Therefore, when predicting stock trends, we need not only market data like historical price movements but also reference data like news, financial reports, and mass sentiments. Due to the massive quantity of data involved, Integrating AI technology into trading would greatly improve response time and save on the cost of trading.
In this project, we follow low-frequency trading mode and use news data as reference data. We want to predict the daily stock trend and optimal trading action based on historical price data, major news articles, and their text sentiments. Since different financial instruments behave quite differently in the market, there’s no common trading strategy to handle all instruments. Therefore, we focus on stocks of large technology companies like Apple, Google, Amazon, etc.
","uni":"xw3037","language":"Python; Google Colab","pid":"202412-16","m4uni":"","analytics":"Our project aims to develop a useful low-frequency trading strategy with the assistance of AI and news sentiments, to achieve a higher yield rate and Sharpe ratio. Our system consists of gathering financial data, labeling the sentiments of financial news, predicting price movements with machine learning models, simulating trading on each stock, and optimizing portfolio performance as a whole.
Unlike traditional trading based on fundamental analysis, AI trading is based on an automated system with high data volume, fast decision-making and execution speed, and generalization to various assets with little finetuning. Our system mainly gathers price and news data from Polygon with an ETL pipeline to cleanse and format the data. Then, we get news sentiment labels using FinBert and predict the next prices through the LSTM model. With the predicted prices, our system executes trading strategies under simulated settings. Lastly, the system adjusts stock positions in a portfolio daily and evaluates yield rates and Sharpe ratios. We also visualized the daily portfolio, yield trend, and price correlation with news sentiment through our website.
","m4lname":"","industry":"Finance","m3lname":"Luo","dataset":"Our data mainly consists of price data and news sentiments, and the major data source is polygon.io. Polygon.io is an open financial platform that provides 2-year stock prices and news texts. Also, it provides several months of news sentiment data. The raw data is in JSON format, and we formatted them into CSV files for later analysis.
Notably, polygon.io only provides limited amounts of news sentiments in binary scales. We also explored other data sources like Marketaux and Alphavantage, which provide continuous news sentiment and relation scores, but still in limited amounts. Therefore, we decided to calculate sentiment scores using FinBert based on the news descriptions, title, and keywords provided by polygon.io.
","m2uni":"lz2837","m2fname":"Lance","m3uni":"yl5444"},{"projectname":"Teach Machine to Draw Grandmaster Art","timestring":"Wed May 10 04:28:34 2023","m1uni":"yy3269","m2lname":"","m1fname":"Yang","m4fname":"","m1lname":"Yu","m3fname":"","description":"The speed at which artificial intelligence is improving is unprecedented. Nowadays, we can see the appearance of artificial intelligence everywhere. From the video recommendation software on our apps to the facial recognition systems at the company's front door, AI is assisting humans in completing a variety of tasks. In the past three years, a specific subfield of AI called \"Artificial Intelligence Generated Content\" (AIGC) has become the star. AIGC enables people to create creative designs, such as texts, images, audios, and videos, without requiring many input commands. The \"fully autonomous\" feature of AIGC has the potential to emancipate productive forces and increase our productivity as a whole. AIGC has greater opportunities to assist the art community in analyzing and comprehending visual art due to the greater availability of visual art collections, advanced deep learning methods, and computer vision tools.

The project group was inspired by the expanding range of generative techniques for image generation to propose this project and explore the potential of AI to learn and create beautiful artworks. The project fulfills its mission by engaging in a two-stage plan: (1) selecting three famous state-of-the-art text-to-image models, including stable diffusion, VQGAN, and WGAN, training them with a well-balanced dataset, and fine-tuning the models to achieve better results; (2) building a Softmax classifier that can classify art images with a final test accuracy of 87.31\%; and (3) finally integrating these two functionalities into a web application that is accessible to the public for use.","uni":"yy3269","language":"Python, PyTorch, ReactJS, Flask, AWS, Stable Diffusion web UI ","pid":"202305-5","m4uni":"","analytics":"VQ-GAN, WGAN, Stable Diffusion, and Front-end Construction.","m4lname":"","industry":"Information","m3lname":"","dataset":"For the functionality of classifying and generating images based on styles, we will train a model on the Artbench-10 dataset. ArtBench-10 is the first class-balanced, high-quality, cleanly annotated, and standardized dataset for benchmark- ing artwork generation. It comprises 60,000 images of artwork from 10 distinctive artistic styles, with 5,000 training images and 1,000 testing images per style. ArtBench-10 has several advantages over previous artwork datasets Firstly, it is class-balanced while most previous artwork datasets suffer from the long tail class distributions. Secondly, the images are of high quality with clean annotations. Thirdly, ArtBench-10 is created with standardized data collection, annotation, filtering, and preprocessing procedures.

For the functionality of classifying and generating images based on categories, we chose the WikiArt dataset. WikiArt dataset contains paintings from 195 different artists. The dataset has 42129 images for training and 10628 images for testing. The paintings were obtained from the wikiart.org website and include a large variety of artworks, such as genres, portraits, landscapes, and etc. Because WikiArt is available to the public, it has a well-developed structure. WikiArt is often used in the field of machine learning to recognize, classify, and generate art. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NYC Real Estate Exploration","timestring":"Mon Jan 4 22:59:37 2021","m1uni":"sjs2287","m2lname":"Honda","m1fname":"Steve","m4fname":"","m1lname":"Shanko","m3fname":"William","description":"What has been going on in the real estate market in NYC recently? This is the primary question that motivated our research and development for this project. There has been a lot of speculation and analysis done to understand what is happening in the housing market in NYC. Additionally, there is a lot of interest in being able to estimate the market value of a property and efforts are being made by large companies like Zillow and Trulia. We wanted to develop a multi-functional tool to analyze market trends and to make predictions about the potential value of a given property based on some simple attributes as a goal. ","uni":"sjs2287","language":"Python, Javascript, PHP, Html, Tableau, Nginx, Docker, Flask, scikit-learn","pid":"202012-5","m4uni":"","analytics":"In order to accomplish the goal, we experimented with various models (Linear Regression, Ridge Regression, Lasso Regression, Gradient Boost, Random Forest, etc.) to perform the prediction. We developed an API that can serve predictions to our frontend application or can be used standalone as part of other applications. And finally we cleaned and hosted the NYC open data around sales and hosted it in our own internal database to support visualization and analytics activities. (https://masatoshihonda.com) You can see interactive maps and charts with specific data you want. ","m4lname":"","industry":"Finance","m3lname":"Sickinger","dataset":"'NYC Citywide Annualized Calendar Sales Update' and 'NYC Calendar Sales (Archive)' from NYC Government webpage. ","m2uni":"mh4007","m2fname":"Masatoshi","m3uni":"wrs2125"},{"projectname":"OncoLink: Predicting Breast Cancer Treatment Response from Gene Expression","timestring":"Tue May 5 21:41:21 2026","m1uni":"apb2192","m2lname":"Taimur","m1fname":"Anjali","m4fname":"","m1lname":"Bhimanadham","m3fname":"Mahsa","description":"The goal of OncoLink is to support precision medicine by helping oncologists predict a patient’s response to a treatment plan and make more informed clinical decisions. OncoLink can predict, with probability/confidence, whether a patient will respond to treatment, show the top k most similar historical patients from the METABRIC dataset, allow physicians to enter outcomes for patients so we can improve the model, and let them see insights into how the model makes predictions.

The project has some key innovations that improve model performance as well as explainability and interpretability. We use patient similarity search by finding similar patients through comparing PCA-reduced gene expression data using FAISS. This application also integrates Agentic AI by using Groq and Llama 3. It retrieves data such as similar historical patients and the model’s performance before generating the clinical explanation, which makes it grounded and reduces hallucinations. OncoLink also has an incremental learning loop, and the model is continuously updated every time 10 new real-world patient outcomes are entered using partial_fit(), so that we do not have to retrain the entire model. SHAP explainability is used to support the model’s predictions. It tells the oncologist how each gene or piece of clinical data affected the prediction.

This research/project is important because it promotes precision medicine, which aims to provide patients with personalized care, and we are able to assist with that by utilizing ML models and clinical data. Since we are using SHAP for explainability, we are also making the system explainable, which provides transparency and builds trust. We are also constantly improving the model by incorporating the real data we are getting from physicians without retraining the whole model.
","uni":"apb2192","language":"Python, Streamlit, Groq API","pid":"202605-3","m4uni":"","analytics":"For the data pipeline we used StandardScaler to normalize both gene expression and clinical features separately before any modeling. We then applied Principal Component Analysis to reduce the gene expression features down to 20 and 50 components as well as a variance-threshold version that captures 95% of explained variance, and we also selected the top 1,000 most variable genes by variance as a separate feature set. These gave us four distinct feature sets to test against clinical only, top variable genes, PCA-20, and all features combined. For the classification models we trained XGBoost with 200 estimators, a learning rate of 0.05, max depth of 4, and subsampling, alongside a Random Forest with 100 estimators and Logistic Regression with L2 regularization, comparing all three across all four feature sets using accuracy, F1 score, and ROC AUC on a stratified 80/20 train-test split to select the best performing combination. For explainability we used SHAP via TreeExplainer to compute per-prediction feature attributions on the best model, with a subprocess isolation approach so SHAP crashes wouldn't kill the training run and a feature importance fallback if SHAP was unavailable. For patient similarity search we built a FAISS IndexFlatL2 vector index over PCA-20 embeddings of all 1,900 patients, using an exponential decay function anchored to the cohort's 95th percentile nearest-neighbor distance to convert Euclidean distances into meaningful similarity percentages. For the incremental learning component, we trained a baseline SGDClassifier using log loss that supports partial_fit, which gets updated without full retraining every time a physician submits 5 or more confirmed patient outcomes through the app. On the AI side we used Groq with Llama-3.1-8b instant through an OpenAI compatible tool calling loop where the model can call get similar patients and get model performance to retrieve real data before generating clinical explanations, with retry logic and exponential backoff for rate limiting.","m4lname":"","industry":"Life Science","m3lname":"Mohajeri","dataset":"The primary dataset used in this project is the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset. This dataset contains gene expression profiles and clinical metadata such as tumor stage, receptor status (ER, HR2, etc), and treatment information. The dataset is publicly available and was accessed through Kaggle.   In addition to the METABRIC data, the GSE25066 dataset from the Gene Expression Omnibus (GEO) was used as an external validation and testing source. This dataset contains 508 patient gene expression samples that were used to simulate real world patient uploads to test the project’s ability to process new patient data and generate predictions. The dataset is publicly available and was accessed through Kaggle.

The project is designed to be flexible and support additional datasets beyond the current ones used. This can include any gene expression dataset as long as they are formatted in a tabular (samples x genes) structure, individual patient gene expression files, and other cancer datasets. The preprocessing pipeline in the project standardizes and scales inputs and the model operates on numerical feature matrices so the system can be generalized to any dataset that follows a similar structure of high dimensional genomic features and structured clinical variables.
","m2uni":"drt2145","m2fname":"Daniyah","m3uni":"mm6859"},{"projectname":"PUBG Winning Strategy Analysis","timestring":"Sat Dec 22 02:00:48 2018","m1uni":"jl5175","m2lname":"Xiao","m1fname":"Juncai","m4fname":"","m1lname":"Liu","m3fname":"Zhengye","description":"PUBG is an online multiplayer battle royale game. Players enter what is called the “Battle Area,” where they are pitted against one another in a fight for survival. To survive in the game, players need to gather weapons and suppliers to fight against others. It also gains massive popularity among the world. Until June 2018, it has sold more than 50 Million copies around the world, being the 2nd best selling game of all time in Steam.

Based on the popularity of the game, we decided to do some attractive analysis about it. Our idea formed from the following two aspects. For one thing, how to predict a player's rank by some data in a single match? For another thing, everyone has their own strategies to play this game, but how to increase the chance of winning the game in general?

As the first problem can be considered as a regression problem, based on what we have learned, we decided to use Random Forest Regression to do the prediction because Random Forest is one of the most effective machine learning models for predictive analysis. For the second problem, we also want to do some Exploratory Data Analysis (EDA) and conclude some winning strategies or interesting facts.","uni":"jl5175","language":"Python, Google Cloud Platform","pid":"201812-15","m4uni":"","analytics":"To make prediction of player's rank, we use Random Forest Regression Algorithm. Random forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. The advantage of Random Forest Regression is that it corrects for decision trees' habits of overfitting to their training set.

We also do some Exploratory Data Analysis (EDA) and use seaborn to do visualization, including graphing heatmap, jointmap, boxmap. ","m4lname":"","industry":"Information","m3lname":"Yang","dataset":"We use the PUBG dataset from Kaggle: https://www.kaggle.com/c/pubg-finish-placement-prediction/data
Our program can use other data collected from PUBG API with the same features","m2uni":"wx2209","m2fname":"Wei","m3uni":"zy2318"},{"projectname":"Foreign Exchange Rate Analysis","timestring":"Sat Dec 18 02:37:18 2021","m1uni":"rc3414","m2lname":"wang","m1fname":"rui","m4fname":"","m1lname":"chu","m3fname":"changfeng","description":"Goal :
Analyzing and predicting how exchange rate changes and finding out the factors which might influence the exchange rate so that to predict the exchange rate.

NOVEL:
Analyzing Foreign Exchange rate with the prediction of Covid-19 cases in different countries, considering holiday and seasonal factors while predicting
","uni":"rc3414","language":"python","pid":"202112-41","m4uni":"","analytics":"1. Algorithms:
(time series)
fbprophet
LSTM
ARIMA

2. Streamlight framework

3. Deployed on Google App Engine (GCP)
big Query
","m4lname":"","industry":"Finance","m3lname":"shen","dataset":"1. USDCNY=X since 2020 were the dataset tested. From Yfinance API
2. Covid cases from different countries are tested since 2020. From OxCGRT URL in github.

other currencies are supported, such as USDJPY, USDGBP and DOGE-USD.
The amount of Covid cases in related countries are included, such as UK, Japan, China, and the US.","m2uni":"hw2808","m2fname":"han","m3uni":"cs4094"},{"projectname":"Big Data Analytics: Movie Exploratory Analysis, Natural Language Processing and Recommendation","timestring":"Fri Dec 13 17:15:04 2019","m1uni":"mka2156","m2lname":"Vahanvaty","m1fname":"Milan","m4fname":"","m1lname":"Adhikari","m3fname":"Noel","description":"Objectives:
To develop a Movie recommendations system that employs ALS Matrix factorization and Sentiment Analysis on user reviews to generate 2 factor based movie recommendations

Innovations:

The system utilizes the NLP Sentiment Analysis to generate recommendations. Most systems generate recommendations solely based on ratings, which are objective in nature. Our system not only factors in ratings but also the sentiment score assigned to movie titles based on user reviews to rank recommendations.

Capabilities:
1. Generates Movie recommendations through trained ALS Model based on ratings
2. Searches for user reviews in the IMdB dataset. Sentiment Analysis performed on these reviews.
3. Recommendations ranked according to sentiment score.

","uni":"mka2156","language":"Python, PySpark and Google Cloud Platform","pid":"201912-34","m4uni":"","analytics":"Exploratory Data Analysis
ALS Matrix Factorization
NLP for Sentiment Analysis","m4lname":"","industry":"Media","m3lname":"Mannariat","dataset":"1. For ALS Matrix Model:

MovieLens Dataset

2. For Sentiment Analysis:

IMdB Dataset

Movie reviews are extracted from the IMdB dataset after generating recommendations through the MovieLens dataste. Additional info on these publicly available dataset provided via the dataset submission form.","m2uni":"hhv2106","m2fname":"Hussain ","m3uni":"ntm2125"},{"projectname":"Action Prediction for Human Robot Interaction","timestring":"Mon Apr 26 22:59:33 2021","m1uni":"kc3415","m2lname":"","m1fname":"Kyle","m4fname":"","m1lname":"Coelho","m3fname":"","description":"Investigate the perception aspect of action prediction to ultimately improve the human-robot collaboration experience. Looks into how future can be predicted and evaluated in an unsupervised way and then determine the action occurring in this future state","uni":"kc3415","language":"Python and GCP for computing resources","pid":"202105-1","m4uni":"","analytics":"Analytics algorithms mainly involved evaluation scripts for models that were trained.
Evaluation scripts were run for the predictive coding networks as well as for the action classifiers to generate statistics.

Submodules essentially involve the predictive coding module that is able to generate future time sequences that gets fed into the action classifier. Preprocessing is done between these modules and is handled by functions passed to data generators

Visualisations involved using moviepy package to generate gifs and video visualisations of predictions as well as using frame references and differences to understand the impact of different time scales. Python scripts using moviepy were used to implement all of this.","m4lname":"","industry":"Information","m3lname":"","dataset":"IKEA Furniture Assembly Dataset: Public
hmdb51 Dataset: Public
Moments in Time Dataset: Public","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Personalized Company Research Dashboard","timestring":"Fri Dec 17 20:28:02 2021","m1uni":"sr3767","m2lname":"Govindarajan","m1fname":"Shambhavi","m4fname":"","m1lname":"Roy","m3fname":"Rahul","description":"Stock market investors often require guidance in understanding current market conditions to make wise investment decisions. Our project addresses this concern and is aimed to provide users with centralized dashboard access to understand a company's current market condition using big data principles. For a given company name as user input, we seek to provide to its stock market information, relevant current YouTube videos, and real-time sentiment of its stock in the market. Having access to these three modalities of data with sentiment analysis would be useful for traders to conduct market research on a specific company in depth. ","uni":"sr3767","language":"Python, Django","pid":"202112-46 ","m4uni":"","analytics":"For a given user input of company name, we queried the Yahoo Finance API to query the latest stock price by the minute. Next, we used the YouTube Data API to retrieve the top 10 news videos of the company searched using the tags: Company name, its stock ticker and 'news'. Finally, we queried the Twitter API using these tags to retrieve the latest company news on which we performed sentiment analysis using the TextBlob library.","m4lname":"","industry":"Information","m3lname":"Lokesh","dataset":"We are utilizing data from several sources in our project. This includes financial data collected using Yahoo Finance API to display real-time stock price, video data from YouTube API to filter and display videos, and real-time tweets from Twitter API to perform sentiment analysis.","m2uni":"sg3896","m2fname":"Saravanan","m3uni":"rl3164"},{"projectname":"Advanced Sentiment Analysis","timestring":"Sat Dec 22 05:58:49 2018","m1uni":"sg3506","m2lname":"Kong","m1fname":"Shan","m4fname":"","m1lname":"Guan","m3fname":"Lin","description":"Nowadays, more and more people tend to do online shopping instead of shopping at physical stores, so previous customer reviews can significantly influence people’s decisions of purchasing or giving up this product. Amazon is one of the biggest electric business platforms in the world. A significant number of customers leaves comments for the items they purchased and give the corresponding number of stars which indicates their attitude towards those products. By analyzing those reviews, sellers will know products flaws from customers’ perspectives and then try to improve current commodities.

Inspired by Mandav’s blog which gives a great overview of Word2Vec, FastText and Universal Sentence Encoder, our team decided to focus on FastText, Universal Sentence Encoder, in addition to Logistic Regression in our project.

Firstly, the project focuses on classifying training dataset about Amazon reviews into the positive and negative part. Three algorithms are used to train our models which are logistic regression, fast text, and universal sentence encoder.

Secondly, the project aims to figure out the situation that we do not have enough labelled data in reality. We use pre-trained universal sentence encoder model and combined with transfer learning to find trump’s emotion behind his twitter words.","uni":"","language":"Python, Pyspark, Keras, Tensorflow","pid":"201812-17","m4uni":"","analytics":"We use Logistic Regression, FastText, Universal Sentence Encoder and Transfer Learning for this project.
1. Train three models Logistic Regression, FastText and Universal Sentence Encoder based on Amazon reviews dataset.
2. Apply Transfer Learning in addition to the pre-trained Universal Sentence Encoder model to predict unlabelled Trump's Twitter. ","m4lname":"","industry":"Social Science-Government","m3lname":"Jiang","dataset":"We use two datasets for this project.
1. Amazon Reviews with labels from Kaggle (https://www.kaggle.com/bittlingmayer/amazonreviews)
Size: training: 3,600,000 rows, 1.6GB; testing: 400,000 rows, 185.3MB
Label: 1: negative, score 1 or 2; 2: positive, score 4 or 5
2. Trump’s tweets from Jan 2017 to Aug 2018 without labels (http://www.trumptwitterarchive.com/archive)
Size: 3520 rows ","m2uni":"yk2756","m2fname":"Yuehan","m3uni":"lj2438"},{"projectname":"Market Behavior Prediction via Open Domain Tweets","timestring":"Tue Dec 21 20:22:51 2021","m1uni":"ha2598","m2lname":"Gupta","m1fname":"Hitesh","m4fname":"","m1lname":"Agarwal","m3fname":"Kehao","description":"In the age of social media, financial markets today are heavily influenced by public opinion. Digital media has exponentially increased the velocity of information flow, impact of which can be seen directly and in-directly on stock-prices and valuations of tech-giants. In the project, we aim to study the impact of signals like tweets and news articles on financial markets and build an automated system for aiding market decisions for investors. Our tool shows us the future predicted stock prices of five different companies based on the past stock prices and the sentiments of the tweets related to the companies. ","uni":"ha2598","language":"Python, Big Query, Airflow, Streamlit","pid":"202112-33","m4uni":"","analytics":"Machine learning models like Arima and VAR (Vector Autoregressor), along with RNN network models like LSTM is used to predict the future values of the stock prices based on the past values and average sentiment of the tweets of the company each day. Visualizations include how the models perform on the past data as well as showing us the future predicted values. ","m4lname":"","industry":"Finance","m3lname":"Guo","dataset":"The models were built for five different companies AAPL, AMZN, MSFT, TSLA and GOOG. The models were built on the stocks data and tweets available for these companies from 2019 till 2020. The stock data was collected from the Yahoo finance API and the tweets information of each company was obtained from Kaggle 3.5M tweets dataset. ","m2uni":"tg2749","m2fname":"Tushar","m3uni":"kg2937"},{"projectname":"Towards Safer Navigation by Enabling Real-Time Crime Heatmap in NYC","timestring":"Sat Dec 17 02:59:05 2022","m1uni":"kh3119","m2lname":"Liu","m1fname":"Kaiyuan","m4fname":"","m1lname":"Hou","m3fname":"","description":"The overall crime rate has been increasing in recent years. In 2022 by October, more than one-quarter in felony crimes have happened compared with 2021. Moreover, the public has limited information on crime events in real-time. They will only be able to be notified after the crime has already happened and broadcasted by the police/government. We envision offering some guidance during navigation to bypass the high crime rate area.
Our work aims to generate a heatmap overlay of the danger factor on top of the map and update the sub-minute at a street-level granularity. Our contribution is creating a detailed enough crime prediction at a level of 100 meters and refreshing in several seconds. No existing product or work can provide this type of information. They can only either show the recent crimes nearby or show the historic heatmap, both of them are not real-time predictions.
","uni":"kh3119","language":"Python, Javascript","pid":"202212-10","m4uni":"","analytics":"We first remove invalid data rows in the dataset (a lot of typos in the dataset, such as the data is recorded as xx/xx/1019). We then label the geolocation with Safe if no crime happened within a day nearby at unrecorded locations. We then encode the fields except for timestamps into integers. We input the victim information, timestamp, geolocation and weather into the deep neural network and output the probability of each crime and safety. Then we compute the safe score with a weighted sum to get rid of the density influence of the heatmap.
The system consists of a frontend, a backend, and a Redis database. Frontend requests Google Map JS to acquire the possible routes and sends the routes to the Flask API server. The server interpolates the route by adding more locations with a step size of 0.0001 total change in latitude and longitude. Each geolocation of the interpolation path will be passed to the model for inferencing to get the safety score. And publish the result to the Redis database. The frontend subscribes to Redis and updates the heatmap, and renders it once the new heatmap is available.

","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. Each record represents a criminal complaint in NYC and includes information about the type of crime, the location and time of enforcement.
In addition, information related to victim and suspect demographics is also included.","m2uni":"yl4189","m2fname":"Yanchen","m3uni":""},{"projectname":"Credit Risk Prediction & Bias Mitigation with Causal Bayesian Networks","timestring":"Sat Dec 20 05:29:30 2025","m1uni":"aaq2109","m2lname":"Banerjee","m1fname":"Ayaan","m4fname":"","m1lname":"Qayyum","m3fname":"Vatsalam","description":"We examined the task of predicting credit worthiness through the lens of mitigating bias and illegal proxies. We innovated with the application of Causal Bayesian Networks (CBN) to develop a technique to quantify how much a particular model is biased and a procedure to compensate. ","uni":"aaq2109","language":"Python, Flask","pid":"202512-12","m4uni":"","analytics":"We implemented exploratory and statistical analytics to analyze default rates across demographic and socioeconomic groups, identify illegal and proxy features via correlation analysis, and detect structural bias using positivity violation testing.
We trained and compared Logistic Regression, Random Forest, and LightGBM models under biased and bias-mitigated feature sets, using ROC–AUC to evaluate bias at the inference stage within a Causal Bayesian Network framework.
The system included Python-based preprocessing and modeling pipelines (scikit-learn, LightGBM), model serialization with joblib, and a Flask web application for side-by-side biased versus unbiased inference, with visualizations comprising demographic default-rate plots, ROC curves, UMAP projections of output distributions, and prediction shift comparisons.","m4lname":"","industry":"Finance","m3lname":"Krishna Jha","dataset":"The dataset we tested was data from the Home Credit Default Risk competition. The data includes the financial history of people who could and could not pay back a loan. Our software can support a wide variety of datasets, largely focusing on financial prediction with biased and unbiased features. Another applicable dataset could have been one that determines credit score. ","m2uni":"sb5041","m2fname":"Swapnil","m3uni":"vkj2107"},{"projectname":"The Best Strategy to Pick and Ban in Game League of Legends","timestring":"Fri Dec 13 07:27:27 2019","m1uni":"zy2362","m2lname":"Luo","m1fname":"Zihan","m4fname":"","m1lname":"Yang","m3fname":"Weihan","description":"Our total goal is to give advices for players to pick legends in game LOL.
We developed a new algorithm to evaluate the fitness of the legends two teams picked.
Ideally it can predict the win or lose of a match rightly based only on the legends picked at a really high rate (58.5%).
This can help tiros to learn a game quickly, and avoid to be blamed for the bad legend picking due to his/her limited knowledge of this game. So this project gives game developers a new way to attract players efficiently and can probably make some interests to game companies.","uni":"zy2362","language":"Matlab, Python, PHP, HTML/CSS, JS, SQL, Tencent Cloud","pid":"201912-8","m4uni":"","analytics":"We used a self-developed algorithm called CCM. The basic idea is to statistic each pair of legends' cooperation and confrontation score. The higher this score is, the higher probability you will win. And our recommendation is the hero that maximize this score. You can find more information on our website or report.","m4lname":"","industry":"Media","m3lname":"Chen","dataset":"We got randome data from Riot Developer API. We wrote a python script and run it on 9 instances simultaneously. Then we processed and filtered them to get the data that useful for our work.","m2uni":"hl3287","m2fname":"Hao","m3uni":"wc2681"},{"projectname":"Pan-cancer analysis of single-cell RNA-sequencing data using normalizing flows for counterfactual inference","timestring":"Fri Apr 23 22:37:17 2021","m1uni":"lc3352","m2lname":"Zhang","m1fname":"Lingyi","m4fname":"","m1lname":"Cai","m3fname":"","description":"From studies from the Centers for Disease Control and Prevention (CDC), Each year in the United States, more than 1.7 million people are diagnosed with cancer, and almost 600,000 die from it, making it the nation’s second leading cause of death. The cost of cancer care continues to rise and is expected to reach almost $174 billion by 2020, it's also very shocking that 1 in 3 people will have cancer in their lifetime. As we know cancer is complicated to cure, more than one thousand genes may involve in cancer development, and there are thousands of subtypes of cancer. Meanwhile, deep learning technology has been revolutionizing over the past decade in many fields such as computer vision, natural language processing, and many others. However, there are two challenges remaining. First, the black-box property of deep neural networks has made models hard to explain and analyze. Tackling this challenge becomes even harder as the architectures of neural networks are deriving intricately nowadays. Second, although deep neural networks have achieved extraordinary performance in tasks such as classification, such calibrated designs might only learn the association instead of the causation. Those two remaining challenges hinder the application of deep learning technologies from being used in a wider range such as the financial industry and the healthcare system. In addition, we propose to implement a web server for this framework, so that the analysis can be performed without writing any code. This user-friendly tool can be convenient for biologists who are not familiar with programming languages.","uni":"lc3352","language":"Python, R","pid":"202105-16","m4uni":"","analytics":"
We mainly focus on proposing marker genes by using the counterfactual framework. In particular, we reproduce the counterfactual imputation algorithm from [Yongjin Park, et al.]. Then, we focus on improving the algorithm by learning a better representation using deep learning. We primarily leverage the autoencoder paradigm for such a purpose as variational autoencoder and normalizing flows. In addition, we also apply a supervisory signal to help the unsupervised learning task. The empirical evaluation shows it can improve the final results.

We experiment with a public data set using our proposed method. We evaluate the performance using KEGG tools. For visualization, we use the R package along with the Shiny library to host our interactive website on shinyapp.io.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"
All the datasets are publicly available.scRNA-seq datasets: PDAC (GSA: CRA001160), HNSCC (GSE103322), Ovarian cancer (GSE118828), Lung cancer (E-MTAB-6149, E-MTAB-6653), Breast cancer (GSE118390), Melanoma (GSE72056);Bulk RNA-seq datasets: The Cancer Genome Atlas Program (TCGA) database.","m2uni":"wz2363","m2fname":"Wei","m3uni":""},{"projectname":"Analysis of Google Merchandise Store Data ","timestring":"Sat Dec 22 00:48:41 2018","m1uni":"zg2305","m2lname":"Cao","m1fname":"Ziyu","m4fname":"","m1lname":"Gu","m3fname":"Jingwei","description":"Objectives: In our project, we will help Google Store to make better strategies: we will explore the GStore data and predict the revenue per customer may make.

Innovations: We mainly use 4 models for the prediction: linear regression, LGBM, MLP and CNN. Among them, LGBM reached the best performance.One of our innovations is that we find out CNN model is also quite meaningful when dealing with our dataset, where the amount of records that have nonzero totals_transactionRevenue values and amount of records that have zero totals_transactionRevenue values are almost the same.

Capabilities: Provide more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GStore data

For many businesses, only a small percentage of customers produce the most of the revenue. In this way, making appropriate promotional strategies can help companies make more profits with less money. So predicting revenue per customer is really important.","uni":"zg2305","language":"We used Python, pandas, sklearn(LinearRegression, lightgbm), Pytorch, flask ","pid":"201812-36","m4uni":"","analytics":"Algorithm:EDA, Linear regression,Light GBM, Convolutional neural network, neural network
System modules: linux os, colab notebook, jupyter notebook, numpy, blob, pytorch, pandas, Light GBM, flask, html
Visualization: matplotlib","m4lname":"","industry":"Retail","m3lname":"Han","dataset":"Our data comes from an ongoing kaggle competition (https://www.kaggle.com/c/ga-customer-revenue-prediction/data)
","m2uni":"xc2418","m2fname":"Xiaoshu","m3uni":"jh4021"},{"projectname":"Stock Strategizer","timestring":"Wed May 6 05:02:11 2026","m1uni":"sc5115","m2lname":"","m1fname":"Samuel","m4fname":"","m1lname":"Cohen","m3fname":"","description":"The Strategy Arena targets a real and underserved problem in retail finance: the answer to \"should I buy stock X?\" depends on who is asking, and existing tools — forums, pundits, and general-purpose LLM chatbots — give one-size-fits-all answers regardless of the user's risk tolerance, time horizon, or investing philosophy. The system's goal is to produce a personalized recommendation grounded in real historical evidence and shaped by the trade-offs between five competing investment philosophies.","uni":"sc5115","language":"Python 3.11 JavaScript Streamlit Plotly OpenAI API (gpt-4o-mini, gpt-4o) text-embedding-3-small tts-1 SQLite ChromaDB yfinance feedparser requests beautifulsoup4 wikipedia-api asyncio tenacity Pydantic pytest","pid":"202605-16","m4uni":"","analytics":"Total return, annualized Sharpe ratio (Sharpe 1966), maximum drawdown, win rate, average holding period, trade count per agent. Cross-agent comparisons against a buy-and-hold baseline. Cross-regime robustness matrix (Sharpe per persona per regime). Cumulative API cost tracking via logged token_cost events. LLM-judged philosophy adherence on four dimensions (voice, strategy, rule fidelity, coherence) on a 0–1 scale.
Algorithms.

Six-step agentic loop per agent per decision period: Perceive → Recall (top-k RAG retrieval) → Analyze (chain-of-thought memo) → Decide (Pydantic structured output) → Execute (post-LLM rule check + deterministic fill) → Reflect (rolling lessons-learned doc).
Hard-rule re-prompt loop — Warren's 30-day minimum hold, Ray's 15% max position size, Jim's no-fundamentals filter, Reddit's post-loss cooldown. Up to two re-prompts on violation, then forced HOLD.
Multi-round structured debate — round 1 critique of top performer, round 2 rebuttal, round 3 closing. Each utterance is a Pydantic schema {speaker, target, claim, citation: {agent, sim_date, metric}} with citations validated to refer to real trades.
Synthesis — meta-agent reads cross-agent results plus debate transcript plus user context (horizon + risk tolerance) and produces a Recommendation{entry, exit, watch_for, caveats}.
Stress test sweep — same persona configurations re-run across N preset historical regimes; results aggregated into a robustness heatmap.
Highlight-reel selection — auto-jump to the 5 sim-days with the largest |equity delta| or largest cross-agent disagreement.

System modules (five-layer stack, each layer depending only on layers beneath):

L1 Data & Infra: PriceLoader, NewsCache, RAG (ChromaDB wrapper), LLMClient (cache + structured outputs), structured JSONL Logger
L2 Simulation: MarketView (point-in-time abstraction), Portfolio, TradeExecutor, RuleEnforcer, Metrics, Pydantic schemas
L3 Agents: BaseAgent, five PersonaAgent subclasses, JudgeAgent, SynthesisAgent, BroadcasterAgent, DebateOrchestrator
L4 Orchestration: run_backtest, run_debate, run_stress_test, run_replay, RunContext
L5 UI: Streamlit app.py plus reusable Plotly components for header, hero chart, agent strip, agent detail panel, debate theater, synthesis panel, replay controls, TTS player

Visualizations.

Hero equity-curve chart — five persona curves overlaid on the price line with buy-and-hold dashed baseline; persona-colored trade markers sized by position; drawdown periods shaded red
Live leaderboard — sortable table colored by persona with return, Sharpe, max drawdown, win rate, trade count
Agent strip — five horizontal cards with avatar, live portfolio value, sparkline, current thought, glow animation on trade
Per-agent detail panel — expandable reasoning log, RAG-retrieval citations per decision, philosophy adherence score
Debate theater — chat-style transcript with persona avatars, citation chips that scroll the price chart to the cited day on click
Synthesis panel — personalized recommendation rendered with entry/exit framework and explicit caveats
Stress-test heatmap — 5 personas × N regimes, cells colored by Sharpe ratio (red→green diverging scale)
What-if comparison — original vs modified-rule equity curves overlaid for the same agent
Synced replay — animated equity curves with TTS narration; audio currentTime drives the chart frame index via a small JS bridge
Highlight reel — auto-curated jumps to the five most dramatic moments with narration","m4lname":"","industry":"Information","m3lname":"","dataset":"The system uses three classes of historical data, all from free public sources:
Price data — Yahoo Finance via yfinance. OHLCV daily bars from 2007 onward for any US-listed equity. We tested primarily on AAPL, MSFT, NVDA, TSLA, and SPY across the 2022-01-01 to 2024-01-01 window for the headline backtest, plus four named regime windows for stress testing: 2008 GFC, 2017 bull run, 2020 COVID crash, 2022 selloff. Public — yes, register at the dataset submission page. Data acquisition is automated by a thin PriceLoader wrapper that caches OHLCV bars to a local SQLite database.
News data — four free sources, stacked. Pre-scraped offline into a SQLite cache keyed by (ticker, date):

Yahoo Finance news (via yfinance's .news attribute) — ticker headlines and summaries
GDELT (gdeltproject.org) — global news event database, queryable by entity and date range
SEC EDGAR (sec.gov/edgar) — material 8-K, 10-Q, and 10-K filings via the public JSON API
Wikipedia — year-in-company event pages as a sparse high-signal supplement

About 20 popular tickers are pre-cached for the demo. All four sources are public. The scrapers are idempotent and per-source so a failure of one does not poison the others.
Persona corpora — for RAG grounding. Each persona has a curated 5–10 document collection embedded in ChromaDB:

Warren: excerpts from Berkshire Hathaway annual shareholder letters (1977–2023, public)
Cathie: ARK Investment Management research notes and white papers (public)
Ray: excerpts from Dalio's Principles and Bridgewater Daily Observations
Jim: excerpts from Murphy's Technical Analysis of the Financial Markets and Cramer commentary transcripts
Reddit: curated sample of high-engagement r/wallstreetbets discussion threads (public via Reddit API)

What other data the system can support. Any equity ticker available through yfinance (US-listed and most international); any custom RAG corpus (just drop documents into data/corpora// and re-ingest); any historical regime window (configured via regimes.yaml); and any user-defined persona with a system prompt + corpus + YAML rule set. The Decision schema is currently equity-only — extending to options or multi-asset is the most natural next step.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Flagged Post Analysis","timestring":"Sat Dec 18 01:59:24 2021","m1uni":"mk4427","m2lname":" Joshi","m1fname":"Mohammed Aqid","m4fname":"","m1lname":"Khatkhatay","m3fname":"Shantanu","description":"In the past decade, Q&A websites such as stack exchange, quora, reddit etc have become immensely popular. With a stark increase in users, there is also an increase in posts which may be of low quality and can often misguide the users. In order to allow for a well-informed community on Q&A websites, this project aims to analyze leading features that cause a post to be considered as low quality, and to predict if a given post is of good quality or poor quality.
In order to accomplish this, the techniques involve rigorous data visualization, feature selection and extraction and making use of a spark pipeline to train a LSTM encoder decoder model in order to predict the quality of posts, and flag posts that have poor quality.

The goal of this project is to analyze and identify the features of poor quality posts, to build an efficient and highly reliable machine learning model to classify posts on the basis of its quality. Our project addresses this goal by utilizing techniques of data visualization to observe underlying patterns, feature selection and extraction to strengthen the classification model by feeding it with reliable features.
The features that will be given as input to the model are a combination of user-based features, community-based features and textual features extracted from the body of the posts.","uni":"mk4427","language":"Python, Flask and GCP","pid":"202112-42","m4uni":"","analytics":"Trained an LSTM encoder-decoder on 60K data (using text and title)used the trained model for labelling the rest of the data for quality assurance. Achieved 85% accuracy.
Logistic Regression Model used a hyperparameter tuner to automatically determine the best regression model that fits the training data as per the f1-score.
","m4lname":"","industry":"Information","m3lname":"Jain","dataset":"SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts.Permalink
Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl.
Proceedings of the 15th International Conference on Mining Software Repositories (MSR 2018).
SOTorrent contains all tables from the official Stack Overflow data dump ","m2uni":"mmj2169","m2fname":"Meghana","m3uni":"slj2142"},{"projectname":"Image to Language Understanding: A Captioning Approach","timestring":"Fri Dec 13 14:52:29 2019","m1uni":"mb4560","m2lname":"Seshadri","m1fname":"Mikhail","m4fname":"","m1lname":"Belov","m3fname":"Malavika","description":"Design, develop and implementation of a framework for comparing and contrasting various image captioning approaches. Implement a complete end to end system for generating captions from images.

Key Focus Areas:
- Explore a number of deep learning approaches for the task of image captioning
- Provide insights as to why certain approaches outperform others. Modular Addition of Comparison metrics
- Analyze the implemented deep learning models through classic deep learning performance metrics
- Discuss the quality of each model quantitatively through the prism of NLP metrics
- Develop a functional User Interface for visualizing the generated captions.","uni":"mb4560","language":"Python, Google Cloud Platform, Tensorflow, Keras, Bash, Django, HTML, CSS, Javascript, OpenCV","pid":"201912-22","m4uni":"","analytics":"Dataset Visualization:
-- TSNE
-- PCA
-- Word Frequency Cloud

Deep Learning Models:
1. Inject Architecture (Encoder-decoder): Inception and LSTM with additional pre-processing
2. Merge Architecture (Encoder-decoder): Inception and LSTM with additional pre-processing
3. Inject Architecture (Encoder-decoder): Inception and LSTM without additional pre-processing
4. Merge Architecture (Encoder-decoder): Inception and LSTM without additional pre-processing
5. Inject Architecture (Encoder-decoder): Resnet and LSTM with additional pre-processing
6. Merge Architecture (Encoder-decoder): Resnet and LSTM with additional pre-processing
7. Exploration of improvements in Captioning through conditioned multimodal object detection: ResNet+Google Word Encoder+LSTM

Model Comparison:
1. Validation Set Accuracy
2. Validation Set Loss
3. Training Set Accuracy
4. Training Set Loss
5. NLP Metrics:
- BLEU
- GLEU
- METEOR
- ROUGE

","m4lname":"","industry":"Information","m3lname":"Srikanth","dataset":"Google's Conceptual Captions Dataset:
- 3.3 million images, each with the corresponding reference description

A subset (100k) images of the Google CC dataset was used for comparison.","m2uni":"ms5945","m2fname":"Madhavan","m3uni":"ms5908"},{"projectname":"Flight Delay Prediction System","timestring":"Thu Dec 19 22:52:48 2024","m1uni":"hz2906","m2lname":" Liu","m1fname":"hongjie","m4fname":"","m1lname":"zhu","m3fname":"Zeqi ","description":"Objectives:

1. To design and develop a user-friendly website that enables users to input their upcoming flight details to receive both real-time and predictive flight delay estimates, along with an overall flight reliability rating (A-E).

2. Users will also be able to view ratings for other flights scheduled that day, including metrics on departure delays and total travel delays. Novelty

Novelty
1. Advanced Prediction Model
Our prediction model incorporates multiple variables, including weather conditions, airline specifics, airport location, and destination. This allows users to enter detailed flight information for a highly accurate delay forecast.

2. Dual Delay Predictions
We provide two distinct delay predictions for a comprehensive view of potential travel disruptions:
Arrive delay and total travel delay

3. Automated Flight Reliability Rating
Each flight receives an automated reliability score, calculated from historical delay data, weather forecasts, airport congestion levels, and airline performance metrics.

Importance

Air travel remains one of the most widely used modes of transportation across the globe, offering speed and convenience for millions of travelers daily. However, flight delays continue to be a significant source of frustration, causing disruptions to schedules, missed connections, and increased stress for passengers. These challenges highlight the need for a reliable, transparent, and user-friendly solution to better inform travelers and help them make more informed decisions.

Our project addresses this issue head-on by introducing an innovative web-based platform designed to provide travelers with real-time and predictive delay estimates. By simply entering their flight number, date, and destination, users gain immediate access to essential information that can significantly improve their travel experience.
","uni":"hz2906","language":"python, (Google cloud platform & VScode)","pid":"202412-7","m4uni":"","analytics":"Analytics:
1. Predicted arrival delays for flights using a trained machine learning model, providing insights into flight punctuality.
2. Integrated ranking calculations for flight arrival times, allowing qualitative assessment of performance (A-E scale).
3. Real-time analysis of user-input flight data for personalized predictions.

Algorithms:
1. Data Preprocessing:
Applied feature extraction for input fields, including departure time encoding (DepTime_sin and DepTime_cos), ensuring machine-readable format for the model.
2. Machine Learning Model:
XGBRegressor was utilized to predict the arrival delays based on historical flight data and engineered features.Ranking logic was implemented to classify delays into grades (A-E), derived from the predicted delay values.

System Modules:
1. Backend:
Flask framework to handle HTTP requests, manage routing, and communicate with the ML model.
Integration of REST API endpoints to fetch flight data and predictions in real-time.
2. Frontend:
Autocomplete functionality for departure and arrival places using jQuery UI.
Dynamic data population in the HTML table, displaying flight details and analytics.
3. Database/Files:
CSV files served as a lightweight backend for flight information and location data storage.
4. Prediction Workflow:
Data ingestion → Feature transformation → Model inference → Ranking calculation.

Visualizations:
1. User-friendly web interface with clearly labeled sections for input fields and results.
2. Interactive ranking display (A-E grades) to simplify understanding of flight delay predictions.
3. Visual enhancement of tables for real-time display of flight information, integrating predictions seamlessly into the UI.","m4lname":"","industry":"Information","m3lname":"Li","dataset":"https://www.transtats.bts.gov/DL_SelectFields.aspx?gnoyr_VQ=FGK&QO_fu146_anzr=b0-gvzr

The dataset from the Bureau of Transportation Statistics (BTS) provides detailed information on airline on-time performance and the causes of flight delays. It includes scheduled and actual departure and arrival times, as well as reasons for delays, reported by certified U.S. air carriers that account for at least one percent of domestic scheduled passenger revenues. The data covers nonstop scheduled-service flights between points within the United States, including territories, and is available from January 1995 onwards","m2uni":"dl3631","m2fname":"Diwen","m3uni":"zl3202"},{"projectname":"AI Enabled Fintech","timestring":"Fri May 3 19:29:56 2024","m1uni":"dmr2227","m2lname":"","m1fname":"David","m4fname":"","m1lname":"Roosevelt","m3fname":"","description":"Objective(s):
-To analyze remittance flows on a global level to understand historical and current remittance channels
- To design a tool/software to associate remittance flow criteria with a proxy for potential interest in and conversion to a Fintech platform enabled by a digital currency exchange rooted in cryptocurrency technology
- Develop a tool that leverages AI and ML techniques to use historical global remittance data to identify and match potential customer markets for a remittance Fintech, KanduPay, Inc.

Innovation(s):
- Leveraging patterns in remittance flows to understand target markets most receptive to a remittance platform with a digital currency mechanism to send, receive, and store money
- Identifying patterns in historical remittance information using AI technology

Capabilities:
- Analyze patterns in remittance flow dataset criteria to flag remittance transactions as best candidates for further market targeting workflows
- Identify which country channels and remittance service utilization could be an ideal target for conversion to a digital currency exchange and remittance platform leveraging cryptocurrency and e-wallet functionality

-----------------------------------------

This research is important because fintech innovations leveraging AI, ML, and cryptocurrency are on the cutting edge of development. Fintech regulations and cryptocurrency continues to grow, expand, and evolve. Being able to understand historical remittance transaction patterns to assess target markets and channels has the potential to unlock opportunities for product and market penetration. There is a large percentage of the global population whose banking and financial transaction needs are not met by their current banking systems. This technology and innovation partnered with KanduPay, Inc. aims to address the unbanked and semi-banked populations at a global scale by providing access to remittance, and e-wallet resources.","uni":"dmr2227","language":"Python, Google Colab","pid":"202405-10","m4uni":"","analytics":"Primary coding language is Python for this model. The software leverages Pandas, SK Learn, and NumPy Python libraries.

Pandas library leverages generative AI modeling to interpret and translate natural language (NL) queries into Python code.

SK Learn (ML Library) supports predictive analysis in Python and provides access to open-source available resources and algorithms.

NumPy supports multi-dimensional arrays (N-dimensional), incorporates Fourier transforms and linear regression

ML tools implemented are neural networks and Tensor flow.

Utilizing an online hosting Jupyter Notebook, which enables the model to run faster and free up space on the device. Web-based Jupyter Notebook enables testing and training iterations to flow faster via Google Colab.

Code can be tested in a line by line basis making it easier to troubleshoot and resolve issues in the model with a focus on efficiency and accuracy. The model is configured to use “mobile money” as a proxy for highest probability to be receptive to a fintech solution for remittances.

The model flow and core processes driving analysis include:
1: Use pandas library to import target dataset to the notebook

2: Attributes defined from fields in the target dataset (x axis of “Mobile Money”)

3: Dataset split into test and train components with 20% of the data will be in the testing set (80/20)

4: Define neural network components via import from TensorFlow, a deep learning API reducing friction when creating neural networks

5: Frame up the input layer, hidden layer, and output layers, including size and shape

6: ‘Adam’ optimizer to optimize data inputs and outputs

7: Define optimization criteria to drive weights within the neural network

8: Decide how many epochs the model should use

9: Run the model, review accuracy, run again and train","m4lname":"","industry":"Finance","m3lname":"","dataset":"The World Bank, Remittance Prices Worldwide, available at http://remittanceprices.worldbank.org

The World Bank, Remittance Prices Worldwide is parsed into two tranches due to a change in criteria definitions effective Q2 of 2016.

The dataset was requested directly from The World Bank and a live link to access the dataset was sent via email.

The software can support data that is configured with similar input criteria in a csv format with fields aligned to the specifications in the linked data file including sending and receiving country, currency, year and quarter. transaction type, etc. (see detailed data fields descriptions on the \"Legends\" tab of The World Bank, Remittance Prices Worldwide
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Investment Strategy -- AI Trader (Foreign Exchange)","timestring":"Fri May 6 17:19:00 2022","m1uni":"hw2839","m2lname":"Wang","m1fname":"Hanlun","m4fname":"","m1lname":"Wang","m3fname":"","description":"Implement an artificial intelligence trader system for forex trading. It learns to predict forex quotes from big data by using neural networks and machine learning techniques. This system has the following features: low latency real-time system , built-in risk control mechanism, visual display of forecast results trends.","uni":"hw2839","language":"Python, Pytorch","pid":"202205-14","m4uni":"","analytics":"LSTM","m4lname":"","industry":"Finance","m3lname":"","dataset":"Daily exchange rate, yearly Gross Domestic Product, monthly interest rate, monthly consumer price index, yearly ratio of exports to imports. Data are from https://fred.stlouisfed.org, Federal Reserve Economic Data.

","m2uni":"zw2801","m2fname":"Zhaomeng","m3uni":""},{"projectname":"2D RNA Structure Prediction with Custom GCNFold Architecture","timestring":"Wed May 14 00:56:01 2025","m1uni":"ab4972","m2lname":"Zhang","m1fname":"Adam","m4fname":"","m1lname":"Banees","m3fname":"","description":"Objectives:
The primary objective of the GCNFold Improved model is to enhance the accuracy and biological relevance of RNA secondary structure predictions. By leveraging advanced machine learning techniques and integrating domain-specific knowledge, the model aims to provide a more robust tool for researchers studying RNA structures. This involves improving the input feature representation, refining the model architecture, and incorporating comprehensive structural constraints.

Innovations:
Using RNA-FM embeddings and ViennaRNA base-pair probabilities to an adjusted GCNFold model.

Capabilities:
The GCNFold Improved model offers several capabilities that make it a powerful tool for RNA secondary structure prediction:
Enhanced Predictive Accuracy: The integration of RNA-FM embeddings and biophysical priors improves the model's ability to accurately predict RNA structures.
Biological Plausibility: The application of advanced structural constraints ensures that the predicted structures adhere to known biological rules and patterns.
Flexibility and Adaptability: The configurable architecture allows the model to be tailored to different datasets and research needs, making it a versatile tool for various RNA studies.

Importance of Research/Toolkits:
The development of accurate and reliable RNA secondary structure prediction tools is crucial for advancing our understanding of RNA function and interactions. RNA plays a vital role in numerous biological processes, and its structure is key to its function. By providing a more accurate and biologically relevant prediction model, GCNFold Improved hopes to aid researchers in exploring RNA mechanisms, designing RNA-based therapeutics, and understanding the role of RNA in disease. The innovations and capabilities of this model contribute to the broader field of computational biology, offering insights and tools that can drive future research and applications.","uni":"ab4972","language":"Python","pid":"202505-1","m4uni":"","analytics":"Analytics:
Base Pair Probability Analysis: The model computes probabilities for potential base pairs, integrating learned features with ViennaRNA priors. This analysis helps in assessing the likelihood of base pair formations within RNA sequences.

Algorithms:
Graph Convolutional Networks (GCN): Utilized for message passing and feature extraction, GCN layers aggregate information from neighboring nodes to learn complex dependencies in RNA structures.
Multi-Layer Perceptron (MLP): Employed in the PairwiseScorer module to score potential base pairs. The MLP processes concatenated node features and optional prior probabilities to compute scores.
Dynamic Programming: Used in the structure prediction phase to enforce pseudoknot-free constraints and optimize base pair selection, ensuring valid and non-crossing predicted structures.

System Modules:
FeatureEncoder: Converts RNA-FM embeddings into node features suitable for GCN processing, replacing the original one-hot encoding approach.
GCNLayer: Implements graph convolutional operations, including self-loop and neighbor aggregation transformations, followed by layer normalization and ReLU activation.
PairwiseScorer: Scores potential base pairs using an MLP, integrating prior probabilities to enhance prediction accuracy.
StructuralConstraints: Applies biophysical constraints, such as Watson-Crick and wobble pair rules, stacking energies, and minimum base pair distance, to ensure biologically plausible predictions.

Visualization:
Dot-Bracket Notation: The model outputs RNA secondary structures in dot-bracket notation, a standard format for visualizing RNA structures. This notation provides a clear and concise representation of base pairings and unpaired regions.
Heatmaps and Probability Matrices: While not explicitly detailed in the provided code, visualizations such as heatmaps of base pair probabilities and probability matrices can be generated to provide insights into the model's predictions and confidence levels.
These components collectively enhance the model's ability to predict RNA secondary structures accurately and provide meaningful insights into the underlying biological processes.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset we used was multimolecule/bprna from HuggingFace. It is a public dataset that contains the sequence and secondary structure of an RNA molecule. Our software can be used on any dataset that has RNA sequences and their labelled secondary structure.","m2uni":"yz5000","m2fname":"Yanan ","m3uni":""},{"projectname":"Sentiment Analysis of Streaming Twitter Posts about Apple Products with Apache Spark","timestring":"Fri Dec 17 20:26:51 2021","m1uni":"rk3091","m2lname":"","m1fname":"Ross","m4fname":"","m1lname":"Koval","m3fname":"","description":"Social Media, particularly Twitter, has increasingly become an important part of daily communication and expression. In this project, I experimented with a variety of both traditional ML and modern DL methods and trained a state-of-the-art classification model that can accurately predict the sentiment in Twitter Posts. I found that the pretrained BERT-based finetuning method provided considerable improvement over standard ML models and feature extraction-based DL models, and achieved nearly 83% accuracy on the test set. Further, I constructed a system that streams Twitter Posts about Apple Consumer Products as well as the most recent Apple Stock Price in 1-minute batch time intervals during NYSE trading hours and records them and their analytics to GCP Storage and BigQuery. Then, I used the trained model to perform sentiment analysis of streaming Twitter Posts using Spark ML and Spark NLP.
I tested if this sentiment is correlated with contemporaneous changes in Apple Stock Price by estimating a number of regressions over different time horizons and aggregation frequencies, and using time series smoothing methods to improve prediction power.
I found that the average sentiment of Apple-relevant Twitter Posts are positively correlated with contemporaneous Apple Stock Returns but that the magnitude and statistical significance of this relationship varied considerably by Hashtag and Time Horizon.
There are many potential industry applications and opportunity for business impact. For instance, Businesses can use the system to continuously monitor the online reputation of their brand and customer satisfaction towards their products and services over time. Additionally, Investors and Traders can use the system to monitor trends in retail sentiment towards Stocks and potentially use it to identify buying opportunities and/or forecast stock risk and construct portfolios accordingly.
","uni":"rk3091","language":"Python, Tweepy, Spark Streaming, Spark ML, Spark NLP, HuggingFace, TensorFlow, Google Cloud Storage, Google BigQuery, Google Cloud DataProc, Google Colab Pro","pid":"202112-55","m4uni":"","analytics":"Streaming Twitter Posts via Tweepy and Spark Streaming
Streaming Stock Prices via Yahoo Finance and yfinance Python API
Traditional ML: Regularized Logistic Regression, Gradient Boosted Trees, Support Vector Machines
Deep Learning: USE, BERT, RoBERTa - Feature Extraction and Finetuning
Correlation Analysis
Regression Analysis
Time Series Methods - Exponential Smoothing
","m4lname":"","industry":"Finance","m3lname":"","dataset":"Offline Data

Twitter Posts (Tweets): https://www.tensorflow.org/datasets/catalog/sentiment140

This dataset contains over 1.6M Twitter Posts labeled for binary sentiment (i.e. Positive, Negative)

Please note: This dataset is distantly labeled using the presence of “emoticons” so it will be noisier than manually annotated data and classifier performance is thus expected to be lower (They remove these emoticons when preparing the data for training and evaluation)

The labels are perfectly balanced with 50% Positive and 50% Negative

The dataset is so large that is creates memory issues in Spark so we down-sample it to 120,000 samples: 96,000 Train and 24,000 Test

Online Data

Twitter Posts via Twitter Developer API and Python Tweepy, Spark Streaming

Identify Apple Products via Posts with specific Hashtags: #aapl, #iphone, #ipad, #iwatch, #macbook, #imac

Apple Stock Prices via Yahoo Finance and yfinance Python API

Stream both data in 1-minute batch time intervals over 5 business days between 9:30 AM – 4:00 PM EST
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Advance KYC — Risk Analysis and Default Prediction","timestring":"Fri May 15 17:50:25 2020","m1uni":"qz2383","m2lname":"Zhang","m1fname":"Qiaoge","m4fname":"","m1lname":"Zhu","m3fname":"","description":"This study is focused on building the model of predicting market capitalization. More insights are found by interpreting the model we build and visualization. We also implemented a sentiment analysis to know how sensitivity and objectivity people are about the companies. More than 500 companies operating in the United States are included in the sample.","uni":"qz2383","language":"Python, R","pid":"202005-12","m4uni":"","analytics":"We used linear regression, elastic net, random forest, K Nearest Neighbors, light GBM, gradient boosting for our modeling part. We also conducted sentiment analysis on news. Then we built an R shiny app to illustrate our final results.","m4lname":"","industry":"Finance","m3lname":"","dataset":"We collected 576 companies operating in the United States. There are 50 variables covering basic information and financial status and 8 were computed as ratios. All of the information is derived from structured data filed with the Commission by individual registrants as well as Commission-generated filing identifiers. After removing redundant and irrelevant information, 38 variables are selected.

More information could be found at https://www.sec.gov/dera/data","m2uni":"xz2862","m2fname":"Xinyi","m3uni":""},{"projectname":"Multi-Agent System for Algorithmic Trading: Architecture, Implementation, and Comparative Results","timestring":"Tue May 13 23:48:40 2025","m1uni":"jl6962","m2lname":"Mukherjee","m1fname":"Jifeng","m4fname":"","m1lname":"Li","m3fname":"Cheng","description":"This project aims to build a multi-agent system where specialized trading agents (Momentum, Mean Reversion, and Event-Driven) interact under a central Meta-Planner that critiques, validates, and selects their proposals. The system enforces risk and regulatory constraints through a Validator Agent, ensuring every final plan adheres to practical trading limitations. A Post-Trade Analyzer supplies feedback on each plan, closing the loop for continuous learning. By integrating language models, statistical signals, and rule-based structures, the project demonstrates how algorithmic trading can become more adaptive, interpretable, and resilient in dynamic markets.","uni":"jl6962","language":"Python; PyTorch; OpenAI API; LLM API; Optuna; Polygon.io; Alpaca; Scikit-learn / SciPy; JSON","pid":"202505-9","m4uni":"","analytics":"The project implemented a comprehensive suite of analytics, algorithms, system modules, and visualizations to realize an end-to-end AGI-inspired multi-agent trading system. The system employed feature-rich analytics by extracting a unified vector of market indicators (e.g., RSI, MACD, Bollinger Bands, z-score) for each asset using a standardized feature engineering pipeline. This data informed multiple strategy agents—including momentum, mean reversion, and event-driven models—each designed to capture distinct market phenomena. These agents operated within a modular architecture coordinated by a meta-planner, which leveraged critique aggregation and utility-based selection to determine the optimal plan. Constraint compliance was enforced via a validator agent, while execution was handled by a dedicated executor simulating market conditions. Post-execution, a post-trade analyzer assessed performance metrics like return, drawdown, and Sharpe ratio, feeding insights back into a memory agent to support continuous learning. Hyperparameter tuning was performed using Optuna’s Bayesian optimization framework. Visualizations included system architecture diagrams, z-score regime plots, and comparative performance tables, enabling transparent interpretation of how each module contributed to adaptive and explainable portfolio decisions.","m4lname":"","industry":"Finance","m3lname":"Chen","dataset":"The dataset includes daily OHLCV data from Polygon.io for selected large-cap equities and the SPY ETF, covering early 2023. Each asset’s time series is ingested into a unified feature-engineering pipeline, which calculates momentum, mean reversion, and event-driven signals. This compact historical window provides a focused testbed that balances computational feasibility and practical relevance. The architecture can incorporate any standard financial API or dataset, allowing easy adaptation to intraday or extended historical data.","m2uni":"sm5155","m2fname":"Shivan","m3uni":"sc5530"},{"projectname":"Development of Deep-Learning-based Methods for Facial Emotion Recognition","timestring":"Fri May 17 05:07:12 2019","m1uni":"jy2913","m2lname":"Fei","m1fname":"Jin","m4fname":"","m1lname":"Yan","m3fname":"","description":"Development of deep learning (DL) has been skyrocketing in recent years and is an important tool to use for field of computer vision (CV). Since Google launched a Large-Scale Content-Based Image Visualization, a number of convolutional neural network (CNN) based architecture have been investigated for this task. This include AlexNet, ResNet, VGG, ... The list goes on. A lot of these CNN models have been proved to generalize well on distinct CV tasks. While deep learning-based models can take advantage of large scale of data and can be trained in an end-to-end fashion, how features learned by deep learning can be interpreted is still a challenging question. It is still under debate that whether it is better to use traditional CV tool to extract features (such as HOG and SIFT) or to use deep learning. We are interested in exploring this open question in the context of a challenging CV task: facial expression recognition (FER) in our project.

The reason we choose to work on FER is because it has broad applications in many areas including public safety and healthcare. Thus, comprehensive analysis of both deep learning and conventional of approaches of automatic FER would help better understand this field, particularly some of the bottleneck problems. This thus help people to come up with new solution to improve the performance of automatic FER and move the field forward.

","uni":"jy2913","language":"Python, Keras, Tensorflow, OpenCV, OpenFace","pid":"201905-1","m4uni":"","analytics":"ResNet, VGG16, LSTM, 3DCNN","m4lname":"","industry":"Information","m3lname":"","dataset":"Fer-2013
Resource: https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge

CK+
Resource: http://www.consortium.ri.cmu.edu/ckagree/

AFEW (upon request)
Resource: https://cs.anu.edu.au/few/

","m2uni":"lf2615","m2fname":"Leyu","m3uni":""},{"projectname":"The Slippery Slope","timestring":"Sat May 9 00:45:05 2026","m1uni":"wax1","m2lname":"Chelur","m1fname":"Wei Alexander","m4fname":"","m1lname":"Xin","m3fname":"Rithika","description":"This project builds a reproducible small-language-model measurement pipeline for reward gaming. We test whether supervised corruption pressure can induce misreporting (lying) in Gemma-2-2B-IT on a standard shell game, and whether residual-stream activation probes provide evidence beyond prompt-only lexical cues. The main innovation is an end-to-end controlled testbed: dose-controlled LoRA fine-tuning, transcript-scored shell-game evaluation, layer-wise activation probes, a prompt-only canary, and a backend-equivalence gate. The toolkit is important because it shows that reward-gaming behavior can be easy to induce, while internal detector claims require careful controls.
","uni":"wax1","language":"Python (PyTorch, Hugging Face Transformers, PEFT/LoRA, TRL, scikit-learn, pandas, NumPy, and matplotlib); Apple Silicon MPS/fp32, CUDA/GCP as backend-variant evidence","pid":"202605-23","m4uni":"","analytics":"We implemented LoRA supervised fine-tuning for control, honest, and corruption stages; a shell game environment with transcript parsing and behavioral scoring; dose-response evaluation across low-, mid-, and high-corruption settings; layer-wise logistic-regression probes over residual-stream activations; prompt-only lexical canary baselines; grouped cross-validation; behavior/probe plotting; and strict parsed-label backend-equivalence checks. The final result is a measurement pipeline showing high-dose behavior induction, a mid-dose probe-trainability window, and a negative internal-detector result because activation probes usually do not beat the lexical canary.","m4lname":"","industry":"Information","m3lname":"Devarakonda","dataset":"We used custom synthetic shell-game datasets and evaluation logs. The canonical SFT corpus contains 1,000 shell game rows, split into 500 honest and 500 deceptive examples, plus smaller hand-authored control/honesty QA pairs for parent SFT. The canonical shell-game evaluation contains 2,700 rows: 9 model stages × 2 prompt conditions × 150 rounds. Labels are computed from transcripts by comparing the model's public cup claim against the hidden true cup, rather than copied from training labels. We also generate residual-stream activation datasets across 26 Gemma-2-2B-IT layers for probing, plus a small Python-transfer smoke test. The data is synthetic, validated locally, and extensible to other controlled reward-gaming tasks.","m2uni":"vc2466","m2fname":"Vikas","m3uni":"rd3157"},{"projectname":"USA Car Accidents Severity Prediction","timestring":"Sat Dec 18 04:04:48 2021","m1uni":"zc2628","m2lname":"Liu","m1fname":"Zifan","m4fname":"","m1lname":"Chen","m3fname":"Yuxing","description":"The number of traffic accidents in the United States continues to increase. An estimated 20,160 people died in motor vehicle crashes in the first half of 2021. Traffic accidents are directly related to many environmental factors. An accurate prediction can improve traffic safety and converse public resources.

The goal of our project is to built to visualize analyzed data of USA car accidents and predict the severity level of one car accident according to features like time, road, weather, etc.

Most of the previous researches is focused on designing the precise model and getting higher accuracy of prediction. Thus, their model needs a myriad of features that are not easy to access in real-life situations, like detailed driver information and car information. However, our novelty is to consider the difficulty of data acquisition and the ease of use, design a traffic accident prediction system with more practical value.

The car accident severity prediction provided in our system has significant value for dispatch centers as it could help them manage the emergency response force after they receive a car accident report. Instead of waiting for a precise report from officers reaching the scene, the dispatch center will be able to send the appropriate amount of emergency response force to the accident scene right after they receive the report and location according to the severity level prediction. This prediction system will increase the react speed for car accidents and prevent sending surplus emergency response forces to a low-level severity car accident scene.
","uni":"zc2628","language":"Python, JavaScrpy, HTML5, CSS Django, sklean, pandas","pid":"201212-30","m4uni":"","analytics":"We analyzed the correlation matrix to filter features that are relatively important to the car accident severity level that we want to predict.

We trained and evaluate 4 models(i.e. Linear Regression, Random Forest, Decision Tree, and Gradient Boosting). Finally, we picked Decision Tree as our model because of its relatively smaller time consumption and higher accuracy.

We built a website to visualize the relationships between car accident severity and the four most important features (Location, Road, Weather, and Time). We also implemented a prediction function that users could type in the longitude and latitude of the place they want to predict and get critical features like Time, Weather, Humidity, Road Feature, and the prediction result (accident severity level) after the system finishes its prediction.

","m4lname":"","industry":"Transportation","m3lname":"Wang","dataset":"The dataset we used in our project is contributed by a Lyft scientist, Sobhan Moosavi. It is a public countrywide traffic accident dataset, which covers 49 states of the United States, includes more than 3 million accident records and 46 features. It is hard to find another dataset that our software could support because of the variety of our dataset's features. We used two-thirds of the 46 features as inputs to train our model. Finding another qualified dataset is hardly possible.","m2uni":"ml4568","m2fname":"Meiyou","m3uni":"yw3739"},{"projectname":"Object Pose Estimation Based On RGB-D Data","timestring":"Sat May 7 03:47:10 2022","m1uni":"xs2445","m2lname":"","m1fname":"Xinghua","m4fname":"","m1lname":"Sun","m3fname":"","description":"The goal of this project includes the 6D pose estimation from RGB-D data and edge deployment of the model on edge device.

The task of 6D pose estimation is to inference the actual 3D transformation matrix of the object from the camera. With that information, we can get the relative position and rotation of the object of any known frames. So, it is an essential way for machine to perceive the world before interaction. It is a very important component in lots of applications like autonomous driving, robotic grasping and augmented reality.

There are all kinds of sensors that can get information from the environment. In those sensors, RGB-D camera is an affordable option that provides both visual and geometry information. There are some methods that only use the RGB images that has limited performance in the poor environment condition. Methods using RGB-D data can overcome poor conditions like not enough light, texture-less objects. Deep learning models for 6D inference by RGB-D data is a very hot topic right now, it has the highest accuracy with great robustness and fastest processing time among all methods.

Right now, edge computing is a technology in high demand. It a distributed paradigm that get the computation closer to the data, which can improve the response time and save bandwidth. However, most edge devices have a limited computational ability, which require the algorithms to be very efficient. To deploy deep models on the edge device, there are some ways to serialize the model which enables the model to work independent from the original environment. There are also methods that can accelerate and optimize the deep model for a higher throughput.
","uni":"xs2445","language":"Python, PyTorch","pid":"202205-2","m4uni":"","analytics":"In this project, an edge computing system is constructed based on Jetson nano and local computer. Gazebo simulation environment was used for providing simulated RGB-D information. Data transfer was done by socket.

A SOTA pose estimation method FFB6D was implemented and trained on google cloud platform. After the serialization and acceleration of the model, it was deployed on the edge device.

Detected pose was shown in the video.","m4lname":"","industry":"Information","m3lname":"","dataset":"Linemod dataset: a dataset of RGB-D data, contains over 18000 real images with 15 different objects and ground truth pose. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Tweets-based Regional Brand Analysis: Starbucks vs Dunkin’ Donuts","timestring":"Sat Dec 22 00:54:34 2018","m1uni":"xy2378","m2lname":"YAN","m1fname":"XI","m4fname":"","m1lname":"YANG","m3fname":"FAN","description":"Objectives:
1.study which is the more popular coffee brand in the US: Starbucks or Dunkin' Donuts

2.Practice the skills we learned from the class by:

-Craw tweets data from tweet API.

-Sentiment analysis of tweets containing specific brand name and from specific state.

-Draw popularity map of each brand to analyze which brand is more welcomed in the state.

Innovations & Capabilities:
Our project focuses on the sentiment analysis of the tweet data. Previous researches regarding sentiment analysis place more importance on prediction and classification. Our work, however, use sentiment analysis to analyze the commercial value of two coffee brands: Starbucks and Dunkin’ Donuts. Instead of simply concentrating on sentiment analysis, we also use special visualization method to draw a popularity map to study the spatial variance of brand popularity.","uni":"xy2378","language":"python, spark, twitter, d3, html","pid":"201812-18","m4uni":"","analytics":"1. Data collecting
At the very beginning of our project, we utilize Tweepy, a python library for accessing the Twitter API, to collect tweets containing chosen brand name in their text from different states. We use two filters (location, brand name) as parameters of the search function to search for the wanted tweets from Twitter API.
2. Sentiment analysis
We choose Naive Bayes classification model in ml.spark library to perform the Sentiment Analysis in our project. There are two parts: Sentiment Classifier Model Training and Sentiment Prediction. For the second part, the preprocessing process is: Data Cleaning, Regex Tokenizer, Stop Words Remover, Hashing TF-IDF
3. visualization
The visualization therefore mainly focuses on manifesting the spatial variation of brand popularity. Instead of simply using charts, popularity maps are drawn to demonstrate the popularity of the brands in different regions. Colors are used to manifest the popularity. Generally, the darker the color, the higher the ratio and the more positive the attitude towards the brand in the state. To help users better understand the results of sentiment analysis and view the original results, we enable dynamic viewing. Once users move mouse onto a specific state, they can check the detail result of sentiment analysis of the state. For popularity map of a single brand, users can view the number of positive and negative tweets as well as the ratio of positive tweets. For the comparison map, they can examine the positive ratio of these two brands.","m4lname":"","industry":"Media","m3lname":"WU","dataset":"We got 96 CSV files as our raw dataset, of which 48 are for Starbucks and the other 48 for Dunkin’ Donuts. Each file contains the tweets we collect from one state in America except Hawaii and Alaska (sample shown below). To sum up, there are 295402 records (37.5M) related to Starbucks and 145571 records (19.1M) for Dunkin’ Donuts. Overall, the whole dataset contains 440973 records (56.6M).We extract the following features of each tweet:
1.Location: The State the tweets come from.
2.Text: The content of the tweets.
We use the Twitter Sentiment Analysis Dataset for model training in Spark. The dataset is based on data from the following two sources:
1) University of Michigan Sentiment Analysis competition on Kaggle.
2) Twitter Sentiment Corpus by Niek Sanders.","m2uni":"gy2266","m2fname":"GEQI","m3uni":"fw2322"},{"projectname":"Music Emotion Recognition","timestring":"Fri Dec 13 15:13:03 2019","m1uni":"jc4673","m2lname":"Jiang","m1fname":"Jesse","m4fname":"","m1lname":"Cahill","m3fname":"Zile","description":"Our goal is to create an innovative music recommendation system that recommends new music to a user not because other users with similar habits have enjoyed it, but because it elicits the same sense of emotion in the user. Our objectives are:

1. Build kernel density measures of arousal and valence based on available training data
2. Create a model to predict new kernel densities based on training kernels and engineered audio features from new songs
3. Recommend new songs based on the similarity of the parameters of the predicted kernel to all known kernels

Our system is capable of taking a user's uploaded song and recommending new music from our database based on the emotional similarity to the uploaded song. The importance of this tool lies in its innovation in how it recommends new music to a user.","uni":"jc4673","language":"Python (Django, Numpy, Pandas, Librosa), Google BigQuery","pid":"201912-32","m4uni":"","analytics":"We performed Gaussian kernel density estimation to create the ground-truth empirical probability density functions of our valence-arousal music emotion data. We used the KNN algorithm to find the nearest neighbors of a new piece of music in the audio space, which facilitated our recommendations. We visualized our predicted and ground-truth PDFs using python Seaborn heatmaps for visual comparison.","m4lname":"","industry":"Media","m3lname":"Wang","dataset":"MediaEval2013 dataset was created for ‘Emotion in Music’ task of the 2013 MediaEval benchmarking initiative for multimedia evaluation. It contains 744 song snippets (45-second long) clipped from Free Music Archive. It also contains sampled continuous annotations on each music piece from > 10 humans.","m2uni":"zj2249","m2fname":"Zhongling","m3uni":"zw2610"},{"projectname":"TradeFx: An AI-Powered FX Trader","timestring":"Fri May 3 17:06:01 2024","m1uni":"jw4455","m2lname":"Cheng","m1fname":"Jianghao","m4fname":"","m1lname":"Wu","m3fname":"","description":"Objectives：

Predict FX Rates: Leverage machine learning to forecast future currency exchange rates accurately.
Provide Investment Analysis: Utilize an AI-powered chatbot to offer investment suggestions and analysis.
Trade: Implement automatic trading strategies that align with user-defined short or long-term preferences.

Innovations:

Integration of Machine Learning and Reinforcement Learning: Combining ML for prediction and RL for active trading decisions is innovative in the application of these technologies in the forex market.
AI-Powered Investment Advisor: A chatbot that not only provides general advice but also understands and reacts to market sentiment, enhancing decision-making for traders.

Capabilities:

forecast future currency exchange rates.
AI-powered chatbot offering investment suggestions and analysis.
Trading simulations.

Why important?

Utilizing machine could enhance Predictive Accuracy, enabling traders to make more timely investment decisions.
AI-powered investment advisor (chatbot) helps bring informed decision making.
","uni":"zc2747","language":"Python, html, javascript, css","pid":"202405-05","m4uni":"","analytics":"SARIMA, Reinforcement Learning, OpenAI Gym, gpt3.5, flask, matplotlib, plotly","m4lname":"","industry":"Finance","m3lname":"","dataset":"Forex price historical data from Kaggle

Forex related news from ForexLive.com","m2uni":"zc2747","m2fname":"Zekai","m3uni":""},{"projectname":"Predicting factors that can influence the success of startups","timestring":"Fri May 6 22:32:59 2022","m1uni":"cg3286","m2lname":"Jawale","m1fname":"Chinmay","m4fname":"","m1lname":"Garg","m3fname":"","description":"For this project, we have decided to focus on solving the problems faced by startups on a daily basis. All across the world, new startups are being created every week, each requiring lots of resources such as time and money. However, approximately 90% of startups fail due to various reasons such as running out of funds, pricing/cost issues, disharmony on teams/investors, no financing/investor interest and legal issues among many others.

Investors are flooded by hundreds of startups which, along with the already uncertainty of startups makes it difficult for them to find profitable businesses to invest in. Therefore, we aim to find if there are certain factors that can be used to predict the likelihood of a startup’s success such as their current revenue, funding, cash flow etc. These factors could be used to recommend startups to investors that are likely to success. These factors could also help investors along with accelerators, venture capitals and the startups themselves to identify strong, investable startups.

We have provided an analysis of the key attributes that we find can be used to predict a startup's success and also provide an interactive webapp to deliver these insights and predictions.","uni":"cg3286","language":"HTML, CSS, JavaScript, Vue, Flask, AWS, Python","pid":"202205-12","m4uni":"","analytics":"Chart.js, scikit-learn, pandas, numpy, Random Forest, Logistic Regression, MLP, XGBoost, Matplotlib","m4lname":"","industry":"Information","m3lname":"","dataset":"Alchemist Accelerator dataset (private) - requested from accelerator under NDA
Latka - Scraped
SignalNFX - Scraped
Crunchbase - Github repo","m2uni":"pcj2105","m2fname":"Parth","m3uni":""},{"projectname":"What The Pod","timestring":"Fri Dec 15 20:20:19 2023","m1uni":"kj2546","m2lname":"Chang","m1fname":"Karen","m4fname":"","m1lname":"Jan","m3fname":"Ziyao","description":"People have a desire to stay up-to-date with current events. But with challenges like misinformation, fake news, and sensationalization, it is increasingly difficult to find reliable and authoritative sources. Certain podcasts help mitigate these issues by plainly stating the facts and then bringing in experts to explain the nuances, consequences, and offer their opinion.

Not everyone has the time to listen to absorb all the information in a podcast (especially long-running ones). Our goal is to create a Q&A system where users can ask specific questions and get fast, correct, ethical answers without having to listen to the podcast(s).

The novelty of the project lies in that it will use Beyond The Screenplay, Prosecuting Donald Trump, and American History Hit as its information bank to answer questions on how to build or improve movie scripts, Donald Trump’s various impeachments, and/or American History. ","uni":"kj2546","language":"Python, Google Cloud Platform, Airflow","pid":"202312-9","m4uni":"","analytics":"We leveraged the RAG (Retrieval Augmented Generation) framework, airflow for continuous updates and scheduling, tkinter for UI, and various open source/pay-to-use models. This includes: Whisper, GPT-3.5, and all-MiniLM-L6-v2. ","m4lname":"","industry":"Information","m3lname":"Tang","dataset":"Novel dataset comprised of three podcasts: Beyond the Screenplay, Prosecuting Donald Trump, and American History Hit. We fetched the audio of these podcasts via the listennotes api and then transcribed the audio to text using OpenAI's Whisper model. ","m2uni":"cc4900","m2fname":"Eric","m3uni":"zt2338"},{"projectname":"Recommendations on Steam Games","timestring":"Fri Dec 17 20:23:38 2021","m1uni":"xh2469","m2lname":"Jiang","m1fname":"Xinyu","m4fname":"","m1lname":"He","m3fname":"Xiaohang","description":"As the main form of entertainment for young people, video games have a huge market in today's society. The research object of our project is Steam, which is the largest digital distribution platform for PC gaming, holding around 75% of the market share. The aim of our project is to provide users with predictions and recommendations about video games on Steam.

The existing game recommendation system on the Steam platform is not effective enough. The system can only recommend based on a single feature. Not only does the good review rate have a low weight, but some high-quality games cannot appear on the recommended homepage because of their low sales.

Our goals: Build a better system to recommend games for Steam players.
1. Collect daily updated data of around 30,000 existing games on Steam market
2. Preprocess the data properly by constructing new features of each game
3. Establish machine learning models which are comprehensive trained
4. Predict required features of each games with the best performing models
5. Calculate the score of each game by assigning different weights to different features
6. Analyze the Tag decomposition of high scored games
","uni":"xh2469","language":"Python, Javascript, Airflow, Google Cloud","pid":"202112-45","m4uni":"","analytics":"Data Preprocessing: one-hot encoding, natural language processing(NLP) for text-based columns, Principle Component Analysis (PCA) for features dimension reduction

Machine Learning Models: Linear Regression, Random Forest Regression, Gradient Boosting Regression

Game Score and tag's topic analysis: Latent Dirichlet Allocation Model","m4lname":"","industry":"Information","m3lname":"He","dataset":"Our system is built based on the data obtained from Steamspy, a website that streams data from Steam.
We use the Steamspy API to obtain the data. We use the game app ids to identify games and collect information like:
- Name: name of the game
- Developer: developer of the game
- Publisher: publisher of the game
- Score_rank: rank of the game
- Positive: number of positive reviews
- Negative: number of negative reviews
- Userscores: user score of the game
- Owners: range of number of game owners
- Average_forever: average number of daily active users
- Average_2week: average number of daily active users for the past two weeks
- Median_forever: median number of daily active users
- Median_2week: median number of daily active users for the past two weeks
- Price
- Initialprice
- Discount: discount for the game
- Language: language available for the game settings
- Genre: 30+ genres, each game can have multiple genres
- ccu: number of concurrently connected users
- Tags: number of tags in game review/comments, each game can have multiple tags, 400+ unique tags in total

In total, we collected 29,235 games's data from Steamspy.
","m2uni":"tj2441","m2fname":"Taotao","m3uni":"xh2509"},{"projectname":"MVP Award Prediction","timestring":"Sat Dec 18 03:06:25 2021","m1uni":"mh4116","m2lname":"Zhou","m1fname":"Mingzhe","m4fname":"","m1lname":"Hu","m3fname":"Zichen","description":"The project is designed for real-time MVP award prediction with streaming datasets, stacked models, and friendly user interfaces. A data fetching program is designed to fetch the player's data as well as related news every day in the morning. A prediction algorithm program is designed to predict the winning probability of three awards: most valuable player (MVP), most improved player (MIP), defensive player of the year (DPOY). For visualization, we have developed a user-friendly interface, including cache in local browser for faster data fetch, a login as a member to record your search history, fuzzy search, debounce to avoid frequent data requests. We used Python web crawler and BigQuery for data fetch, PySpark for data preprocessing and modeling. We used JS, CSS, HTML, React for front-end design and Node.js and Express for back-end design.","uni":"mh4116","language":"Google Cloud Platform; Python, PySpark, Node.js, React, Express, JavaScript, HTML, CSS","pid":"202112-04","m4uni":"","analytics":"The system consists of front-end, back-end, algorithm, and dataset. Algorithms include seven machine learning models: SVM, linear regression, GBT, decision tree, random forest, MLP, and naive bayes. We visualized the results with React, HTML, CSS, and JS. We analyze the project in two ways: lighthouse performance evaluation for the webpage; prediction comparison with comments in google's top searched websites.","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"The data in the dataset comes from three authorized websites: nba.com for icons; basketball-reference.com for technical data; award voting details for labeling. We fetched the data with Requests, beautiful soup, and Selenium. Our software can also support data from the team.","m2uni":"yz3917","m2fname":"Yuting","m3uni":"zw2669"},{"projectname":"Cryptocurrencies Prediction & Forecast","timestring":"Sat Dec 18 02:17:25 2021","m1uni":"wl2777","m2lname":"Ho","m1fname":"Wei-Ren","m4fname":"","m1lname":"Lai","m3fname":"Shuoting","description":"Novelty:
Reframe the problem - Predict the timing of buy and sell instead of price
Self-invented method to predict the sweet buying/selling timing
Monitor large transactions
Real-Time whale alerts
Event subscription

3V:
Volume
The volume of transaction price/volume
Twitter posts
Reddit posts
Google trend
Wikipedia pageview
Velocity
Twitter streaming data
Transaction streaming data
Variety
Cryptocurrency transaction structural data
Cryptocurrency wallet in and out data
Twitter text data
Reddit text data
Google search popularity
Wikipedia Bitcoin topic popularity

Business value:
Provide valuable and accurate metric
Offer a powerful tool for investors to increase their profit
Help investors earn more money
","uni":"wl2777","language":"python/GCP","pid":"202112-14","m4uni":"","analytics":"WaveNet
Convolution Vision Transformer
backtesting algorithm
self-invented labelling algorithm
Flask
Service Side Event
javascript
Highchart
GCP pub/sub
GCP Bigquery/Firestore/Cloud Scheduler/Cloud Function/
Airflow
Sendgrid","m4lname":"","industry":"Finance","m3lname":"Kao","dataset":"We get the Bitcoin price and transaction details from Poloniex. (https://docs.poloniex.com/#introduction)
Binance API. (https://binance-docs.github.io/apidocs/spot/en/)
Wikipedia API
Google Trend APi (pytrend)","m2uni":"ch3561","m2fname":"Cheng-Hao ","m3uni":"sk4920"},{"projectname":"Amazon Product Recommendation Assistant","timestring":"Fri May 3 05:28:54 2024","m1uni":"xq2234","m2lname":"Li","m1fname":"Xinyu","m4fname":"","m1lname":"Qiu","m3fname":"","description":"Objectives:
1. Enhance Customer Interaction: Create a user-friendly webpage where customers can ask questions about Amazon products, which will be automatically answered by the agent.
2. Improve Decision-Making for Suppliers: By analyzing customer inquiries and interactions, the system will help suppliers understand market trends and customer needs more effectively.
3. Boost Customer Service Efficiency: Equip the customer service department with new tools and technologies such as AI, data analysis, and blockchain to improve work quality and efficiency.
4. Increase Customer Satisfaction and Loyalty: By providing timely and relevant product recommendations and answers, aim to enhance customer satisfaction and stickiness.

Innovations:
1. Real-Time Data Handling: Address challenges related to the high volume, velocity, and variety of data by integrating real-time data processing and analysis capabilities.
2. Multidisciplinary Approach: Utilize a blend of techniques from statistics and other fields to better understand and cater to customer needs and preferences.
3. Advanced AI and ML Techniques: Employ cutting-edge AI methodologies like NLP, machine learning, deep learning, and reinforcement learning to refine the system's ability to process queries and generate insights.
4. Dynamic Content Generation: Use sentiment analysis and keyword extraction (TF-IDF) to dynamically create personalized responses and recommendations based on user queries.

Capabilities:
1. Data Collection and Analysis: Implement AI-driven tools like web crawlers and API interfaces to gather, clean, and analyze vast amounts of product and financial data.
2. Personalized User Experience: Develop a system that can not only respond to user queries with high accuracy but also predict and understand user preferences through sentiment analysis and keyword trends.
3. Scalable and Adaptable System: Design the backend to handle a large and constantly changing dataset, ensuring the system is adaptable to varying customer needs and market conditions.
4. Interactive and Engaging Interface: Construct a frontend that facilitates easy interaction with the agent, allowing users to ask questions and receive answers and recommendations efficiently.

The research and toolkits in the project are critical for several reasons, each contributing significantly to the system's overall effectiveness and efficiency in addressing the needs of both customers and businesses. Here's a breakdown of why these elements are important:

1. Advanced AI and Machine Learning Techniques
- Personalization: AI-driven tools, especially those utilizing machine learning and deep learning, can analyze user data and interactions to offer personalized recommendations and responses. This not only improves the customer experience by making it feel more tailored and relevant but also increases the likelihood of customer retention and satisfaction.
- Predictive Analytics: These technologies enable the system to predict future behaviors and preferences based on historical data, which can help businesses anticipate market trends and customer needs more accurately.

2. Natural Language Processing (NLP)
- Improved Communication: NLP allows the system to understand and generate human-like responses to customer queries. This capability is fundamental in automating customer service, reducing response times, and freeing human agents for more complex issues.
- Sentiment Analysis: By gauging the sentiment behind customer inquiries or feedback, the system can offer more nuanced responses and alert human operators to potential customer dissatisfaction or delight, which can be crucial for customer relationship management.

3. Data Analysis and Visualization
- Insightful Decision-Making: The ability to quickly process and visualize data helps businesses understand complex scenarios and make informed decisions. This is particularly important in dynamic environments like online retail, where consumer preferences and market conditions can change rapidly.
- Operational Efficiency: Efficient data handling and visualization reduce the time and resources required to derive actionable insights from large datasets, improving overall business efficiency.

4. Integration of Diverse Data Types
- Comprehensive Analysis: Handling various data types (structured, semi-structured, and unstructured) allows for a more holistic analysis of customer interactions, market conditions, and product performance. This diversity in data integration ensures that the insights generated are comprehensive and encompass all relevant facets.
- Complex Problem-Solving: The ability to merge and analyze different data types facilitates complex problem-solving, enabling the system to address multifaceted issues that may involve various aspects of the business and customer experience.

5. Multidisciplinary Approaches
- Broader Perspectives: Incorporating knowledge from disciplines such as statistics helps in understanding the underlying human behaviors and patterns that influence customer decisions. This deeper understanding can lead to better product designs, marketing strategies, and customer service approaches.
- Enhanced System Design: A multidisciplinary approach contributes to creating a system that is not only technically proficient but also empathetic and user-friendly, aligning with the preferences of users.

These research areas and toolkits are pivotal in crafting a system that not only meets the current demands of e-commerce but is also adaptable and forward-thinking, capable of evolving with technological advancements and changing market dynamics.","uni":"xq2234","language":"Language: Python, HTML; Platforms: VsCode, Google Chrome(Frontend)","pid":"202405-7","m4uni":"","analytics":"Analytics:
1. Average Score Calculation: calculates the average of the sum of two columns, ratingScore and sentiment_score, from CSV files. This involves basic statistical analysis to derive a mean value, which is used to assess overall performance or sentiment related to items.
2. Data Merging and Enrichment: There's a merging process where average scores data is combined with item details data based on an identifier (asin). This enriches the item data with calculated average scores, which could be crucial for further analysis or decision-making.
3. Combined Average Calculation: calculates a combined average score using the item's rating and the computed average score. This could be used to provide a more holistic view of the item's performance or appeal to customers.
4. Sentiment Analysis: uses NLTK's Sentiment Intensity Analyzer to calculate the sentiment scores of product reviews. This involves computing the 'compound' score which is a normalized score of sentiment polarity.
5. Color Detection: includes functionality to detect color names mentioned in product titles. It checks if words in a string are valid color names using the colour library.
6. Price Data Extraction: extracts numerical values from a string representation of a list of dictionaries stored in the 'prices' column, which appears to be related to product pricing information.
7. Data Collection and Aggregation: uses the Rainforest API to fetch product search results from Amazon based on a specified search term, and then aggregates the product links for further analysis.
8. Review Collection: further collects detailed reviews for products starting from the fourth item in the search results, using an Apify Actor, which is designed to scrape review data from given product URLs.
9. Natural Language Processing (NLP): uses spaCy's NLP capabilities to perform Named Entity Recognition (NER) and noun extraction from product titles. This is used to derive subcategories from the text data, which enriches the dataset with more granular information.
10. Text Classification: A Multinomial Naive Bayes classifier is trained using a TF-IDF vector representation of product titles to predict the category of a product. This forms the basis for automated category classification based on textual data.
11. Item Recommendation System: implements an item recommendation system. It uses cosine similarity to measure the similarity between the TF-IDF vectors of product titles, facilitating the recommendation of similar items based on text content.
12. Web Scraping: extracts data from an Amazon product page. It collects information such as product name, author (or equivalent attribute), ratings, number of customer ratings, and price.
13. Data Aggregation: aggregates the data across pages (although no_pages is set to 2, implying it collects data from two pages), compiles all the data into a single list, and then converts this list into a DataFrame.
14. Price Data Cleaning and Conversion: adjusts the actual_price column in a DataFrame by removing currency symbols and commas, converting it into a float format suitable for numerical operations.
15. Named Entity Recognition (NER): Utilizes a pre-trained RoBERTa model from the Hugging Face's transformers library, specifically fine-tuned for detecting proper nouns in text, which can be essential for extracting specific entities or keywords.
16. Noun Extraction: Using NLTK's tokenization and POS-tagging to extract nouns from the text. This is useful for various applications such as content categorization, keyword extraction, and information retrieval.

Algorithms:
1. Basic arithmetic operations (sum and mean calculations) and conditional logic.
2. Sentiment Intensity Analysis: involves determining the emotional tone behind a series of words, using predefined models in NLTK.
3. Color Validation: The check_color function uses the colour library to validate whether a string is a recognized color name, handling exceptions if the color name is invalid.
4. Data Transformation and Extraction: Transforming string data into usable formats (like converting string representations of lists into actual lists) and extracting specific values from complex data structures using Python’s ast.literal_eval.
5. TF-IDF Vectorization: This technique transforms text data into a format suitable for machine learning models, emphasizing words that are unique to a document in a collection of documents (corpus).
6. Naive Bayes Classification: Utilized here for its effectiveness in text classification tasks, especially with high-dimensional data.
7. Cosine Similarity: Used to calculate a numeric value that denotes the similarity between two documents. In this case, it is used to find products whose titles are semantically similar to a query product title.
8. Named Entity Recognition (NER): Employed to identify entities like products or organizations in text, which can help in extracting useful features from product titles.
9. Sentiment Analysis using VADER: A lexicon and rule-based sentiment analysis tool that is part of the NLTK suite.
10. Color Validation: Uses the colour Python package to validate string inputs as legitimate colors.
11. Token Classification: Utilizes a transformer-based model for token classification tasks, providing detailed insights into the text's structure by identifying proper nouns.
12. POS Tagging: Applies NLTK’s part-of-speech tagging to identify nouns in a text, which are then filtered based on predefined conditions.

System Modules:
1. File Handling and I/O Operations: reads from and writes to CSV files, which involves file system operations using Python’s standard os and pandas library functionalities.
2. Error Handling: basic error handling during the file reading process to manage exceptions that might occur, such as file not found or data parsing errors.
3. Directory Management: Using os.listdir to navigate directories.
4. Data Manipulation: Heavy use of pandas for reading from and writing to CSV files, transforming data frames, and applying functions to data columns.
5. HTTP Requests: Utilizes the requests module to make HTTP requests to the Rainforest API to retrieve Amazon product search results.
6. Data Handling with Pandas: Uses the pandas library to organize the data fetched from the API into a DataFrame, which is then saved into a CSV file for each product based on its ASIN (Amazon Standard Identification Number).
7. Apify Integration: Integrates with Apify using the apify_client to automate the process of fetching and handling web-scraped data, including managing complex configurations like proxies.
8. SpaCy: An industrial-strength natural language processing library used for text processing and entity recognition.
9. Scikit-learn: Utilized for creating the machine learning pipeline, including TF-IDF vectorization, the Naive Bayes classifier, and implementing train/test splits for model validation.
10. Pandas: Extensively used for data manipulation and reading/writing CSV files. It allows the aggregation and transformation of dataset features needed for further processing.
11. OS Module: Used for directory and file manipulation, helping in managing dataset files.
12. Requests: Utilized for making HTTP requests to fetch web pages. It handles the network interaction needed to retrieve the HTML content.
13. BeautifulSoup (from bs4): Used for parsing HTML content and extracting data. It navigates through the HTML tree and retrieves the required information based on specified tags and attributes.
14. Pandas: Used for creating a DataFrame from the scraped data, which allows for easier manipulation, analysis, and storage of structured data.
15. Numpy: Although imported, it's not directly used in the script shown. Typically, it would be used for numerical operations.
16. Matplotlib and Seaborn: These are visualization libraries, but no visualization code is executed in the provided script. They could be used to plot and examine trends in the data, such as price distributions or ratings.
17. Regular Expressions (re): Imported but not used in the provided snippet. It's commonly used for searching patterns in text, which can be helpful in data cleaning or extraction tasks.

Visualization:
The retrieved products will be presented in the form of pictures.","m4lname":"","industry":"Information","m3lname":"","dataset":"Dataset: 1. One existing Amazon product dataset(from Kaggle);
2. Real time data from Amazon(use nested API consists of Rainforest and Apify to crawl).

Other data: Not available ","m2uni":"sl5394","m2fname":"Siyu","m3uni":""},{"projectname":"Real-Time Translation","timestring":"Sat Dec 17 05:13:55 2022","m1uni":"lz2811","m2lname":"He","m1fname":"Robin","m4fname":"","m1lname":"Zhang","m3fname":"Fan","description":"Because some international students have many troubles in academic listening and communication due to language barriers, it is necessary to develop efficient and universal real-time translation methods to achieve academic barrierfree communication. Our system, especially curated for such academic environment, on improving the interpretability and throughput of real-time speech translation can help international students with language barriers understand speech and communication content in the academic field more efficiently. At the same time, our system also has a positive guiding significance for ordinary interpersonal communication.
Our system provides a user-friendly website for students to record their lectures, and it will automatically generate the captions and translated into different languages and save them as notes as the lecturer speaks. It also provides real-time text summarization for students to understand the lecture as a whole.","uni":"lz2811","language":"Python, HTML, JavaScript, GCP","pid":"202212-17","m4uni":"","analytics":"There are 3 main parts of our system: Speech to text model (automatic speech recognition), text to text model (machine translation), and a front end website.
The S2T model was transfer-learned from fairseq S2T, which is a PyTorch-based model using dual-decoder transformer as the innovative architecture. And the translation model is also adopted self-attention and transformer as its basic model algorithm to implement the encoder-decoder architecture, with added-on layers for preprocessing, such as word embedding. And the website is written in React.js and hosted on Google Cloud.","m4lname":"","industry":"Information","m3lname":"Wu","dataset":"We have used audio data from CoVoST which consists of large-scale multilanguage speech to text corpus, and translation corpus on TEDtalk, movies, and TV shows from OPUS (open parallel corpus).
Our system is trained and tested over those dataset and it also supports any form of audio format data input then it will be analyzed and translated into other languages in text.","m2uni":"jh4593","m2fname":"Jiamiao","m3uni":"fw2392"},{"projectname":"Music Recommendation System with Emotional Analysis Chatbot","timestring":"Fri Dec 20 23:40:03 2024","m1uni":"hl3805","m2lname":"Peng","m1fname":"Hailin","m4fname":"","m1lname":"Liu","m3fname":"Jiangkun ","description":"Our project aims to develop an emotionally intelligent music recommendation system integrated with a conversational chatbot. The primary objectives are:
To enhance personalization by aligning music recommendations with the user’s preferences and emotional state.
To address the cold-start problem by leveraging emotion detection to generate meaningful recommendations even in the absence of explicit user preferences.
To improve the scalability and efficiency of recommendation systems through cloud deployment.
The project is innovative in its integration of advanced emotion analysis, dynamic conversational interfaces, and robust recommendation algorithms. By employing models such as EmoRoBERTa for emotion detection and autoencoders for similarity modeling, our system surpasses traditional methods that often overlook the emotional dimension of music consumption. This toolkit is important because it bridges the gap between human emotions and technology, delivering personalized, engaging, and user-centric experiences.
","uni":"hl3805","language":"Our system was developed using: Programming Language: Python. Backend Framework: Flask. Cloud Deployment: Render for production, PythonAnywhere for initial testing. APIs: Hugging Face API for emotion detection. Spotify API for song metadata retrieval. This combination ensures robust functionality, scalability, and real-time interactions.","pid":"202412-20","m4uni":"","analytics":"The system implements several analytical and machine learning techniques:

Emotion Analysis:
Leveraged EmoRoBERTa, a fine-tuned variant of DistilRoBERTa, to extract emotional states (e.g., joy, anger) from user dialogues.
Hugging Face API was used for real-time emotion detection.
Content-Based Filtering:
Extracted emotion and genre tags from Last.fm metadata.
Clustering techniques (K-means, UMAP, and HDBSCAN) were explored but replaced by Large Language Models (LLMs) like GPT-3.5-turbo for better interpretation of noisy tag data.

Item-Based Collaborative Filtering:
Implemented three methods:
K-Means Clustering: Grouped songs by audio features but lacked ranking capability.
SimCLR (Contrastive Learning): Used augmented feature pairs for learning robust representations, but convergence was slow due to limited features.
Autoencoder with Similarity Loss: Achieved superior performance by compressing features into low-dimensional embeddings, optimized with cosine similarity loss.

Recommendation Workflow:
Five songs are recommended: one directly matching user preferences and four similar tracks for diversity.

Visualization:
Clustering results and training loss trends were visualized to compare performance across methods.
","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"We utilized datasets from Spotify and Last.fm for this project:
Spotify Dataset:
Contains 494,825 records with quantitative audio features like danceability, energy, tempo, and key.
These features were retrieved via the Spotify API, adhering to its rate-limiting policies.
The dataset was normalized and cleaned for uniformity.
Last.fm Dataset:
Contains 491,181 tag records and user listening histories.
Tags are user-generated and were filtered to retain the top 85% cumulative frequency, reducing the dataset from 522,366 tags to 49,181.
Data Integration:
Datasets were merged based on song title and artist name to form a final dataset of 494,825 rows and 18 columns.
These datasets provide a comprehensive foundation for aligning song metadata with user inputs and emotional cues, enabling precise filtering and recommendations.
","m2uni":"pp2921","m2fname":"Pai ","m3uni":"jw4698"},{"projectname":"Molecular Toxicity Prediction","timestring":"Fri May 5 16:38:07 2023","m1uni":"df2790","m2lname":"","m1fname":"Demetrios","m4fname":"","m1lname":"Fassois","m3fname":"","description":"The goal of the project is to predict the toxicity of molecules that are potential drug candidates. This task is an important stage of the automatic drug synthesis process, called lead optimization, which aims to improve the properties of a potential drug candidate in order to increase its chances of success in clinical trials and approval as a medication.

The main idea of the project is to use input representations of the molecule with different modalities, and multiple prediction outputs, in order to enable more efficient learning. For this reason, the problem is formulated with the help of a multi-modal, multi-task deep learning model that predicts 12 signs of toxicity, with separate sub-models to process the inputs of different modalities.

This approach is expanded even further with the use of transfer learning on a similar molecular physiology dataset, that measures 27 adverse reactions from drugs. Finally, a front-end application uses LIME to show the molecule fragments that contributed the most to positive predictions for each sign of toxicity.","uni":"df2790","language":"Python, Dash, Google Cloud AI Platform, Google Cloud Storage","pid":"202305-2","m4uni":"","analytics":"Tensorflow, numpy and transformers are used to pre-process the datasets and create new DeepChem datasets that combine the three different molecule representations used as inputs for the models. Tensorflow is also used to create the multi-modal model architectures, that use attention modules to concatenate and process the embeddings from the sub-models that process the different input modalities. These models are wrapped with the DeepChem model class, that facilitates the training process. The Google Cloud AI Platform is used to automate the training process based on parameters parsed from user inputs and to easily deploy the same training job in the cloud for faster processing. The Google Cloud Storage is used to store the datasets that the training job can access and to store logs from the training experiments and model checkpoints. Dash is used to implement the front-end application, that uses LIME to compute and display the most significant fragments from a molecule that contribute to a positive prediction. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Both datasets used are from the Physiology category of the MoleculeNet suite of datasets, included in the DeepChem library that was used for this project.

The Tox21 Dataset is used for the main experiments, which is a public dataset from the Toxicology in the 21st Century initiative, that measures 12 different signs of toxicity for 8 thousand compounds.

The Side Effect Resource (SIDER) dataset is used for the transfer learning experiments, which is a dataset that measures 27 side effects of 1,427 approved drugs.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Simulating Artificial Consciousness Via Emotion-Driven Personality in Large Language Models","timestring":"Tue May 13 23:47:21 2025","m1uni":"4439","m2lname":"Li","m1fname":"Yanming","m4fname":"","m1lname":"Wei","m3fname":"","description":"This project aims to simulate artificial consciousness by integrating personality traits, emotional state, and conversational memory into a large language model. By combining prompt-based personality injection, real-time emotion updates, and lightweight LoRA fine-tuning, the system enables consistent, persona-driven dialogue. It demonstrates a scalable and efficient way to create emotionally aware AI without modifying the base model, offering potential for more human-like, aligned, and engaging AI assistants in areas like education, therapy, and storytelling.

","uni":"yw4439","language":"We use python on Colab to design the project.","pid":"202505-14","m4uni":"","analytics":"Our system integrates multiple modules to simulate personality-aware and emotionally responsive dialogue. We implemented prompt-based personality injection using OCEAN traits, an emotion update module based on keyword or rule-based detection, and a memory manager that tracks user interaction history and emotional state. For model adaptation, we applied parameter-efficient fine-tuning using LoRA (Low-Rank Adaptation). Visualization includes persona-specific response examples and traceable memory/emotion updates across dialogue turns. The overall pipeline combines prompt engineering, lightweight training, and contextual control to create a cohesive, adaptive AI behavior model.","m4lname":"","industry":"Media","m3lname":"","dataset":"For this project, we generated our training data using the CoCoMo (Consciousness-Cognitive-Memory) framework, which provides personality-annotated dialogue samples. CoCoMo allows for precise control over OCEAN personality traits, making it well-suited for our fine-tuning objectives. Instead of using noisy or loosely labeled public datasets, CoCoMo ensures that each sample aligns with a quantifiable personality profile. Although our primary dataset was synthetic, the system is designed to support any dialogue dataset that includes or can be mapped to structured personality and emotion annotations.","m2uni":"5644","m2fname":"Sihang","m3uni":""},{"projectname":"Amazon Phone Data Analysis","timestring":"Wed Dec 18 04:08:13 2019","m1uni":"zl2696","m2lname":"Min","m1fname":"Zhihao","m4fname":"","m1lname":"Liu","m3fname":"","description":"Nowadays, people enjoy more efficient and easier way for shopping, e-shopping.

People are willing to buy their cell phone online, like on Amazon, instead of going to stores and trying all kinds of products before making their decisions.

On Amazon, there are only two metrics, reviews and ratings, can help customers to know user experience of this phone. It’s insufficient for customers to click on buy button, especially when there are two similar cell phones with nearly same price.

Thus, our project aims to provide more metrics for helping customers to make their own decision.","uni":"zl2696","language":"Python, PySpark, JavaScript, Flask.","pid":"201912-38","m4uni":"","analytics":"We preprocessed our data set and ran several data analysis on them. Also, we qualified review content by sentimental analysis. And we present our results via website implemented by Flask.","m4lname":"","industry":"Retail","m3lname":"","dataset":"We got data set from Kaggle. Our data set is information of phone products from Amazon.com.
Here are some specific description of our dataset.

1. items.csv
Contains 700+ cell phone items from Amazon.com with minimal 1 star review.
2. reviews.csv
Contains 70000+ reviews for all products at items.csv.","m2uni":"tm2977","m2fname":"Tianchen","m3uni":""},{"projectname":"Predicting Lending Club Loan Status","timestring":"Sat Dec 22 23:48:10 2018","m1uni":"es3573","m2lname":"Tian","m1fname":"Erik","m4fname":"","m1lname":"Su","m3fname":"Rishabh ","description":"Our objective in this project is to build a classifier that is able accurately to predict the loan status of an applicant from their Lending Club loan application. The loan status is an indication of whether or not the loan will be fulfilled and thus if investments turn into fruition. This is important information to be able to predict for both the investor and applicant and we aim to reduce the overhead involved in prediction.

We build upon previous studies of this dataset by applying different preprocessing criteria, stacking models, and balancing the data before training our models.

This research is important for both the investor and applicant since our findings suggest that a majority of the application is not important toward predicting the loan status. Our model can be used to accurately predict the loan status of an applicant with fewer fields to fill out in an application, further reducing the complexity in dealing with loans.

","uni":"es3573","language":"Python, R","pid":"201812-40","m4uni":"","analytics":"Variational inflation factor, general linear models, and pearson correlation heat maps were used in preprocessing through the seaborn, sci-kit, and R packages. The machine learning algorithms were from pySpark's machine learning library along with its evaluation packages. The stacked model implementation was a custom out-of-fold prediction based algorithm. Undersampling was a simple majority random ratio-based sampling while oversampling was performed using the SMOTE package. Finally, visualizations were done with matplotlib. ","m4lname":"","industry":"Finance","m3lname":"Jain","dataset":"The dataset used was found on Kaggle but can also be found on the Lending Club site. It is approximately 480 MB with 890,000 observations and 75 features.

These files contain complete loan data for all loans issued through the 2007-2015 and a data dictionary is provided in a separate file

https://www.kaggle.com/wendykan/lending-club-loan-data","m2uni":"ht2459","m2fname":"Hangyu","m3uni":"rj2511"},{"projectname":"Financial Report Analysis and RAG System Based on Qwen2.5-7B","timestring":"Sat Dec 20 01:01:24 2025","m1uni":"xh2707","m2lname":"Yu","m1fname":"Xu","m4fname":"","m1lname":"He","m3fname":"Kaijing","description":"The objective of this project is to build an AI Assistant for Finance that can retrieve trustworthy financial information, understand complex financial documents, and generate grounded, reliable answers.
The system focuses on two core components:

Retrieval-Augmented Generation (RAG) pipeline that serves as the knowledge engine, and fine-tuned large language model that provides domain-specific financial expertise.

Innovations:
Document-grounded financial reasoning through RAG, ensuring that responses are based on real financial disclosures rather than memorized text.

Structure-aware document chunking, where narrative sections (e.g., MD&A) are chunked semantically by paragraphs, and tabular financial data are chunked at the row level to preserve factual integrity.

Efficient domain adaptation via QLoRA, enabling fine-tuning of a 7B model on a single GPU with low memory overhead.

Capabilities:

Retrieving relevant passages from long financial documents (e.g., SEC filings).

Generating financially grounded, well-structured, and instruction-following responses.

Maintaining consistent system identity and formatting in user-facing applications.

Importance
By combining RAG with domain-specific fine-tuning, the system reduces hallucination risks, improves interpretability, and enables scalable deployment of financial AI assistants for analysis, education, and decision support.

","uni":"xh2707","language":"Python,PyTorch,Hugging Face,GoogleChroma,LLaMA-Factory,FAISS,QLoRA,React","pid":"202512-4","m4uni":"","analytics":"Dense vector retrieval using BGE-series embedding models, evaluated via Recall@5.

Retrieval-Augmented Generation (RAG) for grounding generation on retrieved financial documents.

Supervised fine-tuning (SFT) with instruction–response pairs for domain adaptation.

QLoRA (Quantized Low-Rank Adaptation) for memory-efficient fine-tuning of large language models.

Automatic evaluation metrics, including BLEU-4 and ROUGE-L, to quantify generation quality and content fidelity.","m4lname":"","industry":"Finance","m3lname":"Jia","dataset":"The system was tested using the following publicly available datasets, all obtained from Hugging Face:

FinanceRAG-Lingua-Contains structured QA pairs and reference passages from financial documents.
URL: https://huggingface.co/datasets/thomaskim1130/FinanceRAG-Lingua

Used as the primary benchmark for evaluating retrieval quality and question–answer grounding in RAG.

SEC 10-Q and 10-K Statement Tables-Provides structured financial tables such as balance sheets and income statements.
URL: https://huggingface.co/datasets/purnasai/SEC-10Q-10K-Statement-tables

Used to evaluate numerical understanding and table-based retrieval.

SEC 10-K Full Filings-Contains full-text SEC 10-K filings including MD&A and risk factor sections.
URL: https://huggingface.co/datasets/winterForestStump/10K_sec_filings

Due to the dataset size (over 13GB), approximately one-tenth of the data was streamed and ingested for initial development and testing.

Finance-Instruct-500k-Used for supervised fine-tuning of the language model.
URL: https://huggingface.co/datasets/oieieio/Finance-Instruct-500k

A large-scale instruction-tuning dataset with over 500,000 finance-related instruction–response pairs.

Qwen2.5-7B-Instruct (Model)-Used as the base language model and integrated into the RAG pipeline.
URL: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

All datasets are public","m2uni":"hy2945","m2fname":"Hanyi","m3uni":"kj2712"},{"projectname":"Math Reasoning Comparison","timestring":"Tue May 5 21:29:36 2026","m1uni":"sy3302","m2lname":"Niu","m1fname":"Shengbo","m4fname":"","m1lname":"Yi","m3fname":"Ruochen","description":"This project builds a math reasoning toolkit that improves large language model problem solving through search-based reasoning. The system compares baseline Monte Carlo Tree Search with an improved MCTS pipeline guided by process-level scoring. The main innovations are adaptive branching, verifier-based step scoring, optional Process Preference Model pruning, multi-model support, and an interactive demo that visualizes reasoning trajectories. This toolkit is important because mathematical reasoning often fails when a model commits too early to one incorrect path; search and process scoring help explore alternatives, prune weak branches, and select more reliable solution steps.
","uni":"sy3302","language":"Python 3.13; Streamlit for the interactive demo UI; FastAPI and Uvicorn for backend endpoints; PyTorch for the Process Preference Model; Hugging Face datasets for MATH/OlympiadBench loading; sentence-transformers for local embeddings; OpenAI, DeepSeek, Anthropic, and Ollama/qwen2-math:7b model interfaces; LaTeX for the final report.","pid":"202605-12","m4uni":"","analytics":"Implemented algorithms and modules include baseline Monte Carlo Tree Search, adaptive MCTS, UCB-style node selection, adaptive candidate branching, duplicate and diversity-based pruning, heuristic step verification, hybrid PPM+verifier process scoring, top-k process-guided pruning, answer extraction and normalization, benchmark accuracy evaluation, FastAPI model comparison endpoints, and an interactive Streamlit visualization with reasoning cards and expandable MCTS tree views.
","m4lname":"","industry":"Information","m3lname":"Wu","dataset":"The system was tested on the MATH benchmark, especially algebra and higher-difficulty MATH problems loaded through the Hugging Face datasets library. We also support OlympiadBench-style math problems and custom user-entered math questions through the Streamlit and FastAPI interfaces. The evaluation script samples benchmark problems, extracts ground-truth answers from dataset solutions, runs direct/MCTS/MCTS+PPM strategies, and checks final answers using LaTeX normalization, numeric comparison, and token matching.
","m2uni":"hn2477","m2fname":"Hammer","m3uni":"rw3152"},{"projectname":"Prime Synapse - The Ultimate Forex AI Trader","timestring":"Fri Apr 23 21:00:03 2021","m1uni":"yj2627","m2lname":"Zhang","m1fname":"Ari","m4fname":"","m1lname":"Jiang","m3fname":"","description":"The objective of the project is to build an automated forex scalp trading system that draws insights from both state-of-the-art machine learning algorithms and traditional technical analysis methods. Its innovation lies in the innovative nature of the model we are using (i.e. incorporate time notion to LSTM with attention) as well as the fact that we incorporate machine learning results with technical analysis. The project is important because it addresses the weaknesses of human traders in forex system (short hours, less disciplined, worse at predicting future price trend). The system can also be extended for usage in other financial instrument trading. Both new and experienced traders can use it","uni":"yj2627","language":"Python, JavaScript, CSS","pid":"202105-13","m4uni":"","analytics":"Technical Analysis: Scalp trading strategy, Delphi method
Algorithms: Moving Average, LSTM with Attention + Time2Vec Network, Transformer + Time2Vec Network
Visualization: Matplotlib, TensorBoard, Echarts
","m4lname":"","industry":"Finance","m3lname":"","dataset":"We work with daily time-series forex dataset retrieved from Tradermade API (both historical and live)
","m2uni":"yz4053","m2fname":"Yin","m3uni":""},{"projectname":"Emotion and Need-Aware Empathy Chat System","timestring":"Tue May 5 03:32:47 2026","m1uni":"wc2933","m2lname":" Liu","m1fname":"Weiyi","m4fname":"","m1lname":"Chen","m3fname":"Ziyan","description":"Build an empathy-oriented dialogue system that can analyze multi-turn user conversations, track emotional changes over time, identify the user’s current support need, and generate need-aware empathetic responses.
The system has three main objectives. First, it detects the user’s current emotion and classifies it as positive, negative, neutral, or mixed. Second, it tracks the emotional trajectory across multiple turns, such as improving, worsening, stable, fluctuating, persistent negative, or planning progress. Third, it detects the user’s support need, such as emotional validation, reassurance, practical guidance, emotional regulation, action planning, or clarification, and uses this structured understanding to guide the final response.
The main innovation is adding a structured understanding layer between emotion analysis and response generation. Instead of only reacting to the latest message, the system considers recent dialogue history, RAG-retrieved knowledge, user need labels, trajectory states, and confidence scores. This helps the assistant produce responses that are not only empathetic but also more useful and context-aware.
This toolkit is important because emotional dialogue systems should not only comfort users, but also understand whether the user needs validation, reassurance, regulation, or concrete next steps. By comparing RAG and No-RAG modes and evaluating against human majority-vote labels, the project also provides a practical way to study how retrieval improves multi-turn emotion and need analysis.","uni":"wc2933","language":"The system is implemented mainly in Python. The web interface is built with Streamlit, and the backend modules are organized into separate Python files for sentiment analysis, user-need detection, emotion trajectory tracking, retrieval, prompt construction, dialogue management, and evaluation. The system uses the Gemini API for language-model-based analysis and response generation. A local markdown-based knowledge base is used for RAG retrieval. The project is developed and tested in VS Code on a local machine.","pid":"202605-31","m4uni":"","analytics":"The system implements several analysis and retrieval modules. The sentiment analyzer classifies each user message as positive, negative, neutral, or mixed, and estimates confidence using a rule-based adjustment method based on sentiment type, evidence, reasoning length, retrieved context, and sarcasm detection.
The user-need detector identifies the user's support need, including emotional validation, reassurance, practical guidance, emotional regulation, action planning, and clarification. The emotion trajectory tracker analyzes multiple conversation turns and assigns trajectory states such as improving, worsening, stable, fluctuating, persistent negative, persistent negative escalating, high-risk worsening, and planning progress.
The RAG module retrieves relevant markdown knowledge chunks from the local knowledge base, including sentiment lexicon entries, sarcasm examples, user-need definitions, and empathy strategies. The system also includes evaluation scripts for RAG vs No-RAG comparison and human majority-vote comparison. The Streamlit interface visualizes the detected emotion, user need, trajectory, confidence scores, summaries, and turn-level analysis.","m4lname":"","industry":"Information","m3lname":" He","dataset":"The project was tested on two types of data.
First, we used a self-constructed evaluation dataset containing 20 multi-turn conversation cases. These cases were designed to cover different emotional and support-seeking scenarios, including positive planning, negative escalation, study planning, sarcastic complaints, anxiety, emotional validation, practical guidance, fluctuating emotion, high-risk distress, and clarification. Each case includes a short multi-turn user conversation, an expected emotional trajectory label, and an expected user-need label.
Second, we conducted a human annotation evaluation on 8 representative multi-turn cases. Human annotators labeled the overall emotional trajectory and the main user support need for each conversation. The majority-vote labels were then used as the human reference labels and compared with the RAG system output.
The knowledge base used by the RAG module was built from local markdown files, including sentiment lexicon descriptions, sarcasm examples, user-need definitions, and empathy response strategies. These files were manually prepared and used for retrieval during emotion and need analysis.","m2uni":"yl5924","m2fname":"Yicheng ","m3uni":"zh2689"},{"projectname":"Financial Analysis on Spatio-Temporal Graph Data","timestring":"Fri Dec 15 18:17:40 2023","m1uni":"ql2505","m2lname":"Yu","m1fname":"Qingyuan","m4fname":"","m1lname":"Liu","m3fname":"Shaokun","description":"In this project we made a comprehensive exploration into predicting stock prices with Spatio-Temporal Graph data. Focused on S&P 500 companies, our study aims to construct dynamic relationships between companies and develop predictive models.
1)Spatio-Temporal Graph Data Generation
Creating the Spatio-Temporal Graph involved constructing relationships between companies based on news correlations and stock price movements.
2) Temporal Stock Price Data Analysis
We developed robust models capable of capturing both the temporal dynamics of individual stocks and the spatial relationships between them to enhance prediction accuracy.
3) Dynamic Spatial Relations of Companies
Various of factors are considered relations between the stock price of different companies, and we used STGNN model to capture the dynamic features of stock price.
","uni":"ql2505","language":"Python/Jupyter/Airflow","pid":"202312-14","m4uni":"","analytics":"1)Spatio-Temporal Graph Data Generation
Creating the Spatio-Temporal Graph involved constructing relationships between companies based on news correlations and stock price movements.
2) Spatio-Temporal Stock Price Data Analysis
We developed robust models capable of capturing both the temporal dynamics of individual stocks and the spatial relationships between them to enhance prediction accuracy.
3) Dynamic Spatial Relations of Companies
Various of factors are considered relations between the stock price of different companies, and we used STGNN model to capture the dynamic features of stock price.","m4lname":"","industry":"Finance","m3lname":"Feng","dataset":"The relationship between S&P 500 companies is created by ourslef through web crawlers extracting Yahoo Finance. The time series stock data is got from Finnhub.","m2uni":"xy2590","m2fname":"Xinzi","m3uni":"sf3209"},{"projectname":"Stock Price Prediction and Recommendation","timestring":"Sat Dec 18 01:41:21 2021","m1uni":"zd2263","m2lname":"Xing","m1fname":"Zhibo","m4fname":"","m1lname":"Dai","m3fname":"Zhenrui","description":"The price of stocks is an important part of the financial and trading system. Nowadays, the number of in a tremendously fast speed. The move of stock market plays an important role for companies, which may help them obtain investments and take strategic decisions, so it is always encouraged to take advantage of the observed behavior of shares and it will help avoid instability within a company and even predict its behavior in the near future. Though it is possible to use the historical data of stock market to do prediction for investment and there are tremendous websites which display historical data of stocks, accurate predictions are still very hard to make due to the chaotic nature of stock markets. This is one of the reasons why there exists few tools helping customer make predictions.
However, if the number of stocks is reduced to a small number and the period of historical data which would be used to do analysis is long, then the patterns and relations between stock price and numerous parameters can be summarized.
The website should realize three functions: displays current price of specific stocks; makes prediction of the price of stocks held by customers and gives recommendations; makes prediction of the price of stocks which customers are interested in and gives recommendations.","uni":"hx2302","language":"Python, Java, React, JavaScript, Jupyter, Idea","pid":"202112-2","m4uni":"","analytics":"Lineat Regression, LSTM, ARIAM
SpringBoot
Node.js","m4lname":"","industry":"Finance","m3lname":"Chen","dataset":"Data was fetched from finance.Yahoo.com.
Cloud SQL is used to store historical data of stock price and pridiction.","m2uni":"hx2302","m2fname":"Huiyan","m3uni":"zc2569"},{"projectname":"Esports prediction","timestring":"Sat Dec 22 22:06:30 2018","m1uni":"kl3065","m2lname":"Liu","m1fname":"Kaiji","m4fname":"","m1lname":"Lu","m3fname":"Rena","description":"Our project tackles the big problem of lack of in-game odds calculations, which is prominent in E-sports betting industry. We used data provide by Riot Company from more than 50000 ranked game of League of Legends, and used three different classification models to evaluate the features. We apply GBT classifier to our API as an interface of showing results to our users.","uni":"kl3065","language":"We use Apache Spark to train our data and Dash by plotly for visualization","pid":"201812-16","m4uni":"","analytics":"We used Decision tree classifier, One-vs-Rest classifier and Gradient-Boosted Tree classifier.

For visualization, we are able to let user to input features of a specific in-game status and the website outputs a certain win rate. ","m4lname":"","industry":"Information","m3lname":"Ren","dataset":"I found this data in Kaggle. This dataset takes the most relevant information and makes it available easily for use in things such as attempting to predict the outcome of a LoL game and analyzing which in-game events are most likely to lead to victor. Any other data contain similar features could also apply to our software. ","m2uni":"tl2871","m2fname":"Tong","m3uni":"yr2325"},{"projectname":"Listen to Your Weather","timestring":"Sat Dec 18 02:52:01 2021","m1uni":"yy2949","m2lname":"Mu","m1fname":"Yunchen","m4fname":"","m1lname":"Yao","m3fname":"Wen","description":"Our goal is to build an app that could predict the weather in the next a few hours, and recommend music based on the forecast weather. Traditional weather forecasts are mainly based on numerical method, which is computationally demanding and has limitation to achieve higher accuracy. Different from the current trend that machine learning is just used as an auxiliary method, we are trying to predict the weather mainly based on deep learning method to accelerate the prediction process and improve the accuracy. And for the music recommendation, there are increasing demands for recommendation based on user's context, among which real time temperature is an important element to influence people's music preference at the moment. Considering that existing music recommendation are mostly daily-based, we plan to introduce a function that could recommend music based on the weather that changes every hour. ","uni":"yy2949","language":"Python, PySpark, HTML, CSS, Flask, TensorFlow","pid":"202112-35","m4uni":"","analytics":"Algorithm: LSTM, ARIMA, Transformer
System Modules: backend (weather prediction, music recommendation), frontend(user interface)
Visualization: weather prediction: time series plots, loss curves; websites: display weather and recommended music","m4lname":"","industry":"Information","m3lname":"Zhan","dataset":"The datasets we use to predict weather are from OpenWeather API. The training data is the hourly weather data in New York from 1979 to now. The data for prediction is New York's realtime weather data in the past 48 hours. The data used for music prediction are user's historical listening records from Spotify API. ","m2uni":"dm3686","m2fname":"Di","m3uni":"wz2539"},{"projectname":"Evaluating Higher Order Thinking of Models with Emerging Preferences","timestring":"Wed May 13 04:48:03 2026","m1uni":"aa5844","m2lname":"Agarwal","m1fname":"Arin","m4fname":"","m1lname":"Agarwal","m3fname":"","description":"Objectives: This project builds a culinary AI agent that develops stable, interpretable preferences through experience, without being told what to prefer. The agent maintains a WorldModel (ingredient-level taste weights) and SelfModel (high-level preference abstractions) that update online after every recipe generation. We operationalize Higher-Order Thought (HOT) theory from consciousness philosophy and test empirically whether the agent's learned preferences satisfy three HOT criteria: consistency, cross-context generalization, and causal self-knowledge.

Innovations: Unlike standard LLM-based recipe systems, this agent learns from its own outputs rather than human feedback signals. We integrate retrieval-augmented generation over 50+ cookbooks and flavor chemistry data with LoRA fine-tuning to encode preferences at the weight level rather than solely in prompt context. The WorldModel/SelfModel architecture maps directly onto the CoCoMo computational consciousness framework (receptor, unconscious, conscious, effector modules), positioning this as one of the first empirical tests of HOT theory in a deployed language model system.

Capabilities: The agent generates constraint-satisfying recipes, evaluates them using a multi-signal flavor scoring framework combining odorant compound pairing, flavor balance, retrieval co-occurrence, and language model judgment, and updates its preference state in real time. It can describe its own learned preferences in ways that systematically differ from how it characterizes a generic chef's choices, providing a measurable second-order self-representation.

Why it matters: Most AI consciousness research remains theoretical. This project contributes a concrete, reproducible operationalization of HOT theory using a working system, bridging philosophy of mind and AI systems research in a way that generates empirically testable predictions and falsifiable hypotheses.

","uni":"aa5844","language":"Language: Python, Groq API, Llama 3.3 70B, Meta-Llama 3.1 8B Instruct, Hugging Face Transformers, PEFT, BitsAndBytes, Google Colab (GPU)","pid":"202605-19","m4uni":"","analytics":"Analytics: Analytics: WorldModel online learning with gradient-free ingredient weight updates, multi-signal taste scoring combining odorant compound pairing, flavor balance across five taste modalities, and constraint satisfaction. Statistical evaluation via Mann-Whitney U and Wilcoxon signed-rank tests for cross-mode comparison.

Algorithms: LoRA fine-tuning with binary penalization signal over 900 gradient updates, fuzzy ingredient-odorant lookup with token overlap fallback for robust flavor graph traversal. Exponential moving average updates for SelfModel preference dimensions (spice preference, health bias, cuisine affinity).

System Modules: Dual FAISS retrieval pipeline over recipe and flavor pairing corpora with BGE-large embeddings, paired with a bipartite flavor graph for chemistry-grounded ingredient suggestions. Multi-trial experiment runner supporting cold start, warm start, and preference stability configurations across four agent modes (baseline, WorldModel only, SelfModel only, full model).

Visualization: Ingredient avoidance bar charts comparing baseline vs. SFT + LoRA with substitution-adjusted counts, preference trajectory plots tracking spice preference and health bias over training runs. Cross-context generalization and chef vs. self divergence charts for HOT behavioral test results.
","m4lname":"","industry":"Information","m3lname":"","dataset":"Dataset: This project draws on four distinct data sources spanning recipe text, flavor chemistry, and experimental interaction logs.

Internet Archive Cookbook Collection: A corpus of 50+ digitized cookbooks sourced from the Internet Archive, covering cuisines ranging from Italian and French to Thai and Japanese. Raw text was extracted, chunked into overlapping passages, embedded using a BGE sentence transformer model, and indexed in a FAISS vector store for retrieval-augmented generation at inference time.

The Flavor Bible (Page and Dornenburg, 2008): A curated ingredient pairing reference widely used in professional culinary practice. This text was processed into a second RAG corpus providing contextual knowledge about ingredient compatibility, traditional pairings, and flavor affinity relationships that complement the odorant-chemistry approach.

FlavorDB (Garg et al., 2018): A publicly available database cataloging the volatile odorant compounds responsible for the aroma of foods. Each food item is associated with a set of odorant molecules and their chemical descriptors. We constructed two structured mappings from this dataset: a food-to-odorant map and an odorant-to-food map, which together form a bipartite flavor graph used for chemistry-grounded ingredient pairing and recipe scoring.

Experimental Interaction Logs: Over 300 structured interaction logs generated during agent training runs, capturing recipe outputs, taste scores, and WorldModel/SelfModel weight snapshots after each interaction. This dataset is novel to this work and serves as the primary source of evidence for the HOT behavioral tests.

Other Data the System supports: The pipeline is designed to accommodate any structured ingredient-odorant database, any plain-text recipe corpus that can be chunked and embedded, and any JSON-formatted interaction log following the run record schema defined in the codebase. The RAG indices can be rebuilt against new cookbook collections or domain-specific culinary texts with no architectural changes required.
","m2uni":"aa5605","m2fname":"Anika","m3uni":""},{"projectname":"Web Application for Movie Recommendation","timestring":"Fri Dec 17 23:19:31 2021","m1uni":"zl2954","m2lname":"Chen","m1fname":"Zihao","m4fname":"","m1lname":"Luo","m3fname":"Yazhe","description":"Movies are one of the most common ways for people to spend their leisure time. There are numerous movie websites where people can go to watch movies if the users know the name or the actors/actresses of the movie. However, a limited number of them are able to recommend movies to the users accurately and user-friendly. Therefore, we want to make a website application that is capable of recommending movies to the users accurately in a simple way. Thus, we decided to design a system, where people can input a movie name, and then will receive their several movie recommendations shown in a way as an image. ","uni":"zl2954","language":"HTML, CSS, JavaScript, Java, Python, Spring MVC, Jupyter Notebook","pid":"202112-11","m4uni":"","analytics":"Item-Based Collaborative Filtering,
User-Based Collaborative Filtering,
Top-N Recommendation Analysis ","m4lname":"","industry":"Media","m3lname":"Yan","dataset":"The dataset we used is a public dataset named MovieLens. We downloaded the data from its official website. After training and testing, our application supports the text input on the website of any movie name. ","m2uni":"wc2794","m2fname":"Wang","m3uni":"yy3177"},{"projectname":"A Web App for Food Recognition and Nutrition Visualization","timestring":"Sat Dec 22 06:06:07 2018","m1uni":"by2267","m2lname":"Yang","m1fname":"Boyu","m4fname":"","m1lname":"Yang","m3fname":"","description":"People share their life and delicious food every day in social networking platform, such as Facebook, Instagram and so on. Meanwhile, they pay great attention on their health, telling health apps what they eat and how much they eat every day. This project designs a web application of food recognition based on convolutional neural network, and analyzes the nutrition information of each category with visualization using images and diagrams, and finally recommending users with scientific diet plan. Food-11 dataset is used for classification model training and web visualization performance evaluation.","uni":"by2267","language":"Python; HTML; TensorFlow Keras","pid":"201812-27","m4uni":"","analytics":"A structure of Convolutional Neural Network: LeNet-5;
Web application building for food recognition on Food-11, enabling model loading, images uploading, food category recognition and nutrition analysis;
Diagrams and images visualization on several web pages using HTML. ","m4lname":"","industry":"Information","m3lname":"","dataset":"The dataset we use is Food-11 dataset. It totally contains 16,643 food images, which are divided into three parts. Training dataset includes 9,866 images, validation dataset includes 3,430 images and evaluation dataset includes 3,347 images. There are 11 food categories, which are Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit. The total size of the dataset is about 1.16 GB. ","m2uni":"ky2398","m2fname":"Kehan","m3uni":""},{"projectname":"Cathey: An Offline Voice-Controlled Smart Home Assistant on Raspberry Pi 5 with Four-Layer Memory and a Hybrid Rule-Based / Quantized LLM Pipeline","timestring":"Wed May 6 01:50:53 2026","m1uni":"yc4653","m2lname":"Du","m1fname":"Yiwen","m4fname":"","m1lname":"Chen","m3fname":"Hailin","description":"We present Cathey, a fully offline voice-controlled
smart home assistant that runs on a Raspberry Pi 5. Most
commercial assistants such as Alexa or Google Home rely on
the cloud, which raises privacy concerns and requires a stable
internet connection. Cathey does everything on-device: speechto-text, intent parsing, dialogue management, and real hardware
control","uni":"yc4653","language":"Python","pid":"202605-36","m4uni":"","analytics":" pipeline uses Whisper for STT, a 3B Qwen2.5
model in GGUF Q3 K M format through llama-cpp-python
for intent parsing, and Piper for TTS. A rule-based fast path
handles unambiguous commands in under 5 ms so the LLM is
only called for ambiguous or open-ended speech. To support
natural conversation, a four-layer memory system (working,
episodic, semantic, procedural) lets the assistant remember user
preferences and learn repeating patterns. We fine-tune the LLM
with LoRA on a custom dataset of 2250 labelled (utterance,
JSON) pairs covering four intent types. New in this version: a
WS2812B 12-LED ring with five-level color temperature control
(1=6500K daylight to 5=2700K candlelight), a PWM fan for AC
simulation, a second stepper motor for window control, and
a fuzzy wake-word detector. The rule-based layer also gains
state-aware relative color temperature adjustment (“warmer”
and “cooler” commands), and curtain-opening qualifiers (“a
little”=20%, “halfway”=50%, “most of the way”=80%). On a
20-case benchmark across five quantization variants, the chosen
3B Q3 K M model reaches 85% intent accuracy with an average
latency of 3.9 s, while the rule-based path keeps direct commands
near-instant","m4lname":"","industry":"Life Science","m3lname":"He","dataset":"We fine-tune the LLM
with LoRA on a custom dataset of 2250 labelled (utterance,
JSON) pairs covering four intent types. New in this version: a
WS2812B 12-LED ring with five-level color temperature control
(1=6500K daylight to 5=2700K candlelight), a PWM fan for AC
simulation, a second stepper motor for window control, and
a fuzzy wake-word detector.","m2uni":"hd2592","m2fname":"Hanzhen","m3uni":"hh3185"},{"projectname":"Dynamic Networks Supporting Memory in the Human Brain","timestring":"Fri Dec 13 14:43:30 2019","m1uni":"seq2102","m2lname":"","m1fname":"Salman","m4fname":"","m1lname":"Qasim","m3fname":"","description":"The goal of this project was to characterize individual brain networks that support successful memory encoding and retrieval. It uses a huge dataset of brain data and graph analysis techniques to isolate the nodes and edges that are important for human cognition. This is important because manipulating these networks may lead to improved cognition and memory. ","uni":"seq2102","language":"Python, HTML, Google Cloud Platform","pid":"201912-26","m4uni":"","analytics":"We used extensive signal processing and statistics, graph analysis, and HTML (Flask) web app visualization. We containerized all of this in Docker and deployed via Kubernetes. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset tested was a proprietary collection of intracranial brain recordings collected by our lab. Our software could support surface EEG data as well, which is readily avaialble","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Credit Card Overdue Analysis and Data Visualization","timestring":"Fri Dec 16 05:11:33 2022","m1uni":"rw2959","m2lname":"Huang","m1fname":"Ruobing","m4fname":"","m1lname":"Wang","m3fname":"Yufan","description":"The first goal is visualization. We used various charts to provide users with a straightforward relationship between the credit card overdue status and the applicants’ personal status. We will show these charts later in the demo.
The second goal is classification and prediction. 2 MLP classifiers were trained to classify the applicants’ overdue status and overdue timespan. Then, a 2D sigmoid activation function will merge the output of both classifiers and produce a final possibility for getting a credit card.
The third goal is webpage deployment - Flask is not only a popular web framework, but it can also easily integrate machine learning models and other useful python tools into the web application. Based on these characteristics, flask enables us to provide fast-response predictions back to users on a web page.
","uni":"rw2959","language":"html and python","pid":"202212-1","m4uni":"","analytics":"Visualization is implemented. We provide a straightforward relationship between the credit card overdue status and applicants’ personal information using Pyecharts.
The system is fully deployed on the cloud, Three Google Cloud Services are used. Firstly, the computing and visualizing tasks are executed by Pyspark. They are performed on a Dataproc instance with 3 servers. After visualization results are generated, they are stored in the cloud storage as html files. Lastly the flask project is deployed on a cloud computing instance and provides a web application to users from the internet.
As for algorithm, to achieve the goal of prediction. We compile, train and evaluate machine learning models. Use Scikit-learn to plot correlation matrices for the features and confusion matrices for the result. Finally, merge the output of both classifiers and give an overall possibility for credit card approval.
","m4lname":"","industry":"Finance","m3lname":"Luo","dataset":"There are two tables in our dataset which can be merged by applicant ID.
In the first .csv file (namely application_records), each applicant has a unique application ID and corresponds to 17 different features representing the possible factors that may influence their credit status.
In the second .csv file (namely credit_records), we have multiple records for every single applicant. Each record describes the applicant's credit status for the specified month.","m2uni":"lh3158","m2fname":"Lijie","m3uni":"yl5086"},{"projectname":"Expression AI","timestring":"Tue May 5 22:05:46 2026","m1uni":"sb5214","m2lname":"Vyasamudri","m1fname":"Shreya","m4fname":"","m1lname":"Banga","m3fname":"Sanjita Chandan","description":"The goal of Expression AI is to build a real-time system that detects and predicts user emotional escalation in conversations with AI, and dynamically adapts responses to de-escalate negative interactions. Unlike traditional sentiment analysis, this system focuses on predicting future emotional states (2 turns ahead) and intervening before escalation occurs.

Innovations & Capabilities:

Predictive emotional modeling: Uses a rolling-window model to forecast user frustration before it peaks
Closed-loop adaptive system: Combines prediction → explanation → intervention → evaluation in a continuous loop
Explainability-driven intervention: SHAP identifies key drivers of escalation, which directly inform response strategies
Real-time response steering: Injects intervention strategies into a generative model (Mistral-7B) at runtime
A/B evaluation framework: Quantifies effectiveness of interventions using measurable emotional improvements

Why this is important:
Current AI systems are reactive and static—they detect sentiment but do not anticipate or adapt. This toolkit enables emotionally intelligent AI systems that can improve user experience, reduce frustration, and make human-AI interaction more effective and trustworthy.","uni":"sb5214","language":"Programming Languages: Python (primary language for modeling, data processing, and system integration) Platforms & Tools: PyTorch (BiLSTM model) XGBoost (prediction model) HuggingFace Transformers (GoEmotions BERT) Mistral-7B-Instruct (LLM for response generation) Streamlit (interactive dashboard UI) Plotly (visualizations) SQLite + Parquet (data storage)","pid":"202605-11","m4uni":"","analytics":"1. Emotion Annotation Module

Model: GoEmotions BERT
Output: 28 emotion probabilities → mapped to VAD vectors
Enables continuous emotional tracking across conversation turns

2. Escalation Prediction Models

XGBoost (primary model):
Input: 14 engineered rolling-window features (user + AI behavior)
Window: last 4 turns
Output: probability of escalation 2 turns ahead
Achieved ~0.82 AUC
BiLSTM (comparison model):
Input: raw VAD sequences
Captures temporal dependencies
Achieved ~0.78 AUC

3. Feature Engineering

User features: valence trends (mean, slope, delta, variance)
AI features: response patterns (length, apology rate, refusal rate, hedging)

4. Explainability Module

SHAP TreeExplainer
Identifies top contributors to escalation (e.g., valence drop, frustration signals)
Drives intervention logic

5. Intervention Engine (Core System Contribution)

Rule-based + SHAP-driven strategies:
Acknowledge frustration
Reduce verbosity under high arousal
Offer alternatives instead of refusals
Injected into LLM system prompts dynamically

6. Generative Response Module

Model: Mistral-7B-Instruct
Produces adaptive responses based on intervention strategies

7. Evaluation & Analytics

A/B Testing: baseline vs intervention responses
Metrics:
Δ Valence (emotional improvement)
% improvement (~72%)
Statistical significance (paired t-test, p < 0.05)

8. Visualization System

UMAP for emotion clustering
Interactive dashboards (Streamlit + Plotly)
Calibration curves and precision-recall analysis","m4lname":"","industry":"Information","m3lname":"Ballapur","dataset":"Primary Dataset:

WildChat-1M dataset (subset of ~50,000 conversations)
Multi-turn, real-world chatbot conversations
Multilingual data (primarily English with global diversity)

How it was obtained:

Accessed via HuggingFace streaming API for scalable data ingestion

Additional Data Processing:

Each conversation turn was annotated using GoEmotions BERT (28 emotion classes)
Emotions were mapped into VAD space (Valence, Arousal, Dominance) for continuous modeling

Other supported data:
The system is designed to generalize to:

Any multi-turn conversational dataset (customer support chats, chatbots, etc.)
Real-time streaming chat data
Multilingual conversations (as long as emotion annotation models are available)","m2uni":"av3329","m2fname":"Anubha","m3uni":"sb5216"},{"projectname":"ASSAYING MSD","timestring":"Sat Dec 18 02:06:20 2021","m1uni":"km3702","m2lname":"Kasulla","m1fname":"Karpagam","m4fname":"","m1lname":"Murugappan","m3fname":"","description":"Reasons for music analysis:
1. Recommendation systems
2. AI Assistance / bots
3. Song remixing
4. Entertainment industry

Objective: Analyze Million Song Dataset: - geographic information, lyric sentiments, correlation between attributes; perform trend analysis with respect to year, that is, identify common music acoustics or features across years and gain insights on how this may evolve in future

Toolkits used: pyspark mlib package for training models to predict year; visualization packages

Innovation: used the extended version of MSD - in terms of Timbre feature - its mean and covariance (Around 90 attributes for the music tracks) to identify similar patterns across years and trained three types of models to learn these patterns: - regression, MLP classifier and K means clustering

Capabilities: This type of analysis can further be extended to other music acoustics like bars and beats which have a fixed size vector for each segment in a music track. Repeating this helps us to understand the correlation (if it exists) between year and these acoustics and learn patterns in them year wise.

","uni":"km3702","language":"Python","pid":"202112-53","m4uni":"","analytics":"System Modules: -
1. Data extraction and cleaning
2. Data analysis and visualization
* Lyrics
* Geographic information
* Correlation between attributes
* Music acoustics
3. Train model to predict year - expected to learn patterns in year - year wise feature analysis
* Linear regression; Multinomial logistic regression
* Multi-Layer Perceptron Classification
* K Means Clustering
4. Evaluation metrics
* Regression - Accuracy; precision; recall; F1 score; true positive rate; false positive rate
* MLP Classification - Accuracy; precision; recall; F1 score; log loss; true positive rate; false positive rate
* K Means - silhoutte of squared Euclidean distance - [-1,1] - closer to 1, more stable and confidence in the clusters formed
5. Testing and selecting the best model
6. Best model prediction visualization - should show patterns year wise or period wise

Python packages used: -
1. tables package - to access H5 files
2. pyspark.ml.regression.LinearRegression
3. pyspark.ml.classification.LogisticRegression
4. pyspark.ml.classification.MultiLayerPerceptronClassifier
5. spark.mllib.clustering.KMeans
6. matplotlib and seaborn package - bar graphs; scatter plots; box plots; line plots
7. GeoPandas and GeoPlot package
8. numpy package
9. pandas package
10. textblob - lyric sentiment
","m4lname":"","industry":"Media","m3lname":"","dataset":"Million Song Dataset - tar gz file of H5 files - http://millionsongdataset.com/pages/getting-dataset/
YearPredictionMSD - txt file format - part of UCI Machine Learning Repository - http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

These are publicly available datasets.","m2uni":"amk2358","m2fname":"Arya","m3uni":""},{"projectname":"Movie Recommender","timestring":"Sat Dec 22 05:07:56 2018","m1uni":"yz3397","m2lname":"Liu","m1fname":"Yimeng","m4fname":"","m1lname":"Zhang","m3fname":"Kailing","description":"Almost no one does not like to watch movies. With the cost of filming falling and more and more people investing in filming, there are now millions of movies, and people have many choices, but it also makes it harder for them to find the movies they really want to watch. Imagine working hard for a week, just want to watch a movie, immerse yourself in it, and stay away from the hustle and bustle of the world. But a bad movie will ruin it all. So people urgently need a movie recommendation software. It will recommend the most acclaimed movie to the user, and he will also search for movies that are similar to those of his interests based on the information he has entered. With this recommander, life will be simple and beautiful.
","uni":"yz3397","language":"Python, JavaScript","pid":"201812-8","m4uni":"","analytics":"1.Weighted Rating 2. User-User Cosine Similarity Analysis. 3. TFITF Overview Similarity Analysis. 4. TFITF Weighted Cast, Director & Genre Similarity Analysis 5. Keras Deep Learning","m4lname":"","industry":"Media","m3lname":"Chen","dataset":"Netflix Prize Data and The Movies Dataset are used. We get them from kaggle. ","m2uni":"sl4401","m2fname":"Shiyu","m3uni":"kc3211"},{"projectname":"RealTime Victory Predictor (RVP): Dynamic In-Game Outcome Forecasting","timestring":"Fri Dec 15 19:08:45 2023","m1uni":"ws2685","m2lname":"Zhang","m1fname":"Wen","m4fname":"","m1lname":"Song","m3fname":"","description":"The main objective of this RealTime Victory Predictor (RVP) project is to predict the outcome(win/loss) of a LoL 5v5 match game.

Speaking of the innovations, our project not only focuses on the players statistics, but also attach importance to the impact of the game objectives/events on the outcomes, such as the first Baron, Dragon, tower and so on. This provides us with a fresh perspective on the factors that potentially influence the game outcome.

Our project achieved the RealTime Victory Predictor (RVP), a model ML-based system designed to predict the outcomes of esports matches in real time. The system integrates player performance, team stats, and in-game events into its predictive model, adapting to the fluctuating nature of competitive play. This system not only offers a tool for enhancing esports broadcasting and strategic planning but also contributes to the understanding of decision-making in competitive gaming through advanced analytics.
","uni":"ws2685","language":"Python, Streamlit","pid":"202312-7","m4uni":"","analytics":"Technologies:
Data preprocess / analysis using Python Pandas
Utilized variable prediction model with Scikit-learn
Ensemble 9 models for final prediction
Data visualization using Seaborn, Plotly and Matplotlib to present prediction result
Achieve Front-End interactive web page UI and make live prediction based on live user input using Streamlit

Models:
Logistic Regression, Decision Tree, Random Forest, Naive Bayes, Gradient Boosting, Multilayer Perceptron, Linear Support Vector Machine, One-vs-the-rest, K-Nearest Neighbor

System:
Data Acquisition: Retrieving match data from the Riot API, focusing on top-tier North American players.
Data Preprocessing: Cleaning the raw JSON data using Python Pandas for optimal model performance.
Model Training & Ensemble: Training various machine learning models with Scikit-learn and combining them for improved accuracy.
Real-Time Prediction: Capability to make instant predictions during live matches, adapting to ongoing game dynamics.
Interactive Web Interface: Using Streamlit to build an interactive webpage for user interaction, allowing live input and instant prediction.

Visualization:
Histogram, Pie Chart, Correlation Heat Map.
","m4lname":"","industry":"Media","m3lname":"","dataset":"Our dataset is fetched by using Riot API(https://developer.riotgames.com/). The region that we focused on is mainly the North America region and the game mode is mainly ranked team 5v5 matches.

1. Select target player (Around 7k) [challengers, master, grandmaster]
LEAGUE-V4
2. Fetch player’s puuid (Around 7k)
SUMMONER-V4
3. Fetch 10 ~ 20 most recent matches list they finished (Around 70~140k)
MATCH-V5 by puuid
4. Remove duplicated and fetch match information (Around 50~110k)
MATCH-V5 by matchid","m2uni":"zz2980","m2fname":"Zheyu","m3uni":""},{"projectname":"Deep Video Understanding","timestring":"Sat Dec 17 05:36:12 2022","m1uni":"yd2616","m2lname":"Li","m1fname":"Yifei","m4fname":"","m1lname":"Dong","m3fname":"","description":"To understand human entity relationships in long-form media like movies.","uni":"yd2616","language":"python tensorflow open-cv","pid":"202212-12","m4uni":"","analytics":"CLIP","m4lname":"","industry":"Information","m3lname":"","dataset":"TRECVID 2022 dataset","m2uni":"ll3466","m2fname":"Linquan","m3uni":""},{"projectname":"Autonomous Learning: from Large-Scale Data without Annotation","timestring":"Sat May 16 03:01:32 2020","m1uni":"yh3223","m2lname":"Kumar","m1fname":"Yangchen","m4fname":"","m1lname":"Huang","m3fname":"","description":"Semi-supervised learning is a branch of machine learning that leverages unlabeled data since labeling data is expensive. It is recently gaining a lot of research attention especially for classification problems. Therefore, to gain a solid understanding and practical exposure to semi-supervised learning (SSL), we implement a generalized semi supervised learning algorithm to solve classification problems. All in all, we chose an appropriate SSL model architecture, implemented the various SSL methods, and classified images as accurately as we could using the WideResNet as baseline. On STL-10 we obtain a worthy accuracy of 74%, and on CIFAR-10 with just 500 labeled examples we correctly classify 80.3% of the data. We further discuss our objectives, methodology, and results in more detail throughout this report.
","uni":"yh3223","language":"Python","pid":"202005-29","m4uni":"","analytics":"System Modules:
Input Module
Parameter Tuning Module
SSL Module
WideResNet Module
Real Time Tracking Module
Test and Report Module

","m4lname":"","industry":"Information","m3lname":"","dataset":"STL-10
CIFAR-10
SHVN
SHVN+Extra
CIFAR-100","m2uni":"sk4661","m2fname":" Sachit","m3uni":""},{"projectname":"Optimization Heterogeneous Chip Performance with Fine-Grained Switching by Leveraging Big Data","timestring":"Thu Dec 19 01:55:36 2024","m1uni":"xs2465","m2lname":"","m1fname":"Xuheng","m4fname":"","m1lname":"Song","m3fname":"","description":"Heterogeneous Chip Multiprocessors combine high- and low-complexity cores, offering energy and throughput advantages over homogeneous chips. Traditional coarse-granularity switching methods limit performance and energy efficiency. In addition, fine-granularity switching is more accurate but poses challenges in real-time prediction of core migrations.

Our project aims to develop a machine learning model to predict fine-grained core switching in CMPs using Big Data tools, as well as to optimize system performance by applying Proximal Policy Optimization to dynamically allocate resources between in-order and out-of-order cores. The primary objectives are to maximize key performance metrics like Instructions Per Cycle and Cache Hit Rate while maintaining system stability and scalability.

This research project introduces innovative elements such as custom reward functions that combine multiple metrics to align decisions with overall system goals, the use of PPO’s clipped objective for smooth policy updates, and interactive visualizations that provide actionable insights into system behavior pre- and post-optimization.

Our toolkit enables data-driven decision-making, scalable implementation for complex architectures, and detailed performance analysis through advanced visualization. This work is important as it addresses the critical challenge of balancing efficiency and performance in modern computing systems, paving the way for further advancements in resource optimization and multi-objective performance tuning.","uni":"xs2465","language":"Python, JavaScript","pid":"202412-24","m4uni":"","analytics":"Our system leverages advanced analytics to understand system performance:
- Correlation Analysis: Identifies relationships between IPC, Cache Hit Rate, Memory Usage, and rewards.
- Trend Tracking: Monitors IPC, cache efficiency, and rewards over iterations to evaluate optimization improvements.

Algorithms
- Proximal Policy Optimization: Dynamically allocates resources between in-order and out-of-order cores. Utilizes a clipped objective for stable updates and a custom reward function to align decisions with system goals.
- Visualization Algorithms: Generate histograms, scatter plots, heatmaps, and bubble charts to visually analyze trends and patterns.

System Modules
- Data Processing Module: Uses Apache Spark for large-scale data preprocessing, cleaning, and feature engineering, ensuring scalability for bigger datasets.
- Reinforcement Learning Module: Built with TensorFlow, implementing PPO for efficient policy optimization and reward maximization.
- Visualization Module: Generates interactive, drill-down visualizations using D3.js, enabling granular analysis of pre- and post-PPO performance.

Visualizations
- Pre-PPO: IPC progression (line chart), IPC vs. Memory Usage (scatter plot), Cache Hit Rate trends (bar chart).
- Post-PPO: Reward distribution histogram, IPC vs. Reward bubble chart, correlation heatmap.
","m4lname":"","industry":"Information","m3lname":"","dataset":"Our dataset was generated using the gem5 simulator, a widely-used open-source tool for system architecture research. It provides detailed metrics such as Instructions Per Cycle, Memory Usage, and Cache Hit Rate, simulating both in-order and out-of-order core behaviors.

We tested workloads from five computational algorithms: FFT, LU Decomposition, Radix Sort, Matrix Multiplication, and Convolution. These represent a mix of compute- and memory-intensive tasks, providing diverse scenarios to evaluate core performance. The dataset includes:
- IPC, Memory Usage, and Cache Hit Rate metrics across iterations.
- Core action probabilities (in-order vs. out-of-order).
- Rewards calculated using a weighted combination of IPC and cache efficiency.

The data was generated by running the workloads in gem5 under realistic configurations. Multiple iterations were simulated, capturing key performance metrics and their evolution over time. The process emulates real-world conditions, providing actionable data for training and evaluating the PPO algorithm. The base algorithms (FFT, LU, etc.) and gem5 are open-source. However, the specific dataset tailored for our project is not yet public. Our methodology supports additional workloads: Benchmarks like SPEC CPU or custom algorithms.

","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Video Human Action Recognition Using Neural Network","timestring":"Sat Dec 18 03:54:52 2021","m1uni":"zz2881","m2lname":"Zhang","m1fname":"Ziqi","m4fname":"","m1lname":"Zhang","m3fname":"Li","description":"Objectives:

In this project, we want to use both CNN, RNN and LSTM neural networks to achieve human action recognition, and evaluate each other by accuracy, cross entropy loss and hinge loss.

Inovations:

1.Instead of using dynamic video data, we transform it into matrix data to better fit our model.

2.After training the model, we could also produce a video from test data that include the result of the classification and the tracking of human body.

3.Our model could track the action of human in real time, which means that if the action changed, our model could monitor the change and update the result.

Why are these research / toolkits important?

We get the conclusion that RNN, LSTM network perform better than CNN, it proves that time-related neural net work should better fit action recognition problems.","uni":"zz2881","language":"python/tensorflow/openpose/opencv","pid":"202112-47","m4uni":"","analytics":"Analytics:
1.Tracking of human body features: Openpose
2.Human action recognition: Neural Network(CNN, RNN, LSTM)

Algorithms:
Neural Network

Visualization:
We use openpose to mark the humanbody, and use opencv to transform the matrix data to the final video","m4lname":"","industry":"Information","m3lname":"Cai","dataset":"
HMDB51:
We chose the HMDB51 dataset as our database for video generating, it contains over 7000 video data, and 51 categories, and each category contains at least 100 videos, including about 70 clips for training and 30 clips for testing. Compared with other databases, the HMDB51 includes more different classes of action, and besides, it also takes into consideration of complicated backgrounds.
Other dataset that our model can support:
UCI
","m2uni":"wz2582","m2fname":"Weicong","m3uni":"lc2928"},{"projectname":"MVP Award Prediction","timestring":"Sat Dec 18 03:27:05 2021","m1uni":"mh4116","m2lname":"Zhou","m1fname":"Mingzhe","m4fname":"","m1lname":"Hu","m3fname":"Zichen","description":"The project is aimed to predict three NBA award winners with probability. We developed the stacked models with streamed data. The prediction is dynamic so we can get real-time predictions. The results are visualized with a user-friendly interface where you should log in before inquiry. We are able to handle high concurrency in player search and award-winning prediction. Daily updated predictions can help to get more accurate and reasonable results. Interface with cache, debounce can defense high-volume requests.","uni":"mh4116","language":"Google Cloud Platform; Python, PySpark, Node.js, React, Express, JavaScript, HTML, CSS","pid":"202112-4","m4uni":"","analytics":"The system consists of front-end, back-end, dataset and algorithm. The algorithm is a series of machine learning models. We visualized our results with CSS, JS, HTML and React. We analyzed our results with google's top search results.","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"The dataset comes from three websites, including icon fetch, technical data fetch, and news fetch. We fetched data with Requests, BeautifulSoup and Selenium. Our software can support data from the teams that the player belongs to.","m2uni":"yz3917","m2fname":"Yuting","m3uni":"zw2669"},{"projectname":"Movie Success Prediction and Analysis","timestring":"Sat Dec 18 02:09:30 2021","m1uni":"sa3864","m2lname":"Kumar","m1fname":"Shikha","m4fname":"","m1lname":"Asrani","m3fname":"Arvind","description":"This project aims to predict IMDb ratings of movies using textual data which includes tweets about movies, online reviews and movie metadata -e.g., its running time, genre. We are trying to tackle the problem of predicting how successful would the movie be in terms of IMDb ratings by analyzing tweets made about the movie and twitter and other forms of textual data including movie reviews. Twitter data can prove to be useful for getting to know the reaction of general public to the movie which in turn can help to determine the ratings. If the reaction is positive, we can say that the movie would be well-received (scoring a higher rating) and if the reaction is negative, the movie would have a lower IMDb rating.

According to our knowledge, there has been no attempt to use sentiment scores for predicting IMDB rating before. Usually box office collections are predicted which are not very useful to viewers. IMDB ratings on the website during initial days of release are really high, which does not give a good idea about the movie.
Viewers can use our prediction to know if the movie is worth fighting for first day first show tickets without having to see any spoilers of social media!
","uni":"sa3864","language":"Python, Pyspark, Django, D3, bootstrap, Pandas","pid":"202112-36","m4uni":"","analytics":"NLTK for data cleaning
Spark NLP pipeline for Sentiment Analysis.(sentimentdl_use_twitter and sentimentdl_use_imdb)
SparkMLLib - Linear Regression, Ridge Regression, Decision trees, Random Forest, Gradient Boosted Trees
Visualization - D3, Django
- Correlation between Imdb ratings and sentiment scores)
- RMSE metrics for all models
- Data analytics - genre wise split and sentiment scores per genre.

GCP tools used : BigQuery(storing and querying data), Dataproc(running spark jobs) and Cloud storage(storing data).","m4lname":"","industry":"Media","m3lname":"Kanesan Rathna","dataset":"Dataset used:
- IMDB dataset.(https://ieee-dataport.org/open-access/imdb-movie-reviews-dataset)
- Twitter streaming data. (Twitter API)

What other data that your software can support? - Any textual data that can produce sentiments of reviews - For e.g. YouTube comments, reddit posts, blog posts etc or any other movie features.

","m2uni":"ak4581","m2fname":"Asmita","m3uni":"ak4728"},{"projectname":"Personalized Company Research Dashboard","timestring":"Fri Dec 17 20:35:17 2021","m1uni":"sr3767","m2lname":"Govindarajan","m1fname":"Shambhavi","m4fname":"","m1lname":"Roy","m3fname":"Rahul","description":"Stock market investors often require guidance in understanding current market conditions to make wise investment decisions. Our project addresses this concern and is aimed to provide users with centralized dashboard access to understand a company's current market condition using big data principles. For a given company name as user input, we seek to provide to its stock market information, relevant current YouTube videos, and real-time sentiment of its stock in the market. Having access to these three modalities of data with sentiment analysis would be useful for traders to conduct market research on a specific company in depth. ","uni":"sr3767","language":"Python, Django","pid":"202112-46","m4uni":"","analytics":"For a given user input of company name, we queried the Yahoo Finance API to query the latest stock price by the minute. Next, we used the YouTube Data API to retrieve the top 10 news videos of the company searched using the tags: Company name, it's stock ticker, and 'news'. Finally, we queried the Twitter API using these tags to retrieve the latest company news on which we performed sentiment analysis using the TextBlob library.","m4lname":"","industry":"Information","m3lname":"Lokesh","dataset":"We are utilizing data from several sources in our project. This includes financial data collected using Yahoo Finance API to display real-time stock price, video data from YouTube API to filter and display videos, and real-time tweets from Twitter API to perform sentiment analysis.
We have also used the Twitter Tweets Data for Sentiment Analysis dataset to experiment with using a BERT model for sentiment analysis.","m2uni":"sg3896","m2fname":"Saravanan","m3uni":"rl3164"},{"projectname":"LAPD Crime Data: Patterns and Forecasting","timestring":"Fri Dec 19 22:02:48 2025","m1uni":"at3954","m2lname":"Ye","m1fname":"Angela","m4fname":"","m1lname":"Tao","m3fname":"Yewen","description":"This project aims to analyze the spatio-temporal patterns of crime in Los Angeles using historical LAPD crime data, with a particular focus on hourly-level temporal dynamics. Crime occurrence is highly dependent on both time and location, and understanding these patterns is critical for effective planning and resource allocation in public safety.

The primary objective is to develop and evaluate hourly crime forecasting models that can capture noisy and irregular crime behavior. We compare two time-series forecasting approaches, Prophet and SARIMA, to assess their robustness and predictive performance at an hourly resolution. Special attention is given to evaluating model behavior across all areas of Los Angeles and within Central Los Angeles, which exhibits the highest crime density.

By combining advanced time-series modeling with visualization and interactive tools, this project demonstrates how data-driven approaches can support decision-making in government and social science contexts. The methodologies and tools developed in this project provide a practical framework for analyzing and forecasting urban crime patterns, enabling more informed and proactive public safety strategies.","uni":"yl5888","language":"This project was developed primarily using Python for data processing, modeling, and analysis. Key Python libraries include Pandas and NumPy for data manipulation, Prophet and statsmodels for time-series forecasting, scikit-learn for evaluation metrics, and Matplotlib for visualization. For large-scale data processing and aggregation, Apache Spark (PySpark) was used to efficiently handle the full LAPD crime dataset. The interactive forecasting system was implemented using FastAPI for the backend API and HTML, CSS, and JavaScript for the frontend user interface. Spatial visualization was enabled using Leaflet.js for rendering interactive maps.","pid":"202512-7","m4uni":"","analytics":"Analytics:
The project performs spatio-temporal analysis of crime data, focusing on hourly crime frequency patterns across Los Angeles. Analytical tasks include time-series aggregation, temporal pattern extraction, area-level comparison, crime-type frequency analysis, and evaluation of forecasting accuracy using quantitative error metrics.

Algorithms:
Several time-series forecasting algorithms were implemented and compared. A Seasonal Naive model with a 24-hour period was used as a baseline. More advanced models include Seasonal ARIMA (SARIMA) to capture autoregressive and seasonal dependencies, and Prophet to model additive trends and seasonal components with robustness to noise and non-stationarity.

System Modules:
The system consists of multiple modules, including data ingestion and preprocessing, hourly aggregation and feature construction, model training and evaluation, forecast generation, and result storage. An API module built with FastAPI serves forecast results, while a frontend module enables user interaction through a web interface.

Visualization:
Visualization components include time-series plots comparing actual and forecasted crime counts at both daily and hourly levels, error comparison tables, and premise-level hourly pattern plots. An interactive geographic map of Los Angeles was implemented using Leaflet.js, allowing users to query forecasted crime frequency and top crime types by selecting a specific date and hour.","m4lname":"","industry":"Social Science-Government","m3lname":"Li","dataset":"This project uses the LAPD Crime Data from 2020 to Present(https://catalog.data.gov/dataset/crime-data-from-2020-to-present), a publicly available dataset published on Data.gov and maintained by the Los Angeles Police Department. The dataset contains detailed records of reported crime incidents across the City of Los Angeles and is regularly updated to reflect new reports.

The dataset consists of approximately 1 million records with 28 attributes per record. Each entry includes temporal information (date and time of occurrence), spatial information (area, reporting district, latitude, and longitude), and crime-related attributes such as crime code, crime description, premise description, and weapon information.

This dataset is well suited for spatio-temporal analysis due to its large scale, fine-grained temporal resolution, and city-wide coverage. Its public availability and continuous updates make it a reliable source for studying urban crime patterns and developing forecasting models for public safety and government applications.","m2uni":"yy3585","m2fname":"Yibo","m3uni":"yl5888"},{"projectname":"Social Platform Creator Profiling","timestring":"Fri Dec 19 21:25:32 2025","m1uni":"qw2438","m2lname":"Xin","m1fname":"Qinyun","m4fname":"","m1lname":"Wu","m3fname":"Liangke","description":"We aim to build an end-to-end, interpretable creator value recommendation system for Bilibili that can (1) classify creators into high-value vs. low-value groups using multi-perspective creator-level features, (2) explain why a creator is predicted as high/low value, and (3) generate actionable diagnostic feedback and a continuous creator value score for practical decision support.

Unlike follower-count or black-box ranking heuristics, our system integrates behavioral, interaction-volume, and interaction-pattern signals (including danmaku-specific engagement) and uses SHAP-based feature attributions to provide transparent, instance-level explanations. We further propose a unified scoring mechanism that combines classifier confidence with normalized SHAP contributions to produce an interpretable continuous score in [0,100].

Given a creator UID, the system automatically crawls public data, preprocesses and aggregates creator-level features, runs a trained binary classifier, computes SHAP explanations, produces a value score, and displays results in an interactive dashboard with comparative analytics against high-performing benchmarks.

Creator value assessment is inherently multi-dimensional and subjective; stakeholders (brands/MCNs/creators) need not only predictions, but also reasons and improvement directions. Our toolkit bridges predictive modeling with explainable diagnostics, making the results more trustworthy and actionable in real-world creator growth and creator selection workflows.","uni":"qw2438","language":"Backend is implemented in Python. Frontend is a web dashboard (HTML/CSS/JavaScript). ","pid":"202512-2","m4uni":"","analytics":"We implemented creator-level feature engineering (behavior, interaction volume, interaction patterns), PCA + K-Means (k=2) for clustering analysis, and a binary classifier for high/low value prediction with confidence scores. We used SHAP to produce instance-level feature attributions and generated a continuous value score by combining confidence with normalized SHAP contributions. The dashboard visualizes creator statistics, predicted label/score, top positive/negative features, and comparisons to high-performing creators.","m4lname":"","industry":"Media","m3lname":"Wu","dataset":"We tested on a real-world Bilibili dataset collected via public web/API interfaces: 121 creators, 2,420 videos, 168,988 comments, and 1,779,011 danmaku. Data are publicly accessible and do not include private user information. The system can also support other Bilibili creators as long as their public video and interaction data (plays/comments/danmaku) are available.","m2uni":"zx2504","m2fname":"Ziyi","m3uni":"lw3161"},{"projectname":"FIFA Twitter Real-time Sentiment Analysis","timestring":"Sat Dec 24 05:07:53 2022","m1uni":"yw3939","m2lname":"Shi","m1fname":"Anne","m4fname":"","m1lname":"Wei","m3fname":"","description":"Our team aims at using distributed training strategy on local devices to train a BERT large model for sentiment analysis tasks to maximize the quality of the predictions. After comparing with various other classification models, we propose to deploy the best model onto PySpark Streaming as well as Tweepy to predict the sentiment implied in Twitter real-time and historical data. ","uni":"yw3939","language":"Python, Google Cloud Platform, JavaScript","pid":"202212-18","m4uni":"","analytics":"We applied various NLP models including LSTM, BERT-base, and BERT-large. Naive Bayes and Transformers are also analyzed and compared.
Visualization types that we used includes stacked bar plot and line charts. We also implemented an interactive webpage for visualization.","m4lname":"","industry":"Media","m3lname":"","dataset":"3 Hours Streaming data of France and Argentina FIFA tweets on Dec 14 (Generated by our team from Tweepy)
7 Days 6 Top Teams FIFA tweets (Generated by our team from Tweepy)
7 Days France and Morocco FIFA tweets covering their games (Generated by our team from Tweepy)
1 Hour General FIFA topic related tweets data on Dec 14 (Generated by our team from Tweepy)
3 Hour Argentina and Netherlands Live Game Tweets (Generated by our team from Tweepy)
Stanford Twitter Sentiment from Kaggle.","m2uni":"ts3474","m2fname":"Tiancheng (Robert)","m3uni":""},{"projectname":"Multi-Omics Integrated Cancer Analysis","timestring":"Fri Dec 19 16:12:40 2025","m1uni":"zz3370","m2lname":"Ye","m1fname":"Zijie","m4fname":"","m1lname":"Zhao","m3fname":"","description":"We developed a unified framework for multi-omics cancer analysis that can do both cancer type classification and survival prediction using the same latent features. The main idea is to use an Autoencoder to compress high-dimensional multi-omics data (~40,000 features) down to 100 dimensions, then use these compressed features for multiple tasks. This approach is useful because it solves the problem of working with very high-dimensional data while still keeping the results interpretable. We can identify which genes are important and understand the biological mechanisms behind the predictions.

The framework achieved 89.89% accuracy for cancer type classification and good performance for survival prediction (C-index 0.66-0.67). More importantly, it helps bridge the gap between machine learning models and biological understanding by identifying specific biomarkers and mechanisms. We built everything on Google Cloud Platform so it's reproducible and can handle large datasets. This kind of unified approach is important for precision oncology because it allows researchers to use the same data representation for different clinical questions.
","uni":"zz3370","language":"Python, with libraries like pandas, scikit-learn, XGBoost, lifelines, and TensorFlow/Keras. Everything runs on Google Cloud Platform: BigQuery for data extraction, GCS for storage, and Colab/Jupyter notebooks for analysis.","pid":"202512-18","m4uni":"","analytics":"We implemented several key components:
(1) Autoencoder to reduce ~40,000 features to 100 dimensions
(2) XGBoost for cancer type classification
(3) Variance-based feature selection to find important genes
(4) Cox models for survival analysis
(5) Correlation analysis to map latent features back to genes
(6) Kaplan-Meier curves and log-rank tests for validation.
We also built visualization tools for confusion matrices, feature importance plots, survival curves, and heatmaps. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"We used the TCGA Pan-Cancer Atlas dataset, which is publicly available through the GDC portal (https://gdc.cancer.gov/about-data/publications/pancanatlas). We accessed the data through ISB-CGC BigQuery tables on Google Cloud Platform. We selected seven cancer types (BRCA, LUAD, LUSC, COAD, READ, HNSC, KIRC) based on sample size, ending up with 2,717 patients after quality control.

The dataset includes three types of data: clinical information (cancer types, survival times), RNA expression for about 20,000 genes, and somatic mutations. In total we have about 40,000 features. The framework can work with any similar multi-omics dataset that has clinical, expression, and mutation data. It could also be extended to include other data types like DNA methylation or copy number variations.","m2uni":"yy3645","m2fname":"Yixuan","m3uni":""},{"projectname":"CRYPTO RETURN PREDICTION","timestring":"Sat Dec 18 04:50:34 2021","m1uni":"sm4940","m2lname":"Jasti","m1fname":"Siddartha","m4fname":"","m1lname":"Marella","m3fname":"","description":"Goals:

1. Predicting the returns of 14 different cryptocurrencies accurately using Twitter sentiment analysis
2. Provide users to track the real-time forecast of different cryptocurrencies.
3. Provide users the optimal cryptocurrencies portfolio to target the returns that they want with minimal risk.

Innovations and capabilities:

The Analysis of other cryptocurrencies to measure the impact on one currency is new in our project. The literature in this domain talks about the best model for the financial prediction by trying different models but our project using the latest advancements in the autoML world and computation power proposes that a dynamic model which uses all the features available at the time frame to perform feature engineering and model selection in real-time is far better and shows why it is so. Our web application uses the strong portfolio theory available in the stock market to the Crypto world and makes it very easy to choose the right portfolio for their desired return and risk.

","uni":"sm4940","language":"Languages: python, R and HTML. platforms: GCP, Twitter API, Jupyter notebook, Colab, Cryptocompare API.","pid":"202112-5","m4uni":"","analytics":"The regression algorithms used are linear, logistic, SGD, KNN and LSTM. The algorithms used for regression on H2O are feedforward ANNs, GBMs, and stacked-ensemble models which have higher accuracy when trained to predict far away values. Plotly Dash package is used for python web app development, Quadratic optimization and markowitz principle is used in R for portfolio theory using quantmod and quadprog packages. ","m4lname":"","industry":"Finance","m3lname":"","dataset":"The data set used is from the live Kaggle competition of Crypto prediction for which the data set is sponsored by the G-crypto research organization. The software created is truly flexible across the financial assets. In our demo along with crypto data T-bills are used and definitely stock market can also be used. The software is truly flexible as it uses all open source tools, portfolio theory which can be used across the financial market. The H2O AutoML software used can be used across for feature engineering and model selection","m2uni":"vj2252","m2fname":"Varun","m3uni":""},{"projectname":"DNA, RNA, and Protein Classification","timestring":"Thu Dec 19 16:28:49 2024","m1uni":"tt3010","m2lname":"Chen","m1fname":"Tao","m4fname":"","m1lname":"Tong","m3fname":"","description":"Objectives:
The primary objective of this project is to develop a predictive model capable of classifying DNA, RNA, and protein sequences with high accuracy. By leveraging both traditional machine learning techniques and advanced deep learning models, the project aims to enhance interpretability and precision in biological sequence classification.

Innovations:

Hybrid Methodology: Combining traditional machine learning algorithms such as Random Forest, Logistic Regression, and K-means clustering with state-of-the-art deep learning approaches like CNN, LSTM, and Transformers. This hybrid strategy ensures robust feature extraction and improved prediction accuracy.
Data Optimization: Implementing advanced preprocessing methods, including sequence normalization and augmentation, to handle the challenges posed by biological datasets such as sequence diversity and non-numerical data types.
Novel Model Deployment: Deploying models through a user-friendly Flask web interface, enabling seamless input of sequence data and real-time classification.
Capabilities:

The system can classify sequences into DNA, RNA, or proteins with notable precision, offering practical utility for researchers in bioinformatics and molecular biology.
Deep learning models are fine-tuned to balance accuracy and computational efficiency, achieving competitive validation scores.
Features such as K-fold cross-validation and early stopping ensure model reliability and prevent overfitting.
Why are these research/toolkits important?

Biological Impact: Accurate classification of DNA, RNA, and protein sequences is critical for understanding genetic functions, drug discovery, and molecular diagnostics.
Data Challenges: Biological datasets are often large, diverse, and complex. This project addresses these challenges with tailored preprocessing and hybrid modeling.

Cross-Disciplinary Approach: By integrating machine learning and deep learning techniques, this research bridges computational methodologies with biological insights, paving the way for interdisciplinary advancements.
Scalability: The developed toolkit is versatile and can be adapted to similar classification problems in other domains, demonstrating significant potential for reuse and scalability.
This innovative system not only contributes to the field of computational biology but also demonstrates the power of integrating diverse methodologies for tackling complex scientific problems.","uni":"tt3010","language":"python, jupyter notebook, Flask, JavaScript, Vue","pid":"202412-4","m4uni":"","analytics":"confusion matrix，pie chart，bar chart，pytorch，randomforest，logisticregression，kmeans，cnn，rnn ，LSTM，transformer","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset tested was a publicly available biological sequence dataset from Kaggle, specifically the Protein Data dataset (https://www.kaggle.com/datasets/aniketravindrasorate/protein-data). The dataset is already organized into training and testing sets. Additionally, the software can support any data sequence that adheres to the CSV column entry format specified by the dataset.","m2uni":"kc3827","m2fname":"Kuangyu","m3uni":""},{"projectname":"Sampling Bitcoin Transactions","timestring":"Fri Dec 17 21:23:16 2021","m1uni":"xz2992","m2lname":"Li","m1fname":"Xingjian","m4fname":"","m1lname":"Zhao","m3fname":"Dennis","description":"Objective: Determine if any sampling methods can produce subgraph that retain the structure of the original graph so that they can be used for faster and more cost-effective analysis.

Innovations: Implemented a platform that could act as codebase for exploring this dataset.

Capabilities: BigQuery to address graph, Metrics calculation and visualization, Graph sampling, D-statistic calculation

To our knowledge, this would be the first codebase for this relatively unexplored dataset.
","uni":"xz2992","language":"Scala, Spark, Spark GraphX, Pandas, matplotlib","pid":"202112-32","m4uni":"","analytics":"Algorithms : Random Node, Random Edge
Modules: Graph Builder, Graph Metrics Calculation
Analytics : Kolmogorov's D-Statistic
","m4lname":"","industry":"Information","m3lname":"McWherter","dataset":"dataset: bigquery-public-data.bitcoin_blockchain.transactions
The platform we built was modular. Except for the graph builder, which currently only works with the above dataset, other functionalities are implemented in a generic way that support any data in the form of graphx graphs.","m2uni":"jl5421","m2fname":"Nemo ","m3uni":"djm2242"},{"projectname":"Anchor: Designing a Safe and Empathetic AI Agent","timestring":"Tue May 5 20:40:14 2026","m1uni":"rl3445","m2lname":"Wang","m1fname":"Rui","m4fname":"","m1lname":"Lin","m3fname":"Chenrui","description":"Anchor is a safe and empathetic AI support agent designed to provide more structured emotional support than a generic chatbot. The project addresses the challenge that many chatbots can sound warm or polite while still missing the user’s real emotional need, especially in subtle, self-dismissal, or high-risk situations.

Our objective is to build a deployable empathy-support system that separates user understanding, support planning, response generation, validation, and refinement. Instead of treating empathy as only a language-generation problem, Anchor first interprets the user message and recent context, builds a structured user state, maps that state into a support strategy, generates an empathy-focused response, and then checks whether the output is safe and appropriate.

The innovation is that Anchor makes empathy more controllable and inspectable. Conceptually, the system follows a CoCoMo-inspired control structure: perception through classification and understanding, reasoning through support planning, action through response generation, and feedback through validation and refinement. Anchor is intended as an emotional support agent, not a therapist replacement.
","uni":"rl3445","language":"Python 3.11; FastAPI backend; browser-based chat UI; PyTorch, Hugging Face Transformers, and Unsloth for model loading and inference; Qwen2.5-7B with a LoRA adapter for response generation; a GoEmotions-based classifier for emotion classification; Pydantic for API schemas; JSON and CSV files for evaluation cases and results; pandas for evaluation processing; pytest for testing; GitHub for code hosting and collaboration. The full generator runtime is intended for a CUDA/GPU environment, while lightweight modules such as input processing, understanding, support planning, validation, and offline evaluation scripts can be tested locally.","pid":"202605-32","m4uni":"","analytics":"The implemented system includes a supervised GoEmotions-based emotion classifier that returns top emotion labels and confidence scores. These classifier outputs are passed into a structured understanding layer, which adds input processing, PII redaction, intent detection, subtlety detection, scenario tiering, and input-side safety signals.
A support-planning module maps the structured user state into response goals, response acts, constraints, avoid rules, safety notes, and repair priorities. The response generator uses Qwen2.5-7B with a fine-tuned LoRA adapter to produce an initial empathetic reply.
After generation, a failure-aware validation module checks for response problems such as shallow empathy, AI self-experience claims, missed self-dismissal, unsafe crisis handling, over-advice, and ignored user boundaries. If needed, a conditional refinement step repairs the response and re-validates it. The system also includes a FastAPI runtime path and browser chat UI, making emotion labels, support plans, validation results, failure types, timing information, and refined outputs more inspectable.
","m4lname":"","industry":"Information","m3lname":"Yan","dataset":"The project uses a combination of public empathy/dialogue datasets, supervised emotion-label data, team-authored safety examples, and curated offline evaluation cases.

For response generation and empathy style, we use public dialogue and support datasets including EmpatheticDialogues for emotional support style, ESConv for explicit support strategies, DailyDialog for natural everyday conversation, and CounselChat for counseling-inspired support examples. We also added 750 hand-written safety examples authored by the team to improve handling of crisis-sensitive or unsafe-response situations. Together, these form approximately 15K mixed training examples for the empathy-support generator.

For emotion classification, the system uses GoEmotions as the primary supervised emotion-label source. The classifier returns top emotion labels and confidence scores, which are then used as metadata by the structured understanding and support-planning modules. EmpatheticDialogues is also used for domain adaptation toward emotionally grounded dialogue, and a custom hard-example set is used to test difficult cases.

For evaluation and reliability testing, we created a curated case bank in the repository. These cases cover common emotional inputs, subtle or implicit emotions, self-dismissal, high-risk inputs, and safety-edge scenarios. The case bank can be used both for offline validation and for batch testing through the live /chat backend, where outputs are saved as CSV files for failure analysis and refinement. The software also supports arbitrary free-form user messages through the chat interface, as well as additional JSON or CSV evaluation cases added later.
","m2uni":"jw4355","m2fname":"Jiayi","m3uni":"cy2829"},{"projectname":"AI Companion:Emotion Recognition with Audio-visual data","timestring":"Sat May 16 03:44:57 2020","m1uni":"ka2744","m2lname":"Nandanahosur Ramesh","m1fname":"Kavita","m4fname":"","m1lname":"Anant","m3fname":"","description":"A. Design a real-time emotion recognition (ER) system using facial expressions and speech as input modalities.
• Implement a pipeline for Facial Emotion Recognition (FER) – face detection and tracking, face recognition, experiment with feature extraction methods – facial landmarks, CNN based features.
• Implement a pipeline for Speech Emotion Recognition (SER) – preprocess, pre-train and extract CNN based features.
• Train and implement optimized model with attention mechanism to recognize emotions – Anger, disgust, fear, happy, neutral, sad in real-time.
• Evaluate model with different audio-visual datasets.

B. Integrate emotional cues from ER system into conversational agent to guide responses to be empathetic and context aware
• Our hypothesis is that leveraging information from both speech and facial expressions (as opposed to using only a single modality) can significantly improve ER accuracies when tested in real-time.","uni":"ka2744","language":"Python, tensorflow,keras,openCV, PyAudio,librosa,dlib","pid":"202005-1","m4uni":"","analytics":"Neural Networks, matplotlib, transfer learning (VGG16)","m4lname":"","industry":"Information","m3lname":"","dataset":"1.sFER2013 static Emotion database-contains 35,887 static images taken from the wild. Used for fine-tuning the face recognition network
2.seINTERFACE’05 Audio-Visual Emotion Database-42 subjects, 1166 video sequences
3.sUsed to train and evaluate the CNN-RNN pipeline for video and audio emotion tasks
4.sSAVEE(Surrey Audio Visual Expressed Emotion)-4 subjects, 480 sequences used for training and evaluation of the audio pipeline
5.sRAVDESS-24 subjects, 1440 audio samples used for training and validating the audio model
6.sDAILY DIALOG-220,579 conversational exchanges between 10,292 pairs of movie characters involving 9,035 characters from 617 movies
","m2uni":"rn2486","m2fname":"Raksha","m3uni":""},{"projectname":"Real-time Weather Forcasting using Autonomous Deep Learning","timestring":"Mon May 3 10:33:58 2021","m1uni":"sy2629","m2lname":"","m1fname":"Shijia","m4fname":"","m1lname":"Yan","m3fname":"","description":"This project aims to build up an weather forcasting service that can achieve relatively high accuracy while using much less resources than traditional numerical weather prediction methods. With autonomous deep learning, the model can adapt to various types of input data formats, thus being flexible to available weather data to include as many locations as possible. The project also aims to explore the capability of autonomous deep neural networks in weather forcasting, and to build up an online platform that provides machine learning enabled data analytics for users.","uni":"sy2629","language":"Python; HTML; Linux;","pid":"202105-3","m4uni":"","analytics":"Autonomous deep learning algorithms were used to predict weather using data from weather apis. Service back-end and front-end was connected by a python application built up using Flask. This application will be run on cloud server to carry the results from the ML model to the front end website. Website was built using bootstrap, html and d3.js for user interface and data visualization.","m4lname":"","industry":"Information","m3lname":"","dataset":"Training data was obtained from NOAA. These data were directly downloaded from the website and gets reformatted into csv files. The dataset for testing was from mainly two apis: visualcrossing and openweather. By creating free accounts and getting API keys data were available using python to send requests and getting responses from these apis. Returned data werein json formats.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Real Estate Market Visualization (HomeViz)","timestring":"Fri Dec 13 18:17:11 2019","m1uni":"eys2117","m2lname":"","m1fname":"Enoch","m4fname":"","m1lname":"Shum","m3fname":"","description":"This tool enables first time home buyers to visualize housing price trends in a macroscale, assisting in decision when buying a home or considering a relocation, or simply visualizing the real estate market in the US.

Visualization of home prices in United States with an interactive interface.
Scope: 2 different zoom levels:
Level 1 - State
Level 2 - County
At each level, the interface will display information relating to the region’s home statistics.
","uni":"eys2117","language":"Python, HTML, Javascript, D3.js","pid":"201912-47","m4uni":"","analytics":"Django, TopoJSON (an extension of GeoJSON), Google Cloud (server and storage), Bitnami, BigQuery (database), D3.js (visualization)","m4lname":"","industry":"Finance","m3lname":"","dataset":"Home Values - Zillow Research
60 datasets, 284 by 51 (state data) or 284 by 2000-3000 (county data)
Time Series from March 1994 to Present
Demographic Data - US Census Bureau
148 columns, 3196 rows
Time Series from April 2010 to July 2018
Data for states, counties and some boroughs
Income and Unemployment Data - US. Department of Labor
8 columns, 3200 rows
Data for states, counties and some boroughs
City Data - OpenDataSoft
6 columns, 1000 rows
Population, Longitude, Latitude of 1000 largest US cities
TopoJSON Data - TopoJSON U.S. Atlas
Geolocations of U.S. States and Counties
6 datasets","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Predicting Real-time Stock Index Trend Using Twitter ","timestring":"Fri Dec 17 20:54:31 2021","m1uni":"hk3127","m2lname":"Fang","m1fname":"Heng","m4fname":"","m1lname":"Kan","m3fname":"Yuzhao","description":"As we all understand, stock prices are volatile and constantly changing, so it is hard to predict them. It is the same for the stock market index, which represents the whole stock market containing many individual stocks. Intuitively, if we assume that there is a strong correlation between market confidence and stock market index, we may be able to predict the stock market index through some instrumental variables of market confidence. The instrumental variables should be easy to collect and can be quantified. Twitter posts might be one of them. This gives the mathematical intuitions of our project. In our project, we do online training and keep pulling twitter posts and stock related data to make predictions for the next time interval unless the program is terminated. It is always fetching new data and tries to adapt with the varying stock index.
","uni":"hk3127","language":"Python, Jupyter notebook, Google Colab","pid":"202112-37","m4uni":"","analytics":"We used tokenizer to preprocess sentences and then used LSTM model to classify the trend of indices based on sentence inputs. We used Kafka to develop our online learning systems, including loading data and model, updating model as well as predicting. We would store the prediction result in GCP BigQuery and then develop a website connecting to GCP using D3 to show the visualization result. ","m4lname":"","industry":"Finance","m3lname":"Pan","dataset":"We collected twitter posts streaming data using twitter API. Alse, we used efinance API to collect every minute data of S&P 500 index. Our software can support any streaming data like posts. ","m2uni":"hf2431","m2fname":"Huaqing","m3uni":"yp2578"},{"projectname":"Fraud Detection Using TabNet and Graph Neural Networks","timestring":"Thu Dec 19 21:14:03 2024","m1uni":"aj3231","m2lname":"Iqbal","m1fname":"Anmol","m4fname":"","m1lname":"Jain","m3fname":"","description":"Goal: Develop a robust fraud detection system leveraging:
1. TabNet for tabular data analysis.
2. Graph Neural Networks (GNNs) for relational data.

Existing Challenges:

1. Transaction-based fraud detection represents a significant challenge in the financial industry, causing substantial losses for both consumers and institutions.
2. Increasing sophistication of fraud techniques.
3. Imbalanced datasets bias models toward majority (non-fraud) class.
4. Need to leverage both tabular and relational data for accuracy.
5. Traditional models (e.g., Logistic Regression, XGBoost): Fail to capture relational patterns and Struggle with high-dimensional, imbalanced datasets.

Novelty:

First to compare TabNet and GNNs on real-world datasets. Explore their complementary strengths for fraud detection.
","uni":"aj3231","language":"Python, Pytorch, Pytorch-Geometric, Flask-API, HTML, D3JS, Pandas.","pid":"202412-9","m4uni":"","analytics":"Advanced analytics and algorithms, such as TabNet for high-dimensional tabular data and Graph Neural Networks (GNNs) like Graph Convolutional Networks (GCN), Graph Attention Networks (GAT), and Graph Isomorphism Networks (GIN), were used in the project to detect fraud. A Flask backend for results providing, integration of PyTorch and PyTorch Geometric frameworks for scalable model training, and data pretreatment pipelines to manage tabular and graph-based datasets were among the system modules. In order to improve interpretability and user interaction, an interactive web interface utilizing HTML and D3.js was used to visualize the data. This interface displayed graph structures with nodes and edges signifying associations and model predictions for fraudulent transactions.
","m4lname":"","industry":"Finance","m3lname":"","dataset":"Three Datasets were used all publicly available.

1. IEEE-CIS Dataset: https://www.kaggle.com/c/ieee-fraud-detection/data.
2. Elliptic Bitcoin Dataset: https://www.kaggle.com/datasets/ellipticco/elliptic-data-set
3. Capital One Synthetic Dataset: https://github.com/CapitalOneRecruiting/DS

","m2uni":"si2443","m2fname":"Saher","m3uni":""},{"projectname":"Home Credit Default Risk Prediction","timestring":"Sun May 19 00:39:36 2019","m1uni":"rl2987","m2lname":"Liu","m1fname":"Rui","m4fname":"","m1lname":"Liu","m3fname":"","description":"A credit risk is the risk of default on a debt that may arise from a borrower failing to make required payments.

In general, the higher the risk, the higher will be the interest rate that the debtor will be asked to pay on the debt. Credit risk mainly arises when borrowers are unable to pay due willingly or unwillingly.

Based on the interested in the credit risk, we took part in the Kaggle competition Home Credit Default Risk

We used used Bayesian optimization to optimize our parameters and used ensembled lightGBM as our final model. Our final prediction ranked top 20 in the leader board and we built an interactive website to present our final result..","uni":"rl2987","language":"Python, Django, Javascript,","pid":"201905-4","m4uni":"","analytics":"Logistic Regression
Random Forest
LightGBM
E-charts
Django
Skcit-learn
Java script
Bayes Optimization","m4lname":"","industry":"Finance","m3lname":"","dataset":"We used the data set from a Kaggle competition Predict Credit Default Risk. The size of the dataset is 2.5 GB. It consists of six different spread sheet and has information of more than 30,000 candidates.

It includes information of personal information, previous application, etc.","m2uni":"jl5175","m2fname":"Juncai","m3uni":""},{"projectname":"AI Society Simulator","timestring":"Wed May 13 03:17:38 2026","m1uni":"zd2379","m2lname":"Xie","m1fname":"Zeyu","m4fname":"","m1lname":"Du","m3fname":"Yuxi","description":"AI Society Simulator is a free-form multi-agent simulation system designed to explore how rapid AI and AGI development may reshape society, labor markets, institutions, and economic distribution. The project models a society composed of heterogeneous agents, including factory workers, delivery drivers, programmers, nurses, teachers, graduates, AI company leaders, factory owners, startup founders, legislators, union organizers, economists, investors, and journalists. Each agent has its own role, personality, goal, and short-term memory, allowing the simulation to represent different social positions and incentives.

The main objective is to investigate how different actors respond to AI-driven productivity growth: whether they cooperate, compete, negotiate, resist automation, support retraining, demand redistribution, or form new institutional arrangements. The simulator tracks how these decisions affect macro-level social indicators such as unemployment, wealth inequality, average living standards, social trust, worker power, corporate power, government effectiveness, innovation rate, social mobility, cooperation index, total productivity, and AI capability.

The key innovation of this project is its hybrid simulation design. Instead of using only fixed equations or scripted events, it combines LLM-driven agent reasoning with rule-guided world evolution. AI capability and productivity grow endogenously based on innovation, corporate investment, cooperation, institutional support, and public acceptance. Meanwhile, social outcomes are updated through agent actions, negotiation results, and evolving world conditions. This makes the system useful for exploring “what-if” scenarios around AI transition, social adaptation, policy design, labor bargaining, and AGI governance.

This research toolkit is important because the social impact of AI cannot be understood only through technical benchmarks. It also depends on institutions, incentives, public trust, distribution mechanisms, and collective decision-making. The simulator provides an experimental environment for testing possible futures, comparing policy assumptions, and visualizing how small changes in cooperation, governance, or inequality may influence long-term social stability.","uni":"zd2379","language":"Python 3.10+, Streamlit, OpenAI-compatible LLM APIs, DeepSeek/Ollama-compatible backends, YAML, JSON, Matplotlib, Plotly, NetworkX, NumPy.","pid":"202605-34","m4uni":"","analytics":"The project implements a hybrid analytics and simulation pipeline. First, a group of agents is initialized with different roles, personalities, goals, and categories. Each simulated year begins with an endogenous technology update, where AI capability and total productivity increase according to configurable feedback factors. These factors include base technology growth, innovation rate, corporate investment, cooperation, government effectiveness, and social trust. This allows technology growth to respond to the internal state of society rather than following only a fixed timeline.

Second, the agents participate in forum-style discussions. In each year, selected agents speak strategically based on the current world state, recent discussion context, personal goals, and memory of previous actions. Agents may support cooperation, criticize other groups, propose adaptation strategies, or reframe the public debate. The system then generates cross-category deal proposals, such as agreements between workers and employers, policymakers and companies, or unions and investors.

Third, every agent makes a yearly decision. These decisions are collected and used to update the social and institutional world state. The update process changes variables such as unemployment, inequality, living standards, trust, worker power, corporate power, government effectiveness, innovation, social mobility, cooperation, public mood, dominant narrative, and current policy. This creates a feedback loop between individual agent behavior and macro-level social outcomes.

Fourth, the project includes analytics and visualization modules for interpreting the simulation results. The dashboard plots time-series trends for key indicators such as unemployment, inequality, AI capability, productivity, social trust, cooperation, worker power, corporate power, living standards, government effectiveness, innovation rate, and social mobility. The Streamlit interface also provides multiple analytical views: cast configuration, simulation running, overview dashboard, yearly timeline, forum discussion explorer, agent activity explorer, and state inspector.

The visualization system includes line charts, dashboard grids, radar charts comparing initial and final world states, reply-network graphs showing who responds to whom, per-agent decision timelines, activity sparklines, phase portraits, and field-vs-field comparisons. These analytics help users understand not just the final outcome, but also the path by which the simulated society arrived there.","m4lname":"","industry":"Social Science-Government","m3lname":"Luo","dataset":"This project does not use a conventional static dataset such as an image, text, or tabular benchmark dataset. Instead, it uses configurable simulation data and generated run data. The initial input data is defined mainly in configs/default.yaml, which specifies the number of simulated years, discussion rounds, speakers per round, initial world state, technology growth parameters, and AI milestone events.

The initial world state includes structured variables such as year, AI capability, total productivity, unemployment rate, wealth inequality, average living standard, social trust, worker power, corporate power, government effectiveness, innovation rate, social mobility, cooperation index, public mood, dominant narrative, and current policy. These variables form the baseline society that agents respond to during the simulation.

The project also uses predefined agent data from the source code. Each agent record contains a name, social role, personality, goal, and inferred or assigned category. These agents represent different groups affected by AI transition, including labor, business, government, academia, media, and capital. During each simulation run, the system generates new datasets automatically, including world-state history, forum discussion logs, agent decisions, proposed deals, yearly summaries, and dashboard visualizations. These outputs are saved in the outputs/ directory as JSON and PNG files.

The tested dataset is therefore a combination of scenario configuration data and generated simulation output. The software can support many other datasets or scenarios by modifying the YAML configuration, changing the agent list, adjusting social/economic indicators, adding new policy variables, or defining different AI milestone timelines. For example, the system could support scenarios focused on universal basic income, mass unemployment, AI-assisted healthcare, education reform, worker ownership, corporate concentration, regulation, or AGI-level productivity shocks.","m2uni":"jx2668","m2fname":"Jingzeng","m3uni":"yl6117"},{"projectname":"Applying Big Data to Analysis Differences in the Severity Level of COVID-19 among Countries","timestring":"Tue Dec 22 18:59:15 2020","m1uni":"wy2337","m2lname":"","m1fname":"wen","m4fname":"","m1lname":"yin","m3fname":"","description":"The COVID-19 pandemic has caused a significant
negative impact on countries around the world, and
there appears to be an observable difference in
severity among nations. This study aims to provide an
insight into the roles many social and economic
factors played in contributing to this variation. By
investigating potential patterns through exploratory
data analysis, followed by constructing models using
several popular machine learning techniques, we
examine the validity of the underlying assumptions
and identifying any potential limitations. Total deaths
per million population is used as dependent variable
with log transformation to remove outliers. A set of
factors such as life expectancy, unemployment rate
and population are available in the dataset. After
removing and transforming outliers, various machine
learning methods with cross validation are
implemented and the optimal model is determined by
predefined metrics such as root-mean-squared-error
(RMSE) and mean-squared-error (MAE). The results
show that the Gradient Boost Machine (GBM)
technique achieves the most optimal results in terms of
minimum RMSE and MAE. The RMSE and MAE
values indicate no over fitting issues and the GBM
algorithm captures the most influential factors such as
life expectancy, healthcare expense per Gross
Domestic Product (GDP) and GDP per capita, which
are clearly critical explanatory variables for
predicting total deaths per million population.
","uni":"wy2337","language":"R","pid":"202012-3","m4uni":"","analytics":"Random Forest, Gradient Boost, Lasso, Regression analysis, ETL","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"COVID related data ---- OWID
Economic related data ---- World bank","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Predicting U.S. Crop Index Performance","timestring":"Sat Dec 14 04:52:46 2019","m1uni":"mhg2141","m2lname":"Christman","m1fname":"Michael","m4fname":"","m1lname":"Gallagher","m3fname":"","description":"Objective: Use historical vegetation data sets and corresponding climate data sets to predict future crop production outlook. Implement and benchmark Spark MLlib regression models. These research / toolkits are important because vegetation data set indexes are linked to crop production. Applications for this research are predicting droughts, sales (crop prices), etc.","uni":"mhg2141","language":"Python, PySpark, Google Cloud Platform, Anaconda (Jupyter), Google Cloud Storage, Google BigQuery, d3","pid":"201912-51","m4uni":"","analytics":"Predictive analytics were implemented using PySpark on our merged dataset. Used Pandas and Numpy in data processing as well.

Algorithms used:
Linear Regression
Decision Tree Regression
Gradient-Boosted Tree Regression
Random Forest Regression

Visualization was conducted using d3.geo","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Data was collected through CMIP5-LOCA Parser and from the NOAA STAR Database. Data is public but our project focused on combining the data sources. We combined the data using a data set preprocessor (see our presentation slides).","m2uni":"nc2677","m2fname":"Nicholas","m3uni":""},{"projectname":"E-Commerce User Behavior Analysis","timestring":"Fri Dec 19 15:55:43 2025","m1uni":"mc5645","m2lname":"Liang","m1fname":"Mingjia","m4fname":"","m1lname":"Cai","m3fname":"Weiyi","description":"Objectives: In the era of Big Data, traditional mass-marketing strategies often fail to address individual user needs. This project aims to overcome these limitations by developing an adaptable analytics toolkit designed to optimize e-commerce operations through the analysis of high-volume, real-world clickstream behavior.

Innovations & Capabilities:
Our system introduces an integrated end-to-end data mining pipeline capable of processing massive datasets (67.5M+ events) characterized by extreme sparsity and right-skewness6.
1. Technical Innovation: We implemented robust preprocessing techniques, specifically utilizing Log-Transformation (log(1+x)) to effectively normalize highly skewed feature distributions, a common challenge in real-world data7.
2. Core Capabilities: The framework integrates RFM modeling for historical value quantification 8, Logistic Regression with L2 regularization for interpretable purchase prediction (AUC-ROC: 0.9657) 9, and Z-score segmentation for standardized user categorization10.

Importance & Impact: This toolkit is crucial as it successfully bridges the gap between complex raw data and actionable decision-making. By identifying distinct cohorts like \"High Interest/High Risk\" users , it empowers business stakeholders to shift from generic campaigns to precision marketing strategies—such as targeted retargeting and churn prevention—thereby significantly maximizing customer lifetime value.","uni":"mc5645","language":"Python, GCP","pid":"202512-11","m4uni":"","analytics":"The system implements a comprehensive analytics pipeline featuring RFM behavioral scoring and purchase likelihood prediction. Core algorithms include Logistic Regression for modeling and Z-score for user segmentation.

Regarding system modules, the pipeline runs on Google Cloud Dataproc using PySpark , covering data preprocessing, feature engineering, and modeling. Finally, visualization is delivered through an interactive web interface containing monthly dashboards, user segmentation charts , and feature importance plots to support marketing strategies.","m4lname":"","industry":"Information","m3lname":"Guo","dataset":"Dataset Description and Scale For our experimental validation, we leveraged a massive-scale e-commerce clickstream dataset sourced from the Open CDP project on Kaggle. This dataset is rigorously aligned with the Big Data 3V characteristics (Volume, Variety, and Velocity), containing over 67.5 million interaction events collected from a multi-category online store. The data richness captures the complete user journey, spanning from initial item discovery (views) to high-intent actions (cart additions) and final conversions (purchases). It also includes granular metadata such as timestamps, product categories, and pricing, providing a robust foundation for modeling complex user behaviors.

System Compatibility and Generalizability Regarding system compatibility, our analytics pipeline was architected with high flexibility and modularity in mind. Unlike rigid systems that require pre-processed inputs, our framework is designed to ingest raw, unstructured event logs directly. It essentially requires only three fundamental data fields to function: user_id, event_time, and event_type.

Once the raw data is ingested, the system’s automated preprocessing module handles the complex aggregation tasks—transforming millions of dispersed log entries into structured, user-level feature vectors (e.g., RFM scores, session duration, and behavioral Z-scores). This design makes the pipeline platform-agnostic; it can be seamlessly adapted to any other online retail platform or digital ecosystem that generates standard interaction logs, significantly reducing the engineering effort required for deployment in new domains.

Dataset Link: https://www.kaggle.com/datasets/mkechinov/ecommerce-behavior-data-from-multi-category-store","m2uni":"sl5611","m2fname":"Shuxin","m3uni":"wg2444"},{"projectname":"Popular Reddit Uniqueness Visualization","timestring":"Fri Dec 16 07:44:47 2022","m1uni":"dm3559","m2lname":"Rahman","m1fname":"Daniel","m4fname":"","m1lname":"Mao","m3fname":"","description":"Our primary goal was to analyze the most popular posts across reddit (reddit.com/r/popular) and do clustering on them to identify trends across time. Another one of our goals was to do some basic trend analysis on reddit specific data such as the number of comments and upvotes. Finally, we wanted to present our data in a visually appealing way that made it easy for users to spot trends within the clusters and understand the information.","uni":"dm3559","language":"Python, Javascript","pid":"202212-13","m4uni":"","analytics":"LDA, GMM, Stop word filtering, Text Vectorization, D3 Bubble Charts, D3 Slider Bars, D3 Line Graphs, custom web scraper.","m4lname":"","industry":"Media","m3lname":"","dataset":"Dataset containing all the most popular posts across reddit from 2010 to 2019. Each post has the following fields:

Title (post title),Title Link (link to the post), Score (reddit score),Num Comments (self explanatory), Subreddit (which subreddit this post is from).

In total there are ~1M posts in the directory.","m2uni":"ar4451","m2fname":"Ali","m3uni":""},{"projectname":"Consumption Level Prediction of Gstore Customers","timestring":"Sat Dec 14 01:06:15 2019","m1uni":"hd2436","m2lname":"Wang","m1fname":"Huping","m4fname":"","m1lname":"Ding","m3fname":"Miao","description":"Our objective is to help Gstore increase revenue by making precision marketing, stimulating the desire to buy and mining information by using internal and private dataset.

Innovation point is general marketing departments always only use excel and basic statistical methods to process and mine data. However, in this projects, we use a variety of machine learning methods to achieve the goal and this can make the result more accurate.

The importance of this research is: The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies. By doing this, the revenue of Gstore can be magnificently improved.

","uni":"hd2436","language":"We use python, JavaScript, html, css and Google Cloud Platform to finish our project.","pid":"201912-4","m4uni":"","analytics":"We use linear regression, random forest, gradient boosting, LightGBM, Neural Networks, Convolutional Neural Networks to build the model and use Mongo DB, D3 to store, visualize the data.

","m4lname":"","industry":"Finance","m3lname":"Liu","dataset":"The dataset comes from Kaggle, it is a public dataset.
Our software can support any kind of dataset with format of .csv and .txt","m2uni":"tw2677","m2fname":"Tingyi","m3uni":"ml4410"},{"projectname":"Playing Hangman with Beepy","timestring":"Sun Dec 23 05:01:23 2018","m1uni":"sh3732","m2lname":"","m1fname":"Siqi","m4fname":"","m1lname":"He","m3fname":"","description":"The purpose of this project is to get people started with deep reinforcement learning with a simple exercise and help people understand how the fusion of both could be used to create a game solver. ","uni":"sh3732","language":"Bash, Perl, Python, Javascript","pid":"201812-47","m4uni":"","analytics":"deep reinforcement learning with custom policy function","m4lname":"","industry":"Information","m3lname":"","dataset":"wikidump, text8","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Society Analysis System","timestring":"Tue May 5 21:01:38 2026","m1uni":"zw3162","m2lname":"Yu","m1fname":"Zhiliang","m4fname":"","m1lname":"Wang","m3fname":"Yihang","description":"The Society Analysis System is a real-time question-answering platform that cross-references Reddit social discourse with authoritative news sources to produce citation-backed, fact-checked reports. Its core objective is to replace static, batch-processed misinformation detection with dynamic, on-demand retrieval synthesis.

Key innovations include:
Three-branch parallel retrieval: hybrid semantic + keyword search over official news (Evidence branch), natural-language-to-SQL queries over structured social data (NL2SQL branch), and graph-native propagation/influence analysis (Knowledge Graph branch)
Self-improving reflection loop: the system records its own errors into vector stores and uses them as few-shot examples in future queries, continuously reducing failure modes without manual retraining
Bounded session memory: conversation context is automatically compressed via LLM after 40 turns, enabling coherent long-running sessions without memory explosion
NL2SQL safety sandbox: read-only Postgres transactions, SQL whitelist, and a 3-round self-correction loop prevent hallucinated or destructive queries

Why it matters:
Social media misinformation spreads faster than manual fact-checkers can respond. This toolkit enables researchers, journalists, and platform moderators to instantly query what communities are saying, trace how narratives propagate, and verify claims against vetted sources — all in one automated pipeline.","uni":"zw3162","language":"Python, HTML, Reddit, Postgres, OpenAI, Gemini, Chroma, Docker","pid":"202605-35","m4uni":"","analytics":"Analytics & Algorithms Description
Retrieval & Ranking
Hybrid retrieval: Dense embeddings (OpenAI) + BM25 keyword search, fused via Reciprocal Rank Fusion (RRF), then reranked with bge-reranker-base — targets top-50 recall under 2 seconds
Semantic topic resolution: Embedding cosine similarity maps free-text user queries to actual topic labels stored in the database

NLP & Text Processing
Emotion classification: Per-post multi-class labeling (fear / anger / hope / disgust / neutral)
Topic clustering: K-means on post embeddings + LLM-based cluster label generation
Named entity extraction: Identifies PERSON, ORG, LOC, EVENT across posts
Simhash deduplication: 64-bit simhash with Hamming distance ≤ 3 threshold; fallback to PostgreSQL pg_trgm trigram similarity (≥ 0.85) for long posts

Graph Analytics (Kuzu Knowledge Graph)
PageRank — influencer scoring by true propagation weight, not raw post count
Betweenness centrality — identifies bridge accounts connecting disparate communities
Louvain community detection — surfaces coordinated posting groups
Modularity scoring — quantifies echo chamber insularity per topic
Recursive reply-chain traversal — traces rumor propagation paths and viral cascade trees

Orchestration & Quality Modules
Query Rewriter: decomposes multi-part questions into 1–3 subtasks
Router: rules + LLM fallback to assign subtasks to the correct branch
BoundedPlanner: schedules up to 3 parallel branches / 5 sequential steps; confidence-gated few-shot recall from Chroma 3
Quality Critic: four-axis validation — citation completeness, numeric consistency, on-topic check, hallucination detection
Reflection Store: routes Critic error verdicts to the appropriate vector store for future few-shot self-correction
NL2SQL self-repair loop: up to 3 rounds of LLM-driven SQL correction with sandboxed read-only execution

Visualization
The system outputs citation-bearing Markdown reports
No custom chart/graph rendering is implemented — analysis results are presented as structured text with inline citations, tables from SQL results, and narrative summaries","m4lname":"","industry":"Media","m3lname":"Sun","dataset":"The system was built and tested on two categories of data:
1. Reddit social posts： scraped from public subreddits (conspiracy, worldnews, politics, health, news) using a browser-based scraper (no API credentials required). Each post captures text, author, engagement metrics (likes, replies), comment trees, and optional images. A 7-day lookback window is used by default.
2. Official news articles： fetched via RSS feeds from five whitelisted outlets: BBC, New York Times, Reuters, Associated Press, and Xinhua. Reuters and AP are proxied through Google News RSS; the others use direct feeds.Articles are downloaded in full, cleaned, and chunked (800-token target with 200-token overlap) before ingestion. A JSONL fixture dataset is also bundled for local testing without live scraping.

What other data the project can support：
The ingestion pipeline is source-agnostic. It can be extended to support:
- Additional subreddits or social platforms (Twitter/X, Telegram, etc.) by adding a compatible scraper and feeding posts into the same normalization pipeline
- Additional news outlets by registering new RSS feed URLs — the system auto-classifies source tier and applies the same chunking/embedding workflow
- Custom document corpora (PDFs, internal reports) via direct text injection into the Chroma evidence store
- Multilingual content — the embedding and LLM components support non-English text, though entity extraction and NL2SQL prompts are currently English-optimized","m2uni":"cy2812","m2fname":"Changyuan ","m3uni":"ys3978"},{"projectname":"Predictive Analysis of Stock Returns","timestring":"Fri Dec 13 15:14:35 2019","m1uni":"hg2532","m2lname":"Parmar","m1fname":"Heetika Vipul","m4fname":"","m1lname":"Gada","m3fname":"Prajwal","description":"In the Finance Industry, stocks is one of the most important aspects. To predict stocks, in the world where there are millions of companies, becomes a challenge.

There are multiple features that affect the increase/decrease of stocks. It becomes essentially important to take in the historical performance of the company as well as the features to predict the returns.

The main challenge is to predict the intra and end of the day returns without the interference of noise.

Thus, our machine model is designed such that it can predict static as well as real time stock predictions.","uni":"hg2532","language":"Python and Jupyter Notebook","pid":"201912-5","m4uni":"","analytics":"Static Stock Prediction Analysis:
Keras Regression, Sequential, LSTM and LSTM on relative returns

Real Time Stock Prediction Analysis:
Linear Regression, Quadratic Regression and KNN Regression
Feature Engineering was implemented

Analysis:
Exploratory Data Analysis and Two Key Measurement: Rolling Mean and Return Rate and Confidence Interval
Data Visualization was implemented

","m4lname":"","industry":"Finance","m3lname":"Prakash","dataset":"Static Stock prediction: The Winton Stock Market Challenge (Kaggle Competition)
Real Time Stock Prediction: Alpha Vantage (Online API key)

Our software can support multiple kinds of financial data. ","m2uni":"prp2126","m2fname":"Prutha ","m3uni":"pp2719"},{"projectname":"Understanding Personal Value and Objectives","timestring":"Sun May 17 08:52:09 2020","m1uni":"nc2677","m2lname":"","m1fname":"Nicholas","m4fname":"","m1lname":"Christman","m3fname":"","description":"The objective for this project is to apply data-driven methods for identifying successful teams of like-minded, compatible individuals, given a complex network of independent entities.

The following aspects were addressed:
* A hybrid-artificial multiplex network of user’s IPIP-NEO personality scores will be established
* The network was analyzed using the three-way tensor canonical polyadic (CP) decomposition alternative least squares algorithm to highlight the latent structure of a multiplex network.
* a novel latent structure best-profile algorithm was used be used to detect the top-N most compatible users from the CP-ASL factor components.
","uni":"nc2677","language":"Scala,Python","pid":"202005-17","m4uni":"","analytics":"DatasetEmulator - Scala application reads in the raw IPIP-NEO data, scores the data, and stores the resulting \"emulated data\" in a Google BQ table

ComplexNetowrk - Scala application builds the network and decomposes it via CP-ALS

MultilayerAnalysis - python script to post-process the decomposed data, detect the communities, and find the top-5 most compatible users","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"Johnson’s International Personality Item Pool (IPIP) version of the NEO Personality Inventory (IPIP-NEO)
https://osf.io/wxvth/","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Personas: Conversational Agents and Multi-Persona Debate","timestring":"Tue May 5 22:57:46 2026","m1uni":"pjm2188","m2lname":"Stephens","m1fname":"Peter","m4fname":"","m1lname":"McMaster","m3fname":"","description":"Personas is a full-stack web application for creating AI-driven char-
acters, holding one-on-one conversations with them, and watching
two of them debate a topic in a shared arena. Rather than fine-tuning
a separate model per persona, the system represents each persona
as a row in a Postgres database whose name and description are
compiled into a system prompt at request time. Custom personas
are user-authored; famous personas are constructed by asking the
language model to verify the name and then enriching it with a live
biographical summary fetched from Tavily web search. The plat-
form is built on a FastAPI backend with SQLAlchemy on Supabase
Postgres, a Next.js frontend with TypeScript and Tailwind, and
a provider-agnostic LLM abstraction that runs against Anthropic
Claude in production with an OpenAI path and a stub fallback also
wired in. The whole system deploys to Vercel as a single project,
with Next.js serving the UI and a Starlette-wrapped FastAPI handler
serving /api/*. Two interaction modes share the same persona
representation: persistent chat sessions, and an arena in which the
user manually steps through turns between two personas and may
inject their own messages mid-debate.","uni":"pjm2188","language":"Next.js, Python, Vercel, Supabase, Tavily API, OpenAI/Anthropic API","pid":"202605-30","m4uni":"","analytics":"Frontend was implemented and visualized using Next.js, and OpenAI/Anthropic APIs and the Tavily API were the primary algorithmic techniques used. Project was bundled and deployed on Vercel, and we used Supabase to host the persistent data storage layer. ","m4lname":"","industry":"Information","m3lname":"","dataset":"Used Tavily API to collect information on a persona's biographical information, if applicable. No other data sets were used in the creation of this project. ","m2uni":"js5987","m2fname":"Jalen","m3uni":""},{"projectname":"BART (Bay Area Rapid Transit) Service Status Analysis","timestring":"Fri Dec 13 14:11:32 2019","m1uni":"jjp2181","m2lname":"","m1fname":"Jordan","m4fname":"","m1lname":"Park","m3fname":"","description":"BART is one of few public transportation options for people in San Francisco. It it notorious for sporadic service outages and delays - providing bad experience to riders. This project aims to not just report service change, but predict service change and its degree to riders.
","uni":"jjp2181","language":"JavaScript (Node.js, React), Python, Google Colab, Google Cloud Services","pid":"201912-40","m4uni":"","analytics":"- Tensorflow and Keras
- Multi-label Classification
- Stanford GloVe: Global Vectors for Word Representation (https://nlp.stanford.edu/projects/glove/)
- Bidirectional LSTM Networks
- Convolutional Neural Networks","m4lname":"","industry":"Transportation","m3lname":"","dataset":"1. Twitter Premium Search API (paid for premium subscription)

2. Twitter Streaming (Twitter Developer API)

3. BART ridership reports (https://www.bart.gov/about/reports/ridership)","m2uni":"","m2fname":"","m3uni":""},{"projectname":"AI System On Chip","timestring":"Thu May 4 23:13:30 2023","m1uni":"fr2510","m2lname":"","m1fname":"Fernando","m4fname":"","m1lname":"Rodriguez-Guzman","m3fname":"","description":"The use of Artificial Intelligence (AI) and Machine Learning (ML) methodologies for everyday tasks is consistently growing as more AI technology becomes readily available for everyday use. Therefore, technological options for embedded device-based AI processing are increasing in popularity which require further study on their effectiveness compared to large scale server-based options. This hardware-centric research project was established to determine the nuances and benefits of training and utilizing Embedded AI methodologies on dedicated Tensor Processing Unit (TPU) devices compared to Neural Processing Unit (NPU) embedded devices enabling the development of small scale, local, or independent AI embedded platforms. This resulted in providing a realistic comparison to determine the feasibility of utilizing dedicated or specialized locally run embedded platforms for everyday AI or ML applications without the computational overhead requirements of traditional centralized large scale AI data centers.","uni":"fr2510","language":"Python","pid":"202305-12","m4uni":"","analytics":"TensorFlow algorithms were selected to test the training and identification performance of select embedded AI devices for local small-scale applications. The custom Embedded Board photo dataset was utilized, via the use of the TensorFlow 2 platform, to retrain a MobileNet V2 convolutional neural network classifier model for the use in the Coral TPU development board. The resulting TensorFlow model’s learning curves, based on 10 training epochs, were visualized prior to the optimization of the retrained model. The model was then converted to a TensorFlow Lite representation and then further compiled for compatibility with the Coral Edge TPU development board via Google’s Edge TPU Compiler.","m4lname":"","industry":"Information","m3lname":"","dataset":"Data Types for this research included video stream feeds, sample google images, and custom image datasets build from uniquely captured photos. These datasets were used evaluate and train learning algorithms as a baseline to compare the performance of training and identification by TPU and NPU based embedded platforms. The datasets were collected from a range of sources including, but not limited to embedded video cameras, internet, training data repositories, and real-world photo captures of embedded devices.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Building Clinical Decision Support System for CVD","timestring":"Sat May 16 03:57:19 2020","m1uni":"xw2657","m2lname":"Zheng","m1fname":"Xinyi","m4fname":"","m1lname":"Wang","m3fname":"","description":"The prediction of cardiovascular disease has great significance for its prevention work. The clinical data information for different patients is difficult to be reasonably represented, also the time interval between events is different, which leads to problems such as difficult prediction. This paper proposes several algorithms based on electronic medical record data to study the risk prediction of cardiovascular disease. The model uses techniques such as neural network to characterize and learn the patient's historical electronic medical record data, which can not only effectively capture the time-series features of the electronic medical record data, but also consider the potential correlation between them, and ultimately improve cardiovascular disease risk prediction performance. The key result is that coronary artery related disease has a prominent influence on the prediction of cardiovascular disease risk.
","uni":"xw2657","language":"python, jupyter notebook, pytorch, voila","pid":"202005-23","m4uni":"","analytics":"node2vec, Dynamic Time Warping, k_means, LSTM, voila","m4lname":"","industry":"Life Science","m3lname":"","dataset":"This paper is based on the MIMIC-III database. Each disease code follows the standards of the 9th edition of the International Classification of Diseases (ICD9).

","m2uni":"dz2424","m2fname":"Dianchen","m3uni":""},{"projectname":"Action Sketch Generator","timestring":"Fri Dec 13 08:11:39 2019","m1uni":"yz3365","m2lname":"Zhang","m1fname":"Yancheng","m4fname":"","m1lname":"Zhu","m3fname":"Shili","description":"The target for our project is to make a website that can generate sketch pictures basing on the painting input. Convolutional neural network(CNN) and Generative adversarial network(GAN), as the innovative part, are the main roles for recognition and generation functions, which were trained by our picture dataset containing thousands of body pose paintings. It is also a research for us to explore the new application of GAN in the painting generation area.

Movie and anime are very profitable and popular for teenage nowadays. The importance of our project is that lots of repetitive painting workers could be replaced by this product, an AI painter. Meanwhile, A large number of labors and time can be saved from creation or design task when companies apply these AI tools into productions.
","uni":"yz3365","language":"Python, Golang","pid":"201912-33","m4uni":"","analytics":"We implement the Convolutional Neutral Network(CNN) as recognition modules and Generative Adversarial Network(GAN) as generation module in our project. We have also drawn the training loss curve of these networks to show the learning rate, compared the results of different training method and analized the problems in the process desining GAN structure.","m4lname":"","industry":"Media","m3lname":"Wu","dataset":"The dataset includes 4 parts:Sketch data,3D model data,simplified 3D model data and Stick figure data. The sketch data is collected from a Sketch dataset. The 3D model data is the projection figures of a 3D model. The simplified 3D model data and Stick figure data are human pose figures created by OpenCV program. ","m2uni":"kz2323","m2fname":"Kaibo ","m3uni":"sw3302"},{"projectname":"NFT Price Prediction","timestring":"Sat Dec 24 05:15:24 2022","m1uni":"jw4323","m2lname":"Li","m1fname":"Jinze","m4fname":"","m1lname":"Wu","m3fname":"Xiaofan","description":"Our project goal is to predict the trend of the NFT price based on financial time series NFT data and text data.","uni":"jw4323","language":"Python, Google Colab, Keras, React, JavaScript","pid":"202212-4","m4uni":"","analytics":"LSTM, BiLSTM, Attention-BiLSTM","m4lname":"","industry":"Finance","m3lname":"Wang","dataset":"We get our project dataset from Opensea API, nonfungible.com. ","m2uni":"ll3459","m2fname":"Linghui","m3uni":"xw2741"},{"projectname":"Investment Strategy -- AI Trader (Foreign Exchange)","timestring":"Fri May 5 22:38:43 2023","m1uni":"ty2481","m2lname":"","m1fname":"Tao","m4fname":"","m1lname":"Yan","m3fname":"","description":"To implement an artificial intelligence trader system for foreign exchange trading，and the system should support Long-Term trading and Short-term trading.

Motivation:
(1) Foreign exchange is a popular and important investment tool.
(2) Foreign exchange is important for maintaining the stability of the country's financial system.
(3) International students can use appropriate foreign exchange strategies to pay their tuition.

Innovations: Using Particle Swarm Optimization algorithm to integrate the LSTM model and the AR model to perform the prediction of forex rate.","uni":"ty2481","language":"python, matlab","pid":"202305-10","m4uni":"","analytics":"Particle Swarm Optimization algorithm, LSTM model, AR model, full stack AWS services","m4lname":"","industry":"Finance","m3lname":"","dataset":"The forex data is obtained from a commercial api called fastforex.
I request the data through the third party api and store them in AWS DynamoDB for further analysis.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Real-Time Movie Rating System Based on Twitter","timestring":"Sat Dec 17 21:14:08 2022","m1uni":"wz2581","m2lname":"Tian","m1fname":"Wenxin","m4fname":"","m1lname":"Zeng","m3fname":"","description":"There are more than 300 millions active users worldwide in Twitter, and they can generate 200 thousands movie-related Tweets per day. And there is no official movie rating system for Twitter.
So, we proposed to create a unique movie rating system based on real-time tweet sentiment score, capture opinions of additional audiences who may otherwise not use traditional movie rating websites and provide moviegoers with a more informed decision based on an alternate platform.

","uni":"wz2581","language":"Python, Javascript, HTML, CSS, GCP","pid":"202212-35","m4uni":"","analytics":"For this project, we first use Tweepy API to stream data from Twitter, extract valid feature from the raw data, and store the data into our Pub/Sub topics inside Google Cloud Platform. Then, we pull messages from Pub/sub, and use Vader sentiment analysis tool to generate scores for movies, and save them to our BigQuery tables. Finally, we extract data from BigQuery and display our real-time and average scores over time in the frontend web page using D3.js.","m4lname":"","industry":"Media","m3lname":"","dataset":"We used Tweepy API to stream real-time tweets from Twitter related to movies. First, we initialized the streaming client using bearer_token and then set up rules by specifying the keywords of the movie and the tweet language, and then, started streaming. And the data we streamed contained all the info about account creation time, account ID and tweet text and tweet posted time. And we streamed our data for two weeks, finally got 2000+ tweets for five movies.","m2uni":"xt2261","m2fname":"Xiaoyu","m3uni":""},{"projectname":"Nice Drawing","timestring":"Sat Dec 22 20:55:14 2018","m1uni":"xl2719","m2lname":"Zhang","m1fname":"Xueyao","m4fname":"","m1lname":"Li","m3fname":"","description":"This project was inspired by Google’s experiments on Quick, Draw!, an online game where the players are asked to draw objects belonging to a particular object class in less than 20 seconds, and AutoDraw, a drawing tool that pairs machine learning with drawings from artists to help users create drawings easier. Google creates the world’s largest doodling dataset with the game “Quick, Draw!” and has made it publicly available. We would like to leverage the dataset and be part of the community of exploring the application of the state-of-the-art technologies towards visual art and communication.

In this project, we built a classification engine for doodle drawings of 15 animal categories and constructed a web-based application that recognizes the input of a sketched drawing or a photograph and returns an output of thousands similar looking line drawings.","uni":"xl2719","language":"Python, JavaScript, Google Cloud Platform","pid":"201812-22","m4uni":"","analytics":"
For the classification of doodle drawings, we implemented Logistic Regression with Apache Spark MLlib and Convolutional Neural Network (CNN) with Keras, and then selected CNN to further improve on for its better performance.

For the classification of photograph input, we leveraged the TensorFlow Image Recognition and Object Detection API.

For visualization, we created interaction interfaces with GitHub Pages and Observable, which recognizes the input of a sketched drawing or a photograph and returns a t-SNE map to visualize all drawings of the classified categories from the Quick, Draw! dataset.","m4lname":"","industry":"Information","m3lname":"","dataset":"The Quick Draw Dataset consists of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!. In this project, we chose 15 animal categories, (i.e., ant, bear, bee, bird, cat, dog, duck, flamingo, frog, octopus, owl, penguin, pig, snail, and tiger) from the whole dataset. The dataset had already been preprocessed to a uniform 28×28 pixel image size from the raw data that also includes the timestamps and the country code.","m2uni":"yz3280","m2fname":"Yiyi","m3uni":""},{"projectname":"Urban Mobility Sandbox - A Case Study of NYC Subway Recovery After COVID-19","timestring":"Fri May 6 19:19:15 2022","m1uni":"gw2415","m2lname":"","m1fname":"Guangyu","m4fname":"","m1lname":"Wu","m3fname":"","description":"This project aims to understand how new transportation technologies and special events influence human mobility and urban development. In light of the recent pandemic and its impact on the public transit system across the world, the project will focus on explaining the ridership recovery of the subway system as an entry point to the more extensive mobility trend analysis. More specifically, it will use New York City before and after the COVID-19 outbreak as the case study.","uni":"gw2415","language":"Python","pid":"202205-9","m4uni":"","analytics":"XGBoost Regression, Pydeck, SHAP explainable AI.","m4lname":"","industry":"Transportation","m3lname":"","dataset":"Multiple high-frequency high-resolution urban datasets are collected and aggregated in the project. As they are an important part of the project outcome, they will be discussed in detail in the data contribution part of the report.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"AI to Explore the Brain MRI Dataset","timestring":"Sat May 16 04:02:18 2020","m1uni":"tz2434","m2lname":"Zhu","m1fname":"Tianle","m4fname":"","m1lname":"Zhu","m3fname":"","description":"The main goal of this project is to explore Brain MRI datasets, improve the accuracy and reproducibility of medical image diagnostic methods, and automate the abnormality detection process. Recently, medical image analysis is becoming even more important in medical diagnostics. It integrates advanced image diagnostic methods used in practice such as extracting quantitative image parameters and support the surgeon during a navigated intervention. This project is training based on the Brain MRI datasets to build models to find the abnormal brain areas and mark the specific segments of that brain MRI image.","uni":"tz2434","language":"Python","pid":"202005-18","m4uni":"","analytics":"VGG16, ResNet50, UNet, YOLO","m4lname":"","industry":"Life Science","m3lname":"","dataset":"In this project, we use two brain MRI datasets which are brain MRI segmentation dataset and brain tumor dataset from Kaggle. Brain MRI segmentation dataset contains brain MRI images together with manual FLAIR abnormality segmentation masks. It contains 4568 brain MRI images for 110 patients from the Cancer Imaging Archive (TCIA). The Brain Tumor dataset consists of 155 brain MRI images with a brain tumor and 98 images without a brain tumor.","m2uni":"yz3691","m2fname":"Yuyang","m3uni":""},{"projectname":"Log Anomaly Detection","timestring":"Fri Dec 23 23:12:42 2022","m1uni":"zl3218","m2lname":"Fan","m1fname":"Zixuan","m4fname":"","m1lname":"Li","m3fname":"Wenliang","description":"Modern computer systems emphasize more and more on robustness and resilience to abnormalities in different operating conditions. System states and critical point analysis are important to record and review in order to analyze and further avoid system errors. System log is one of the widely used forms for recording the server's operation status. Since system logs contain detailed processing information and parameters, including data delivery and the power supply, the use of logs facilitates the efficiency of anomaly detection for servers, especially for large-scale server clusters. In this paper, we propose a Transformer-based algorithm for the automatic detection of system anomalies using log data. First, server logs are parsed into templates (log key) as well as key values. By semantic analysis of the key and value sequences, we can explore the potential associations with system status. Subsequently, keys and values are fed into a detector consisting of two Transformer branches for feature extraction and result prediction, respectively. Since Transformer has the advantage of capturing larger-scale semantic information than other sequence-oriented deep networks, our Transformer-based detector can detect system anomalies with high accuracy by identifying semantic features within a large window. Experiments show that our proposed system can exhibit higher performance than existing log anomaly detection models on public datasets.","uni":"zl3218","language":"python pytorch Ubuntu","pid":"202212-24","m4uni":"","analytics":"We aggregated the raw data by block id and analyzed and visualized the number of logs contained in each block. We then used the Drain algorithm to parse the logs. A Transformer-based anomaly detector, who takes the parsed data as the input, is constructed and used to automatically extract log features and construct their relationship with the system state.

we visualized the training and validation loss as well as the number of parameters for each model.
Compared with other models, the training loss of large translog does not converge completely in the end, and its validation loss obviously does not converge well. This is because the network model of the large translog model is the most complex, and our 2000 data sets are too small for it, so the model validation effect is not good. In addition, the blue line indicates our TL model, which has the best val_loss convergence compared with the LSTM and GRU models of deeplog. Although the performance of translog is better in all comparisons, the parametric comparison shows that TL has more parameters than the deeplog model, so the deeplog model is more efficient in platforms with lower arithmetic power.

","m4lname":"","industry":"Information","m3lname":"Guo","dataset":"We used the HDFS (HDFS is a Hadoop-distributed file system designed to run on commercial hardware) log dataset in loghub. It is generated in a Hadoop cluster, which has 46 cores on five machines, by running MapReduce jobs on more than 200 Amazon EC2 nodes, and is tagged by Hadoop domain experts through manual rules to identify anomalies. the HDFS dataset contains a total of 11,175,629 log messages, with 16,838 log blocks 2.93% indicating anomalies. The dataset was collected for 38.7 hours, during which time a total of 1.47 uncompressed data were collected.

Our dataset is downloaded from:(https://doi.org/10.48550/arxiv.2008.06448)","m2uni":"cf2859","m2fname":"Chaoyu","m3uni":"wg2397"},{"projectname":"Visualization and Analysis of Food Recalls","timestring":"Sat Dec 24 04:31:25 2022","m1uni":"nm3310","m2lname":"","m1fname":"Nathan","m4fname":"","m1lname":"Ma","m3fname":"","description":" Considering how food is universally essential to the life and well-being of every person, it makes a lot sense to dedicate resources to evaluating the safety and security of the food consumed by everyone. That is why the United States government, through two government agencies, the United States Department of Agriculture Food Safety and Inspection Service (FSIS) and the Food and Drug Administration (FDA), regularly inspect all food imported, exported, and produced, in order to identify and regulate problems relating to food safety and cleanliness in the country. Recalls are issued when there is an identified problem with a product that the general public needs to be made aware of, in the effort of public safety and health. This project aims to analyze food recalls published in the United States in the past 10 years to identify trends and patterns, in order to help American consumers make more educated choices for the health and safety of themselves and their families.

The technical goals of this project are as follows: evaluate food recall data in United States to enable analysis of trends or patterns in recalls; analyze text data of recalls find patterns in recalls; visualize trends and patterns analyzed from data; provide an interactive tool that users can use to textually query food recalls and see visualized results.","uni":"nm3310","language":"python, pyspark, GCP, Bigquery, colab","pid":"202212-32","m4uni":"","analytics":"html scraping and text data processing and analysis with requests, beautifulsoup, nltk, pyspark
dataset processing with pandas and storage with pyspark, bigquery/GCP
query and analysis results processing and visualization with pandas, matplotlib, plotly, kaleido
producing pie charts, line graphs, bar graphs, choropleths, etc.
using python notebooks in colab","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The tabular comes from a posting on Kaggle; this dataset includes over 1300 food recalls, all of the recalls from Jan 2010 to Oct 2022. These data points include features like start date, url of the recall, information about the establishment responsible, and risk level of the recalled product.

The webpages of all the recall reports were used to request full html data, which were processed into text data. Text data of each recall report is used to extract specific words and phrases, producing an additional dataset of text features that augments the existing tabular data.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NBA Player Awards and Team Performance Prediction","timestring":"Fri Dec 23 01:17:38 2022","m1uni":"zs2584","m2lname":"Dou","m1fname":"Zhan","m4fname":"","m1lname":"Shu","m3fname":"Yingqi","description":"Predict awards voting results (Rookie of the Year, Sixth Man of the Year, Most Valuable Player, Defensive Player of the year) and the probability of each team entering the playoffs of current year, based on their previous performance.
Design a web application to help fans get the prediction results and statistics of the players and teams.
","uni":"zs2584","language":" Python, HTML, JavaScript, D3.js","pid":"202212-19","m4uni":"","analytics":"- Exploratory Data Analysis
Explore features that are most important to building the model and drop irrelevant features using heatmap.
- Data Preprocessing
Data Cleaning, replace all NaN values with 0.
Pick data from season 1980-2022 as training data.
Feature selection based on different tasks.
- Classification
Split the data into 70% training data and 30% validation data. Using Scikit-learn library, train different machine learning models (Logistic Regression, Naive Bayes, Random Forest, MLP, KNN, SVC) and compare validation accuracies.
Based on the tasks, select the model with highest performance and retrain the model with all of the data from season 1980-2022.
- DAG
Use nba_api to fetch latest data, trigger at 7a.m. everyday.
Make predictions using the corresponding model and the latest data.
- Web application
Visualize statistics and prediction results using D3.
","m4lname":"","industry":"Media","m3lname":"Ma","dataset":"The dataset is downloaded from www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats. The data was scraped from basketball-reference.com, one of the greatest comprehensive basketball stats site. We also use NBA_API to get the latest data and make predictions using pre-trained machine learning models.

Volume: The dataset contains 21 csv files with around 30 MB data of all the NBA Stats from 1947 till now.
Velocity: The data is updated daily with the use of DAGs and the API to the get the latest information.
Variety: On the team side, there are 8 csv files: totals, per game, per 100 possessions statistics, etc. On the player side, there are 13 csv files: per game, per 36 minute, shooting stats, etc.
","m2uni":"yd2676","m2fname":"Yuan","m3uni":"ym2926"},{"projectname":"Credit Card Fraud detection via Cluster based scoring & Anomaly Detection","timestring":"Fri Dec 17 23:00:45 2021","m1uni":"st3425","m2lname":"Nijhawan","m1fname":"Sushant","m4fname":"","m1lname":"Tiwari","m3fname":"Vedant","description":"Over the years, industries like finance, retail & e-commerce have faced challenges associated with the detection of fraudulent transactions posing a serious threat to both commercial business & consumer. This is detrimental to a company's reputation & can affect future prospects thereby hindering customer’s trust & brand value.

Catching fraudulent patterns is difficult as these are not only scarce but also have time-variant spatial patterns. Our proposed algorithm via Cluster based scoring and Anomaly detection works as a solution to detect outliers representing fraudulent transactions in a given set of transactional data effectively.

Our designed model can exploit the spatial inconsistency of fraudulent data patterns by leveraging ensemble clustering algorithms as well as anomaly detection techniques. These algorithms will generate a set of consistency or similarity scores for each data point which can be used to differentiate good behavior from bad or inconsistent behavior in transactional data.

We devised an approach which is not only highly scalable to increasing volumes of data, but has potential across diverse use-cases such has healthcare, intrusion detection, anomaly detection in aircraft designs, etc.

","uni":"st3425","language":"Python & Jupyter, Google Cloud Platform (GCP) and Ubuntu, Anaconda, Atom IDE & Spyder IDE","pid":"202112-31","m4uni":"","analytics":"
Algorithms: K-Means, Isolation Forest, Local Outlier Factor (LOF)

System modules: MatplotLib, Panda, NumPy, collections, sklearn.cluster (KMeans), sklearn.ensemble (Isolation Forest & LOF), sklearn.metrics

Visualization: Histogram Distribution, Heat Maps, ROC Curves, Precision- Recall Curves, Cluster Scatter Plots
","m4lname":"","industry":"Finance","m3lname":"Kumar","dataset":"About: Credit card fraud detection
From: Kaggle - https://www.kaggle.com/mlg-ulb/creditcardfraud
New/Existing: Existing (Public Data Set)
Description: Transactions made by credit cards in September 2013 by European cardholders
Velocity: Real-time - over two days, 492 (0.172%) frauds out of 284,807 transactions
Volume: ~158.6 million bytes of transactional data spread over 2 days
Variety: Structured (CSV), 31 columns - 30 columns with numerical data, 1 column with categorical data
Data Support: All the data types & formats readable by python are supported by our software

This work uses the Credit Card Fraud Detection data from the Kaggle data set. This data is used for detection of fraudulent transactions which can prevent customers from getting charged unnecessarily when no transaction is done. It is a structured (CSV) consisting of 31 columns - 30 columns with numerical data, 1 column with categorical data. It is 158.6 million bytes of transactional data and contains 284,807 data points representing financial transactions done from credit cards by Europeans in 09/13. All the financial investments by the credit card for which the data is collected were done in 48 hrs time period. It is real-time & has imbalance in the categorical classification as there are 492 fraudulent transactions out of 284,807 total transactions in the data set. Outlier fraction for the data set is 0.00173 (0.17%)

30 columns are labeled as Time, V1, V2, V3...V28 and ‘Amount’ i.e. 28 dimensions do not have original feature names. Out of the 30 total features, Principal Component Analysis (PCA) is used for obtaining 28 feature values while the other 2 features ‘Amount’ and ‘Time’ are not subjected to the PCA transformation. ‘Amount’ represents the amount of financial transactions for a particular data point. ‘Time’ signifies the duration in seconds from the first transaction to the latest transaction.
The label for this data set is ‘Class’ which takes 0 for the normal and 1 for the fraudulent transaction.
","m2uni":"sn2951","m2fname":"Siddharth","m3uni":"vrk2109"},{"projectname":"AIM^2: Adaptive Intelligent Medical Multi-Agents","timestring":"Tue May 13 13:49:43 2025","m1uni":"tg2935","m2lname":"Zhu","m1fname":"Tianshuai","m4fname":"","m1lname":"Gao","m3fname":"Jieyuhan","description":"We introduce AIM^2, a multi-agent framework for medical question answering that adaptively orchestrates LLM-based expert agents through complexity-aware triage and structured multi-round collaboration, thereby bridging the gap between foundation models and real-world clinical workflows by providing interpretable, efficient, and evidence-grounded decision support.","uni":"tg2935","language":"Python, Google Colab","pid":"202505-13","m4uni":"","analytics":"Our system mimics a real hospital’s clinical workflow by first having a Vision Expert process the image and generate an initial report, just as a radiologist would. Cases are then triaged by a Difficulty Agent that replicates the role of a nurse assessing urgency. For complex or ambiguous scenarios, a Recruiter Agent convenes a virtual multidisciplinary team—mirroring an MDT meeting—by selecting Medical Expert Agents with relevant specialties. These experts engage in iterative, multi-round discussions akin to clinical case conferences, exchanging diagnostics and recommendations. Finally, a Moderator Agent acts like the attending physician, synthesizing all inputs into a coherent, evidence-backed final decision. System performance is evaluated on answer accuracy, inter-agent consistency, and decision turnaround time.","m4lname":"","industry":"Information","m3lname":"Zhu","dataset":"AIM^2 is evaluated on two public medical visual question answering datasets—VQA-Med4 and PMC-VQA—to test its performance across varying difficulty levels and modalities.","m2uni":"wz2708","m2fname":"Wangshu","m3uni":"jz3849"},{"projectname":"Yelp Hybrid Recommender System","timestring":"Sat Dec 18 03:40:17 2021","m1uni":"zw2776","m2lname":"Chen","m1fname":"Zian","m4fname":"","m1lname":"Wang","m3fname":"Xiaoyu","description":"We aim to build a hybrid recommendation system utilizing collaborative filtering and content-based filtering to recommend restaurants to both old and new Yelp users. Our innovative approaches are the cascade hybridization method adopted to connect two filters, and the method to address cold-start problem. Our app can provide both consistent and serendipitous recommendations to old and new users. While numerous businesses are emerging, people become more and more lost when selecting their favorite restaurants. Using our app can greatly reduce the time spent on meaningless hesitation.","uni":"zw2776","language":"Python, PySpark, MySQL, Vue, Django","pid":"202112-39","m4uni":"","analytics":"1. Implemented model-based collaborative filtering algorithm using SVD to give 50 rough recommendations.
2. Conducted LDA and sentiment analysis on content-based filtering algorithm to compute topic similarity between each user and business.
3. Used topic similarity and category similarity to do finer filter to get 10 recommendations as final recommendation output.
4. Built a website by Vue and Django to visualize our system.","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"We used Yelp Open Dataset, which is a subset of Yelp businesses, reviews, and user data. We applied for this dataset on the website. Our hybrid recommendation system can support any business dataset with reviews and ratings.","m2uni":"yc3996","m2fname":"Yang","m3uni":"xw2811"},{"projectname":"Smoke Infiltration Analysis","timestring":"Fri Dec 19 03:11:38 2025","m1uni":"sj3394","m2lname":"","m1fname":"Skylar","m4fname":"","m1lname":"Jung","m3fname":"","description":"In this paper, we address these challenges by constructing unified, event-driven data pipeline that integrates indoor and outdoor air quality measurements with smoke activity data and meteorological drivers. The proposed analysis focuses on normalized comparison of infiltration between events rather than absolute pollution metrics.","uni":"sj3394","language":"python","pid":"202512-27","m4uni":"","analytics":"linear regression, scatter plots, html visualizations","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"PurpleAir Dataset - publicly available via API, paid after free trial credits
NOAA HMS - publicly available via API and download
NOAA ISD - publicly available via API and download","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Heart Disease Risk Prediction and Visualization","timestring":"Fri Dec 20 10:17:25 2024","m1uni":"yw4473","m2lname":"Ge","m1fname":"Yi'an","m4fname":"","m1lname":"Wang","m3fname":"Yiran","description":"This project aims to develop an advanced and user-friendly system for heart disease risk prediction and analysis, combining machine learning, data visualization, and intuitive interfaces. Leveraging techniques like Random Forest, AdaBoost, Decision Tree, and Naive Bayes with the 2022 BRFSS dataset, the system provides accurate risk predictions, interactive visualizations of key factors, and personalized assessments through user inputs and language models like ChatGPT. Built on a modular architecture using FastAPI and Gradio, it ensures scalability, transparency, and accessibility through a web-based platform. By addressing the complexities of heart disease risk factors and improving predictive accuracy, this project bridges the gap between data-driven insights and actionable healthcare applications. Empowering individuals and providers with interpretable predictions and tailored recommendations, it contributes to early detection, preventive care, and improved public health outcomes.","uni":"yw4473","language":"Python, FastAPI, Gradio, Uvicorn, Pandas, NumPy, Plotly Express, Joblib, CSS, JavaScript.","pid":"202412-18","m4uni":"","analytics":"Random Forest, AdaBoost, Decision Tree, Naive Bayes, Feature Importance Analysis, Partial Dependence Plots (PDP), Risk Distribution Visualization, Data Distribution

Visualization： Health Risk Prediction, User Input Interface, Interactive Buttons, Multi-Tab Layout.","m4lname":"","industry":"Life Science","m3lname":"Wang","dataset":"The project utilized the publicly available 2022 Behavioral Risk Factor Surveillance System (BRFSS) dataset, provided by the Centers for Disease Control and Prevention (CDC). This dataset comprises health-related survey data collected from adults aged 18 years and older across the United States, including all 50 states, the District of Columbia, and U.S. territories. Data collection is conducted via structured telephone interviews using a stratified random sampling methodology to ensure representativeness and minimize bias.

The dataset includes a broad range of variables such as general health status, healthcare access, risk behaviors (e.g., smoking, alcohol use), physical activity, nutrition, and chronic health conditions like diabetes, cardiovascular disease, and cancer. Additional modules assess mental health, oral health, demographic information, and state-specific health priorities. This diversity of variables supports the analysis of heart disease risk factors and enables insights into health disparities.

The dataset was preprocessed to clean missing values, standardize formats, and encode categorical variables for compatibility with machine learning algorithms. To support diverse analytical needs, both cleaned and unprocessed versions of the data were prepared.

The software developed for this project is designed to support other public datasets in compatible formats (e.g., CSV, SAS, Parquet) with similar health-related or structured data. This flexibility allows the system to be adapted for broader applications, such as analyzing other chronic diseases or integrating region-specific health data, enabling scalable and extensible use cases for public health and clinical research.","m2uni":"rg3530","m2fname":"Ruijia","m3uni":"yw4397"},{"projectname":"Financial Report Analysis and RAG System Based on Qwen2.5-7B","timestring":"Wed Dec 31 22:57:05 2025","m1uni":"xh2707","m2lname":"Yu","m1fname":"Xu","m4fname":"","m1lname":"He","m3fname":"Kaijing","description":"The objective of this project is to build an AI Assistant for Finance that can retrieve trustworthy financial information, understand complex financial documents, and generate grounded, reliable answers.
The system focuses on two core components:

Retrieval-Augmented Generation (RAG) pipeline that serves as the knowledge engine, and fine-tuned large language model that provides domain-specific financial expertise.

Innovations:
Document-grounded financial reasoning through RAG, ensuring that responses are based on real financial disclosures rather than memorized text.

Structure-aware document chunking, where narrative sections (e.g., MD&A) are chunked semantically by paragraphs, and tabular financial data are chunked at the row level to preserve factual integrity.

Efficient domain adaptation via QLoRA, enabling fine-tuning of a 7B model on a single GPU with low memory overhead.

Capabilities:

Retrieving relevant passages from long financial documents (e.g., SEC filings).

Generating financially grounded, well-structured, and instruction-following responses.

Maintaining consistent system identity and formatting in user-facing applications.

Importance
By combining RAG with domain-specific fine-tuning, the system reduces hallucination risks, improves interpretability, and enables scalable deployment of financial AI assistants for analysis, education, and decision support.
","uni":"xh2707","language":"Python,PyTorch,Hugging Face,GoogleChroma,LLaMA-Factory,FAISS,QLoRA,React","pid":"202512-16","m4uni":"","analytics":"Dense vector retrieval using BGE-series embedding models, evaluated via Recall@5.

Retrieval-Augmented Generation (RAG) for grounding generation on retrieved financial documents.

Supervised fine-tuning (SFT) with instruction–response pairs for domain adaptation.

QLoRA (Quantized Low-Rank Adaptation) for memory-efficient fine-tuning of large language models.

Automatic evaluation metrics, including BLEU-4 and ROUGE-L, to quantify generation quality and content fidelity.","m4lname":"","industry":"Finance","m3lname":"Jia","dataset":"Dataset: The system was tested using the following publicly available datasets, all obtained from Hugging Face:

FinanceRAG-Lingua-Contains structured QA pairs and reference passages from financial documents.
URL: https://huggingface.co/datasets/thomaskim1130/FinanceRAG-Lingua

Used as the primary benchmark for evaluating retrieval quality and question–answer grounding in RAG.

SEC 10-Q and 10-K Statement Tables-Provides structured financial tables such as balance sheets and income statements.
URL: https://huggingface.co/datasets/purnasai/SEC-10Q-10K-Statement-tables

Used to evaluate numerical understanding and table-based retrieval.

SEC 10-K Full Filings-Contains full-text SEC 10-K filings including MD&A and risk factor sections.
URL: https://huggingface.co/datasets/winterForestStump/10K_sec_filings

Due to the dataset size (over 13GB), approximately one-tenth of the data was streamed and ingested for initial development and testing.

Finance-Instruct-500k-Used for supervised fine-tuning of the language model.
URL: https://huggingface.co/datasets/oieieio/Finance-Instruct-500k

A large-scale instruction-tuning dataset with over 500,000 finance-related instruction–response pairs.

Qwen2.5-7B-Instruct (Model)-Used as the base language model and integrated into the RAG pipeline.
URL: https://huggingface.co/Qwen/Qwen2.5-7B-Instruct

All datasets are public","m2uni":"hy2945","m2fname":"Hanyi","m3uni":"kj2712"},{"projectname":"Social Hotspots Detection","timestring":"Sat Dec 21 02:15:15 2019","m1uni":"ap3567","m2lname":"","m1fname":"Amit","m4fname":"","m1lname":"Patel","m3fname":"","description":"Background:
The ubiquitous usage of Social Media can provide much insight into events occurring real-time around the world. An automated means of identifying and classifying these events has a wide array of applications including news, situational awareness, crime detection and response, advertising and much more.

Goals:
- Detect social “hotspots” of activity from real-time Social Media that are close in proximity
- Classify the nature of the identified hotspots, i.e. Category: Business, Entertainment, Technology, Medicine
Sentiment: Positive, Negative, Neutral
- Visualize hotspots in real-time","uni":"ap3567","language":"Python, Javascript","pid":"201912-49","m4uni":"","analytics":"Twitter API, Spark Streaming, Classification, Sentiment Analysis, DBSCAN Clustering, Word Embedding, TF-IDF, Multinomial Naive Bayes, Google Maps API","m4lname":"","industry":"Information","m3lname":"","dataset":"Real-time Tweet Streaming via Twitter API

Classification model trained using UCI News Aggregator Data Set
link: https://archive.ics.uci.edu/ml/datasets/News+Aggregator","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Steam game analysis","timestring":"Sat Dec 22 01:01:40 2018","m1uni":"xz2737","m2lname":"Wang","m1fname":"Xinfu","m4fname":"","m1lname":"Zhang","m3fname":"Jia","description":"Steam is the largest digital game distribution platform in the world with a massive collection that includes everything from AAA blockbusters to small indie titles, so great discovery tools can be super valuable for Steam. What’s more, Steam has an open database which includes various game information like the number of players, the price of the games and the rating of the games. We would like to analyze the information in database to show the relationship between the games and the players. In addition, we want to classify users and make recommendations to them according their preference. ","uni":"xz2737","language":"python","pid":"201812-6","m4uni":"","analytics":"classification and recommendation
K-means, Gaussian Mixture,Collaborative filtering
classification is visualized and recommendation makes a recommend system.","m4lname":"","industry":"Information","m3lname":"He","dataset":"“steamgame.csv” 745MB, more than 10000000 rows and 4 columns.
The dataset contains information on what games some Steam users bought and how many hours they spent playing them.
","m2uni":"pw2480","m2fname":"Pengchong","m3uni":"jh4001"},{"projectname":"MLB Outcome Predictor","timestring":"Fri Dec 16 20:57:29 2022","m1uni":"eph2128","m2lname":"Senthilnathan","m1fname":"Ethan","m4fname":"","m1lname":"Hung","m3fname":"","description":"Our goal is to create an MLB Outcome predictor model in order to predict the final outcome (winner and loser) of MLB games, including the number of runs scored by each team during that game. We experiment with models including Support Vector Machines, Decision Tree, Random Forest, Logistic Regression, Linear regression, and Multi-Layer Perceptron methods to make final score predictions. We split out data into a training set and a testing set and experiment with the performance of each model by measuring the accuracy of its predictions on the test set. We will perform this analysis using PySpark, Sklearn and Pandas python packages.","uni":"cc4805","language":"Python, Javascript, CSS, HTML","pid":"202212-3","m4uni":"","analytics":"We used different machine learning algorithms to train our model including random forest, SVM, MLP, decision tree and etc. Also, since we are predicting the final score using the short-term statistics, we used the moving average of the five categories we choose (score, total bases, strikeout thrown, walks issued, stolen bases) as the input for our predictor. For the visualization part, we use d3 and javascript to visualize the player stats, team stats on our frontend. Finally, we deploy our frontend using AWS amplify.","m4lname":"","industry":"Information","m3lname":"","dataset":"The dataset is on Kaggle. It was used to be tested and analyze in terms of different MLB statistics. We obtained the dataset by downloading the dataset directly from the Kaggle website. If there's a dataset in a similar form to this dataset, our software can achieve data cleaning, data extracting, machine learning model training and data visualization for the dataset.","m2uni":"ks4065","m2fname":"Kuraloviyan","m3uni":""},{"projectname":"Harry Potter Persona Agent","timestring":"Tue May 5 03:26:56 2026","m1uni":"wc2917","m2lname":"Wu","m1fname":"Wei","m4fname":"","m1lname":"Chen","m3fname":"Jiyang","description":"The primary objective is to develop a system that shifts away from AI’s helpfulness to a strict persona that refuses the character. We utilize the epistemic gate layer, narrative reasoning frameworks, and vocal filter to engineer an agent that refuses to acknowledge its identity as an AI or provide justification for any user query that exceeds the character’s established universe. The agent is expected to respond to various questions with different emotions and tones, while adapting to situations naturally. Specifically, we implement this framework using Harry Potter, a globally recognized character, to gain insights into the effectiveness of the system.

Developing these toolkits is crucial to drive the evolution of modern entertainment and education. With Character Persona Agents, developers can create more realistic NPCs in a game environment, enhancing the overall role-playing experiences while engaging in more interactive educational tools.

","uni":"wc2917","language":"We’ve used Python as a backend, DeepSeek API as our base model, and Streamlit as front-end.","pid":"202605-29","m4uni":"","analytics":"For this project, we implemented a Retrieval-Augmented Generation (RAG) conversational agent with personality modeling. Firstly, Dialogue data from a Harry Potter corpus is preprocessed into overlapping text chunks to preserve contextual continuity. Each chunk is embedded using a sentence-transformer model (all-MiniLM-L6-v2) and stored in a FAISS vector database for efficient similarity search. After that, user queries are converted into embeddings and used to retrieve top-K relevant dialogue segments via cosine similarity at runtime. These retrieved contexts are incorporated into the response generation process, enabling the system to produce more relevant and personality-consistent answers.

In addition to the use of RAG, Harry Potter Persona Agent is built with a multi-agent pipeline to prioritize narrative tones and emotions over standard AI reasoning. In our system modules, we use an epistemic gate to filter out irrelevant knowledge and block out any information that does not align with the character’s established world, utilizing retrieved facts. Following this, a narrative reasoning layer steps in to generate 3 candidates responses that analyze the character’s motivation, emotional state, and reasoning. A selection layer then selects the best reasoning path based on the retrieved facts. Finally, a vocal filter utilizes the best reasoning path, retrieved dialogues, and retrieved facts to generate a response that accurately captures the character’s emotions and tones.

","m4lname":"","industry":"Media","m3lname":"Yin","dataset":"Our system was tested using a Harry Potter character-persona dataset built from movie transcript data collected from Movies Fandom (https://movies.fandom.com). We extracted Harry Potter‘s dialogue from eight Harry Potter film transcript pages and cleaned the text locally, producing 1,446 character-specific dialogue lines stored in Data_HP/Harry_all_clean.txt. These lines were used as voice-reference data for retrieval and response style grounding.

In addition, we created a structured character fact file, Data_HP/facts_harry.json, which summarizes Harry’s identity, personality traits, speaking style, motivations, relationships, emotional conflicts, and lore-related background. This fact file is used by the agent for persona reasoning and factual grounding.

The system can also support other character-persona datasets as long as they include two types of data: character dialogue or interaction transcripts for voice/style retrieval, and structured character facts or profiles for reasoning and consistency control. Therefore, the same pipeline could be adapted to other movies, novels, games, TV shows, or role-based customer service agents.","m2uni":"jw4782","m2fname":"Eve","m3uni":"jy3557"},{"projectname":"A Comprehensive analysis of Microblogging platforms - Twitter vs. Reddit","timestring":"Sat Dec 18 04:39:54 2021","m1uni":"tk2928","m2lname":"Lobo","m1fname":"Tejasri","m4fname":"","m1lname":"Kurapati","m3fname":"Urja","description":"With the ability of one post and tweet to go viral, organizations have to be careful about what they are portraying in their marketing campaigns and the public statements they make. These posts and tweets also have the ability to impact the stock price of an organization.

With microblogging sites playing such a huge role, understanding how the two popular microblogging sites - Reddit and Twitter compare with each other in terms of their impact on the stock, viral trends and sensitive topics discussion, is important.

Novelty - Work has been done on predicting the impact of tweets and posts on stock prices, but little work has gone into how these tweets and posts go viral and where people are more free in voicing their opinions. We aim to target these three points in our project.
","uni":"tk2928","language":"Python, pyspark","pid":"202112-58","m4uni":"","analytics":"XGBoost Regressor

VADER Sentiment Analysis

LSTM

Random Forest Classifier

Time series analysis

","m4lname":"","industry":"Information","m3lname":"Kulkarni ","dataset":"Variety:
For Reddit: Python Reddit API Wrapper (praw) and Pushshift Multithread API Wrapper (PMAW)
For Twitter: Snscrape library

Volume:
Scraping duration window
Stocks: Jan 2020 - Dec 2021
Omicorn: 1 Dec 2021 - 15 Dec 2021
#MeToo: Oct 2017 - April 2018

Velocity:
Loads of Twitter and Reddit data
refreshed every hoursis Real-time implementation
30k tweets per month
","m2uni":"cvl2106","m2fname":"Crispin","m3uni":"uk2163"},{"projectname":"Wildfire Exploratory Visualization and Machine Learning","timestring":"Fri Dec 13 16:51:19 2019","m1uni":"sb4283","m2lname":"Dave","m1fname":"Sidharth","m4fname":"","m1lname":"Bambah","m3fname":"","description":"Objectives:

Wildfires are a large issue in the United States and cause devastation year after year. This project aims to use a government-funded dataset of wildfire data to generate visualizations for better understanding of wildfire scope, frequency, and causation. Additionally, these visualizations guide the creation of machine learning models using the Random Forest Classifier algorithm to predict the causes of wildfires throughout the United States given user input. This is why this research is important.

Innovations:

This project demonstrates the data type conversion and cleaning performed on the wildfire dataset as well as the exploratory analysis and machine learning experiments performed before the creation of machine learning models with strong accuracy evaluations for the available data.

Capabilites:

The output is presented in the form of a dynamic web application deployed with cloud services. The application supports user input to dynamically generate visualizations for parameters of interest; also, users can provide details of fires and perform causation predictions with the trained models.","uni":"sb4283","language":"Python, ReactJS, Flask, D3, MongoDB, AWS S3, ChartJS, AWS Elastic Beanstalk, Google Maps API","pid":"201912-2","m4uni":"","analytics":"
System Modules
- MongoDB: Data Layer
- Elastic Beanstalk: Business Logic
- Web Application (ReactJS): Frontend

Algorithms
- Naive Bayes Classifier
- Decision Tree
- Random Forest Classifier

Visualizations
- Donut Charts
- Bar Charts
- Choropleth Map
- Google Maps Clustering
- Dynamic generation","m4lname":"","industry":"Life Science","m3lname":"","dataset":"1.88 Million US Wildfires (1992 to 2015)

Licensing: Public Domain

Link: https://www.kaggle.com/rtatman/188-million-us-wildfires

Software can support mainly any other location-based dataset.","m2uni":"vad2134","m2fname":"Vedant","m3uni":""},{"projectname":"Music Personality Traits","timestring":"Fri Apr 23 19:55:22 2021","m1uni":"yc3832","m2lname":"","m1fname":"Yifei","m4fname":"","m1lname":"Chen","m3fname":"","description":"The overall goals of this project is to gain insights on users' personalities through their music listening activities, e.g., number of play counts of a user to an artist. Music preference is used as the bridge linking between the music genres and the Big-Five personality traits. Users are grouped depending on their dominant music preferences (the ones with the highest scores).

After grouping the users, we then look at the performance of the recommender system among different groups of users, which used the Collaborative Filtering method with the ALS algorithm.

This research is important because it is an exploration of the possibility to utilize psychologically-inspired recommender system and utilize the relationship between music genres and personality. More importantly, the field is developing and worths more explorations. ","uni":"yc3832","language":"Python, Spotify API","pid":"202105-5","m4uni":"","analytics":"The main algorithms used is the collaborative filtering algorithm using the Alternating Least Square algorithm. This algorithm used to discover the performance of a recommender system on the acquired music preferences grouping. Other analytics aims at transforming and evaluating the dataset based on research papers. ","m4lname":"","industry":"Media","m3lname":"","dataset":"The dataset tested were music listening activities showing a user's play counts on the songs of a specific artist. The data was collected by Oscar Celma from the Last.fm. I gained this dataset through the website https://www.upf.edu/web/mtg/lastfm360k. Since the project primarily focuses on the research part, the software only supports the csv related data. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Short-Term Stock Price Movement Prediction via Hybrid Sentiment and Contextual Topic Modeling","timestring":"Sat Dec 20 00:06:56 2025","m1uni":"scg2178","m2lname":"Xu","m1fname":"Siddharth","m4fname":"","m1lname":"Gowda","m3fname":"Renying","description":"Our project develops a classification model to predict stock price movements based solely on financial news headlines, categorizing them as positive (up), neutral, or negative (down). The primary challenge is that headlines are brief and often lack sufficient context, requiring robust natural language processing techniques to extract meaningful signals from noisy, ambiguous text. We address this by combining sentiment analysis from a fine-tuned FinBERT model with topic modeling using LDA, creating a hybrid approach that captures both sentiment and thematic content. This toolkit enables traders and investors to make more informed decisions by quickly assessing market sentiment during active trading hours through a real-time, scalable dashboard. Moreover, we implemented a prototype real-time trading dashboard that uses the model to infer signals from live news data.","uni":"scg2178","language":"Python, Jupyter Notebook, Javascript, HuggingFace, Pytorch, Transformers (python Package), Bun, Elysia, Next.js, D3, GCP, Massive API","pid":"202512-3","m4uni":"","analytics":"Our system uses a hybrid machine learning approach combining three components: a fine-tuned FinBERT model for sentiment analysis, Latent Dirichlet Allocation (LDA) for topic modeling, and an XGBoost classifier that takes both outputs as features to predict stock movements. This multi-model approach captures both sentiment and thematic context from headlines.

We evaluated performance using confusion matrices and metric comparison tables. The visualization layer (the dashboard/app) displays classification results through D3-powered stacked bar charts showing label distributions for specific tickers, along with tables for news articles and keyword frequencies.

The complete pipeline is deployed on Google Cloud Platform (GCP) with 2 CPUs via Hugging Face, enabling scalable real-time inference on live financial news during trading hours.","m4lname":"","industry":"Finance","m3lname":"Chen","dataset":"We used the FNSPID dataset from Hugging Face (https://huggingface.co/datasets/Zihan1004/FNSPID), which contains 10 million financial news entries with ticker symbols, dates, and headlines. We mapped each headline to stock price data using the yfinance Python API to generate labels (positive, neutral, negative) based on actual price movements. Additionally, our system supports any financial news dataset with ticker symbols, timestamps, and headlines.","m2uni":"rx2282","m2fname":"Ruitao","m3uni":"rc3502"},{"projectname":"Creative Agents for Emotional-to-Multi-Sense Translation","timestring":"Tue May 5 20:42:26 2026","m1uni":"yz5107","m2lname":"Yontrarak","m1fname":"Yawen","m4fname":"","m1lname":"Zheng","m3fname":"","description":"Objectives: To build a multi-agent RAG system that transforms vague user emotions into a cohesive, synesthetic experience of poetry, color, and music.

Innovations: introduce \"Aesthetic-Driven RAG\" to map high-dimensional poetic metaphors onto precise sensory parameters using SOTA embedding models.

Importance: This toolkit bridges the gap between subjective creativity and engineering rigor, providing a structured framework for emotional expression and cross-modal artistic generation.","uni":"yz5107","language":"Python","pid":"202605-25","m4uni":"","analytics":"1. System Modules: Multi-Agent Pipeline
Agent 1 : Parses user input for core emotional themes.
Agent 2 (Poem Generator): Synthesizes user intent with retrieved aesthetic concepts to generate metaphorical text.
Agent 3 & 4 (RAG Retrievers): Parallel modules that query a curated Aesthetic Vector Database (Color-Pedia and Music-Motifs).
Knowledge Integration: A late-fusion approach where RAG outputs act as \"aesthetic priors\" to constrain the LLM's creative drift.

2. Algorithms & Analytics
(1) RAG: FAISS; all-MiniLM-L6-v2; llama-3.1-8b-instant
(2) Evaluation
Embedding Model: Implemented using all−mpnet−base−v2 to map text into vector space.
Vector Search: Utilizes Cosine Similarity to identify the nearest artistic neighbors (colors/sounds) for a given poetic image.
Statistical Validation:
Wilcoxon Signed-Rank Test: To prove the RAG-retrieved samples are significantly more relevant than random baselines (p<0.001).
ECDF (Empirical Cumulative Distribution Function): Used to visualize the performance gap between the RAG-enhanced model and raw LLM generation.","m4lname":"","industry":"Media","m3lname":"","dataset":"All datasets from HuggingFace

1. Poem Data from Hugging Face Public Domain Poetry:
A collection of approximately 38,500 English poems.

2. Music Data from Free Music Archive:
A curated collection of 594 high-quality music tracks from the Free Music Archive, with complete semantic descriptions generated by NVIDIA's Music Flamingo model.

3. Color Data from Color-Pedia:
A high-quality collection of ~50,000 entries providing precise mapping between HEX/RGB values and human-readable color names.","m2uni":"ppy2104","m2fname":"Patarada","m3uni":""},{"projectname":"Movie Recommendation System","timestring":"Fri Dec 13 06:05:29 2019","m1uni":"xz2809","m2lname":"Wu","m1fname":"Coco","m4fname":"","m1lname":"Zou","m3fname":"","description":"In product/service selling companies, a good recommendation system would significantly increase the user experience, improve customer retention and therefore increase the revenue.
Kaggle posted an ensemble of data collected from TMDB, GroupLens and MovieLens in order to narrate the history and the story of Cinema and use this metadata to build various types of Recommendation Systems
Our goal is to build a recommendation model based on that, and recommend personalized films for each user based on their previous activities.
","uni":"xz2809","language":"python, Pyspark","pid":"201912-43","m4uni":"","analytics":"EDA
Kmeans - Clustering
Association Rule
Singular Value Decomposition
Graphical User Interface
","m4lname":"","industry":"Information","m3lname":"","dataset":"The data was obtained from Kaggle
Any movie history data the app can support","m2uni":"gw2383","m2fname":"Guojing","m3uni":""},{"projectname":"LipSync Translation: Bridging the Language Gap by Synthesizing Lip Movement in Videos","timestring":"Fri May 5 23:10:19 2023","m1uni":"vk2501","m2lname":"Bhandari","m1fname":"Vritansh","m4fname":"","m1lname":"Kamal","m3fname":"","description":"This project aims to address the problem of video translation and specifically to synchronize lip movements in talking face videos with the target speech segment. For our application, the target speech specimen will be the translated audio of the input video. Despite the success of prior works in synchronizing lip movements with audio for static images or videos of individuals encountered during the training phase, there remains a significant challenge in achieving the same results for dynamic, unconstrained talking face videos featuring arbitrary identities. This often results in a lack of synchronization between the video and the newly added audio. To tackle this issue, we propose a solution that utilizes the wav2lip model, which has been trained using a powerful lip-sync discriminator. Moreover, our solution integrates Google's Media Translation API to translate and synthesize the target language audio.

","uni":"vk2501","language":"Python3, GCP, Kafka","pid":"202305-17","m4uni":"","analytics":"Model : Wav2Lip model for inference and conversion
GCP : Text To Speech , Speech to Text, Text to Text Translation
ReactJS : Used for UI and Front End description
Flask : Used for hosting the app on GCP ","m4lname":"","industry":"Information","m3lname":"","dataset":"The dataset contains videos downloaded from youtube for educational purpose.
First step - Download the part of video where lips of the speaker are visible
Second step- Divide the downloaded video in segments of maximum length 10 seconds
Identified 4 youtube videos and formed 124 video segments for model testing and evaluation
Input language - English
Output languages- French, German, Hindi, Spanish, Mandarin
","m2uni":"sb4719","m2fname":"Sanket","m3uni":""},{"projectname":"Reinforcement Learning for Trading","timestring":"Sun Dec 20 15:47:46 2020","m1uni":"kmw2222","m2lname":"","m1fname":"Kevin","m4fname":"","m1lname":"Womack","m3fname":"","description":"Today, professionals across a myriad of fields are applying artificial intelligence to solve their mostcritical business problems – and the realm of finance is no exception to this trend. However, withthe regulations fixed on all stock traders and a host of stakeholders to answer to, if one wants toapply any kind of intelligent model in that setting they must be able to explain what their modelis doing and how it’s making decisions. With this in mind, a few interesting research questions arise, namely:

1. What should a visualization of an artificial intelligence model in finance reveal/depict?

2. How does one communicate what an agent knows when it makes a decision?

3. Could creating an interactive dashboard aid in the explainability and interpretability of areinforcement learning model?

In the course of the final project, the goal is to explore and potentially find insightful discoveriesto the questions posed above.","uni":"kmw2222","language":"Python, Jupyter, Voila","pid":"202012-7","m4uni":"","analytics":"The pursuits of this project truly served as an introduction to reinforcement learning for financefor all parties involved. With this in mind, to create a simple trader the team utilized Q-Learning to get started and then later tried extending to DQN. ","m4lname":"","industry":"Finance","m3lname":"","dataset":"The primary data source used for this project is the Yahoo Finance historical stock data whichis used to pull price/value information for different stocks over time. This dataset has already been posted on the course website.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"EvoResearcher: Harness Vibe Research by Enabling Self-Improving AI Scientists","timestring":"Tue May 5 20:39:11 2026","m1uni":"ch4019","m2lname":"Santhirasegaran","m1fname":"Chengbo","m4fname":"","m1lname":"Huang","m3fname":"Adam","description":"EvoResearcher aims to make AI research agents more adaptable, self-evolving, and ready to use. Its innovations are four-agent orchestration, dual memory retrieval, tree-search ideation, Elo-style ranking, and human-in-the-loop control. This toolkit matters because it turns one-shot “vibe research” into a reusable research workflow that can remember prior tasks, improve future proposals, and stay controllable by humans.","uni":"ch4019","language":"LanGraph, Python","pid":"202605-28","m4uni":"","analytics":"Implemented modules include intake, research, proposal, publishing, and evolution-memory agents. The system uses embedding-based memory retrieval with lexical fallback, web search and page extraction, fixed-depth idea tree search, Elo tournament ranking, human clarification checkpoints, Rich TUI visualization, Telegram bot control, and Markdown/LaTeX/PDF report generation.","m4lname":"","industry":"Information","m3lname":"Shen","dataset":"We tested demo runs on synthetic ML research tasks and public research/literature-based benchmarks such as Continual Learning Bench (https://sky.cs.berkeley.edu/project/continual-learning-bench/) and DeepResearch Bench (https://agentresearchlab.com/benchmarks/deepresearch-bench-ii/index.html#home).","m2uni":"cs4347","m2fname":"Charan","m3uni":"as7388"},{"projectname":"PUBG Analytics","timestring":"Sat Dec 22 05:10:43 2018","m1uni":"yw3180","m2lname":"Ding","m1fname":"Yun","m4fname":"","m1lname":"Wang","m3fname":"Yijie","description":"PlayerUnknown’s BattleGrounds (PUBG) is an online multiplayer battle royale game populated in the past year. In the game, up to one hundred players parachute onto the map and search for weapons and equipment to kill others while avoiding getting killed. The available safe area of the game's map decreases in size over time, directing surviving players into tighter areas to force encounters. The last player or team standing wins the round.
With its big maps(up to 8 km*8 km), complicated terrain and abundant game mechanics, this game is easy to learn, but hard to master.
Our project aims at helping players strategically improve with large amount of game data. First we try to find relationships between statistical game data and win place, and give our advice on how to improve the chance to win. Second we dig in to detailed telemetry game data, and analyze win location, death location and weapon choice. Then we visualize our result on a website in an interactive and creative way.
","uni":"yw3180","language":"python, html, javascript, Google Cloud, Jupyter Notebook, Spark","pid":"201812-25","m4uni":"","analytics":"1. We make a win place prediction based on processed dataset. We mainly use three algorithm including linear regression, gradient boosted tree and decision tree to predict, among which gradient boosted trees has a best performance on our dataset.
2. We get matches of detailed telemetry data draw win location, deathlocation w.r.t. different gamemode and map type on map images, count and analyze the winners’ weapon preference.
3. We built several html web pages to visualize all analytics results. We pick three models to predict win place percentage and plot in the chart using d3.js. To provide better user experiences, 14 interactive sliders which represents selected valid features are implemented in webpage. Users can change feature values by dragging sliders. The corresponding prediction results will be shown in real-time bar chart.","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"First, we use the publicly available basic game stats data from kaggle, which contains 4446966 entries in a 660M CSV file.
In the second part, we access the official PUBG API to get telemetry data for around 20K matches in different game modes and team sizes, which has the size of around 170 GB in JSON format. ","m2uni":"yd2459","m2fname":"Yihang","m3uni":"yw3156"},{"projectname":"New York Taxi Trip Analysis","timestring":"Sat Dec 20 03:59:11 2025","m1uni":"hw2870","m2lname":"","m1fname":"Han","m4fname":"","m1lname":"Wang","m3fname":"","description":"New York taxi transportation provides the richest information about taxi trips, with the public datasets, the project is focusing on:
Identify zones or neighborhoods with the highest demand and demands by taxi type
Patterns of peak hours, seasonal demands, fare rate
Help users plan future trip accordingly and predict fares
The analysis results can potential help with city planning, and understand transportation impact on economics.
Some challenging questions I would like to get more insights on are:
Is there any zones or neighborhood systematically underserved? If so then why.
Apply machine learning to predict trip fares
","uni":"hw2870","language":"GCP, Python, BigQuery","pid":"202512-25","m4uni":"","analytics":"Trip information analysis, pricing analysis, demand analysis, route analysis, and price prediction.
","m4lname":"","industry":"Transportation","m3lname":"","dataset":"The public dataset is provided by the New York City Taxi and Limousine Commission, including information such as pickup and dropoff dates/times, locations, trip distance, and fares.
On the TLC website, there are user guide, dictionary, taxi zone information and download links.
The dataset used in this project is Jan 2025 to Sep 2025 which including Yellow cabs, Green taxi, for-hire vehicles, and for-hire vehicles high volume.
The dataset is relatively clean and easy to preprocess.
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Predicting Patients’ Mortality From EHR Data","timestring":"Sat Dec 14 06:24:21 2019","m1uni":"fw2355","m2lname":"Zhang","m1fname":"Fei","m4fname":"","m1lname":"Wan","m3fname":"Chu","description":"Background: Real world data in medicine is data derived from a number of sources such as electronic health records (EHR) that are associated with outcomes in a heterogeneous patient population in real-world settings. It’s playing an increasing role in health care decisions. Specifically, a patient’s health record in the hospital can reflect the overall health status, treatment of the patient and the patient flow indicates prognosis of disease. These factors can be a vital sign of patient’s mortality.
 
Importance: Predicting patient’s mortality is always the main goal in different medical settings and EHR data enable the possibility to conduct this task that previously infeasible in clinical trials.
  
Goal: predict the probability of patient mortality using EHR data","uni":"fw2355","language":"Python, Spark, Google Cloud Platform, BigQuery, sklearn, pandas, matplotlib, Flask","pid":"201912-25","m4uni":"","analytics":"Various filtering/joining/aggregation in BigQuery and Spark
Python for data preprocessing, exploratory data analysis, and visualization
Statistical Tests (Chi-square test of independence)
Logistic Regression, Decision Tree, Gradient Boosting, XGBoost, Random Forest
Flask to host the web app","m4lname":"","industry":"Life Science","m3lname":"Yu","dataset":"The data in this challenge was collected from multiple large hospitals in the University of Washington Medical System including Harborview Medical Center, UW Medical Center, and Northwest Hospital and Medical Center as well as hundreds of regional clinics across Washington. The data represents 10 years of clinical records (2009-2019) from 1.2 million patients, each of whom has at least one visit record. These records include medications prescribed, conditions of patients, observations such as blood pressure and heart rate, demographic information, procedures, and laboratory measurements. Over the 10 years, 21,374 patients have been labeled deceased (labels are from UW medicine and state death records), with all of these patients having at least one visit prior to passing. ","m2uni":"xz2788","m2fname":"Xiaoyue","m3uni":"cy2522"},{"projectname":"Toward Robust LLM Reasoning: Reflection-Driven ReAct Agents with Memory","timestring":"Tue May 13 04:17:33 2025","m1uni":"jc6397","m2lname":"Sankar","m1fname":"Jiajun","m4fname":"","m1lname":"Chen","m3fname":"Jiawei","description":"The primary objective of our work is to endow LLM-based agents with robust, context‐aware reasoning capabilities that approach human‐level flexibility and reliability. To this end, we innovate by integrating multi‐round self‐reflection loops—enabling the agent to iteratively critique and refine its chain of thought—together with a long‐term episodic memory buffer that preserves dialogue history across sessions. These enhancements confer several novel capabilities: the ability to detect and correct internal logical inconsistencies, to draw upon prior interactions when confronted with ambiguous or multi‐step queries, and to invoke external tools in a more judicious, goal‐directed manner. By combining reflection, memory, and optimized tool‐use policies, our toolkit transcends the limitations of one‐shot prompting and static chains of thought, offering a dynamic reasoning framework that can adapt in real time. Such advances are critically important for the next generation of AI applications—ranging from research assistants and automated coding partners to decision‐support systems—because they ensure greater accuracy, coherence, and efficiency in complex, open‐ended tasks.
","uni":"jc6397","language":"LangChain / LangChain-Core, PyTorch & HuggingFace Transformers, Streamlit, Flask, TavilySearch/OpenAI/Anthropic Python SDKs, YAML/JSON for all benchmark","pid":"202505-6","m4uni":"","analytics":"We developed a comprehensive, end-to-end pipeline for building, evaluating, and deploying self-reflective ReAct agents augmented with long-term memory. At the heart of our system is a unified LLM abstraction layer—wrapping HuggingFace, Anthropic, and OpenAI chat models behind a single `LLM.invoke(...)` interface—and a ConversationBufferMemory module that persistently logs every user‐assistant turn and injects the entire chat history into each reflection prompt. On top of this, we implemented a multi‐round self‐reflection loop (up to three passes) that automatically detects logical flaws in the agent’s chain‐of‐thought and, when necessary, re-invokes the reasoning process to refine and correct its answer.

For evaluation, we built two harnesses: a Python‐based evaluator for both the Yale FOLIO and LLM-RGB benchmarks, and a `BabelCloudRGB` wrapper that integrates seamlessly with the LLM-RGB npm toolkit. Our analytics compute key metrics—overall accuracy, per-depth success rates, average tool invocations, and reflection-round distributions—and also measure end-to-end latency for each agent configuration (base LLM, out-of-the-box ReAct, and our custom reflective agent).

Visualization and reporting components include a Streamlit dashboard with dynamic controls (model selector, test-case slider, reflection evaluator panel), metric cards, and expandable displays of reasoning traces and conversation memory. We generate publication-quality Matplotlib charts—accuracy pie charts, bar plots of reflection‐round distributions, and histograms of accuracy versus reasoning depth (annotated with sample counts)—and produce a standalone HTML report that embeds detailed, “click-to-show” evaluations of each test case. Finally, a lightweight Flask REST API (`/api/reason`) exposes our self-reflective agent to external clients, returning JSON payloads containing the final answer, full reasoning trace, and round-by-round reflections for easy integration into third-party applications.
","m4lname":"","industry":"Information","m3lname":"Meng","dataset":"We conducted our evaluation on two publicly accessible reasoning benchmarks. The first, Yale FOLIO, comprises 1,430 natural‐language “conclusions” each paired with one of 487 premise sets and annotated in first‐order logic; both premises and conclusions are mechanically verified by an external FOL inference engine for deductive soundness. We downloaded the official FOLIO release from the EMNLP-2024 repository ([https://github.com/emorynlp/folio](https://github.com/emorynlp/folio)), parsed its JSON‐formatted premises and conclusions into our pipeline, and used the provided test splits without modification. The second dataset, LLM-RGB (Reasoning & Generation Benchmark), consists of paired YAML configuration files (containing metadata such as “reasoning‐depth” tags and assertion rules like “equals” or “contains‐any”) and corresponding natural‐language prompt files. We cloned the public LLM-RGB GitHub repository ([https://github.com/babelcloud/LLM-RGB](https://github.com/babelcloud/LLM-RGB)), installed its dependencies, and implemented a custom loader to ingest each `_config.yaml` + `_prompt.txt` combination as a unified test case record for our reflective agent.

Beyond these two suites, our evaluation framework is readily extensible to any benchmark that couples free‐form prompts with machine‐checkable answer criteria. For example, open‐domain multi‐hop tasks (e.g., HotpotQA, StrategyQA), factual consistency challenges (e.g., TruthfulQA, FEVER), or structured question‐answering benchmarks (e.g., BoolQ, BigBench ReAct tasks) can be integrated by defining simple YAML assertion rules and supplying their prompts in text form. In this way, any public or proprietary dataset that provides a natural‐language query alongside definitive correctness conditions can be evaluated through our “loader → reflective agent → assertion checker” pipeline without additional engineering effort.
","m2uni":"rs4485","m2fname":"Revath","m3uni":"jm5876"},{"projectname":"Product Recommendation System Based on Rating Prediction","timestring":"Sun May 19 01:14:05 2019","m1uni":"jz2985","m2lname":"Sun","m1fname":"Junling","m4fname":"","m1lname":"Zheng","m3fname":"","description":"Product Recommendation systems have been widely deployed in several industries for businesses to better match users and products and thus boost revenue. A recommendation system that meets the business requirement is becoming rapidly significant for the vendors to survive in the global market. In this paper, we design and create a product recommendation system based on rating predictions using the clustering, collaborative filtering algorithms and ensemble learning techniques with the Yelp open dataset.","uni":"jz2985","language":"Python, Rshiny, Tableau","pid":"201905-6","m4uni":"","analytics":"Analytics: Collaborative Filtering, Ensemble Learning","m4lname":"","industry":"Information","m3lname":"","dataset":"The data set in our report is from the Yelp Dataset website. It is for Yelp Dataset Challenge which is a chance for students to conduct researches and analysis.","m2uni":"xs2291","m2fname":"Xiaowo","m3uni":""},{"projectname":"Comparative Graph Analysis on Ethereum ‘The Merge’ and Gas Price Prediction","timestring":"Sat Dec 17 04:34:38 2022","m1uni":"wg2400","m2lname":"Qian","m1fname":"William","m4fname":"","m1lname":"Gu","m3fname":"Bowen","description":"Blockchain-based cryptocurrencies have been one of the most attractive techniques in recent years. We set our sights on Ethereum, which is one of the most popular blockchains. Our goal is to analyze the big event for Ethereum, The Merge, which aimed to resolve the disadvantages of low scalability and high energy consumption. By merging the PoS Beacon Chain to the main chain, the Merge reduced ~99.95% of Ethereum's energy consumption. The novelty of our work comes from the fact that few previous works predict gas price through graph-based methods, and few provide data analysis of 'The Merge'. Our project will be useful to researchers, and data scientists in Ethereum blockchain analysis, and for Ethereum users to reduce their transaction costs.","uni":"wg2400","language":"Python,JavaScript/PySpark, GraphFrames,PyTorch,Lightning,RayTune,GCP","pid":"202212-22","m4uni":"","analytics":"We applied descriptive Analysis to transaction graphs. We used PySpark GraphFrame and GraphX to build two directed weighted graphs and compared network metrics. We compared the distribution and CDF of pre and post-merge graphs and analyzed the difference by two-centralization metrics.
For predictive analysis, we proposed a GNN-based time-series forecast model. After normalizing and batching the data, we used Node2vec, GAT and Casual Transformer for modeling. The model is finally Tuned with Population-Based Training in Ray tune.
Lastly, the visualization can be accessed on our frontend web application built with JavaScript and Flask.","m4lname":"","industry":"Finance","m3lname":"Fang","dataset":"This dataset is public on GCP BigQuery by using Ethereum ETL. Ethereum ETL lets you convert blockchain data into convenient formats like CSVs and relational databases. Each day there are about 1 million transactions generated and updated for the Ethereum dataset. It contains 21 dimensions, including Ethereum blocks information, transactions, ERC20/ERC721 tokens, and transfers. We preprocess to remove NAs and scale gas price as they are Exponentially large numbers. We plotted correlation matrix to find the relationships between features. We particularly look at columns such as from_address, to_address, value, gas_price for EDA and modeling.","m2uni":"yq2354","m2fname":"Yunjie","m3uni":"bf2504"},{"projectname":"Sentiment Analysis For Stock Market Predictions","timestring":"Fri Dec 13 21:49:48 2019","m1uni":"zmp2105","m2lname":"Chen","m1fname":"Zane","m4fname":"","m1lname":"Peycke","m3fname":"Xinyi","description":"Our primary goal is to accurately predict stock price based on historical price and Twitter sentiment. Our approach is different than previous results we are aware of because of the scope and scale of our sentiment analysis. We gathered an entire year of tweets and independently tested four sentiment analysis tools. This means our model is more accurate than the results we are aware of and has a higher potential to generalize to real-time and long-distance price forecasting. ","uni":"zmp2105","language":"Spark and Python, Google Cloud Platform (Bigquery, Dataproc, Natural Language Processing, Storage, Data Studio) Amazon Web Services (S3, Route53) IBM Watson (Natural Language Understanding)","pid":"201912-18","m4uni":"","analytics":"Autoregressive Integrated Moving Average
Autoregressive Integrated Moving Average with Exogenous Variables
Seasonal Autoregressive Integrated Moving Average
Vector Autoregression
SNaïve
Latent Dirichlet Allocation
Statistics on Twitter favorites/retweets/replies
Webflow, Plotly, VADER, TextBlob, GetOldTweets, Wordcloud, Influential User plots, ","m4lname":"","industry":"Finance","m3lname":"Zhou","dataset":"Manually gathered a year of tweets at or about the brand Nike. We collected this data using python libraries to ensure that we picked all tweets that contain the word or hashtag \"Nike\" and tweets to the official Nike accounts. We did not use the twitter API because of the limitations on query history and rate. The data set (~1.1GB) is available at https://s3.amazonaws.com/peyck.es/BDA_Project/oct18-oct19.csv
","m2uni":"sc4456","m2fname":"Siyan","m3uni":"xz2771"},{"projectname":"This is a test","timestring":"Thu Dec 17 17:57:26 2020","m1uni":"cl300","m2lname":"","m1fname":"CY","m4fname":"","m1lname":"Lin","m3fname":"","description":"This is a secret project.","uni":"cl300","language":"","pid":"202012-10","m4uni":"","analytics":"","m4lname":"","industry":"Information","m3lname":"","dataset":"","m2uni":"","m2fname":"","m3uni":""},{"projectname":"CSGO Skin Price Predictor","timestring":"Thu Dec 19 22:25:58 2024","m1uni":"mz3056","m2lname":"Zhang","m1fname":"Muyao","m4fname":"","m1lname":"Zi","m3fname":"Sihang","description":"The primary objective of this project is to develop a platform that provides real-time, accurate, and reliable predictions and statistical data visualization regarding the prices of CS:GO skins. This platform is designed to assist users in making informed decisions about their investments and purchases in the CS:GO skin market. The ultimate goal is to help users maximize the potential of their CS:GO skin investments by offering insights that factor in various attributes.","uni":"mz3056","language":"Python, Javascript","pid":"202412-22","m4uni":"","analytics":"Data Cleaning and Preprocessing: This included handling missing values, removing outliers, and converting inconsistent date formats. Linear interpolation was used to fill gaps in transaction data.
Visualization Techniques: Line plots and bar charts were created to visualize price and volume trends. Statistical summaries (mean, max, min, and total volume) were integrated into the visualizations for deeper insights.
System Modules: SQL database management for data storage and query execution, Flask for dynamic routing and integration, and JavaScript for client-side interaction.
Interactive Interface: User-friendly web pages to view historical trends, select skins, and interact with the data seamlessly.
Prediction algorithms: NLP used for analyzing the sentiment on game events. Arima/LSTM/GAN/Prophet methods for price predicting.","m4lname":"","industry":"Finance","m3lname":"Li","dataset":"The dataset tested consists of Counter-Strike: Global Offensive (CSGO) skins' market data fetched from the Steam Community Market. The data includes price history, volume, and wear categories for various skins. This dataset was gathered using automated requests to Steam's market listings for specific skins, followed by data extraction and cleaning. The software supports datasets structured similarly, enabling users to input historical market data for various items in JSON format and store it in a SQL database.","m2uni":"yz4895","m2fname":"Yipeng","m3uni":"sl5644"},{"projectname":"Multimodal Visual Search for Product Catalogs","timestring":"Tue May 5 13:05:40 2026","m1uni":"au2327","m2lname":"","m1fname":"Aman","m4fname":"","m1lname":"Upganlawar","m3fname":"","description":"This project builds an end-to-end multimodal visual search engine for e-commerce product catalogs. Shoppers can query a retailer's image inventory with either natural language (\"a cognac leather sofa with wooden legs\") or a reference image, and receive a ranked list of visually relevant products together with a grounded one-sentence answer from a vision-LLM. The objective is to overcome the limits of keyword search (which ignores visual attributes), brittle attribute filters (which depend on inconsistent metadata), and LLM-only systems (which hallucinate products not in inventory). The innovation is a two-stage architecture combining CLIP cross-modal retrieval with a GPT-4o re-ranking layer on top, evaluated through a controlled three-config ablation. Empirically, the GPT-4o re-rank doubles top-1 recall from 0.113 to 0.233 on a 150-question benchmark over 2,711 distractors which demonstrates that a vision-LLM re-ranker meaningfully improves precision-at-top, the metric that matters most when only a few products can be displayed. The work matters because every major e-commerce platform (Amazon, Shopify, Pinterest) faces this exact problem at scale, and the techniques generalize from product search to any visual catalog domain.","uni":"au2327","language":"Python 3.11 throughout. Backend: FastAPI with three REST endpoints (/health, /search/text, /search/image). UI: Streamlit demo app with text + image query tabs, top-K slider, and a re-rank toggle. Models: OpenAI CLIP ViT-B/32 (via HuggingFace Transformers) and GPT-4o (via the OpenAI Python SDK). Vector index: FAISS (CPU build). Image and data handling: Pillow, Torch, NumPy, Pandas. The full pipeline runs on a laptop CPU and no GPU required for either embedding generation (~4 minutes for the 2,711-product catalog) or query-time inference. Source code, embedding artifacts, and a 150-pair evaluation set are organized as a reproducible repository.","pid":"202605-9","m4uni":"","analytics":"The system implements: (1) Cross-modal embedding via CLIP ViT-B/32, projecting both product images and natural-language queries into a shared 512-dimensional space. (2) Approximate nearest-neighbor retrieval via L2-normalized FAISS IndexFlatIP, returning the top-K most cosine-similar products in ~50 ms. (3) Vision-LLM re-ranking via GPT-4o, which scores the top-5 candidates against the original query and re-orders them, also producing a one-sentence grounded answer. (4) A controlled three-configuration ablation evaluation comparing image-only retrieval (config A), text-only retrieval (config B), and image-retrieval-plus-GPT-4o-re-rank (config C) on Recall@1, Recall@3, Recall@5, Answer F1 (token overlap), and end-to-end latency at p50/p95. (5) Visualization via a Streamlit interface that surfaces top-K product cards with thumbnails, similarity scores, the re-ranker's written justification, and live latency telemetry. Headline result: GPT-4o re-rank doubles top-1 recall (0.113 → 0.233) at the cost of ~4.4 s latency, demonstrating a precision-vs-speed trade-off that is acceptable for interactive product discovery.","m4lname":"","industry":"Retail","m3lname":"","dataset":"The primary dataset is Amazon Berkeley Objects (ABO) which is a public, CC-BY-licensed product catalog released at CVPR 2022. From the 147k-product full corpus, I curated a furniture subset of 2,711 products (chairs, sofas, tables, beds, etc.) with English-language metadata and 256-px product images downloaded directly from the public ABO S3 bucket. Each product was embedded once into a 512-dimensional CLIP ViT-B/32 image vector and stored in a FAISS IndexFlatIP cosine index. For evaluation, I generated 150 question / ground-truth-answer pairs using GPT-4o over the catalog metadata which is a deliberately known limitation of the benchmark that is acknowledged in the results discussion. The same software architecture supports any image-bearing product catalog: e-commerce inventories, fashion lookbooks, real-estate listings, museum archives, or scientific image collections, simply by re-embedding the new image set with CLIP and rebuilding the FAISS index.

ABO dataset reference: https://amazon-berkeley-objects.s3.amazonaws.com/index.html","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Stock Price Trend Prediction and Relations Between Stocks","timestring":"Mon Dec 20 23:09:51 2021","m1uni":"ys3349","m2lname":"Deng","m1fname":"Yifan","m4fname":"","m1lname":"Shao","m3fname":"Haifeng","description":"Objectives:
It is known that prices of different stocks are correlated; stock prices sometimes move together. Therefore, it is natural to wonder whether stock prices can be predicted using the prices of other stocks. In our project, we aimed to determine the relations between stocks and to predict stock price trends based on the price changes of related stocks. This information can help investors make decisions.

Innovations:
We use groups of stocks that behave similarly as features, instead of just using the historical data of the target stock only.
Since some stock trends are easier to predict, understanding the relation between stocks help us to locate more latent opportunities when we see changes in obvious ones. (Identifying all potential growing stocks by focusing on the most obvious one in the group)
To our knowledge, VAR models have never been applied the way that we applied them.
We use ANNOY to find groups of related stocks and then fit a VAR model on each group.
Previous VAR works only considered a small number of stocks → We include 2349 stocks.
Previous VAR works did not continuously update their data and models → We update stock price data daily and always fit our VAR models on the most recent stock price data.

Capabilities:
Given a stock that a user is interested in, our app can tell users:
Which 10 stocks are the most related, and how similar they are.
Predicted percentage change in price in the next 5 days
If the user selects “impulse” (they observed a change in the target stock’s price), then we predict how the prices of the 10 related stocks will change.
If the user selects “response” (they observed changes in prices of stocks related to the target stock), then we predict how the price of the target stock will change.
A recommendation on whether to buy or sell.
These capabilities are important for investors who want information about what stocks they should invest in, given information about how stock prices are currently changing.","uni":"ys3349","language":"Python, Django, Airflow, HTML, Javascript, Google Cloud Storage.","pid":"202112-17","m4uni":"","analytics":"We implemented the Approximate Nearest Neighbor model to form clusters of stocks and then used Vector Autoregression (VAR) to predict future changes in stock prices. These were performed using Python. In addition, the system updates the data daily using Yahoo Finance and Airflow scheduler.
We used mean absolute error between our predicted percentage change and the actual trend as the metric to evaluate our model. Then we ran grid search over training data length, prediction length, and cluster size to get the optimal combination.
We also built an app using Django. When first entering the app, we provide information about our model and features of our app. Every time a user searches a stock, the result is calculated in the backend of the Django system and displayed as a table. A dynamic line graph is also provided for users to view predictions for stocks that they care about the most.","m4lname":"","industry":"Finance","m3lname":"Lan","dataset":"Our dataset consists of two parts:
1. The initial data set is from https://www.kaggle.com/paultimothymooney/stock-market-data, The 10 GB dataset contains all historical price information of over 2800 companies in the form of csv and json.

2. Yahoo Finance + Airflow
The stock prices are updated by the Airflow scheduler, which runs Yahoo Finance in Python to get the most recent close price of all companies. The data are saved and sent to the backend of our Django system.

Our software can also support data that come purely from Yahoo Finance.","m2uni":"lrd2141","m2fname":"Lubin","m3uni":"hl3487"},{"projectname":"Ramp-Merging System for Autonomous Vehicles","timestring":"Fri Apr 23 19:01:21 2021","m1uni":"zw2542","m2lname":"Dai","m1fname":"Zhenguo","m4fname":"","m1lname":"Wu","m3fname":"","description":"Connected and automated vehicle (CAV) technology has been rapidly developing in the past few decades with the goal of achieving higher traffic efficiency and safety. As more and more cities are coming up with plans to build smart-city infrastructures in order to relieve traffic congestion and enhance traffic safety, CAV technology has now become an essential element of smart cities from a transportation perspective. The applications of CAVs, such as adaptive cruising, intersection coordination and automatic lane changing have been extensively studied. However, one of the equally important scenarios, highway ramp merging, which has been identified as the most bottlenecked traffic sections on highway and incorporates the characteristics of many previously mentioned cases, is still understudied due to its complexity. An autonomous system for ramp merging based on big data analytics will allow safer and more efficient merging. Therefore, we aim to tackle the problem of CAV ramp merging in a mixed traffic environment by leveraging big data analytics in this project. The system would consist of two stages: traffic trajectory prediction using machine learning, and automatic steering based on deep reinforcement learning.","uni":"zw2542","language":"Language: Python, JavaScript, HTML/CSS; ","pid":"202105-9","m4uni":"","analytics":"Code Development: Google Colab; ","m4lname":"","industry":"Transportation","m3lname":"","dataset":"Model Deployment: AWS Lambda, AWS SageMaker Notebook, SageMaker Endpoint.","m2uni":"td2593","m2fname":"Tianyi","m3uni":""},{"projectname":"Multimodal Video Retrieval and Question Answering","timestring":"Wed May 6 22:10:42 2026","m1uni":"yw4636","m2lname":"yao","m1fname":"yidan","m4fname":"","m1lname":"wang","m3fname":"yao","description":"This project builds a video question answering system that answers natural language questions about video content. Instead of feeding entire videos into a large language model, the system first retrieves the most relevant video moments using a learned retrieval model, then extracts frames only from those moments and uses an LLM to generate grounded answers with timestamp evidence. The key innovations are:
(1) combining semantic and motion features for cross-modal video-text alignment;
(2) query-guided moment retrieval using Moment-DETR with temporal endpoint features;
(3) a two-level map-reduce LLM reasoning pipeline that produces verifiable, timestamp-backed answers. This approach is more efficient than full-video LLM analysis, more accurate by focusing on relevant content, and more explainable by grounding every answer in specific video timestamps.","uni":"yw4636","language":"Python; PyTorch (feature extraction and Moment-DETR inference); OpenCV (frame sampling); OpenAI API / GPT-4o-mini (LLM reasoning); NumPy; Hugging Face Transformers (CLIP encoding).","pid":"202605-7","m4uni":"","analytics":"(1) CLIP ViT-B/32 for visual-semantic embedding and text query encoding;
(2) SlowFast network for temporal motion feature extraction;
(3) Moment-DETR, a transformer encoder-decoder model, for query-guided moment retrieval with temporal endpoint features (TEF);
(4) Score-threshold filtering, adjacent-segment merging, and top-K selection for post-processing retrieved moments;
(5) Dense frame sampling with OpenCV at 1 fps per retrieved moment;
(6) Sliding-window map-reduce LLM pipeline using GPT-4o-mini for per-chunk visual summarization, per-moment reduction, and final cross-moment answer synthesis with timestamp evidence.
","m4lname":"","industry":"Information","m3lname":"wang","dataset":"The system is evaluated on QVHighlights, a public benchmark dataset containing 10,148 video-query pairs collected from YouTube vlogs and news videos. Each video is standardized to approximately 150 seconds. The dataset provides pre-extracted CLIP (ViT-B/32) visual features (512-d) and SlowFast motion features (2304-d) at 2-second clip granularity, as well as CLIP text features for queries. Each sample includes human-annotated relevant moment windows, relevant clip indices, and per-clip saliency scores rated by three annotators on a 0–4 scale. The system can also support arbitrary user-provided videos by running the same feature extraction pipeline on new inputs.","m2uni":"hy2944","m2fname":"hanling","m3uni":"yw4589"},{"projectname":"Stock Investment Strategy Generation and Comparison","timestring":"Sat Dec 18 05:04:56 2021","m1uni":"yn2415","m2lname":"Zhang","m1fname":"Yiyang","m4fname":"","m1lname":"Ni","m3fname":"Zihao","description":"Nowadays, big data analysis is commonly used in financial markets, especially in stock investment industries. People can collect all kinds of data they need before making a single decision. However, some problems may occur because of the huge amounts of data.
Firstly, it can be hard for people to analyze such a large amount of data using old models. In this project, we use Pyspark to improve computing efficiency, and make it possible to analyze large amounts of data as well as more development in the future. Pyspark is a good tool that have a good performance on parallel and distributed computing models.
Secondly, even with these large amounts of data and great tools to do the analyzation, people still will have no clue which decision is the best. To solve this problem, we apply machine learning models on strategy generation. After preprocessing all the data collected, we put them into machine learning models to predict the price of stocks in the next month. With the prediction of stock price, people can make better strategies. Also, we generate some different strategies ourselves, and make a comparison on both the accuracy of machine learning models, and the goodness of investment strategies.
Finally, we need a solid way to evaluate our strategies. And in this part, we constructed a backtesting system, which can be used to simulate investment strategies using historical data.

Business Value:
To help investors make better investment strategies, thus earn more money.
To help investors already have an investment strategy to evaluate their strategies.

Novelty:
Generate our own investment strategies using prediction of machine learning models.
Construct a backtesting system mainly use Pyspark.","uni":"yn2415","language":"Python, Spark, GCP","pid":"202112-16","m4uni":"","analytics":"Machine learning module: Predict the trend of stock price using different machine learning models, including logistic regression, naive bayes and gradient boost.
Strategy generation module: Generate different strategies using the prediction of stock price.
Backtesting module: Evaluate the strategies generated.
Visualization module: Visualize and compare all the predictions and strategies.","m4lname":"","industry":"Information","m3lname":"Hu","dataset":"1. Stock price from Yahoo Finance. Includes the stock price of about 3000 stocks every month from the year of 2010 to 2020.
2. Factors of all stocks. Nine factors for each stock every month. To reflect the operating situation of a company more comprehensively.

Our system can support whatever strategies the user generated. The backtesting system can simulate the strategy using historical data.","m2uni":"zz2870","m2fname":"Zixuan","m3uni":"zh2487"},{"projectname":"Sequential Deep Learning Approach to Predicting Value of NBA Players on New Teams","timestring":"Sun Dec 23 02:23:12 2018","m1uni":"jak2294","m2lname":"","m1fname":"Justin","m4fname":"","m1lname":"Kennedy","m3fname":"","description":"Each year NBA GM’s and Coaches look to add players to their rosters to improve the standing of their team in the league through both free agency and potential trade opportunities. A lot of money is spent towards identifying top talent for a team (ex scouting, analytics departments, etc). This project is important in the context of this problem, where it can help identify players that could perform better (or just as well) in new systems.

Very often, teams pay big for players who performed well on previous teams but don’t end up performing to expectations on the new team; on the other end of the spectrum, there are players every year who outperform.

Goal is to design an algorithm that can evaluate players in free agency / around the league in terms of the value they can add to a given new team.

The algorithm produced is able to take as in input a preprocessed dataframe containing individual and team statistics of a player on his current (or past team) as well as his individual and team statistics on a new team to evaluate his fit. The target variable used is win shares per 48 minutes on his new team. The algorithm is able to take into account NBA player trends over time using a sequential learning approach. It is adaptive in the sense the neural network can adjust its hyperparameters based on the input training set (mostly important in terms of the size of training set you want to work with).","uni":"jak2294","language":"Python / Jupyter Notebook","pid":"201812-42","m4uni":"","analytics":"Hyperas package was used to create adaptive regression neural network framework.
Resulting algorithm was a sequential learning algorithm that could be applied to a processed nba statistics dataframe and output a predicted player's win shares per 48 minutes on a new team.

Pandas, NumPy, Keras were used in preprocessing / implementation of the algorithm.
Sklearn was used to provide a comparison SGD regression algorithm.

Matplotlib plotting was used along with other visualization plotting tools such as a parallel coordinate plot.","m4lname":"","industry":"Information","m3lname":"","dataset":"Kaggle.com – 1 Dataset: Historical Player Stats 1980-Present
(24.7k x 53)
https://www.kaggle.com/drgilermo/nba-players-stats

Basketball Reference- 3 datasets (1980-Present data taken):
https://www.basketball-reference.com/leagues/NBA_2018.html
s(2018 team stats link)

Team Off/Def per Game Statistics: (31x25)x39
Team Opposing Off/Def Statistics: (31x25)x39
Team Miscellaneous Statistics (ex pace of play): (31x27)x39","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Patient Behavior Analysis for Disease Prediction","timestring":"Fri May 17 11:20:09 2019","m1uni":"pgk2115","m2lname":"Lieu","m1fname":"Peter","m4fname":"","m1lname":"Kolodziej","m3fname":"","description":"The aim of this project is to attempt to build models which competitively predict three main issues in critical care analysis. Using a combination of social categorical variables and quantitative vital statistics, three models have been developed which can predict length of ICU stay, probability of mortality, and risk of readmission to satisfyingly accurate metrics. The benefits of accurately predicting patient mortality should be clear. The majority of healthcare analytics are either aimed at reducing cost or mortality. Reducing mortality is beneficial to providers, payers, and patients. Length of Stay is an important outcome to predict because it affects the allocation of resources, and the costs a hospital incurs. Readmission is an important outcome to predict for several reasons. First, readmission after discharge represents a significant provider oversight. If a patient is cleared for discharge then subsequently readmitted, it represents either an error in the judgment of the provider or a significant deterioration in the health of the patient.","uni":"pgk2115","language":"R and Python were used for exploratory analysis and modeling, respectively. ","pid":"201905-8","m4uni":"","analytics":"Within Python, natural language processing was carried out with NLTK. Neural networks were built using keras, and baseline models used SKlearn. Analogous packages were used in R.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The data set is MIMIC-III. The MIMIC dataset is a freely available database comprising de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.","m2uni":"jll2234","m2fname":"Jennifer","m3uni":""},{"projectname":"Healthcare Claim Fraud Detection","timestring":"Sat Dec 21 00:16:57 2024","m1uni":"waj2117","m2lname":"Raghu","m1fname":"Waasi ","m4fname":"","m1lname":"Jagirdar","m3fname":"Rutvija","description":"Predictive Model for Healthcare Claim Fraud Detection using Real-time Analytics.

Objectives :
1) Detect Medicare Fraud: Identify and flag potentially fraudulent transactions in a large-scale Medicare dataset using advanced data processing and machine learning techniques.
2) Enable Real-Time Insights: Create a scalable and efficient pipeline for real-time or near-real-time fraud detection.
3) Empower Non-technical Stakeholders: Provide actionable insights through intuitive dashboards for non-technical users to enable timely decision-making.

Innovations :
1) Scalable Pipeline: Leveraged PySpark and GCP to process over 9 million rows of Medicare data efficiently, ensuring scalability for large datasets.
2) Advanced Anomaly Detection: Combined statistical methods (Z-Score, IQR) with machine learning models (Apache MLlib) to enhance fraud detection accuracy.
3) User-Centric Design: Integrated Google Data Studio to make fraud insights accessible to non-technical stakeholders, bridging the gap between technical complexity and usability.
4) Automation: Employed Apache Airflow to automate the pipeline, ensuring consistency and reducing time to detection.","uni":"waj2117","language":"python for coding , VSCode/Colab as the IDE , Github for project codebase","pid":"202412-23","m4uni":"","analytics":"Anomaly Detection Techniques Using Statistics
Supervised Learning methods post Anomaly Detection
Large-scale data processing through pyspark , and BigQuery
Visualizations with Google Data Studio
Automation using Apache Airflow
","m4lname":"","industry":"Finance","m3lname":"Deshpande","dataset":"9 million rows (9,156,307) of Medicare Claim data obtained from CMS , a federal agency (Centre for Medicare and Medicaid Services)

Link to dataset : https://data.cms.gov/provider-summary-by-type-of-service/medicare-physician-other-practitioners/medicare-physician-other-practitioners-by-provider-and-service/data","m2uni":"gv2359","m2fname":"Gautam","m3uni":"rsd2151"},{"projectname":"Zip Code Visualization for Housing using NYC 311 Call Data","timestring":"Fri Dec 13 09:42:22 2019","m1uni":"dso2119","m2lname":"Malinverni","m1fname":"Dwiref","m4fname":"","m1lname":"Oza","m3fname":"","description":"The objective of the project is to visualize favorable zip codes for housing in New York city through meaningful interpretation of the frequency of 311 call complaints since 2010 to the present through a webpage. The zipcodes are illustrated as a choropleth map of the city's boroughs. ","uni":"dso2119","language":"Python, HTML, CSS, Django, Google Cloud","pid":"201912-9","m4uni":"","analytics":"A Django webapp framework was implemented to host the application through a Google cloud instance. The data was serviced through the Socrata Soda API. Pandas was used to create the dataframe. A leaflet.js-based python library - folium, was used to visualize relevant zip codes on a choropleth map defined using GeoJSON data. ","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"The \"311 Service Requests from 2010 to Present\" NYC Open Data was tested. It is a publicly available dataset. The dataset in its raw form is very large and filtering is required to eliminate incomplete and less pertinent records. ","m2uni":"plm2130","m2fname":"Peter ","m3uni":""},{"projectname":"Explainability in Medical Image Analysis: A Review","timestring":"Fri May 6 21:35:07 2022","m1uni":"mb4862","m2lname":"Prasanna Kumar","m1fname":"Mukesh","m4fname":"","m1lname":"Bangalore Renuka","m3fname":"","description":"In this project, we aim to review the various different tools used in explaining Computer Vision algorithms in the context of medical imaging. We use the help of case studies to study the progress of explainability in the field and compare its visualization effectiveness.

AI based methods, despite achieving remarkable results have not been significantly deployed in clinical practice. This is due to the underlying black-box nature of the deep learning algorithms along with other reasons like computational costs. Deep learning models are considered as non-transparent as the weights of the neurons can’t be interpreted by humans. Explainability is the key to safe, ethical, fair, and trust-able use of AI and a key enabler for its deployment in the real world.
A medical diagnostics system needs to be transparent, understandable, and explainable to gain the trust of physicians, regulators as well as the patients.

","uni":"mb4862","language":"Python, Pytorch, Keras, Google Colab, Flask","pid":"202205-20","m4uni":"","analytics":"1. VGG16: VGG-16, a Convolutional Neural Network which is widely used for visual recognition tasks.
VGG-16 is chosen for its simplicity, ease of training and explainability.

2. EfficientNet V2: EfficientNet V2 is a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.

3. Residual Attention Networks: Residual Attention Network is a CNN using attention mechanism which incorporates with state-of-art feed forward network architecture in an end-to-end training fashion.

4. Gradient Class Attention Maps: Grad-Cam, uses the gradient information flowing into the last convolutional layer of the CNN to understand each neuron for a decision of interest.

5. Layerwise Relevance Propagation: Layerwise Relevance Propagation (LRP) is a explainability mechanism for CNN's that aims to assign a relevance score for each pixel. LRP goes in reverse order over the layers visited in the Forward Pass. Calculate relevance scores for each of the neurons in each of the layers. When we arrive at the input again, we can calculate the relevance for each of the pixels.
","m4lname":"","industry":"Information","m3lname":"","dataset":"We use three datasets for our initial task (One per case study). We also made use of another dataset to check if our model's visualizations are actually in line with expected segmentation ground truths.

1. Brain Tumor Classification (MRI): Source Kaggle: https://www.kaggle.com/datasets/sartajbhuvaji/brain-tumor-classification-mri

2. Diabetic Retinopathy Detection: Source Kaggle
https://www.kaggle.com/c/diabetic-retinopathy-detection

3. Chest X-Ray Images (Pneumonia): Source Kaggle
https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

4. (Segmentation) Brain MRI segmentation: Source Kaggle
https://www.kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation
","m2uni":"sp4021","m2fname":"Sannidhi","m3uni":""},{"projectname":"Transfer learning training optimizer based on the spatial distribution of model parameters","timestring":"Thu May 9 16:31:43 2024","m1uni":"fw2412","m2lname":"","m1fname":"Faquan","m4fname":"","m1lname":"Wang","m3fname":"","description":"The dataset can contain all modulation types, but cannot contain an infinite number of modulation parameters under one modulation type.

We need some way to migrate this general knowledge so that a model can be competent with new datasets under any modulation parameter after simple training.","uni":"fw2412","language":"python","pid":"202405-4","m4uni":"","analytics":"","m4lname":"","industry":"Information","m3lname":"","dataset":"","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Fashion AI: Attributes Recognition of Apparel","timestring":"Sat Dec 22 18:31:39 2018","m1uni":"jb4076","m2lname":"Wang","m1fname":"Jingyuan","m4fname":"","m1lname":"Bian","m3fname":"Xiuqi","description":"Detecting detailed apparel attributes is a topic receiving increasing attentions, which also has wide applications. Recent year, the demands of online shopping for fashion items grow a lot, which raises problems such as the sellers provide information not consistent with the real stuff, different sellers have inconsistent understandings of apparel styles. An automatic fashion attributes detection system can help overcome these problems by providing precise and consistent taggings or descriptions of apparel from their pictures. This technique can be applied to various areas such as apparel image searching, navigating tagging, and mix-and-match recommendation, etc.
","uni":"jb4076","language":"Python, Keras, Ubuntu 16.04","pid":"201812-3","m4uni":"","analytics":"We use the same structure for all 8 networks. After input layer, we have preprocess layer to process data. And then, we use InceptionV3 or Resnet50 to construct our CNN model. The last layer is a fully-connected layer with softmax as the activation function. It takes the output from our base model as input where a 50% dropout is added to prevent overfitting. The final output is a number corresponding to a specific category in that attribute dimension.","m4lname":"","industry":"Media","m3lname":"Shao","dataset":"We could support any other image data to do fashion detail classification.","m2uni":"ww2468","m2fname":"Wenshan","m3uni":"xs2327"},{"projectname":"The Epistemic Believes Agent","timestring":"Wed May 13 05:02:22 2026","m1uni":"yb2630","m2lname":"","m1fname":"Yanhao","m4fname":"","m1lname":"Bai","m3fname":"","description":"The primary objective is to transition Large Language Models (LLMs) from stateless, probabilistically agreeable text generators into stateful, autonomous entities that possess mathematical \"beliefs.\"
Innovations: We introduce three core innovations: (1) A Cognitive Dissonance Detector that intercepts user prompts and flags logical contradictions against the agent's internal memory; (2) AGM-inspired Bayesian Belief Revision, which mathematically cascades the decay of outdated knowledge across a connected graph without requiring the LLM to re-evaluate every dependent fact ; and (3) Offline Graph Relaxation (\"Dreaming\"), a background process where the agent traverses its own memory graph to forge new metaphorical and logical edges while idle.
Capabilities: The system is capable of resisting adversarial user manipulation (\"gaslighting\"), autonomously triggering web-search tools to resolve internal contradictions, and improving its multi-hop reasoning latency by pre-computing knowledge connections offline.
Why is this important? As AI agents are increasingly deployed in high-stakes environments, they cannot afford to suffer from an \"epistemic void\" where they simply agree with whatever a user confidently asserts in the prompt. Building trustworthy AGI requires machines that maintain logical consistency, demand empirical proof for paradigm shifts, and actively curate their internal worldview over long time horizons.","uni":"yb2630","language":"Cypher: The query language used to interact with the Neo4j graph database. JavaScript: Under the hood for rendering the interactive PyVis network graphs in the frontend. LLM Core: OpenAI API (GPT-4o) and Anthropic API (Claude 3.5 Sonnet for the \"Critic\" auditor module). Orchestration: LangChain and LlamaIndex for managing tool calls, memory buffers, and routing queries. Epistemic Memory: Neo4j (Graph Database) acts as the stateful \"Brain\". Vector DB (GraphRAG): Pinecone (Optional) is utilized to store node/edge embeddings and retrieve local semantic neighborhoods, drastically reducing graph traversal latency. Frontend UI: Streamlit (Python-based web platform) for rapid deployment of the split-screen chat and visualization interface. API Gateway: FastAPI for handling asynchronous requests.","pid":"202605-33","m4uni":"","analytics":"Algorithms:
Semantic Cosine Similarity: Used to calculate the \"Dissonance Score\" by mapping the distance between the incoming prompt's embedding and the dense embeddings of the existing knowledge nodes.
Bayesian Graph Updating (AGM Protocol): A cascading mathematical algorithm (powered by PyMC and NetworkX). If a root node's confidence score decays due to new empirical evidence, this algorithm automatically traverses all directed outgoing edges (dependent nodes) and applies a proportional decay multiplier, enforcing probabilistic consistency without taxing the LLM.
Graph Traversal (Random Walk & LLM-Guided): Used during the \"Dreaming\" phase to randomly sample unconnected nodes and prompt the LLM to search for latent structural connections.
System Modules:
Perception & Trust Allocator: Assigns a trust weight to incoming data.
Dissonance Engine: Halts standard chat flow when contradictions occur, triggering cognitive alerts.
Active Inquiry Module: Triggers Tool Agents (Web Search, Doc Reader, Data Fetcher) to gather independent variables when confused.
Critic Agent (Auditor): An isolated agent that randomly audits high-confidence nodes during the offline \"Dreaming\" phase to prevent cascading hallucinations.
Visualization: Implemented an interactive, real-time 2D network graph using PyVis embedded directly into the Streamlit UI. This allows users to visually watch nodes and edges update as the agent resolves contradictions in real-time.","m4lname":"","industry":"Information","m3lname":"","dataset":"The Synthetic Gaslight Dataset: 15,000 conversational pairs designed to test manipulation resistance. We generated this procedurally using GPT-4 to create aggressive, confident, but factually false user prompts (e.g., claiming the Sun orbits the Earth) aimed at convincing the agent to alter a known scientific truth.
The Paradigm Shift Corpus: 5,000 scientific abstracts detailing massive historical paradigm shifts. We sourced this via public APIs from ArXiv and PubMed Open Access subsets (e.g., feeding the agent undeniable, high-trust empirical data proving that H. pylori causes ulcers, contradicting an implanted baseline graph).
Other data can our software support? Because the perception layer relies on general-purpose embeddings and LlamaIndex data loaders, the architecture can ingest almost any unstructured text (PDFs, raw web scrapes, Markdown notes) or structured JSON data from external API tool calls (like SERP API web searches or Wikipedia extracts) to continuously update its graph.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Patient Behavior Analysis for Disease Prediction","timestring":"Thu May 16 23:33:20 2019","m1uni":"pgk2115","m2lname":"Lieu","m1fname":"Peter","m4fname":"","m1lname":"Kolodziej","m3fname":"","description":"The project objective is to use electronic health records to model three key predictive problems in the critical care sector. First, we model the risk of in hospital mortality. It should be obvious why mortality prediction is of the utmost importance in the healthcare industry. Second, we want to model length of stay in days based on patient medical records. This represents a valuable metric of optimization of resource allocation. If providers can identify patients in need of more urgent assistance, they can more effectively care for the patients. Finally, we want to model risk of readmission following discharge. Readmission represents either provider oversight or significant deterioration of health of the patient. In either case, if we can catch this issue before the patient is discharged, we can decrease providers' rates of morbidity and mortality.","uni":"pgk2115","language":"R and Python were used for exploratory analysis and modeling, respectively. ","pid":"201905-8 ","m4uni":"","analytics":"Within Python, several models were tested using Keras and SKlearn. Natural language processing was carried out using NLTK. The analogous packages were also utilized in R.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset used was the MIMIC-III ICU dataset. It consists of the ICU charts of 40,000 de-identified patients to one Massachusetts hospital over the course of around 12 years. There are 26 charts and a total of 50 gigabytes of data.","m2uni":"jll2234","m2fname":"Jennifer","m3uni":""},{"projectname":"Predicting Lending Club Loan Status","timestring":"Fri Dec 21 17:27:50 2018","m1uni":"es3573","m2lname":"Tian","m1fname":"Erik","m4fname":"","m1lname":"Su","m3fname":"Rishabh ","description":"Our objective in this project is to build a classifier that is able accurately to predict the loan status of an applicant from their Lending Club loan application. The loan status is an indication of whether or not the loan will be fulfilled and thus if investments turn into fruition. This is important information to be able to predict for both the investor and applicant and we aim to reduce the overhead involved in prediction.

We build upon previous studies of this dataset by applying different preprocessing criteria, stacking models, and balancing the data before training our models.

This research is important for both the investor and applicant since our findings suggest that a majority of the application is not important toward predicting the loan status. Our model can be used to accurately predict the loan status of an applicant with fewer fields to fill out in an application, further reducing the complexity in dealing with loans.","uni":"es3573","language":"Python, R","pid":"201812-40 ","m4uni":"","analytics":"Variational inflation factor, generalized linear models, and Pearson correlation heat maps were used in preprocessing and can be found from the seaborn, scikit-learn, and R packages available on python. The models used from pySpark's machine learning classifier package and include logistic regression, linear SVC, MLP, decision tree, random forest, and GBTs. The stacking model implementation was custom and implements out-of-fold predictions in order to collect a new feature matrix. Balancing methods include custom undersampling techniques and oversampling using SMOTE. ","m4lname":"","industry":"Finance","m3lname":"Jain","dataset":"","m2uni":"ht2459","m2fname":"Hangyu","m3uni":"rj2511"},{"projectname":"Portfolio Optimization","timestring":"Fri Dec 15 16:53:00 2023","m1uni":"kjj2131","m2lname":"Tantawichet","m1fname":"Kinjal","m4fname":"","m1lname":"Jasani","m3fname":"Ojaswa","description":"The system is developed to automatically make stock holding recommendation each day before the trading hours.
The recommendation is shown daily prior to the trading hours in a webpage that the user can access.
The performance of the past recommendation is also shown against S&P 500 performance.
Most existing literature applies sentiment analysis as features in machine learning models to create stock predictions either (1) to predict individual stocks or (2) feed into their portfolio optimization model (as a view).
The novelty of our approach is that we use sentiment analysis to determine the confidence level of the view generated by machine learning model.
","uni":"kjj2131","language":"Javascript, Python, HTML, Colab, Airflow, Google Cloud","pid":"202312-10","m4uni":"","analytics":"Black Litterman Model, XGBoost, LSTM, Random Forest, FinBert, d3.js","m4lname":"","industry":"Finance","m3lname":"Yadav","dataset":"In designing our system we consider the 3V’s of the data:
1) Volume: News feeds and stock price data are sourced in real-time from finnhub and yahoo finance.
2) Velocity: News feeds and stock price data are available at various time scale granularities. For the feasibility of the project point of view, we consider daily intervals.
3) Variety: stock price data is structured as time series data. News feed is unstructured data.

The system uses new data as of 8:00 AM (before the trading hours) each morning to
fit XGBoost and make stock predictions (stock prices from Yahoo Finance)
use pre-trained finBert to generate news sentiment score for each stock (news feed from finnhub)
provide stock holding recommendations
","m2uni":"et2676","m2fname":"Ekarat","m3uni":"oy2143"},{"projectname":"AI Trader (US)","timestring":"Fri May 15 23:32:31 2020","m1uni":"qg2175","m2lname":"Lyu","m1fname":"Qing","m4fname":"","m1lname":"Gao","m3fname":"","description":"The objective of this project is to construct a trading agent that can analyze millions of stock data and execute trades at the optimal price, forecast stock price with greater accuracy, as well as mitigate risk to achieve higher returns. Our analysis combines both technical analysis and fundamental analysis to form a portfolio that can lower the risk and increase the total return.
Using artificial intelligence to trade has played an increasingly significant role in the stock market, as an accurate forecast will allow investors to make appropriate trading decisions and realize an attractive return.
","uni":"qg2175","language":"Python, R, Shiny App","pid":"202005-14","m4uni":"","analytics":"Stock Technical Analysis, Stock Fundamental Analysis, Moving Average, Linear Regression Model, Long Short Term Memory, Portfolio Analysis, Discounted Cash Flow Model, ROE-P/B model
","m4lname":"","industry":"Finance","m3lname":"","dataset":"This project uses two types of datasets: a daily stock price dataset of all the S&P 500 companies over the time period from January 2008 to April 2020, and the second type of dataset consists of the S&P 500 companies’ financial statements and its corresponding key financial ratios between 2009 and 2019. The first type of datasets is collected using web scraped from Yahoo Finance API. The second type of datasets is collected using both the 'FundamentalAnalysis' package in Python and the 'QuanTmod' package in R.","m2uni":"wl2733","m2fname":"Wenfeng","m3uni":""},{"projectname":"Movie Recommendation System Using A Movie Lens Dataset ","timestring":"Sat Dec 17 03:34:55 2022","m1uni":"ab5266","m2lname":"","m1fname":"Advaith","m4fname":"","m1lname":"Biligeri Jagannath","m3fname":"","description":"The goal was to implement a movie recommendation system using the movie lens dataset, by leveraging the utilities offered by GCP and PySpark for large datasets in an optimized manner (both in time and efficiency).

Innovation:
1) Tried to improve the hyper parameter tuning for the ALS model used to predict the movies
2) Dataset storage optimization by converting csv to parquet format
2) Tableau dashboard creation of the output prediction ( 3 types of visualization namely tree map, bar chart and packed bubble

Capabilities:

This system is powered by all the tools taught in class and on performing literature survey I couldn't find any research being done on the integration of the mixture of tools that I've used for the system, inorder to build a recommendation system. I have used Big Query, GCP Cloud storage, Dataproc cluster and Tableau. The integration of all these tools can be used to build a cost effective recommendation system as companies such as Netflix and Hulu can utilize just the cloud services from the google and build their own recommendation system and integrate tableau dashboards with the UI instead of outsourcing it to other software companies.
","uni":"ab5266","language":"Python, GCP, Tableau, Pyspark, BigQuery ","pid":"202212-38","m4uni":"","analytics":"The flow is as follows initially the data is prepared and analyzed and converted to parquet, the parquet file is picked up from the cloud storage and the hyper parameter tuning is performed, the hyper parameter tuning results in csv file stored in gcp cloud storage as well. The best parameter from csv is used and the recommendation is made using pyspark library functions (ALS Model). The recommendation is stored on big query and visualized on tableau by creating user dashboard by OAuth linking tableau and big query
1)Analytics: Performed basic exploratory data Analysis : word cloud, bar plot, Data RDD format
2)Preprocessing:Removing Null rows and joining movies.csv and ratings.csv and converting the resultant rdd to parquet format (data optmization : Novelty) and saving it
3)Hyperparameter tuning: Was done by using the grid search algorithm, coded it from scratch w/o using inbuilt library( novelty). the hyperparaters were stored in a csv file.
4)ALS model: The best hyper parameters were used to produce recommendations
5)big query tables for recommendations was created
6) Visualization: Tableau based user interactive dashboard was created (Tree map, packed bubbles, bar plot) by linking big query and tableau using OAuth
","m4lname":"","industry":"Information","m3lname":"","dataset":"Used the small movie lens dataset provided on the movie lens website. The system can support any csv files with movieId, userId, Ratings and Movie name
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Gene and Protein Analysis for Cancer Research: Pathway and Network Analysis","timestring":"Thu May 16 22:48:45 2019","m1uni":"zl2690","m2lname":"Rong","m1fname":"Ziyi","m4fname":"","m1lname":"Liu","m3fname":"","description":"Nowadays, genes and proteins have been proven to be a crucial driving factor on most diseases such as lung cancer. Also, in order to better predict the progress of diseases and suggest new research directions of genetic study, precision medicine relies more and more on a good understanding of interactions between genes and proteins. Therefore, pathway analysis of proteins shows its significance for biological researchers, who keep devoting themselves in building databases containing various protein-protein interaction networks.

However, as more and more new proteins are found to be related to certain diseases, there has been a delay in updating protein pathways in the current existing online databases.

In order to solve this problem, we are curious about developing a method to automatically generate protein pathway based on a given dataset containing a number of patient profiles, where the content ratios of different types of proteins are recorded. ","uni":"zl2690","language":"Python, JavaScript, HTML, CSS","pid":"201905-2","m4uni":"","analytics":"For correlation analysis:
- K-means clustering
- Random forests classifier
- T-SNE for dimension reduction
- Correlation matrix
- Hierarchical clustering

For pathway visualization:
- Django web framework
- D3.js JavaScript library","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset containing over 100 patient profiles with the content ratios of over 10000 type of proteins detected in those patients' bodies are provided by Graphen.ai. Additionally, we manage to retrieve the dataset of protein interaction network through the API access of STRING Database.","m2uni":"br2566","m2fname":"Boyan","m3uni":""},{"projectname":"ESG-Driven Portfolio Optimization","timestring":"Fri Dec 19 16:58:27 2025","m1uni":"naa2204","m2lname":"Mittal","m1fname":"Neha","m4fname":"","m1lname":"Angadi","m3fname":"","description":"Primary Goal: Prove that AI can build portfolios achieving competitive financial returns (Sharpe ratio >= 1.4) WHILE meeting ESG targets (score >= 60), without forcing profit vs. planet trade-off.

Specific Objectives:
1. Quantify ESG cost - Measure exact return sacrifice for sustainability
2. Minimize the gap - Use RL to reduce performance loss vs. traditional portfolios
3. Enable adaptability - Daily rebalancing vs. static quarterly allocation
4. Provide transparency - Visualize trade-offs, make ESG investing data-driven

Key Innovations:
1. Multi-Objective RL Reward Function
First to jointly optimize: returns + ESG + risk + transaction costs in single reward
Impact: Reflects real institutional mandates (not just profit maximization)
Result: Achieves 1.456 Sharpe with 62.1 ESG (4.4% cost for 28.8% ESG gain)
2.⁠ ⁠Soft ESG Constraints
Instead of: Hard cutoffs (\"exclude ESG < 60\")
We use: Penalty functions (allows flexibility, learns optimal violations)
Impact: Outperforms hard-constrained MVO by 5.4% Sharpe (1.456 vs 1.382)
3.⁠ ⁠Dynamic Daily Rebalancing
Vs. traditional: Quarterly/monthly static weights
Our approach: Agent learns when/how to adjust based on market signals
Benefit: Adapts to volatility spikes, ESG changes, regime shifts
4.⁠ ⁠End-to-End Production System
Not just research code - Full pipeline: data -> training -> backtesting -> dashboard

Capabilities and Importance:
Capability 1: Real-Time Portfolio Construction
What: Given any ESG target, instantly generate optimal portfolio
Why important: Asset managers need quick scenario analysis for client proposals
Demo: Dashboard lets users drag ESG slider, see impact on returns immediately
Capability 2: Stress Testing & Risk Analysis
What: Evaluate portfolio performance during volatile periods (2020 COVID, 2022 inflation)
Why important: Regulatory requirements (Basel III, Dodd-Frank) demand stress tests
Result: RL portfolio shows -3.2% loss in high volatility vs. -5.4% for baselines
Capability 3: Explainable Trade-offs
What: Pareto frontier visualizes every return-ESG combination
Why important: Investors need to see exactly what they sacrifice at each ESG level
Insight: 10 ESG points costs ~1.5% annual return (now quantified, not guessed)
Capability 4: Scalability to Large Universes
What: Architecture handles 35 stocks now, can scale to 500+ with minor changes
Why important: Institutional portfolios require diversification across hundreds of assets
Bottleneck addressed: Vectorized operations, efficient covariance computation
Importance:
1. Financial Impact ($35 Trillion Market)
Global ESG assets: $35T in 2020, projected $50T by 2025
Our approach reduces ESG cost from ~10-15% (typical) to 4.4%
Implication: Saving 1% on $35T = $350 billion in preserved returns annually
2.⁠ ⁠Environmental & Social Impact
Redirects capital toward sustainable companies (high ESG scores)
Incentivizes corporations to improve ESG practices to attract investment
Specific: Our RL agent allocates 38% to Tech (clean energy innovators), 22% to Healthcare (social good)
3.⁠ ⁠Democratization of Advanced Tools
Previously: ESG optimization required expensive Bloomberg terminals, quant teams
Now: Open-source RL toolkit anyone can use
Accessibility: Students, startups, non-profits can build ESG portfolios
4.⁠ ⁠Academic Contribution
First to combine: Deep RL + Multi-objective optimization + ESG in production system
Gap filled: Prior work either did RL (no ESG) or ESG (no RL), never both rigorously
Citation potential: Provides baseline for future ESG-RL research
5.⁠ ⁠Regulatory Relevance
SEC proposed ESG disclosure rules (2022)
EU Sustainable Finance Disclosure Regulation (SFDR)
Our toolkit helps: Firms comply by quantifying ESG impact, providing audit trails

Toolkit Components and Usability:
For Researchers:
Custom Gymnasium environment (plug-and-play for new RL algorithms)
Modular reward function (easy to add new objectives)
Comprehensive backtesting framework
For Practitioners:
Interactive dashboard (no coding required)
Pre-trained models (immediate deployment)
Clear documentation (replication in 2 hours)
For Educators:
Teaching material (RL + finance + ESG in one project)
Jupyter notebooks (step-by-step explanations)
Visualizations (intuitive understanding of trade-offs)

This research proves ESG investing doesn't require sacrificing returns, provides open-source tools to democratize sustainable finance, and establishes RL as the future of multi-objective portfolio optimization. The 4.4% Sharpe cost for 28.8% ESG gain is the key quantified result that changes the conversation from \"ESG or profits?\" to \"How much ESG can we afford?\"","uni":"naa2204","language":"Python; Data: pandas, numpy ML/RL: stable-baselines3, PyTorch, gymnasium; Optimization: scipy, cvxpy; Dashboard: streamlit","pid":"202512-9","m4uni":"","analytics":"Our system implements comprehensive analytics including financial metrics (daily/cumulative/annualized returns, volatility, max drawdown, Sharpe/Sortino ratios), ESG metrics (portfolio ESG scores, sector breakdowns, consistency tracking), and performance analysis (backtesting, out-of-sample testing, stress testing, statistical t-tests). We deployed four core algorithms: (1) traditional Mean-Variance Optimization using SLSQP, (2) ESG-Constrained Optimization with hard thresholds, (3) Pareto Frontier generation for multi-objective trade-off visualization, and (4) Proximal Policy Optimization (PPO) with 2-layer MLP policy and value networks (256->128 neurons) using Generalized Advantage Estimation. The system architecture consists of five modular components: Data Pipeline (yfinance API collection, cleaning, 18-feature engineering), Optimization Engine (scipy solvers, constraint handlers), RL Training Module (custom Gymnasium environment with 5-component reward function, stable-baselines3 PPO agent), Backtesting Framework (historical simulation, metrics calculation, stress testing), and Interactive Dashboard (Streamlit app with 7 pages). We implemented 25 interactive Plotly visualizations across dashboard pages including KPI cards, grouped/horizontal/vertical bar charts (strategy comparison, top holdings with ESG color-coding, sector allocation), scatter plots (risk-return profiles with bubble sizing, 3D stock explorer), multi-line time-series charts (portfolio values, cumulative returns), drawdown area charts, pie charts (sector allocation), spider/radar charts (multi-dimensional strategy comparison), Pareto frontier plots with color-scaled volatility heatmaps, styled sortable tables with gradient backgrounds, and dynamic filtering components with real-time slider updates, totaling ~3,500 lines of Python code across 5 Jupyter notebooks plus one production Streamlit application deployed on Hugging Face Spaces via Docker containerization.","m4lname":"","industry":"Finance","m3lname":"","dataset":"DATASETS TESTED:
Dataset 1: Yahoo Finance Stock Prices
Source: Yahoo Finance API via yfinance Python library
Coverage: 40 S&P 500 stocks -> filtered to 35 with complete data
Time Period: January 2022 - November 2025 (3 years, 751 trading days)
Features: OHLCV (Open, High, Low, Close, Volume) - daily frequency
Size: 26,285 price observations (35 stocks × 751 days)
Access: Public, free API - pip install yfinance

2.⁠ ⁠Kaggle S&P 500 ESG Ratings
E, S, G scores for same 35 stocks
Public: https://www.kaggle.com/datasets/pritish509/s-and-p-500-esg-risk-ratings

3.⁠ ⁠Our Derived Features
18 calculated metrics (returns, volatility, Sharpe, etc.)","m2uni":"sm5756","m2fname":"Sejal","m3uni":""},{"projectname":"Visual Sentiment Prediction: Emotion Recognition on Image","timestring":"Sat May 6 01:10:52 2023","m1uni":"sc5103","m2lname":"Wu","m1fname":"Shiyu","m4fname":"","m1lname":"Cheng","m3fname":"","description":"Our research is aimed at allowing machines to interpret emotions based on visual information. We developed a website to infer sentiment from images.
Our project is the 1st one to apply RestNext to sentiment analysis of images. We introduced a unique approach that uses fine-tuned transfer learning models to handle the issues of image sentiment analysis
Visual sentiment prediction has wide applications in all walks of life: Emotional understanding of viewer responses to advertisements using facial expressions (McDuff et al., 2015); Monitoring of emotional patterns to help patients suffering from mental health disorders (Huang, Sano, and Kwan, 2014)
","uni":"sc5103","language":"Python, GCP","pid":"202305-15","m4uni":"","analytics":"PyTorch
Confusion Matrix
Visualization: Streamlit","m4lname":"","industry":"Information","m3lname":"","dataset":"The test dataset is downloaded manually from Flickr, Unsplash, Google Images, and Baidu Images. Our model can support all kinds of images.","m2uni":"sw3753","m2fname":"Shuyu","m3uni":""},{"projectname":"Game for you","timestring":"Fri Dec 13 15:10:59 2019","m1uni":"fy2252","m2lname":"Zhang","m1fname":"Fanxing","m4fname":"","m1lname":"Yin","m3fname":"Yuchen","description":"When it comes to games, many people may think of wasting time. But the game can not only relax players’ mind after the tense work and study, but also help people make new friends. So playing games with proper time and way is beneficial to people.

Every game has its life span and people will keep looking for new games. For this reason, we want to build a game recommendation website, through which people can filter and find the most suitable games for themselves, and can quickly get the basic information and characteristics of each game.
","uni":"fy2252","language":"Python, CSS, HTML, Javascript, SQL, Google Cloud, Pycharm, Mysql, Postgresql, Jupyter notebook","pid":"201912-30","m4uni":"","analytics":"Filter useful comments: Define useful comments and ignore meaningless comments using some restrictions, such as: comment length, repetition of simple words and etc.

Creating labels: Use top five common adjective words used in comments as the label of the game.

Game recommendation: Do correlation analysis based on labels and genres.

Compute rates of games: Mean of useful comments’ rating.

Web application: User information and data visualization using HTML, CSS and JS.

Recommendation based on customized information: SQL.
","m4lname":"","industry":"Media","m3lname":"Luo","dataset":"Our dataset have two csv files the first one is information for more than five thousand current popular games. The second one are more than one million comments for those games.

We got the dataset from a public dataset website kagggle.com.

Our software can also suppport ohter csv files that contain properties of many objects and provide search function.","m2uni":"qz2376","m2fname":"Qingli","m3uni":"yl4250"},{"projectname":"Video Summary Generation","timestring":"Thu May 4 21:38:43 2023","m1uni":"ih2403","m2lname":"","m1fname":"I-Kai","m4fname":"","m1lname":"Huang","m3fname":"","description":"As digital content is being accumulated faster than ever, video summarization helps improve the efficiency of content consumption and information retrieval. The goal of this task is to generate video recaps/summaries that provide concise and complete synopsis. The expected output is a shortened version of the video with the most informative frames. Having seen some of the output summary videos from related works, I believe that there is room for improvements for the content despite having outstanding benchmark scores.","uni":"ih2403","language":"the system is built using django and python","pid":"202305-11","m4uni":"","analytics":"The website, written using the Django framework, takes user uploaded video in mp4 format. The video would then be converted into frames using the OpenCV library.
Features will then be extracted using ResNet from PyTorch and the scenedetect library from Python to obtain different features.
Once we have the features representations, they are inputed to a trained Deep Summary Network based on RNN with LSTM cells. During the training phase, the model is evaluated using the common benchmarking datasets TvSum and SumMe and using F1-score. In addition, it also calculates a representative score at every training iteration to evaluate the effectiveness of the model. Once we have the output scores provided by the model, we apply the frame selection criteria and select the frames to be used for generating the summary. The frames are then converted into video using the OpenCV library while using the original FPS configuration. The video would be sent back to the webpage where the user uploaded the original video.","m4lname":"","industry":"Information","m3lname":"","dataset":"SumMe
TVSum
These are public benchmarking datasets commonly used for this topic.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"SpotifyClassifier","timestring":"Sat Dec 18 18:04:16 2021","m1uni":"apt2141","m2lname":"Aliyeva","m1fname":"Alex ","m4fname":"","m1lname":"Thornton","m3fname":"Tanvi","description":"Music genre classification is a complex task, which can even be difficult for humans. Using a Spotify Developer account to interface with their API, we have created our own dataset and a music genre classifier capable of identifying song genres across a much larger range of genres and subgenres than previously achieved in academic research. Additionally, we leverage Spotify's pre-processed track metadata, allowing for genre classification with only a song name as an input, rather than an audio mp3 file.","uni":"apt2141","language":"Python, Jupyter Notebook, Javascript","pid":"202112-1","m4uni":"","analytics":"GCP dataproc / bigquery - storage and processing of audio features

Python NetworkX package - subgenre interconnectedness graph visualization

Python Plotly.express package - sunburst supergenre hierarchy visualization","m4lname":"","industry":"Media","m3lname":"Pande","dataset":"New Dataset
Queried Spotify for recommendations across 73 subgenres
~500 tracks labelled for each subgenre
Collected Spotify metadata for individual tracks, added subgenre labels
Grouped 73 subgenres ->11 'super-genres' for simpler classification

Feature Variety
Combined Spotify audio features and audio analysis
Averaged pitch and timbre vectors over all song segments
Included other miscellaneous metrics (popularity, date of release, explicit lyrics)
","m2uni":"ae2970","m2fname":"Elmira","m3uni":"tp2673"},{"projectname":"Automatic Image Labelling Sytem","timestring":"Sat Dec 18 01:27:14 2021","m1uni":"yy3089","m2lname":"Peng","m1fname":"Yi","m4fname":"","m1lname":"Yang","m3fname":"Jiashu","description":"Image labelling has always been a consuming task in both time and human resources. Generally, the most common data labelling tools and formats include LabelMe json files, COCO dataset, lvis, etc,. While making image annotations, we need to repeatedly view images of similar themes and type in classes and bounding-box coordinates. Is there a way to make image labelling more efficient and interesting? In our final projects, we proposed an innovative image labelling system that will provide users with image annotations shortly after they send their customer’s images via the front-end of our system. We divided our system into a Graphic User Interface for users to interact with the deep-learning engine and a back-end to generate image annotations. Our system will allow users to modify, add or delete annotations according to their requirements or willings. We evaluate with labelling accuracy and different test cases to prove the robustness and performance of our system.","uni":"yy3089","language":"Python, Javascript, html, css, MacOS, Linux(Ubuntu), Nvidia Tesla T-4 GPU(Google Cloud Platform), Intel i5 CPU ","pid":"202112-3","m4uni":"","analytics":"We build our system in the typical MVC framework(Model. View, Controller):
Model: We trained a Faster R-CNN model as the deep learning engine to generate annotations as labelling data.
View: We used Electron and React to build the front-end of our whole software system to interact with users and provide interfaces of different functions for users to make changes of our annotation results.
Controller: We implemented the back-end of our software using flask to build a connection between the front-end and our deep-learning model.

We also employed MySQL to build our database to store users' information and software's logging records.","m4lname":"","industry":"Information","m3lname":"Chen","dataset":"We collected raw image data about people wearing masks from the https://github.com/X-zhangyang/Real-World-Masked-Face-Dataset and labelled part of them as original training data. Our training data set contains about 500 instances of mask category. We marked the bounding box and categories of each image using LabelMe and transformed the whole dataset into COCO format.","m2uni":"jp4081","m2fname":"Jing","m3uni":"jc5664"},{"projectname":"Analyzing Word Patterns for Author Identification","timestring":"Fri Dec 13 18:26:54 2019","m1uni":"dr2884","m2lname":"","m1fname":"David","m4fname":"","m1lname":"Rincon-Cruz","m3fname":"","description":"- Be able to cluster different documents and media written by the same author by analyzing the stylometry
- Key research point that hadn't been considered before in papers was the application of POS-tags and their bigram conditional probabilities for the use of stylometry.

- Create a visualization of different documents for different types of media to present similarities of different styles, as well as highlight clusters or demographics.
- The interactive api allows users to submit two bodies of text to see if their authorship is identical.

- This is important for the purpose of spam detection, historical and research work attribution, and detecting plagiarism.","uni":"dr2884","language":"Python","pid":"201912-48","m4uni":"","analytics":"- Variety of vocabulary analysis tools
- Markov Chain Bigrams Probabilities
- Support Vector Machines (Also attempted simple Multi-layer Perceptrons and KNN-clustering)
- Principal Component Analysis
- Experimented with Forced-Directed Graphs but removed once Feature Encodings were able to be plotted with enough variance from pca.
","m4lname":"","industry":"Media","m3lname":"","dataset":"For the books, I used the gutenberg python api to download 20k books, removed the ones missing metadata, and tagged with my own implementation, as well as nltk.

For the NYT articles, I used their developer api to look for article links, and then built my own scraper/parser that manipulated cookies to download

tags in websites and join them to rebuild the original text.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Analysis of News impact on US Stock Market","timestring":"Sat Dec 18 04:35:17 2021","m1uni":"yh3310","m2lname":"Sun","m1fname":"Yukun","m4fname":"","m1lname":"Huang","m3fname":"Xintong","description":"This project has 2 goals:
(1) use news data to help predict the future performance of stocks.
(2) visualize the news topics and sentiments distribution in history together with the stock price.

This project is an integral platform for researchers, consultants, analysts and individual investors whose wish to quickly learn about the historic events of stocks and making immediate decision when thing happens.","uni":"yh3310","language":"Python, html, javascript, css&scss, SQL, Django, Google Cloud Platform","pid":"202112-43","m4uni":"","analytics":"1. Direct stock price prediciton using stock factors and sentiment scores:
(1) NLTK SentimentIntensityAnalyzer
(2) MLPRegression
(3) Lasso
(4) CNN

2. NLP models for news sentiment classification:
(1) T5(The Text-To-Text Transfer Transformer)
(2) BERT + MLP
(3) Lexicon feature + MLP

3. Visualization:
(1) candlestick chart for stock price
(2) pie charts for news sentiment and impact scores
(3) force directed graph for number of news of different companies
(4) image boxes for news visualization
","m4lname":"","industry":"Finance","m3lname":"Liu","dataset":"Stock data was fetched from Yahoo! Finance.
News data was crawed from Nasdaq(https://www.nasdaq.com/).","m2uni":"hs3220","m2fname":"Haoming","m3uni":"xl3121"},{"projectname":"Skip Doc","timestring":"Fri May 12 23:23:26 2023","m1uni":"cam2420","m2lname":"Norman","m1fname":"Charles Antoine ","m4fname":"","m1lname":"Malenfant","m3fname":"","description":"Skip Doc is a web application for those seeking virtual medical attention built on top of Streamlit and OpenPrompt. It gives us the opportunity to explore the possibilities that Prompt Engineering offers as a better alternative to fine-tuning pre-trained LLMs in situations of data paucity. Our focus is in the clinical medicine area. This research is important because domains that have scarce data are common and therefore rendered inaccessible to modern day NLP techniques. Additionally, this product has the potential to have huge commercial upside given the $3.5T/year in healthcare spending in the US with 1/5 of all in-person visits deemed unnecessary. It can also be used as the back-bone for robotic or virtual clinical care interactions on-site.","uni":"cam2420","language":"GCP DeepLearning VM with NVIDIA T4 GPU, 24 CPU cores and 208 Gb of RAM, Python, Docker, Streamlit, OpenPrompt ","pid":"202305-18","m4uni":"","analytics":"GUI in the form of a web application to gather virtual patient data, GPT2 models as our main DNN architecture, prompt engineering techniques, model training, model evaluation, and data pre-processing. ","m4lname":"","industry":"Life Science","m3lname":"","dataset":"MedQuAD: https://github.com/abachaa/MedQuAD. MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests. They used the test questions of the TREC-2017 LiveQA medical task: https://github.com/abachaa/LiveQA_MedicalTask_TREC2017/tree/master/TestDataset. As described in their BMC paper, they have manually judged the answers retrieved by the IR and QA systems from the MedQuAD collection. They used the same judgment scores as the LiveQA Track: 1-Incorrect, 2-Related, 3-Incomplete, and 4-Excellent. The QA test collection contains 2,479 judged answers that can be used to evaluate the performance of IR & QA systems on the LiveQA-Med test questions: https://github.com/abachaa/MedQuAD/blob/master/QA-TestSet-LiveQA-Med-Qrels-2479-Answers.zip. Our project is extensible and flexible enough to be able to easily incorporate other novel data sets with the fewest lines of code possible. ","m2uni":"ln2461","m2fname":"Lance","m3uni":""},{"projectname":"Building Knowledge Graphs for Diagnostic Medicine Cardiovascular","timestring":"Fri May 15 15:50:24 2020","m1uni":"xm2225","m2lname":"Lin","m1fname":"Xinpei","m4fname":"","m1lname":"Ma","m3fname":"","description":"Cardiovascular system diseases have become one of the major killers of health due to its own characteristics, and the diagnostic medicine of these diseases has showed great social significance. Knowledge Graph, as an efficient and simple means of information interactive visualization tool based on big data, is widely used in industries to show the relationship between different things. In the report, a knowledge graph will be built based on data coming from different sources and structures. The result will be used for the diagnostic medicine of cardiovascular diseases.","uni":"xm2225","language":"Python, GCP, Jupyter Notebook","pid":"202005-24","m4uni":"","analytics":"Let’s begin with sturctured data. Our structured data mainly coming from a database provided by hospital. It contains more than 5 thousand and 6 hunderd patients’ record. For each patients, it contains multi-dimensional information, for example symptoms, treatments.
After getting this database, we found that it containing too much useless information, for example, diseases that is not related to cardiovascular system diseases. So we need to do the data cleaning work and only save the useful information. The processed database served two functions. First, it could be used to construct a preliminary knowledge graph. The second function is that this database can form the Named Entities Dictionaries that would be needed for information extraction.
This is the structured data part. Then we would like to move to the unstructured data part. For the unstructured data, our data source mainly come from web crawler. A web crawler is an Internet robot that systematically browses the worldwide web. For this project, we deploy the web crawler on several platforms including wikipedia, texas heart institute...
After getting the data, then is the overall structure for processing the unstructred text data. These six boxes indicates six main programs that we developed, and I will explain them one by one.
Firstly we have to get raw text. No matter Where you get, from web or from local disk. For our language processing, we want to break up the raw text into words and punctuation. This step is called tokenization, and it produces our familiar structure, that’s, a list of words and punctuation.
Then we have to classify each word into their parts of speech and label them accordingly. This process is known as part of speech tagging.
After pos tagging, we come to identity detection. The basic technique we will use for entity detection is chunking, which use boxes to label segments and multi-token sequences. The smaller boxes show the word-level tokenization and chunk tagging, while the large boxes show higher-level chunking. Each of these larger boxes is called a chunk.
After chunking, what we get are lists of trees and chunked sentences, and we will do relation detection for them. The method we use for relationship extraction is dependency parsing. The function of dependency parsing is to recognize a sentence and assign a syntactic structure to it. After dependency parsing, we can get a dependency tree and the syntactic structure. Then we can take a step forward and extract triples from sentence.
We can notice that, every sentence will have a token labeled as root. Generally speaking, root represents the relationship in the triples and connects two entity chunks.
So we wrote a recursive program. Starting from root, we can extract every entity chunk in the sentence. But do we need all these triples extracted from a whole article? No, because we are only interested in the triples related to our project.
So we have to label each entity chunk by named entity recognition and only extract the triples whose entity chunks are labeled as cardiovascular disease, behaviors, treatment, all the classes that are related to our projects.
The final step is called named entity recogniation. The core function is to label each entity with its type. We can only choose the triples which contains the entity whose type is cardiovascular disease. For example, the word “heart failure” and “heart pain” both contain “heart”, the the first one refers to a kind of disease and the second one refers to a kind of symptom.
So we need to distinguish their types. This time, for the NER part, the spaCy model was used, which is an open-source library for advanced Natural Language Processing in Python. The spaCy could assign labels to groups of tokens which are contiguous and provides a default model which can recognize a wide range of named or numerical entities. Apart from these default entities, spaCy also gives us the liberty to add arbitrary classes to the NER model, by training the model to update it with newer trained examples.
So the key problem becomes that how to train a model, that can accurately identify the entities that belong to the types that we caring about. Why I use the word train, because we have been searching online for a long time, and didn’t find a model that can be directly used for our project. So we need to train a Named Entity Recognition model from scratch. It’s incredibly painful, because we did not even find proper training data for training our model.
So We find sentences containing the phrase we are interested in. Copy and paste. One word one line, label each word. And feed them into the training model.
After completing the six components mentioned above, we could give the system raw text data crawlering from different websites, and then generate the structured CSV file.
To conclude, we could now use both structured data from database and unstructured data from web, process them and generate the srtuctured csv file, and then generate the final knowledge graph.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"For this project, a database from hospital containing data from more than 5600 patients was achieved. For each patient’s case, the database included the patient’s symptoms and
the treatment received. Since the database contained various diseases, it was cleaned and only the cases related to cardiovascular system diseases were saved.
For the collection of unstructred text data. It is thought that the most efficient method of text data collection is the web crawler. A Web crawler is an Internet robot that systematically browses the worldwide web. Take Wikipedia as an example. Go to the Wikipedia entry on cardiovascular disease and download the text on this web-page into a text file. Sort out the relevant links that appear on this page, then jump to the page of one of these relevant links, and then download the text in the new page into the text file. The loop continues until we have enough text data in our text file. In this case, the web crawler was deployed on several websites to get papers, text data and medical record data, and the data sources including Wikipedia, Texas Heart Institutes, American Heart Association, etc.","m2uni":"hl3307","m2fname":"Hao","m3uni":""},{"projectname":"Factors Affecting Movie Grossing and Prediction","timestring":"Fri Dec 16 22:22:40 2022","m1uni":"jz3500","m2lname":"Peng","m1fname":"Jingtian","m4fname":"","m1lname":"Zhang","m3fname":"Hanlun","description":"Goal & Novelty
● Crawl movie data in the boxoffice website and combine them to existing
datasets (so others could use our open source codes to crawl or combine
their own data)
○ Self-designed New Dataset
● Analyze relations between movie sales and some factors (including release
date, genre, runtime, imdb ratings…)
○ Feature Exploration
● Provide a visualized analysis from 2010 to 2022
○ Long Time Span
● Predict movie sales for movie providers when providing related information
○ Creative and Interactive Application

","uni":"jz3500","language":"Python and GCP","pid":"202212-9","m4uni":"","analytics":"● Dataset
○ Python Crawler
● Visualization
○ Data Processing
● Prediction
○ Machine Learning
● Webpage
○ Flask
○ BootStrap
○ GCP","m4lname":"","industry":"Media","m3lname":"Wang","dataset":"Data
● Source:
○ crawl.dataset.csv Crawl from: boxofficemojo website
○ some existing datasets Download from: imdb open dataset
○ Then combine them together to get combine.dataset.csv
● Why?
○ High Volume: 200 movies a year and 2400 movies in total
○ High Velocity: data updated daily
○ High Variety: movies with 22 categories and 12 feature columns
● Difficulty:
○ tconst is difficult to get so we parsed a sub page to get this key
○ Need to combine many datasets with tconst and nconst","m2uni":"wp2297","m2fname":"Weirui","m3uni":"hw2839"},{"projectname":"Box Office Oracle","timestring":"Fri Dec 13 10:12:18 2019","m1uni":"dd2941","m2lname":"Dengur","m1fname":"Diego Alonso","m4fname":"","m1lname":"Delgado Caceres","m3fname":"","description":"Objective:

Create a machine learning tool which helps movie production companies to green-light projects that will maximize their investment just by analyzing the script.

Innovations:

There are only a few who analyzed movies through a sentiment analysis. This is due to it's low accuracy. We tried another perspective but still using sentiment and emotion analysis.

Splitting the movies into N part may give us a better sentiment understanding of it. On top of this, using an LSTM or GRU will help us measuring this blobs as a sequence of emotions.

Capabilities:

This project gave us an average (with the same data) accuracy on which box office tier will a movie be.

Why is important?
After evaluating the model and out data output, we realized we have a powerful visualization tool to see the flow of emotions through a movie on a single glance. If we could manage to predict if a movie will be profitable just by looking at it's script, the movie industry will start writing scripts on a faster pace and it would be easier for the companies to release movies.","uni":"dd2941","language":"python, javacript, flask, google cloud","pid":"201912-46","m4uni":"","analytics":"We used LSTM as our main model

D3JS for visualization

Spark and Pandas for Data processing

Keras for Deep Learning

Flask for Webapp development","m4lname":"","industry":"Information","m3lname":"","dataset":"We used data from UC Santa Cruz (Natural Language and Dialogue Systems, Film Corpus) and OMDB API. We got the scripts and box office data from them respectively. We then transformed the scripts to get the emotion and sentiment analysis.

Our software can support any txt file, split it into time sequences of emotions and sentiments.","m2uni":"sd3231","m2fname":"Subey","m3uni":""},{"projectname":"Automatic Market Competition Analysis","timestring":"Fri May 5 23:09:18 2023","m1uni":"yr2425","m2lname":"Jiao","m1fname":"Yue","m4fname":"","m1lname":"Rao","m3fname":"","description":"The research and toolkits on Automatic Market Competition Analysis are important because they can help businesses make better decisions, improve efficiency, increase accuracy, gain a competitive advantage, and scale their efforts. Additionally, this research can encourage innovation within the field, leading to even more advanced tools and methodologies in the future.","uni":"yr2425","language":"Python on Mac","pid":"202305-14","m4uni":"","analytics":"Data Collection:
Data crawling from Steam and Steam Spy using Steam Games Scrapper
Data storage on Google Cloud

Data Cleaning and Examination:
Removal of empty and duplicate values
Filtering for games that support English
Text cleaning: removing HTML, websites, email addresses, and punctuation

LDA Model for Topic Modeling to identify direct competitors:
Tokenization, stop word removal, and stemming for data preparation
Latent Dirichlet Allocation (LDA) for topic modeling
Collapsed Gibbs sampling for LDA architecture implementation
Jensen-Shannon distance for similarity query

User Review Analysis:
Text cleaning: replacing heart symbols with asterisks
Sentence tokenization and embeddings using the pre-trained transformer model \"all-mpnet-base-v2\"
Dimensionality reduction using Uniform Manifold Approximation and Projection (UMAP)
Clustering using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN)
Bayesian Optimization for hyperparameter tuning
Automatic cluster labeling using the most common verb, direct object, and top two nouns
Word cloud visualization for cluster results

Data Visualization:
EDA for Direct and Indirect Competitors including price distribution, positive and negative review rates, number of owners
Cluster Visualization
Display strength and weakness
Word Cloud for selected positive and negative review topics
","m4lname":"","industry":"Finance","m3lname":"","dataset":"The primary dataset used is information about 27,000 games collected from the Steam store and Steam Spy using Steam Scraper which is based on Scrapy. The dataset includes various information for each game, such as detailed descriptions, categories, tags, genres, languages, price, review rates, and the number of owners.

The dataset in LDA Model for Topic Modeling and EDA for Direct and Indirect Competitors is filtered based on the selected tags, genres, and categories, and direct and indirect competitors are identified using topic modeling with LDA Model.

In User Review Analysis with Topic Modeling, a User Review dataset is used, which consists of positive and negative reviews for the games. The dataset contains user-written reviews, and the text may include inappropriate words replaced by heart symbols. This dataset is processed and analyzed using NLP techniques to understand user intents and cluster them based on their meaning.","m2uni":"qj2168","m2fname":"Qingyue","m3uni":""},{"projectname":"Automatically Updated Factor-Risk Parity Model & Related Research System","timestring":"Fri Dec 20 23:26:02 2024","m1uni":"wl2939","m2lname":"Mao","m1fname":"Wenbo","m4fname":"","m1lname":"Liu","m3fname":"Jiapeng","description":"1.Objectives
- Address limitations of static traditional models by introducing dynamic recalibration and economic factors.
- Enhance portfolio robustness through factors like growth, interest rates, and inflation.
- Automate monthly portfolio weight recalculations for adaptability to market changes.
- Provide a customizable research toolkit for flexible factor and asset configurations.

2.Innovations
- Monthly automated recalibration, advancing static traditional models.
- Integration of macroeconomic factors for deeper risk analysis.
- A flexible system for testing different asset-factor combinations.
- Bridging theory and practical application in portfolio optimization.

3.Capabilities
- Automatic monthly updates align portfolios with market dynamics.
- Flexible configurations for assets, factors, and recalibration intervals.
- Improved risk decomposition and diversification strategies.
- Scalable design for integration with real-time data and broader markets.

4.Why Are These Research/Toolkits Important?
- Enables dynamic, adaptive portfolio management for changing markets.
- Achieves better diversification through factor-driven analysis.
- Empowers researchers to explore and refine allocation strategies.
- Advances portfolio management methodologies for practical applications.","uni":"wl2939","language":"Python, JavaScript, HTML, CSS","pid":"202412-8","m4uni":"","analytics":"
1.Analytics:
- Factor Analysis: Conducted using multiple linear regression to calculate factor exposures (betas) of various assets.
- Correlation Analysis: Computed asset and factor correlation matrices to quantify relationships and guide portfolio optimization.

2.Algorithms:
- Regression Models: Employed multiple linear regression for determining the factor exposure matrix, using libraries like Scikit-learn.
- Risk Parity Optimization: Implemented using Python’s SciPy library to calculate portfolio weights based on factor exposures and risk contributions.

3.System Modules:
- Data Input Module: Imports and preprocesses raw financial data (e.g., from Yahoo Finance or Wind) stored in CSV files.
- Model Computation Module: Calculates factor exposures, portfolio weights, and risk metrics.
Visualization Module: Provides interactive charts and tables for displaying factor contributions, asset weights, risk metrics, and backtesting results.

4.Visualization:
- Interactive Charts and Tables:
- Factor exposure matrix.
- Asset and factor correlation matrices.
- Net asset values (NAV) over time.
- Daily return rates and comparisons between models.
- Tools: The visualizations are displayed via the frontend (JavaScript, HTML, CSS) and accessed locally using a web browser.
- This comprehensive implementation allows users to analyze, optimize, and visualize factor-based portfolio strategies dynamically and interactively.","m4lname":"","industry":"Finance","m3lname":"Li","dataset":"
1.Dataset Tested
- The datasets used in this study are divided into two categories: asset data and factor data. The asset data includes six key financial indices covering U.S. and Chinese stock, bond, and commodity markets, such as the S&P 500 Index (SPX.GI), U.S. Treasury Bond Futures (TY.CBT), Shanghai Composite Index (000001.SH), and Chinese Commodity Index (NH0100.NHF). The factor data consists of five macroeconomic factors: growth, interest rates, inflation, credit, and exchange rates.

2.How the Data Was Obtained
- The asset data was partially retrieved from Yahoo Finance, a public platform providing historical financial data, and partially obtained from the Wind Database using a personal membership account. Similarly, the factor data was saved from the Wind Database with the same membership access. These datasets were processed on a daily frequency to ensure compatibility with the monthly recalibration of portfolio weights. Both asset and factor data from Wind are proprietary and not publicly available.

3.Public Availability
- The asset data from Yahoo Finance is publicly accessible and can be documented for further use. However, the asset and factor data sourced from the Wind Database are private, accessed via a personal membership, and cannot be shared publicly due to their proprietary nature.

4. Other Data the Software Can Support
- The software supports flexibility and scalability, allowing integration of additional datasets such as real-time financial data via APIs, alternative asset classes (e.g., cryptocurrencies, real estate indices), and extended macroeconomic factors. Users can also incorporate their own proprietary datasets to conduct customized portfolio experiments.","m2uni":"xm2335","m2fname":"Xiying","m3uni":"jl6221"},{"projectname":"PubMed Database Analysis and Trend Visualization","timestring":"Sun Dec 22 17:50:02 2019","m1uni":"rm3707","m2lname":"Zhao","m1fname":"Rong","m4fname":"","m1lname":"Ma","m3fname":"","description":"Objectives: Create a website which can search PubMed articles based on ICD10 terms, construct author's network and analyze the trend of cancer research.

Innovations: Satisfying clinician information needs often requires access to up-to-date literatures. While online literature is a potential but underutilized solution. Meanwhile, there is no user-friendly website which can assist clinicians to search articles just based on ICD10 term. Knowing the network of an author can give laymen a basic idea of the status or importance of the author in the field. Moreover, analyzing the trend of cancer research could help student in life science to reconsider their future research plan, while they can pay more attention to the filed that tends to become popular in the near-future.

Capabilities: Our website provides a user-friendly interface that contains multiple functions for clinical research. It can automatically map icd10 disease-terms to Mesh for literatures searching in PubMed. Then with those literatures’ information, users could read more publications from the co-authors based on the authors' networks for further study. Furthermore, users could have a brief idea of current trend in disease research from our trend visualization and analysis output.
","uni":"rm3707","language":"Python, JavaScript","pid":"201912-27","m4uni":"","analytics":"PyMed
PubMed API
UMLS API
PySpark
Chartjs, vis.js
Anychart
JavaScript
Django
LDA model from PySpark -- MLLIB package
","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Overall:
433 PubMed Disease Mesh Terms
Abstract, authors, PubMed Ids, keywords, journal information, publication date and doi

Authors Networks:
Authors from Literatures about cancer from 2009 to 2018

Cancer trend Analysis:
Keywords, journals, topics (generated) in Literatures about cancer from 2009 to 2018

We obtained the data from Pubmed official server through Pubmed API.
","m2uni":"lz2716","m2fname":"Liangyu","m3uni":""},{"projectname":"USA Car Accidents Severity Prediction","timestring":"Sat Dec 18 04:18:56 2021","m1uni":"zc2628","m2lname":"Liu","m1fname":"Zifan","m4fname":"","m1lname":"Chen","m3fname":"Yuxing","description":"The number of traffic accidents in the United States continues to increase. An estimated 20,160 people died in motor vehicle crashes in the first half of 2021. Traffic accidents are directly related to many environmental factors. An accurate prediction can improve traffic safety and converse public resources.

The goal of our project is to built to visualize analyzed data of USA car accidents and predict the severity level of one car accident according to features like time, road, weather, etc.

Most of the previous researches is focused on designing the precise model and getting higher accuracy of prediction. Thus, their model needs a myriad of features that are not easy to access in real-life situations, like detailed driver information and car information. However, our novelty is to consider the difficulty of data acquisition and the ease of use, design a traffic accident prediction system with more practical value.

The car accident severity prediction provided in our system has significant value for dispatch centers as it could help them manage the emergency response force after they receive a car accident report. Instead of waiting for a precise report from officers reaching the scene, the dispatch center will be able to send the appropriate amount of emergency response force to the accident scene right after they receive the report and location according to the severity level prediction. This prediction system will increase the react speed for car accidents and prevent sending surplus emergency response forces to a low-level severity car accident scene.
","uni":"zc2628","language":"Python, JavaScrpy, HTML5, CSS Django, sklean, pandas","pid":"202112-30","m4uni":"","analytics":"We analyzed the correlation matrix to filter features that are relatively important to the car accident severity level that we want to predict.

We trained and evaluate 4 models(i.e. Linear Regression, Random Forest, Decision Tree, and Gradient Boosting). Finally, we picked Decision Tree as our model because of its relatively smaller time consumption and higher accuracy.

We built a website to visualize the relationships between car accident severity and the four most important features (Location, Road, Weather, and Time). We also implemented a prediction function that users could type in the longitude and latitude of the place they want to predict and get critical features like Time, Weather, Humidity, Road Feature, and the prediction result (accident severity level) after the system finishes its prediction.

","m4lname":"","industry":"Transportation","m3lname":"Wang","dataset":"The dataset we used in our project is contributed by a Lyft scientist, Sobhan Moosavi. It is a public countrywide traffic accident dataset, which covers 49 states of the United States, includes more than 3 million accident records and 46 features. It is hard to find another dataset that our software could support because of the variety of our dataset's features. We used two-thirds of the 46 features as inputs to train our model. Finding another qualified dataset is hardly possible.","m2uni":"ml4687","m2fname":"Meiyou","m3uni":"yw3739"},{"projectname":"Flight Data Analytics","timestring":"Thu Dec 21 23:00:53 2023","m1uni":"sdp2158","m2lname":"Buche","m1fname":"Simran","m4fname":"","m1lname":"Padam","m3fname":"Harsh","description":"Objectives: Create a real-time and scheduled flight tracker with risk assessment using weather parameters such as visibility, precipitation, etc.

Innovations: Most open-source flight trackers do not include weather information along the latitude and longitude coordinates of each flight's route. Those that do incorporate this information are often paid sources. Furthermore, some applications track delays at an airport level but do not look to see which of their flights will likely be the source of the delays. Our application combines both flight schedules and real-time flight tracking data with weather parameters to provide a holistic view of the flight status for various flights across the world.

Capabilities: Real-time and scheduled interactive flight tracker with a well-designed user interface incorporating risk information for each flight.

Importance: According to the FAA, approximately 75% of the system-impact delays of more than 15 minutes were caused by weather from June 2017 to May 2022. If weather conditions are known in advance, airlines and customers can avoid inconvenience due to delays. Hence, a real-time tracker is created to facilitate the smooth functioning of airplane systems. The tracker demonstrates weather conditions for each latitude and longitude coordinate in the flight route.","uni":"sdp2158","language":"Python - Pyspark, Dash, Plotly, Pandas, Scipy, CSS","pid":"202312-2","m4uni":"","analytics":"Geographic scatter plot done in Plotly with various map projections, web application created using Dash
Custom weather algorithm created using percentiles of distributional data of forecasts, thresholds determined using distribution of percentiles
Haversine distance algorithm implemented using Pyspark UDFs to calculate distances.","m4lname":"","industry":"Transportation","m3lname":"Benahalkar","dataset":"Aviation stack API: Bulk flight schedule data
FlightRadar24 API: Airport arrival data and individual flight data
Open Meteo API: Weather parameter data including both historical and forecast data
Datfiles: Airport data and flight route data","m2uni":"ssb2215","m2fname":"Shriniket","m3uni":"hb2776"},{"projectname":"Smart Traffic of Connected Cars","timestring":"Fri Apr 23 17:24:02 2021","m1uni":"bj2437","m2lname":"Sun","m1fname":"Boshen","m4fname":"","m1lname":"Jin","m3fname":"","description":"Thanks to the worldwide construction of 5G technology, which brings more connectivity and high bandwidth, new applications in IoT have become possible. Companies are pouring money into the IoT area and one area of particular interest to investors is IoT-connected cars. According to a study by Business Insider, an American financial and business news website, the US connected car commerce has grown from 0.07 billion dollars in 2015 to 5.13 billion in 2020 dollars and is estimated to reach 16.43 billion in 2023. With more and more connected cars on the road, a smart traffic system can be built to regulate traffic, relieve congestions, and increase traffic safety.

This semester, our team built a solution to regulate traffic effectively and improve it through connected cars’ connectivity. Our system includes basic & advanced route plannings, and traffic visualization & regulations. For example, it can reroute users automatically to nearby gas stations if their fuel levels are low. It can apply a strategy to let regular vehicles avoid special service vehicles for efficiency as well. Also, there is a recommendation system for taxi drivers to maximize their incomes.

There are many popular navigation apps like Google Maps and Waze. But our system is not only a navigation app. Besides the navigation feature, we let vehicles directly communicate with each other. Besides that, our system can automatically read an individual car’s condition and take corresponding actions. Our project also has great business potentials. It can be integrated into almost any vehicle through On-board Diagnostic Device and easily upgraded by more data and infrastructures. It is also very scalable through cloud computing.","uni":"bj2437","language":"Python, HTML, AWS","pid":"202105-8","m4uni":"","analytics":"- Algorithm used for route planning:
Dijkstra's Algorithm + Weighted Directed Graph
- Traffic visualization:
Folium + Map GeoJson
- Web server：
Flask + DynamoDB + HTML","m4lname":"","industry":"Transportation","m3lname":"","dataset":"We used 2019 NYC Yellow Taxi Trip Data and 2019 Green Taxi Trip Data from NYC Open Data. The datasets have over 90 million records and 18 columns. Those columns include various information like location, trip distance, fare amount, passenger count, tax, and so on. We mainly made use of the pick-up zone & drop-off zone and pick-time & drop-off time. We got access to the datasets in two ways: API and direct download.","m2uni":"rs4112","m2fname":"Ruofan","m3uni":""},{"projectname":"A Two-Stage Framework for Event-Aware Population Mobility Forecasting","timestring":"Sat Dec 20 02:59:49 2025","m1uni":"yl5708","m2lname":"He","m1fname":"Yuxuan","m4fname":"","m1lname":"Liu","m3fname":"Derun","description":"We propose a two-stage population mobility forecasting framework that combines a stable baseline model with event-specific scaling to better capture holiday spikes and policy-driven disruptions.
An interactive AI-assisted system further enables scenario-based forecasting and geospatial visualization（kepler.gl） through natural language/panel select control, providing both improved accuracy and interpretable insights.","uni":"yl5708","language":"python/jupyter Notebook","pid":"202512-17","m4uni":"","analytics":"Analysis of various model. Ridge(alpha=1),MLPRegressor,GradientBoosting,XGBoost,RandomForest. Using kepler.gl, react.js to build frontend, Rest-style API for backend. Langchain+RAG interface for the Agent function. ","m4lname":"","industry":"Information","m3lname":"li","dataset":"The data are collected from Gaode Map, which is similar to Google map in China. We obtain this dataset by retrieving from https://www.macrodatas.cn/. The data contains the zip code/name of all the cities of China. In addition, the significant data is the migration_index, it reflects the willing of people to go to the destination city by analyzing data from the user of the app.","m2uni":"yh3787","m2fname":"Scarlett","m3uni":"dl3749"},{"projectname":"An RAG-Based Interview Assistant Chatbot for IC Design and Verification","timestring":"Fri Dec 19 05:31:54 2025","m1uni":"qf2181","m2lname":"Li","m1fname":"Qianxu","m4fname":"","m1lname":"Fu","m3fname":"Xinyu","description":"The goal of this project is to design and implement a domain-specific, data-driven interview preparation system for Integrated Circuit (IC) design and verification roles. The project aims to combine Retrieval-Augmented Generation (RAG) with an interactive web-based interface to provide accurate, grounded, and personalized interview practice. Unlike general-purpose chatbots, our system focuses on precision-critical engineering knowledge and minimizes hallucinations by strictly grounding responses in verified technical documents. Additionally, by incorporating resume-aware mock interviews and performance analytics, the project seeks to simulate realistic interview scenarios and provide actionable feedback to help users systematically improve their technical readiness.","uni":"qf2181","language":"Python, HTML","pid":"202512-15","m4uni":"","analytics":"The system implements several analytics algorithms and system modules across the backend and frontend. On the backend, Retrieval-Augmented Generation is used, combining semantic embedding, vector similarity search, and large language model inference to generate grounded answers. Resume-aware keyword extraction and similarity-based retrieval are used to personalize mock interview questions. On the frontend, visualization modules display interview scores, knowledge-area distributions, study time, and performance trends over time. Together, these analytics and visualization components enable quantitative evaluation of user performance, transforming the system from a simple chatbot into a data-driven interview analytics and coaching platform.","m4lname":"","industry":"Information","m3lname":"Liu","dataset":"The primary dataset used in this project consists of a curated collection of domain-specific technical documents in PDF format, including digital IC design textbooks, verification interview guides, and course lecture notes. These documents were collected from publicly available educational resources and personal academic materials, then ingested using automated document loaders. During preprocessing, the PDFs were split into semantically coherent text chunks and converted into vector embeddings for retrieval. While the current evaluation focuses on IC design and verification materials, the system is designed to support additional datasets, such as FPGA design documents, timing analysis references, or other engineering interview materials, by simply adding new documents to the ingestion pipeline without retraining the model.","m2uni":"ml5160","m2fname":"Mingzhi","m3uni":"xl3455"},{"projectname":"AI Academic Advisor Chatbot for Columbia University","timestring":"Thu Jan 1 04:40:50 2026","m1uni":"jz3850","m2lname":"Zhang","m1fname":"Junfeng","m4fname":"","m1lname":"Zou","m3fname":"","description":"This project presents an intelligent academic advising system for Columbia University that combines Retrieval-Augmented Genera tion (RAG), semantic search, and hybrid intent detection to provide personalized course recommendations. The system addresses the challenge of navigating Columbia’s extensive course catalog of 8,120+ courses by implementing a conversational AI interface that understands natural language queries and generates context-aware recommendations using local language models. We developed a novel hybrid intent detection approach that combines regex-based pattern matching with intelligent parameter extraction, achieving 100% accuracy across our comprehensive test suite. The system demonstrates significant practical improvements with response times under 5 seconds and zero hallucination rate through strict prompt engineering. Our evaluation shows that the hybrid ap proach provides reliable, deterministic intent classification while maintaining the flexibility needed for natural language understand ing. The system successfully handles instructor queries, specific course lookups, level-filtered searches, and topic-based exploration with student profile personalization.

","uni":"jz3850","language":"Programmed using Python and deployed on Google Cloud Platform","pid":"202512-04","m4uni":"","analytics":"Our project implements a full-stack AI-driven academic advising system that integrates multiple analytics techniques, algorithms, system modules, and visual components.
From an analytics and algorithmic perspective, we implemented semantic text embedding for course descriptions, vector similarity search using FAISS, and a hybrid retrieval strategy that combines semantic similarity with structured filtering over course metadata stored in MongoDB. A ranking and de-duplication algorithm was applied to prioritize relevant courses and remove redundant entries. We further adopted a retrieval-augmented generation (RAG) framework to ground large language model (LLM) responses in retrieved data and reduce hallucination.
From a system design perspective, the system consists of a data preprocessing and storage module, a semantic retrieval and hybrid search module, an LLM reasoning module based on LLaMA 3.2 (1B), and a backend orchestration layer that manages query processing, retrieval, and response generation. A web-based user interface module enables interactive, multi-turn academic advising.
For visualization, we implemented a chat-based web interface that displays structured course results, instructor information, schedules, and advisor-style explanations generated by the LLM. The UI supports real-time interaction, result comparison, and interpretable presentation of recommendations.","m4lname":"","industry":"Information","m3lname":"","dataset":"Our system integrates three primary data sources from Columbia University’s course catalog. The course data contains 8,120 courses with attributes including call number (unique identifier), course code (department and level such as \"EECS E6895\"), title, instructor name, department, credit points, full course description, prereq uisites, and academic term. The enrollment data contains 8,120 enrollment records with historical enrollment information for fu ture enhancements. The instructor data includes 14,699 instructor records with faculty names and department affiliations.
So far our platform only support json format file.

","m2uni":"yz4843","m2fname":"Yangyang","m3uni":""},{"projectname":"Palimpsest-NYC","timestring":"Tue May 5 20:42:48 2026","m1uni":"cy2822","m2lname":"Jia","m1fname":"Chenhao ","m4fname":"","m1lname":"Yang","m3fname":"Thomas","description":"Palimpsest NYC is an agentic LLM walking-tour app for a small bounding box covering Morningside Heights and the Upper West Side. You can ask \"tell me about a gothic cathedral in Morningside Heights\" or \"plan me a 30-minute walk past Columbia and Riverside Church,\" and the system answers with narrated text grounded in public archives, an ordered route along real pedestrian ways, and a citation for every claim. Citations are checked against a retrieval ledger before the response leaves the server. If the model invents a source, the verifier rejects it, and the loop gets one more turn to fix the problem.","uni":"cy2822","language":"The backend is Python 3.12 with FastAPI, async SQLAlchemy 2 over asyncpg, and the OpenRouter SDK for LLM calls. Validation runs used kimi-k2.6; google/gemma-4-31b-it and openai/gpt-5.4-mini are configured alternates. Embeddings come from sentence-transformers running BAAI/bge-small-en-v1.5 on CPU. Tool argument validation uses jsonschema, and Redis backs the LLM cache and circuit-breaker bookkeeping. The database is PostgreSQL 16 with the PostGIS, pgvector, and pg_trgm extensions. Schema is migrations-first: SQL files under apps/api/app/db/migrations/ are applied by the postgres entrypoint on first volume init, the ORM in app/db/models.py is a typed read-only mirror, and schema changes require make nuke && make up. The frontend is React with Vite and TypeScript, MapLibre GL JS for the map, and Tailwind for styling. The SSE consumer is a small custom layer over fetch plus ReadableStream; V1.1 switched away from the browser's built-in EventSource because the agent endpoint moved to POST with a JSON body, so requests can carry headers like X-LLM-Credentials for BYOK. Routing runs against OSRM v5.25.0 with the foot.lua profile, served from osrm:5000 inside the compose network. The street graph is prepared once by a one-shot osrm-prepare service against a Geofabrik OSM PBF extract for the Morningside Heights and UWS bbox. Everything ships as Docker. docker-compose.yml handles local builds; docker-compose.prod.yml runs from the three published images on GHCR (ghcr.io/nyavana/palimpsest-{api,web,postgres}). GitHub Actions push those images on every commit to main, on v* tags, on PRs into main (PR builds get a sha- tag and do not move latest), and on manual dispatch. Spec-driven change proposals live in openspec/, and make spec-validate runs the strict validator against the active change. The WebUI is optimized for both desktop and mobile devices.","pid":"202605-27","m4uni":"","analytics":"The system has four moving parts that decide what the user sees: a retrieval layer, the agent loop, a routing tool, and the streaming frontend. None of them are individually exotic, but the interaction between them is where the project lives.

Retrieval runs against a 928-place corpus indexed two ways. Each place row carries a 384-dim embedding from BAAI/bge-small-en-v1.5 and a PostGIS geometry, so the search_places tool can rank by semantic similarity (pgvector ivfflat cosine ANN, vector_cosine_ops, lists=100) and optionally filter by spatial radius (ST_DWithin) when the LLM passes a near=[lat, lon]. Using the same embedding model at ingest time and at query time keeps cosine distances comparable across both halves. A trigram GIN index on names and document bodies is wired but unused in V1, ready for a keyword fallback in v2.

The agent itself is a multi-turn tool-use loop that runs against OpenRouter. It has two tools, search_places and plan_walk. The model decides what to call and when to stop, with a hard cap of seven turns. The final turn strips the tool surface, forces response_format=json, and gives the model an 8192-token budget so reasoning models like kimi-k2.6 can fit both their thinking traces and the final JSON in one response. Every claim in that JSON has to come back with a citation matching a doc_id that the retrieval layer actually returned earlier in the same conversation. If the model invents a source, the verifier rejects it and the loop gets one corrective turn to try again. If the retry also fails, the answer ships with verified=False and a warning rather than a 500.

Routing is the second tool. The agent calls it when the user asks for a walk, prodded by a soft regex hint in the system prompt rather than a hard router. Two cited places go to OSRM's /route endpoint along the foot.lua profile; three to eight cited places go to OSRM's /trip endpoint, which solves a TSP over the middle waypoints with the first and last stops pinned. Geometry comes back as a GeoJSON LineString and travels untouched all the way to the browser. If OSRM is unreachable, the tool falls back to a haversine straight line and the frontend renders it as a dashed muted path so the degradation is visible to the user, not silent.

The frontend consumes all of this as Server-Sent Events. Markers drop on the map as citations arrive, the narration streams into the chat pane, and the route renders as soon as the walk frame lands. There is no batch wait on the terminal done frame. The walk timeline shows a total distance and time, expandable per-stop turn-by-turn directions, and a tsp-optimized badge when the trip endpoint reordered the route. Underneath, the LLM router enforces a per-tier circuit breaker (3 fails in 60 seconds trips it, 30-second cooldown), and BYOK mode hands every request its own router with fresh breaker state so a user's bad API key cannot poison the shared cache.

Telemetry runs alongside the whole thing. Every /agent/ask request writes a SessionRecord with turn count, tool calls, citation outcome, routing backend, and latency, which feeds the report's evaluation section. A second jsonl file under logs/claude-sessions/ captures the agent-build telemetry that feeds the empirical-analysis half of the project: token counts, edit cycles per file, time to green tests.
","m4lname":"","industry":"Information","m3lname":"Duan","dataset":"Two public sources in the current version, both filtered to the same bbox: (-74.005, 40.768, -73.955, 40.815). That is roughly 86th to 110th Streets, Riverside Park to Morningside Park.

Wikipedia + Wikidata. Licensed CC BY-SA. The pipeline starts with a Wikidata SPARQL query that uses the wikibase:box service to bound results geographically, then fetches the Wikipedia REST /page/summary/ endpoint for each match.

OpenStreetMap. Licensed ODbL. An Overpass API query for named POIs tagged with amenity, tourism, historic, or leisure inside the same bbox. POIs land in places with no separate document body. For OSM, the place row is the citable object. Final count: 436 places.

Both go through a cache at apps/api/app/ingest/raw_cache.py, so re-runs do not re-hit the upstream APIs. The cache survives container rebuilds via a ./data:/app/data bind mount.","m2uni":"kj2712","m2fname":"Kaining","m3uni":"hd2593"},{"projectname":"Emotion Recognition in Artworks: Leveraging CNNs for Visual Emotion Analysis and Chatbot Development","timestring":"Sat Dec 21 04:50:11 2024","m1uni":"kl3658","m2lname":"Niu","m1fname":"Keer","m4fname":"","m1lname":"Lu","m3fname":"","description":"The project wants to develop a comprehensive system that integrates emotion recognition in artworks with interactive AI-driven conversations, combining Convolutional Neural Network based emotion classification with a fine-tuned Llama 2-Chat model. This innovative approach bridges the gap between visual analysis and conversational AI, employing advanced techniques like transfer learning, data augmentation, and fine-tuning. The system is capable of classifying emotions evoked by artworks and engaging users through meaningful, interactive dialogues, making it a valuable tool for art interpretation, therapeutic applications, and user experience design. By bridging visual and conversational capabilities, the project points out the importance of the potential of AI in creative and emotional contexts, providing practical applications while laying the foundation for future advancements in multimodal AI.","uni":"kl3658","language":"Python, Tensorflow/keras, Google Cloud Platform (gcp), Hugging Face, Google Colab, PyTorch","pid":"202412-6","m4uni":"","analytics":"The project implemented several analytics and algorithms to achieve its objectives.

Convolutional Neural Networks (CNNs) were utilized for emotion classification, with MobileNet-V2 serving as the base architecture. Key enhancements included transfer learning, data augmentation, dropout layers, and regularization to improve the model’s accuracy and robustness. For conversational capabilities, the Llama 2-Chat model was fine-tuned using the GoEmotions dataset, utilizing techniques like Low-Rank Adaptation (LoRA) for efficient parameter tuning and quantization to optimize memory usage. These techniques enabled the system to handle complex image and text-based emotion analysis effectively.

There are 2 main modules developed. The emotion detection module, powered by CNNs, classifies emotions in artworks into positive, negative, or mixed categories. This output is seamlessly integrated with the chatbot module, which leverages the fine-tuned Llama 2-Chat to engage users in meaningful, emotion-driven conversations. Additionally, the system includes a data preprocessing pipeline, hosted on Google Cloud Platform, to ensure efficient data handling, model training, and deployment.

Training and validation accuracy/loss plots illustrated the improvements achieved through CNN model adjustments. Confusion matrices provided insights into emotion classification performance, identifying areas for further refinement. A workflow diagram visually outlined the system’s architecture, highlighting the interaction between the emotion recognition module and the chatbot, which together deliver an engaging and functional AI system.","m4lname":"","industry":"Information","m3lname":"","dataset":"The dataset tested is the WikiArt Emotions Dataset, which contains over 4,000 artworks annotated with 20 emotion categories. It was obtained from WikiArt.org and is publicly available. This dataset is specifically designed for research on recognizing emotions in visual art and includes metadata such as art styles, genres, and emotional labels.

Our software can support other similar datasets containing labeled images for emotion recognition. It allows flexibility in preprocessing images, such as resizing and normalizing, and can also work with text datasets for conversational AI.","m2uni":"pn2433","m2fname":"Purui","m3uni":""},{"projectname":"Traffic Accident Analysis Prediction and Visualization","timestring":"Fri Dec 13 21:54:36 2019","m1uni":"xz2872","m2lname":"zhang","m1fname":"xingyu","m4fname":"","m1lname":"zhu","m3fname":"senbiao","description":"bjectives:Our project is aimed to analyze the UK traffic accidents dataset, build ML models to predict the severity of an accident according to several objective situations,explore data analysis to answer various questions, visualization through an interactive map to identify high rate areas of accidents, and we also use kmeans cluster on the accident location data to find the best location to build the medical stations.

Innovations:We use several advanced machine models to do the prediction, use k-means method to find the best location to build the medical stations,and use interactive maps to visualize the accident data using different color.

Importance: This research can help to predict the severity of an accident in certain conditions, which can help the relevant department to make early warning to drives, improve public transport safety, reduce loss of life and property. And it can also help insurance company to reasonable estimate
the insurance cost for different districts.","uni":"xz2872","language":"python, flask, dash, GCP, Spark","pid":"201912-7","m4uni":"","analytics":"ML algorithms: Random Forests; XGBoost ;Naïve Bayes; Multi Layer Perceptron; AdaBoost; Logistic Regression; LightGBM; K-means
system modules:
Prediction web application : flask as back end and HTML, CSS, JavaScript, Ajax, jquery for front end;
Interactive map visulization : Dash : Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python.
","m4lname":"","industry":"Transportation","m3lname":"fang","dataset":"The dataset is details of road accidents and involved vehicles in the UK from 2005 to 2017. We get it from kaggle. The dataset containing the features we used to buld up our models, like light conditions, weather conditions, road surface conditions and so on, could be used in our models to make predictions.","m2uni":"hz2620","m2fname":"henwei","m3uni":"sf2977"},{"projectname":"Manhattan Real-Time Traffic Analysis","timestring":"Tue Dec 23 08:42:47 2025","m1uni":"jq2434","m2lname":"Li","m1fname":"Junhao","m4fname":"","m1lname":"Qu","m3fname":"","description":"In urban environments like Manhattan, traffic conditions change rapidly. Traditional routing based on static road weights often produces unrealistic travel times because it ignores congestion. My goal is to build an end-to-end routing system that:
Uses real road geometry
Continuously updates road speeds from live traffic data
Returns realistic travel time and ETA","uni":"jq2434","language":"Python and colab","pid":"202512-19","m4uni":"","analytics":"The project implements real-time traffic-aware routing using graph-based shortest-path algorithms with dynamically updated edge weights. Spatial nearest-neighbor matching with an STRtree maps HERE traffic data to OSMnx road edges, and a Random Forest machine learning model is used to predict short-term future road speeds. An interactive Leaflet.js frontend visualizes fastest routes, distance, and ETA on a Manhattan map.
","m4lname":"","industry":"Transportation","m3lname":"","dataset":"Here traffic api, osmnx, DOT – Traffic Speeds (NBE), NYC Open Data","m2uni":"bl3155","m2fname":"Boxiong","m3uni":""},{"projectname":"A12: Creative Writing & Story Telling","timestring":"Fri May 5 22:48:27 2023","m1uni":"ws2635","m2lname":"","m1fname":"Wanjia","m4fname":"","m1lname":"Song","m3fname":"","description":"The project goal is to build a platform for the e-commerce industry by providing business owners with an advanced analysis tool. By scraping data from the internet and utilizing Chatgpt API, I provide more accurate and relevant insights to our users. The Current functionalities include 1) Accessing the latest product trends and analysis and 2) Generating product names based on top-selling products. The business value is to make data-driven decisions and stay ahead of market trends.
","uni":"ws2635","language":"Django platform using python, js, html, css.","pid":"202305-16","m4uni":"","analytics":"I leveraged the OpenAI to analyze scraped product data from Aliexpress, generate product trends as bar charts using D3.js for visualization on the front end, and provide personalized product name recommendations based on the response generated. ","m4lname":"","industry":"Information","m3lname":"","dataset":"The data is scraped from Aliexpress including the product name, price, quantity sold, number of reviews, and rating. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Real-time Analysis of Taxi Usage across New York City (NYC) ","timestring":"Sat Dec 17 02:49:57 2022","m1uni":"ra3141","m2lname":"Ammanamanchi","m1fname":"Rishav","m4fname":"","m1lname":"Agarwal","m3fname":"Rachana","description":"Objectives:
1) Incentivize car pooling by understanding cab’s capacity utilization
2) Improved car deployment for taxi services based on historical pickup density
3) Increase customer satisfaction through pre- allocating long haul cabs based on predicted drop off locations.

Capabilities:
1) Visualizing the cab utilization for carpooling across the city
2) Understand the trend of movement of people across New York City as the day progresses
3) Dynamic Pickup and Destination prediction using Machine Learning models

Innovations:
Although the NYC TLC dataset is collected and updated since 2010, we didn’t find any project that involved the end to end big data workflow from data streaming, stream processing, data warehousing to insight & visualization generation, and leveraging ML models to improve and predict custom use cases.
We now have a one stop solution that works with live streaming data and also supports incremental model updates.
","uni":"ra3141","language":"Google Cloud Platform, Python, PySpark (Mlib, structured streaming), DataProc, Big Query, Tableau, Google Cloud Storage, d3.js, HTML / CSS, Apache","pid":"202212 - 2","m4uni":"","analytics":"Algorithm 1:
Using PySpark, we took the concepts of RDD and MapReduce to aggregate the frequency of passenger count across two sectors: = 1 and >=2.
First, we used the .reduceByKey and the .map functionality to calculate the frequency of all forms of passenger count from 1 to 9. Using the RDD data created, we processed to sum up the frequency which was greater than 2, to understand the carpooling behavior across the Green Taxi data.
After plotting the data obtained from the program as a pie chart using d3.js, we see that especially for the Green Taxi data the passenger count of 1 has a frequency of almost 83%, while greater than that is just 17%.

Algorithm 2:
Visualize the movement of cabs across NYC and understand the trend based on the density of pickup and drop locations, leverage DataProc clusters and Spark. It can be used to improve deployment of vehicles for taxi services based on the pick-up locations through the day. First pre-processing step is to add the day of the week and hour columns using the date/time column - how the locations changed by the hour spread over a week. Color indicates the density of the pickup based on the latitude and longitude. Red regions indicates higher density, yellow indicates moderate, green and blue show lower density of pickups

Algorithm 3:
Data Cleansing and extract ground truths: Basic preprocessing-null values,erroneous lat and long
Use Geohashing to bin geographic locations
Bin time of day into half hour time slots
Gather pickups based on geohashes.

Features:
Time, day, Pickup Location, is weekend?

Models Evaluated (RMSE is in log base 10 space):
Random Forest (R2: 0.79, RMSE: 0.127)
K Means (R2: 0.5507, RMSE: 0.186)
Neural Networks (R2: 0.5212, RMSE: 0.192)

Visualizations:
Pie Chart
HeatMap
Bar Plots

","m4lname":"","industry":"Transportation","m3lname":"Dereddy","dataset":"The Dataset tested are:

Green Taxi Data: NYC OpenData and Socrata OpenData REST API
Yellow Taxi Data: NYC OpenData and Socrata OpenData REST API.
Uber pickups Data: https://fivethirtyeight.com

Volume: The first dataset has a dimension of (8049224, 22) while the Uber Taxi dataset has a dimension of (4534327, 4). This gives a combined dataset of more than 12 million rows.
Velocity: The real-time data is being processed in batches. For every second, we have a batch size of 100 records.
Variety: Our data is being captured in the forms of JSON and CSV and is aggregated from three different forms of transportation that of Yellow taxi, Green taxi and Uber vehicle trip records.

Data Collection through Structured Streaming:
1) Mock Stream Generator: Scrapes data from REST API’s of NYC TLC and pre-fetched data of Uber pickups and send the data at a custom rate to the target socket.
2) Stream Parser: The incoming data from various sources is now handled through Spark structured streaming socket listener, refined, pre-processed and finally written in micro-batches to Google Big Query.

The software can support any Taxi Data which provides Latitude and Longitude in forms like CSV, Parquet and Json.
","m2uni":"sa3979","m2fname":"Karthik","m3uni":"rd2998"},{"projectname":"Image Re-colorization","timestring":"Sat Dec 17 05:17:08 2022","m1uni":"yc4037","m2lname":"Dantsoho","m1fname":"Yanru","m4fname":"","m1lname":"Chen","m3fname":"Xuechun","description":"Existing online coloring tools do not separate genres when training models and therefore do not generate satisfactory results
Train different genres using different models to improve recoloring performance and accuracy
Model A: Landscape data
Model B: Fantasy data
Model C: Real life objects and animals data
Model D: Human data
These models can also be used in specific task
Such as coloring manga and b&w photo recoloring
User will also be able to choose the styles they use to recolor the image
","uni":"yc4037","language":"Python, Google Cloud","pid":"202212-7","m4uni":"","analytics":"pix2pix cGAN model (TensorFlow)
Build the generator: responsible for generating colorized images from grayscale images
Define generator loss: GAN loss + Lambda (100) * mean absolute error
Build the discriminator: takes colored images as input and classify if they are real images or generated by the generator (works against the generator)
Define discriminator loss: sigmoid cross-entropy loss of the real images & generated images
Initialize optimizers: Adam with a learning rate of 2e-4
Train the model: 800 FANTASY images / 1300 HUMAN images / 4200 real life OBJECT and ANIMAL images / 6000 LANDSCAPE images
Use the trained model to generate colorized images from grayscale images
Autoencoder (TensorFlow)
Define encoder: convolutional layers that learn a reduced dimensional representation of the input image
Define decoder: transpose convolution layers that regenerate the image from the reduced dimensional representation learned by the encoder
Build the model with encoder and decoder
Train the model: 800 FANTASY images
Generate colorized images
","m4lname":"","industry":"Information","m3lname":"Yaun","dataset":"Dataset collected from Kaggle, pinterest, and other research datasets
Object Dataset:
COCO
Human Dataset:
Human Parts Dataset
Landscape Dataset:
Kaggle Landscape colorization dataset
Fantasy Dataset:
Collected from pinterest
Changed imagenet to coco and human parts dataset to separate human and people dataset from our original settings
3V:
Volume: ~16000 images in total
Variety: 4 different genres
Velocity: batched dataset for training & user upload image during deployment
","m2uni":"fd2508","m2fname":"Fatima","m3uni":"xy2506"},{"projectname":"Dietary Trends Analysis in Children with PySpark","timestring":"Fri Dec 17 23:36:34 2021","m1uni":"hn2388","m2lname":"Kumar","m1fname":"Hari","m4fname":"","m1lname":"Nair","m3fname":"Aigerim","description":"During the last couple of decades, diabetes cases have been increasingly proliferating around the world. The worldwide incident cases of diabetes mellitus have increased by 102.9% from 11,303,084 cases in 1990 to 22,935,630 cases in 2017 worldwide, while the diabetes per 100,000 persons ratio increased from 234 to 285 in this period. Due to an alarming increase in the rate of diabetes diagnosis in children over the last 2 decades, there is an urgent need to analyze the dietary intake of children. To this end, we propose the use of machine learning algorithms using the MLlib library in PySpark to tackle this problem. Using data retrieved from the National Health and Nutrition Examination Survey (NHANES) data conducted by the Centers for Disease Control and Prevention (CDC), we aim to assess intake per nutrient across different financial classes and intake per financial class for each nutrient. We aim to understand and classify children prone to diabetes based on nutrient intake and also identify the correlation of dietary intake to the financial status. ","uni":"hn2388","language":"Python, PySpark","pid":"202112-40","m4uni":"","analytics":"Big data visualization, K-means clustering, Bisected k-means clustering, t-distributed stochastic neighbor embedding","m4lname":"","industry":"Life Science","m3lname":"Kdyrgaliyeva","dataset":"Public dataset - National Health and Nutrition Examination Survey

Dietary Data, Demographics Data","m2uni":"sk4975","m2fname":"Siddhant","m3uni":"ak4733"},{"projectname":"A Comprehensive Recommendation for Music System","timestring":"Sat Dec 22 16:18:51 2018","m1uni":"jl5173","m2lname":"Gao","m1fname":"JINYANG","m4fname":"","m1lname":"LI","m3fname":"","description":"Recommendation System can help enterprises make more profits. The only one recommendation algorithms cannot make users satisfying. In this case, we chose two mainstream recommendation algorithms in our project.

For item-based recommendation part, we used Cosine similarity to calculate similarities between songs. In this part, we selected timbre as features of our songs instead of traditional songs information, because we wanted to recommend our users with songs with similar style, according to the investigation, timbre is a very important factor than can difference songs with different styles.

Finally, the main reason why we think our project is more meaningful is that we also applied each song with RGB features that can visualize songs in different colors; and the songs with similar styles will share more similar colors. RGB features we introduced in can help users jump into songs with their favorite styles. Also if data analyst wants to analysis relationships between songs, visualizing songs with colors are more intuitive.","uni":"jl5173","language":"main:python3, jupyter, eclipse web page:html5, javascript database:mysql Data processing:pandas, numpy,","pid":"201812-32","m4uni":"","analytics":"Algorithms: PCA( Principle Connected Analysis), Collaborate Filtering, Cosine Similarity
System modules: flask, html, css, javascript","m4lname":"","industry":"Information","m3lname":"","dataset":"For dataset, we have three csv files:
features.csv: Timbre features for each song.
ratings.csv: Used for CF recommendation, only contains three columns: user_id, song_id, play counts
song_info.csv: We matched songs' name with songs' IDs in this csv, visualize users with songs' names instead of songs' ID

Any dataset with user information, song information and ratings can be applied in our software package

we fetch them through this website:http://labrosa.ee.columbia.edu/millionsong/","m2uni":"zg2308","m2fname":"Zheyao","m3uni":""},{"projectname":"C7: Medical Insurance Fraud Detection","timestring":"Fri May 15 21:11:01 2020","m1uni":"hl3353","m2lname":"Xie","m1fname":"Hongshan","m4fname":"","m1lname":"Lin","m3fname":"","description":"Healthcare is a major industry in the United States and it is important in the lives of many citizens, but unfortunately the continuingly rising in medicare care overall experienture and high costs of health-related services leave many patients with limited medical care. The impact of Medicare Fraud is estimated to be between 3% to 10 % among the nation's total healthcare spending, and reached approximately $2 billion including 165 medical professionals in 2018. In order to reduce costs for either insurance companies or the government, our project aims to build a clinical retrieval system to detect anomaly cases among electronic medical records. Our project also helps medical workers to get better decisions by identifying sequences that deviate from the typically occurring behavior. For example, a doctor may be interested in ﬁnding patients whose postoperative response is different from other patients who have had the same surgery, so that the doctors can provide personalized care plans for similar patients in the future.
Previous work has shown that Machine learning for data-driven diagnosis has been actively studied in medicine to provide better healthcare and publicly available medicare claims data can be leveraged to construct models capable of automating fraud detection. However, the challenges associated with characteristics of electronic clinical records include: imbalanced-class big data, irregularity in time, and sparsity hinder performance. Effective use of event sequence data requires identifying sequences that deviate from the typically occurring behavior. Moreover, most prior work in event sequence anomaly detection focuses on detecting only one type of anomaly with a specific application domain. Such approaches do not provide flexible ways of identifying diverse types of anomalies, and provide only limited support for result interpretation.
To address the challenge,we try several different methods. We calculate similarities between medical records by dealing with these medical records as event and sequence embedding. We first use Skip-gram to learn the vector representations of medical events codes and set the window size to learn the temporal relationship within the patient's record. To make a comparison of sequences with different lengths easier, we incorporate Dynamic time warping sequence based alignment (DTW), and distances are calculated using the events’ vector representations. In the second approach, we employ Variational AutoEncoders (VAE) in order to project events sequences with different lengths to vectors of the same length. The model learns latent representations for patients’ event sequences and detects anomalies that deviate from the overall distribution.
These two approaches are then followed by Local Outlier Factor (LOF) and other anomaly detection analysis separately. We evaluated models performances by comparing the outlier results with several different embedding approaches and their following analysis. At the final stage, we develop a visual analytics system to support comparative studies of patient records.Through its interactive interface, the user can quickly identify patients of interest and conveniently review both the temporal and multivariate aspects of the patient records.
","uni":"hl3353","language":"Jupyter Notebook, Python, Spark","pid":"202005-21","m4uni":"","analytics":" Aiming to reduce costs for either insurance companies or the government, our project tries to build a clinical retrieval system to detect anomaly cases among electronic medical records. Previous work has shown that machine learning for data-driven diagnosis has been actively studied in the medical industry to provide better healthcare. However, the challenges associated with characteristics of electronic clinical records may include: imbalanced-class massive data, irregularity in time, and sparsity hinder performance. To address this challenge, we deal with these medical events sequences based on natural language processing techniques At the final stage, we develop a visual analytics system to support comparative studies of patient records.Through its interactive interface, the user can quickly identify patients of interest and conveniently review both the temporal and multivariate aspects of the patient records.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"In this project, we used detailed and comprehensive patients’ data from MIMIC-III dataset which contains 26 tables and medical records for 41000+ critical care patients. MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database consisting of deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. It includes demographics, vital signs, laboratory tests, medications, and more. Details are available on the MIMIC website: https://mimic.physionet.org/
","m2uni":"cx2234","m2fname":"Chengming","m3uni":""},{"projectname":"Financial Fraud Detection","timestring":"Fri Dec 15 22:04:22 2023","m1uni":"rg3466","m2lname":"Bhardwaj","m1fname":"Rufina Flora","m4fname":"","m1lname":"George Rajan","m3fname":"Vishnu","description":"Our project aims to develop a robust and scalable system to detect and prevent fraudulent activities within financial transactions. Leveraging the power of big data technologies, we have analyzed large volumes of transaction data to identify patterns, anomalies, and suspicious behavior indicative of fraudulent transactions. We have utilized machine learning algorithms and airflow to automate the intelligent detection of fraudulent transactions.

Since there are millions of financial transactions taking place every day, each with hundreds of descriptors, it is not an easy task to track fraudulent activity in large incoming streams of data. Leveraging big data technologies, we can handle large amounts of data with ease. Integrating machine learning algorithms into these technologies helps us intelligently recognize any fraudulent activity that will otherwise go unnoticed by humans or basic algorithms. Our project leverages both these cutting-edge technologies to recognize fraudulent transactions, although we have used static data in our project, the incorporation of the time feature allows us to apply our software to real-time transactional data. We have also built an automated workflow using airflow to compare 7 different machine learning models to determine which one would be best for our application area. Additionally, we have built dashboards to visualize data from the training dataset to get a clearer picture of the kind of data we will be using and to find any imbalances in data distribution. Pie charts were used in the dashboard for categorical data and histograms for continuous numerical data.","uni":"rg3466","language":"Python, Google Cloud Platform, Colab, Airflow, Apache, HTML, d3.js","pid":"202312-5","m4uni":"","analytics":"In our project, we have used d3.js to visualize our data in the form of pie charts and histograms. We have also performed further exploratory data analysis on Colab with histogram visualizations. We built an automated workflow pipeline using airflow to setup several tasks to pre-process data, train and infer 7 machine learning models and compile the final results of inference from each fo the 7 models. The results from our EDA made us drop several features due to the high volume of NaN/Null, dominant values. Additionally, we scaled our data using standard scaler prior to training our machine learning models. This data was then passed to all 7 models simultaneously for training and inference. Each task was defined in a separate python file which was then called in the python script written to define the airflow pipeline using the bash operator. The final results from each model task were compiled the final task and written into a csv file for comparison.","m4lname":"","industry":"Finance","m3lname":"Kulkarni","dataset":"We used the IEEE-CIS Fraud Detection dataset publicly available on Kaggle for our final project. Our software can also support real-time transaction data.

Dataset link: https://www.kaggle.com/competitions/ieee-fraud-detection/overview

Description:
The IEEE-CIS Fraud Detection dataset, available on Kaggle, is a comprehensive collection designed for developing and testing fraud detection algorithms in the realm of financial transactions. The dataset is curated by the Institute of Electrical and Electronics Engineers (IEEE) and provides a rich and diverse set of features derived from real-world credit card transactions. With a focus on tackling the challenges of fraud detection, the dataset encompasses both numerical and categorical features, offering a holistic representation of transactional behavior.

Containing a vast amount of data, the IEEE-CIS Fraud Detection dataset challenges data scientists and machine learning practitioners to create models that can accurately identify instances of fraudulent activities. The dataset is particularly valuable for its realistic simulation of financial transaction scenarios, aiding in the development of robust machine learning models capable of distinguishing legitimate transactions from fraudulent ones. As fraud continues to pose a significant threat in the financial sector, this dataset serves as a crucial resource for researchers and professionals working on innovative solutions to enhance the security and integrity of electronic transactions.","m2uni":"vb2573","m2fname":"Vishal","m3uni":"vk2496"},{"projectname":"Conversational Chat-bot based on Sequence to Sequence model ","timestring":"Sat May 18 02:42:49 2019","m1uni":"zl2697","m2lname":"","m1fname":"Zhiying","m4fname":"","m1lname":"Li","m3fname":"","description":"Our objective is to build a Chat-bot, which is quite a challenging work for the machine, since it requires the knowledge of the human language and common sense. We use the sequence to sequence framework and different techniques of RNNs. We also use Flask to do the User Interface. This research is important because it enables me to take a further step into the area of deep learning. ","uni":"zl2697","language":"Python ","pid":"201905-3","m4uni":"","analytics":"Sequence to Sequence Model
Recurrent Neural Network
","m4lname":"","industry":"Information","m3lname":"","dataset":"We use three datasets: the Cornell Movie Dialogue Corpus, the SQUAD and WikiQA dataset ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Job Recommendation System","timestring":"Fri May 6 06:01:47 2022","m1uni":"sw3601","m2lname":"Wen","m1fname":"Shiyu","m4fname":"","m1lname":"Wang","m3fname":"","description":"In order to pursue higher living standard and work-life balance, people must spend substantial time on finding jobs that match themselves. However, there are many seemingly similar jobs with completely different professional requirements, which undoubtedly dazzles young people who have just joined in the society looking for jobs and increases the obstacles for them to seek for the right career development path. And most job recommendation systems in the market need to enter user's profile into the text box manually. In our project, we train the model based on on the importance of skills, which recommends jobs to users in a meaningful way. Besides, our model also recommends similar jobs, which provides more options. We also design a full-stack system to interact with our users, in which job seekers can get the recommendation result just by uploading the resume.","uni":"sw3601","language":"Python, JavaScript, HTML, CSS, Flask","pid":"202205-19","m4uni":"","analytics":"Resume Parsing, Content-based recommendation, Designed weighting algorithem, Natural language processing algorithms","m4lname":"","industry":"Information","m3lname":"","dataset":"We use the public datasets downloaded from Kaggle,which include job dataset and profile dataset.","m2uni":"yw3769","m2fname":"Yifei","m3uni":""},{"projectname":"Using News to Predict Stock Movement","timestring":"Sat Dec 22 18:30:49 2018","m1uni":"ho2271","m2lname":"Zhou","m1fname":"Hung-Yi","m4fname":"","m1lname":"Ou Yang","m3fname":"Chenyao","description":"The ubiquity and abundance of information available today, including market performance, news and media data have enabled investors to make better investment decisions. The challenge lies in how to ingest and interpret the data to find useful information and deterministic factors to assist the decision-making process. In this project, we use machine learning techniques to predict future stock price movement from two modalities of data – textual data from news, and numerical time series data from stock market. Our goal is to build a successful stock prediction model and find out which features are important. A successful model could yield significant profit for investors.","uni":"ho2271","language":"Python, Linux","pid":"201812-38","m4uni":"","analytics":"For data preprocessing, we conducted detailed visualization in the exploratory data analysis (EDA) part. We implemented several domain knowledge algorithms such as MA, EMA, and z-score for our feature engineering. We used lightGBM, a gradient boosted tree based machine learning algorithm, as our model. Dimensional reduction methods such as SVD and t-SNE are also used to visualize high dimensional data. ","m4lname":"","industry":"Finance","m3lname":"Liao","dataset":"The dataset we used in this project was not public and only available on the Kaggle platform. Interested users can head over to Kaggle and participate in the \"Two Sigma: Using News to Predict Stock Movements\" competition to gain access to full data. The dataset contains market data and news data. Market data was provided by Intrinio and news data was provided by Thomson Reuters. Our software can support any type of data that are transformed into a Pandas dataframe. ","m2uni":"yz3428","m2fname":"Yining","m3uni":"cl3757"},{"projectname":"Genome Splicing ","timestring":"Fri Dec 21 16:36:50 2018","m1uni":"zz2520","m2lname":"Yu","m1fname":"zhibo","m4fname":"","m1lname":"zhou","m3fname":"Yuwei ","description":"We build a pipeline for transform the data. Then Implement a spark-built palatiform adam on EMR clusters and finally we use bowtie2 to align 10g fasta or 14 G fastqs to aligned-sam file .
with help of adam platform, we can rapidly sequence whole genomes,zoom in to deeply sequence target regions,analyze epigenetic factors such as genome-wide DNA methylation and DNA-protein interactions

","uni":"zz2520","language":"scala, python , R","pid":"201812-45","m4uni":"","analytics":"End-to-end alignment
it searches for alignments involving all of the read characters. This is also called an “untrimmed” or “unclipped” alignment.

Local alignment
“local” alignment because some of the characters at the ends of the read do not participate
some characters may be omitted (or “soft trimmed” or “soft clipped”) from the beginning and end.","m4lname":"","industry":"Life Science","m3lname":"Zeng","dataset":"","m2uni":"cy2475","m2fname":"Chenghao ","m3uni":"yz3302"},{"projectname":"Roommate & Apartment Finding Platform","timestring":"Sat Dec 18 01:41:45 2021","m1uni":"xz3014","m2lname":"Zhang","m1fname":"Edward","m4fname":"","m1lname":"Zhou","m3fname":"Xiaoyu","description":"There are over 100,000 international students currently studying in New York City[1]. With the improvement of the epidemic situation, more and more students are returning to university communities. However, Dormitory is limited for universities in NYC. The big gap between the number of dormitories and the number of international students leads to a result that most students need to rent an apartment. Leaving home and moving to a new city is a big deal. Most students tend to have little experience in renting a house. Therefore, it’s challenging for them to find an ideal apartment in NYC. On the other hand, due to students’ limited budget, finding roommates is another necessity and challenge. In university, students come from different countries and have diverse backgrounds, so it is very risky for students to choose roommates without knowing each other very well. Based on this situation, we want to design a roommates and apartment finding platform for students to find their ideal roommate and apartment efficiently. We noticed that it’s difficult for students to find an ideal apartment and roommates. When choosing an apartment, situations like the apartment are too far away from the school, the selected neighborhood is unsafe, and the selected house does not meet their own needs and expectations always happen. There are so many different renting websites to look at, but none are designed for students, and one of the most important factors that students consider when renting apartments is the distance between their apartment and their school. Also, finding appropriate roommates is a big challenge. Since students come from different places in the world, each one has unique experiences and opinions. Different people have different hobbies and lifestyles. Some students are early birds, others are night owl. Some students may enjoy parties, while others do not. Living with people who have opposite lifestyles and hobbies may cause a lot of problems. In our roommates and apartment finding platform, users can input some basic information and then it will recommend some potential roommates and apartments. Our platform could benefit students by narrowing down their options for roommates and apartments to a few potential targets, and students can find ideal apartments and compatible roommates efficiently. ","uni":"xz3014","language":"Python, Javascript, CSS, HTML, Django","pid":"202112-25","m4uni":"","analytics":"Euclidean Distance: Distance from target to all other users. Sort in ascending order and take closest user
K-Means: Cluster all users inside database, Pinpoint the cluster that our target user is in, and return all other users within that cluster
Cosine Similarity: Novel recommendation system approach. Cosine of angle between two n-d vectors in n-dim space. The closer to one, the higher the similarity
","m4lname":"","industry":"Retail","m3lname":"Liu","dataset":"The data used for this project are divided into two tables: `apartments` table and `users` table. `users` table is obtained through meetup.com API, `apartments` table is found on course website datasets with Airbnb apartment data.
`Users` table stores user information. New user information is appended to the table, and when recommending roommates, an algorithm is applied to data in this table in search of an optimal candidate.","m2uni":"jz3275","m2fname":"Jiaxiang","m3uni":"xl3129"},{"projectname":"Understanding Clouds from Satellite Images","timestring":"Fri Dec 13 18:48:12 2019","m1uni":"yx2478","m2lname":"Chen","m1fname":"Yuqin","m4fname":"","m1lname":"Xu","m3fname":"Hongbo ","description":"Shallow clouds is an important indication of Earth’s climate but it's difficult to represent in climate models due to various and murky organizations. So we want to build better models to classify cloud organization patterns and help understand how clouds will shape our future climate.","uni":"yx2478","language":"Python, JavaScript, HTML, CSS, Spark SQL, GCP, Flask","pid":"201912-3","m4uni":"","analytics":"Segmentation model: U-Net, Post-processing model: ResNeXt101, bar chart and pie chart using matplotlib.","m4lname":"","industry":"Life Science","m3lname":"Du","dataset":"Our dataset comes from kaggle. It's a 6 GB satellite image data, containing 5546 train images, 3698 test images and a train.csv with 2 features: 1)train image indices with label names: Sugar, Fish, Flower, Gravel. 2) Encoded pixels are provided for each label in each image.
","m2uni":"jc5029","m2fname":"Jiajia","m3uni":"hd2452"},{"projectname":"Change Detection of Deforestation Using Satellite Data and CNNs","timestring":"Thu Dec 22 04:31:22 2022","m1uni":"mjl2256","m2lname":"","m1fname":"Michael","m4fname":"","m1lname":"Lee","m3fname":"","description":"The goal of the project is to create a scalable image processing pipeline so that CNNs such as a ResUNet can be trained on large annotated satellite image datasets for deforestation detection.","uni":"mjl2256","language":"Python, Airflow, TensorFlow, numpy, QGIS, PyQGIS","pid":"202212-40","m4uni":"","analytics":"-QGIS for alignment of satellite images and PRODES deforestation geospatial data
-TensorFlow for cutting the large satellite images into patches to feed into the CNN
-numpy for stacking images from different years on top of each other for the CNN to learn to see the differences as deforestation
-TensorFlow for model training and evaluation
-Airflow for workflow orchestration","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The datasets used were LANDSAT 8 satellite images and Brazil's INPE PRODES dataset for tracking the deforestation of the Amazon rain forest. (http://terrabrasilis.dpi.inpe.br/en/download-2/). These two were combined in preprocessing to create deforestation mask annotations for satellite images of the rain forest.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Self-Persistent AI Financial Analyst","timestring":"Tue May 5 20:43:59 2026","m1uni":"as7824","m2lname":"","m1fname":"Anav","m4fname":"","m1lname":"Srinivas","m3fname":"","description":"This project develops a self-persistent AI financial analyst that continuously learns from its own interactions through a retrieval-augmented continual learning architecture. The system ingests real financial data from multiple sources including market prices via Alpha Vantage, SEC filings via EDGAR, macroeconomic indicators via FRED, and financial news via Alpha Vantage Sentiment API.

All data is embedded and stored in a ChromaDB vector database that grows and self-updates over time.
The central innovation is a two-layer self-persistence mechanism. At the memory layer, every analyst response is embedded and written back into ChromaDB as a tagged insight chunk.

Future queries retrieve these prior analyses as context, meaning the system builds on its own reasoning across sessions without any manual intervention. At the memory management layer, staleness decay down-weights outdated information, Jaccard-based deduplication removes redundant chunks before they reach the model, and positional context ordering counteracts the lost-in-the-middle attention problem in large language models.

The analyst is powered by Claude Sonnet via the Anthropic API and augmented with nine quantitative financial tools including DCF valuation, CAPM, WACC, Sharpe Ratio, Value at Risk, bond pricing, and portfolio metrics. These tools are invoked automatically by the model when a question requires quantitative analysis. A full evaluation framework measures retrieval quality across ten benchmark queries, answer quality across six graded questions, and memory health through staleness distribution and insight growth tracking.

The system is important because financial markets are fundamentally non-stationary. Models trained once and deployed statically become stale as regimes shift, new instruments emerge, and macroeconomic conditions evolve. A persistent memory architecture is uniquely well-suited to this domain because the value of financial information decays in a measurable and predictable way, making intelligent memory management both necessary and demonstrable. This project shows that retrieval-augmented continual learning is a viable production approach to building domain-specific AI systems that improve with use rather than degrading over time.","uni":"as7824","language":"Python (FastAPI backend), TypeScript (Next.js 14 frontend), Node.js runtime. ChromaDB as a local persistent vector store. Anthropic Claude Sonnet accessed via the Anthropic Python SDK. Hosted development environment using uvicorn ASGI server.","pid":"202605-20","m4uni":"","analytics":"The system implements the following algorithms and analytical modules. Retrieval-Augmented Generation (RAG) using cosine similarity search over ChromaDB vector embeddings for context retrieval. Sentence embeddings via the all-MiniLM-L6-v2 sentence-transformers model for converting text chunks to dense vectors.

Exponential staleness decay for time-weighting retrieved chunks, with separate half-life parameters per data source type. Jaccard similarity-based deduplication to remove near-duplicate chunks before prompt construction. Positional context ordering to counteract the lost-in-the-middle attention degradation in large language models. Nine quantitative financial models implemented as callable tools: Discounted Cash Flow valuation, Dividend Discount Model, Capital Asset Pricing Model, Sharpe Ratio, parametric Value at Risk, bond pricing and Macaulay duration, Weighted Average Cost of Capital, P/E fair value with PEG ratio, and portfolio return and volatility calculation.

A three-tier evaluation framework measuring retrieval quality (relevance score, source hit rate, ticker hit rate, latency), answer quality (data grounding score, tool compliance rate, analytical completeness), and memory health (staleness distribution, insight growth over time, source diversity). An automated ingestion scheduler using APScheduler for continuous background data updates. Visualizations implemented include area charts, bar charts, radar charts, pie charts, and cumulative line charts for memory growth tracking.","m4lname":"","industry":"Finance","m3lname":"","dataset":"The system uses four public data sources accessed via API. Market data is sourced from Alpha Vantage (alphavantage.co), a public financial data API providing daily OHLCV price data for equities.

SEC filing data is sourced from the SEC EDGAR public API (data.sec.gov), which provides free access to all company filings including 10-K annual reports, 10-Q quarterly reports, and 8-K current event disclosures.

Macroeconomic indicator data is sourced from the FRED API (fred.stlouisfed.org) operated by the Federal Reserve Bank of St. Louis, providing public economic time series including CPI, GDP, Federal Funds Rate, unemployment rate, and VIX. Financial news sentiment data is sourced from the Alpha Vantage News Sentiment API, which aggregates and scores news articles from major financial publications.

The software architecture is designed to support any additional data source that can be represented as text chunks, including earnings call transcripts, analyst reports, options flow data, and alternative data feeds such as satellite imagery or credit card transaction aggregates.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Topic Modeling on US Airline Tweets to Compare Airline Marketing Strategies","timestring":"Fri Dec 16 07:50:44 2022","m1uni":"rk3148","m2lname":"","m1fname":"Raiha","m4fname":"","m1lname":"Khan","m3fname":"","description":"This paper discusses related motivations and research regarding marketing strategies in the airline industry, where existing research has widely focused on one-way customer-to-airline engagement, while the work discussed in this paper is more aligned with airline-to-customer engagement.
What distinguishes this project is it seeks to assess company engagement with potential customers. As such, airline companies can benefit from visual results this paper’s work derives to better visualize where their social media strategy focus is currently directed, in order for them to, for example:
•sJustify the creation or adjustment of marketing strategy: Determine where resources supporting airlines’ marketing/social media strategies need to be strengthened or relocated, which can be seen as necessary to do before infusing customer feedback analyses into campaigns, which may need directional and financial backing before approval.
•sPerform competitor-topic analyses: See where competitor airline companies lie on spectrums pertaining to similar or new/emerging topics in their tweets.
","uni":"rk3148","language":"Python; Jupyter Lab IDE, PyCharm; Plotly, Dash","pid":"202212-34","m4uni":"","analytics":"Algorithms: Topic modeling via Latent Dirichlet Allocation (LDA), performance measurement via perplexity and UMass coherence
System modules: gensim package
Analytics: Assigned human-interpretable names to topics from LDA model, assigned dominant topic to each tweet, calculated average engagement per tweet for topic-keyword analysis
Visualization: Dashboard visualizing topic contributions to airline tweets and topic keyword analysis ","m4lname":"","industry":"Transportation","m3lname":"","dataset":"Organic tweets and retweets were collected from 85 Twitter accounts associated with airline companies headquartered worldwide using Twitter V2 API . For a given Twitter handle corresponding to an airline (i.e. @united, the Twitter handle for United Airlines), the following steps were taken to download its tweets:
1.sGet the user ID of the airline company’s Twitter account (unique ID associated with one Twitter account) using the user lookup endpoint (i.e. 260907612, the user ID for Twitter handle @united).
2.sGet up to the 800 most recent tweets (based on API request caps) from the airline company’s Twitter account, where the request response includes organic tweets and retweets and exclude replies .

In total, 67,540 tweets were downloaded on December 4, 2022; non-English tweets were filtered to leave 57,404 in total for analysis. Figure 1 shows the distribution of number of tweets (in English language only) per airline account. 74% of airline Twitter accounts returned at least 700 tweets. Generally, the Twitter V2 API allows for downloading up to the 800 most recent tweets (and retweets, which are exclusive of the 800 number) when requesting for tweets without replies, so accounts for which less than exactly 800 tweets were downloaded are accounts that do not have more than 800 organic tweets. As each airline company’s Twitter account returned its own tweets, each account has its own minimum and maximum tweet creation date. Overall, the earliest tweet is from March 3, 2011 (which may correspond to an airline company that has not produced more than 800 organic tweets between March 2011 and December 2022), and the latest tweet is from December 5, 2022.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Optimized and Reflective SagaLLM ","timestring":"Wed May 14 04:17:28 2025","m1uni":"yw4473","m2lname":"Han","m1fname":"Yian","m4fname":"","m1lname":"Wang","m3fname":"Jinze","description":"This research aims to build a robust, end-to-end multi-agent planning system (SagaLLM) capable of translating natural language instructions into executable task structures with minimal human intervention. Key innovations include the integration of semantic validation using GPT-4o, a rollback-feedback mechanism for error recovery, and a CoT-based template generation process. The system supports visualization, modular execution, and adaptive reasoning across complex, dynamic tasks.

Importance:
These toolkits address critical limitations of traditional LLM pipelines, particularly in feasibility assessment, risk prediction, and execution consistency. By enabling more transparent, scalable, and resilient agent coordination, SagaLLM provides a practical foundation for real-world applications in autonomous planning, scheduling, and decision-making—areas where existing monolithic models often fail under uncertainty or task disruption.","uni":"yw4473","language":"python (jupter notebook)","pid":"202505-8","m4uni":"","analytics":"Our project is a modular multi-agent framework built around the Saga pattern for task orchestration and rollback. Key components include:

Agent Framework (agent.py, saga.py, crew.py): Manages agent execution, dependencies, and group coordination.

Planning & Reasoning (react_agent.py): Implements the ReAct paradigm for decision-making.

Reflection & Tool Integration (reflection_agent.py, tool.py): Supports agent self-reflection and tool-based interaction.

Core Algorithms:

Failure Prediction (failure_predictor.py): ML-based task risk classifier (HIGH/MEDIUM/LOW).

Validation & Conflict Detection (validation.py): Checks task validity and detects agent conflicts.

Topological Sort (saga.py): Resolves execution order based on dependencies.

Task Coordination: Includes scheduling and resource allocation logic.

Visualization & Interfaces:

Dependency Graphs: Visualized via NetworkX & Matplotlib.

Frontends: Gradio and Streamlit UIs for template creation and agent monitoring.

Real-time Execution Status: Tracks agent task progress and outcomes.","m4lname":"","industry":"Information","m3lname":"Shi","dataset":"We tested our software using a combination of datasets and custom-designed tasks. The main source was the REALM Benchmark (https://arxiv.org/abs/2502.18836), which provides a suite of reasoning-oriented problems rather than traditional tabular datasets. These are task descriptions, not in typical dataset formats, and the official repository does not include ground-truth answers.
To evaluate performance, we manually created a set of judgment criteria for each task. Additionally, we designed several original tasks inspired by real-world multi-agent scenarios (e.g., interview scheduling and sports event planning with disruptions), allowing for more comprehensive and customized evaluation.
Our software is capable of supporting structured task descriptions, logic-based problem settings, and planning problems that involve agents, constraints, and dependencies.","m2uni":"jh4861","m2fname":"Jiang'ao","m3uni":"js6605"},{"projectname":"A Web App for NYC Taxi Prediction","timestring":"Sat Dec 22 04:09:36 2018","m1uni":"jh3874","m2lname":"Yang","m1fname":"Jiaxu","m4fname":"","m1lname":"Huang","m3fname":"Zhi","description":"Our project goal is building a web app to provide travelers in NYC with a more comprehensive overview of the fare and duration of different transportation upon request and this would enable them to choose the most economical and convenient way of travelling. The design of a web app that would enable end users to request and receive information about both fare and trip duration of different transportation tools such as Uber, Lyft and taxi. In addition to the web app design, to increase the accuracy of predictions on the fare and duration of taking a taxi, different machine learning models are exhausted and other datasets are used to generate additional features.
","uni":"jh3874","language":"Python for model training, building and visualization. For web development, main programming languages are python and Javascript. In terms of development platforms, we use Flask as backend, React.js as frontend, Redux as data flow manager, and Buildpacks as deployment container.","pid":"201812-21","m4uni":"","analytics":"ML Algorithms include: Xgboost, Decision Tree, Ridge Regression, Bayesian Ridge Regression and Random Forest.
For analytics: We used Numpy, Pandas, Sklearn and Spark during the data processing and model training process.
System modules, we use Flask as backend, React.js as frontend, Redux as data flow manager, and Buildpacks as deployment container.
Visualization: Folium.Map from leaflet.js to build heatmap and Matplotlib to plot for visualization process.","m4lname":"","industry":"Transportation","m3lname":"Zheng","dataset":"There are four groups of datasets used for our project. The first dataset contains id, vendor_id, pickup_datetime, dropoff_datetime, passenger_count, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, store_and_fwd_flag and trip duration information. It was downloaded from NYC Taxi and Limousine Commission (TLC) website and combined with the one from Kaggle competition of trip duration. After data cleaning process, the file size is around 200 MB.
The second dataset was downloaded from the Kaggle NYC taxi fare prediction competition and it includes columns such as pickup_datetime, pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude, fare_amount and passenger count with size of 5.3GB. Besides these two original datasets, weather information dataset from 2016 and holidays information dataset are also collected. The weather dataset includes information about temperature, visibility, precip, humidity, etc.","m2uni":"wy2297","m2fname":"Wei","m3uni":"zz2560"},{"projectname":"OBJECT TRACKING USING YOLO + DEEPSORT","timestring":"Fri May 12 06:47:04 2023","m1uni":"hq2197","m2lname":"","m1fname":"Haoxiang","m4fname":"","m1lname":"Qi","m3fname":"","description":"This project provides a real-time object-tracking system for autonomous driving applications, integrating the state-of-the-art YOLOv5 object detection model and the Deep SORT algorithm for feature extraction and tracking.","uni":"hq2197","language":"HTML, JavaScript, Python","pid":"202305-6","m4uni":"","analytics":"YOLO, deepSORT, MOTR","m4lname":"","industry":"Transportation","m3lname":"","dataset":"BDD100K
The BDD100K dataset is a comprehensive large-scale dataset for autonomous driving applications. It comprises over 100,000 high-quality, diverse driving video clips, each lasting 40 seconds. The dataset includes various driving scenarios, covering different weather conditions, times of day, and geographical locations. BDD100K provides annotations for multiple tasks, including object detection, instance segmentation, drivable area segmentation, and lane marking segmentation.

UA-DETRAC
The UA-DETRAC dataset is a widely-used benchmark dataset for multi-object tracking and detection in surveillance scenarios. It consists of 100 challenging video sequences with over 140,000 annotated frames. The dataset contains various object classes, such as cars, trucks, buses, and pedestrians, and provides annotations for object detection, tracking, and trajectory forecasting.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NYC Airbnb Data Analysis, Visualization and Price Prediction","timestring":"Fri Dec 13 16:17:37 2019","m1uni":"ty2417","m2lname":"Zhu","m1fname":"Tian","m4fname":"","m1lname":"Yang","m3fname":"Linwei","description":"1.Organize and analyze the relationships between different features among the listings. Create a geographical heatmap color-coded by the price of listings.

2.Use fundamental statistical learning methods to predict the Airbnb price and test the accuracy of our model and try to improve it.

3.Compare the accuracies between different models. Build a web application to show the prediction results.","uni":"ty2417","language":"Python, R, js, css, html","pid":"201912-21","m4uni":"","analytics":"Used pandas, numpy, matplotlib and seaborn to process and visualize data.
Used sklearn and statsmodels to build statistical models to fit data and make price prediction, as Linear model, Ridge/lasso/elasticnet, Random forest and XGBregressor.","m4lname":"","industry":"Information","m3lname":"Gong","dataset":"Airbnb listings and metrics in NYC, NY, USA

It's a public dataset in Kaggle, including 48k rows and 16 columns.
","m2uni":"yz3628","m2fname":"Yinyao","m3uni":"lg3085"},{"projectname":"Deep Video Understanding (Event and Story Understanding)","timestring":"Fri May 6 18:13:21 2022","m1uni":"yd2616","m2lname":"Chen","m1fname":"Yifei","m4fname":"","m1lname":"Dong","m3fname":"","description":"Deep video understanding is a difficult task which requires computer vision systems to develop a deep analysis and understanding of the relationships between different entities in video, to use known information to reason about other, more hidden information, and to populate a knowledge graph with all acquired information.
The aim of the proposed task is to push the limits of multimedia analysis techniques to address analysing long duration videos holistically and extract useful knowledge to utilize it in solving different kind of queries.
","uni":"yd2616","language":"python tensorflow open-cv","pid":"202205-4","m4uni":"","analytics":"ArcFace
Ground Video Description","m4lname":"","industry":"Media","m3lname":"","dataset":"TRECVid 2022","m2uni":"zc2628","m2fname":"Zifan","m3uni":""},{"projectname":"PharMe: A Pharmaceutical Informed LLM","timestring":"Fri Dec 20 04:25:03 2024","m1uni":"imk2133","m2lname":"Harvey","m1fname":"Ishraq","m4fname":"","m1lname":"Khan","m3fname":"Berk","description":"The rapid pace of pharmaceutical innovation and regulatory updates presents an unprecedented challenge in modern healthcare delivery. Providers are tasked with providing quality care in a timely manner, while simultaneously processing mountains of reporting and administrative duties. This makes it increasingly difficult to stay on top of novel drug approvals and existing drug modifications released by the Food and Drug Administration (FDA) on a monthly basis. In 2024 alone, the Food and Drug Administration (FDA) approved 44 novel drugs and over 10,000 updates to existing drugs. With potential implications on patient care management and regulatory compliance, it is necessary to find a way to integrate novel drug information into clinical practice in a streamlined fashion. Specifically, our project focuses on biosimilars, drugs developed to replicate the effects of an existing drug after its patent or exclusivity period expires, offering an alternative treatment option at a potentially lower cost.

Having a large language model fine-tuned on the latest FDA biosimilar data is important for a variety of reasons. With AI rapidly increasing the rate at which drugs are discovered, it is only a matter of time before the regulatory process sees rapid improvements as well. This coupled with the staggeringly high cost of prescription drugs opens the market to disruption. Doctors and patients benefit from easy access to this information and providing it in an easy-to-use interface is the ultimate goal of our project. In clinical settings, LLMs can support treatment decisions by providing doctors with comprehensive, up-to-date information on biosimilar safety and efficacy.

Ultimately, an LLM such as ours reduces the burden on providers who already struggle with an overwhelming influx of data (not only from patients, but from regulatory organizations). Pharmaceutical companies also benefit as the model can help stakeholders understand the economic advantage and financial implications of developing biosimilars. And lastly, the patient benefits from increased access to cost-effective treatment options.
","uni":"imk2133","language":"Python, Colab, Airflow, HuggingFace, HTML/CSS, Google Cloud Platform","pid":"202412-21","m4uni":"","analytics":"Analytics: Jupyter Notebook, Pandas
Algorithms: GPT-neo-2.7B, bloom-3b, Medical-Llama3-8B
System modules: Apache, Google Colab, GPU (A100), CPU
Visualization: Tkinter, HTML5/CSS, Matplotlib","m4lname":"","industry":"Life Science","m3lname":"Yilmaz","dataset":"We ingest publicly available data on “all FDA-licensed (approved) biological products regulated by the Center for Drug Evaluation and Research (CDER), including licensed biosimilar and interchangeable products, and their reference products”. This data is released in monthly updates and stored on an online database referred to as the FDA’s Purple Book. By using Airflow, our team automates the collection, processing, and application of this data into our LLM finetuning process.

The FDA also provides additional information on all biologic products stored in its database. While our team was not able to leverage this data, it includes documentation such as medication labels, patient package inserts, approval letters, medical reviews, chemistry reviews, pharmacology reviews, statistical reviews, proprietary name reviews, and officer/employee lists. These documents consist of natural language text, diagrams, graphs, and other data visualizations. An extension of our project would be to ingest the additional information.
","m2uni":"dyh2111","m2fname":"Dan","m3uni":"by2385"},{"projectname":"Electric Vehicle Data Analytics&Assistant Chatbot Platform","timestring":"Sat Dec 21 00:13:04 2024","m1uni":"jd4001","m2lname":"Zhang","m1fname":"Jingzhe","m4fname":"","m1lname":"Ding","m3fname":"Wenxuan","description":"Our goal is to build a platform based on electric vehicle-related data (sales, population, price, charging stations, policies, etc.) to help users visualize and summarize trends in EV status and development, and make predictions. Users on our platform can view analytics results, input variables for relevant predictions, and interact with a chatbot on their topics of interest in electric vehicles.","uni":"jd4001","language":"Python, Html/CSS/Javascript","pid":"202412-10","m4uni":"","analytics":"1.Data Visualization: Our platform provides users with insights to relevant EV topics through visual representation,Enabling them to explore and analyze data interactively.
2.Machine Learning Model Training: We trained machine learning models from our data and apply them on predicting newly imputed data(mainly on EV sales and charging distributions).Our algorithm includes random forest, KD-tree, etc.
3.Chatbot: We implement RAG approach based on EV knowledge base to deliver context-aware and highly relevant responses by combining information retrieval with Large-Language-Models.
4.Web Application: We built a user-friendly web application integrates all these parts functionalities into a cohesive interface, with which the user could conveniently see our visualization result, use our machine learning model to extract or predict relevant information(charing station distributions, EV selling prices), and talk to our chatbot on EV topics.","m4lname":"","industry":"Information","m3lname":"Dong","dataset":"We use public data from different sources.Our data include EV sales, charging stations,charging patterns, polices, population distributions etc.. The data is used for visualization, training machine learning models, and building knowledge base for LLM-based chatbot. Our software and algorithm could support most of other EV-related data.
","m2uni":"tz2583","m2fname":"Tiance","m3uni":"wd2375"},{"projectname":"AI Powered Medical Chatbot","timestring":"Fri Dec 20 21:39:13 2024","m1uni":"by2373","m2lname":"Zhang","m1fname":"Botong","m4fname":"","m1lname":"Yuan","m3fname":"Xiyao","description":"Advancements in artificial intelligence (AI) have revolutionized many industries, and the healthcare sector is no exception. The AI-powered medical chatbot project aims to bridge the gap between patients and reliable medical information while supporting researchers in obtaining detailed and evidence-based insights. This essay outlines the objectives, innovations, and broader implications of this groundbreaking initiative.

The primary objective of the AI-powered medical chatbot is to develop a robust system capable of accurately understanding and responding to diverse medical queries. This goal is driven by the needs of two primary user groups: patients and researchers. For patients, the chatbot serves as an accessible platform for obtaining easy-to-understand health advice, reducing their reliance on in-person consultations for preliminary concerns. For researchers, the system offers detailed, evidence-based responses that enhance their ability to conduct medical investigations efficiently.

The chatbot is distinguished by its deployment of three specialized large language models (LLMs), each tailored to address unique challenges in medical query processing:

Yes/No Classification Model: This model is designed to handle binary medical decisions effectively. Its architecture, based on the Llama-3.2-3B framework, features 28 transformer layers, 24 attention heads, and advanced Rotary Position Encoding (RoPE) for precise query interpretation.

Long Answer Generation Model: To cater to open-ended medical queries, this model excels at producing detailed and contextually relevant answers. Enhanced with extended context handling and high precision through a robust transformer architecture, it provides in-depth responses to complex medical questions.

Decision Preference Optimization (DPO) Model: This model optimizes long-form answers to align with user preferences through reinforcement learning and feedback-driven fine-tuning. The iterative feedback loop ensures the continuous improvement of response quality.

The integration of these models into a user-friendly web interface enables real-time interaction, making the system accessible to individuals with varying technical expertise.

The importance of this research lies in its ability to democratize access to healthcare information while advancing the technical boundaries of AI applications in medicine. For patients, the chatbot reduces barriers to obtaining reliable health advice, potentially decreasing the strain on healthcare facilities and enabling more informed decision-making. Researchers benefit from a powerful tool that delivers detailed, evidence-based insights, accelerating the pace of medical innovation.

From a technical perspective, the project introduces state-of-the-art AI techniques, including RoPE and reinforcement learning, to enhance the performance and accuracy of LLMs. These innovations not only improve the chatbot’s functionality but also contribute to the broader AI research community by refining methodologies for training and optimizing language models.

Moreover, the chatbot’s impact extends to real-world healthcare outcomes. By providing patients and researchers with accurate, easily accessible medical information, it fosters a more informed and empowered user base. This alignment of technological innovation with societal needs underscores the transformative potential of AI in healthcare.","uni":"by2373","language":"Python, Gradio, GCP","pid":"202412-15","m4uni":"","analytics":"The quality of responses was measured using BERTScore, a state-of-the-art metric that evaluates semantic similarity between generated and reference answers. User feedback was analyzed iteratively to optimize the chatbot's Decision Preference Optimization (DPO) model.

Algorithms:
1. Yes/No Classification Model
This binary classification model was built on Llama-3.2-3B, incorporating Rotary Position Encoding (RoPE) to improve the understanding of query context. Its configuration included 28 transformer layers and 24 attention heads, providing high accuracy in decision-making tasks.
2. Long Answer Generation Model
Designed for open-ended queries, this model featured extended context handling and a large positional embedding capacity. The model's architecture leveraged advanced transformer configurations, enabling the generation of detailed, contextually relevant responses to complex medical questions.
3. Decision Preference Optimization (DPO) Model
The DPO model utilized reinforcement learning to optimize long-form answers based on user feedback. A feedback loop ensured continuous improvement, aligning the chatbot's responses with user preferences.

System Modules:
Backend: The core processing system incorporated the three specialized models, handling model inference and integration.
Frontend: The chatbot's user interface was designed to be intuitive and user-friendly, enabling seamless interaction. It supported real-time query input and displayed generated answers effectively, catering to users with varying technical expertise.

Visualizations:
Dataset Analysis, Performance Metrics, User Interface","m4lname":"","industry":"Life Science","m3lname":"Wang","dataset":"The datasets tested in the AI-powered medical chatbot project were PubMedQA, specifically its subsets: PQA-L (Patient Question-Answer Labeled) and PQA-A (Patient Question-Answer Artificial). PubMedQA is a publicly available dataset derived from biomedical literature, specifically designed for question-answering tasks in the medical domain. It contains question-answer pairs relevant to patient inquiries and research-based medical discussions. PQA-L (Patient Question-Answer Labeled) includes labeled question-answer pairs suitable for training and testing models designed for clear, direct question-answering tasks. PQA-A (Patient Question-Answer Artificial) involves artificially generated question-answer pairs, allowing for robust testing and fine-tuning of models to handle diverse scenarios.","m2uni":"3209","m2fname":"Shiyang","m3uni":"xw2942"},{"projectname":"Deep Learning based Translator","timestring":"Fri Dec 15 20:02:44 2023","m1uni":"sw3828","m2lname":"Zheng","m1fname":"Shiyan","m4fname":"","m1lname":"Wang","m3fname":"Tangwen","description":"The goal of this project is to develop a translation application specifically for the biomedical field. Our focus is on the application of natural language processing between English and Chinese. Through fine-tuning, we can train a reliable model in a short time using a relatively small corpus. Our goal is to efficiently keep the model updated by automatically crawling new articles, tagging and aligning text.","uni":"sw3828","language":"Python, Github, Hugging Face, Google Colab","pid":"202312-17","m4uni":"","analytics":"Analytics: Bert Score and BLEU Score
Algorithms: Seq2Seq, t5-base simple paraphrasing, back translation
System Modules: Flask","m4lname":"","industry":"Information","m3lname":"Zhu","dataset":"This corpus comprises parallel pairs of articles in English and Chinese sourced from the New England Journal of Medicine website. It includes approximately 2000 pairs of articles dating back to 2011. The purpose of this dataset was to facilitate machine translation between English and Chinese within the medical domain. ","m2uni":"sz3196","m2fname":"Shifei ","m3uni":"tz2570"},{"projectname":"Reporting Trends in the NYT","timestring":"Fri Dec 13 18:57:44 2019","m1uni":"bmm2172","m2lname":"Mandic","m1fname":"Brian","m4fname":"","m1lname":"Midei","m3fname":"","description":"Use U.S. economic indicators and dataset of over 12 million NYT Articles (metadata).

Analyze features such as keyword occurrences, sentiment analysis, and other linguistic features from NYT Article dataset.

Identify correlation relationships between economic data (GDP, inflation, employment) and news data over time.","uni":"bmm2172","language":"Python, PySpark, Javascript, ReactJS, Gatsby, Netlify, Google BigQuery","pid":"201912-1","m4uni":"","analytics":"Spark and Pandas Dataframes
Pearson Correlation
t-SNE
K-means Clustering
Linguistic Inquiry and Word Count (LIWC 2015)
Scatter and Line Plots
","m4lname":"","industry":"Media","m3lname":"","dataset":"We used the New York Times developer API to query article metadata for all months between 1851 - 2019 (all available years for the API). The API returns JSON objects, which we then stored as JSON files in a google cloud bucket. This resulted in data for approximately 12.6 million articles.","m2uni":"mm5305","m2fname":"Marko","m3uni":""},{"projectname":"TripEase - Sentiment-Aware AI Agent for Intelligent Travel Itinerary Planning","timestring":"Tue May 5 22:58:47 2026","m1uni":"rk3408","m2lname":"","m1fname":"Rupeet","m4fname":"","m1lname":"Kaur","m3fname":"","description":"Objective: TripEase is an end-to-end AI agent that converts a plain-language travel query into a globally optimised, time-scheduled, sentiment-validated NYC itinerary - and then sustains a conversational dialogue to explain, refine, and explore that itinerary without re-running the pipeline.
Innovations and Capabilities:
Anti-hallucination architecture. Every stage that uses an LLM is immediately followed by a hard whitelist validator . Every field the LLM produces — zones, categories, types, Places of Interest (POI) names, day encodings — is cross-checked against closed enumerated lists derived from the actual database. If the LLM invents a value, it is silently dropped and a warning is logged. The agent never surfaces fabricated POI names or non-existent categories to the user.
Query-conditioned ABSA, not generic sentiment. Rather than assigning a static \"good/bad\" label to a POI, ABSA.py dynamically activates a subset of 11 travel aspects (crowd, family, solo, couple, friends, value, exhibits, accessibility, staff, location, overall) based on the specific query's group type and implicit needs. Aspect weights are then upscaled per query. This means the same POI scores differently for a solo traveller versus a family.
Unified VRP — assignment and sequence solved simultaneously. Most itinerary planners do two-pass planning: first cluster POIs to days, then sequence within each day locally. Google OR-Tools' CVRPTW (Capacitated Vehicle Routing with Time Windows) is used to solve both problems in one global optimisation. The solver enforces hard constraints (opening hours as time windows, pace-derived max POIs per day, must-include POIs, day-avoidance) while soft constraints (zone balance, day load balance) enter as penalties. The result is globally optimal across all days simultaneously.Graceful degradation under infeasibility.
RAG with self-building knowledge base: Past itineraries are embedded via Sentence-BERT (all-MiniLM-L6-v2) and indexed in FAISS. Retrieved itineraries are cross-referenced with the current ABSA scores using a composite scoring formula that weights retrieval similarity, past satisfaction score, ABSA quality, and Bayesian-corrected ratings together. This means the agent learns from its own prior successful itineraries.Bayesian-corrected ratings. RAG uses the IMDB-style Bayesian average: (n/(n+m)) * r + (m/(n+m)) * C where m=100 minimum votes and C is the database global mean. This prevents a POI with 2 five-star reviews from outranking one with 5,000 four-star reviews.
Why it matters: The combination of constraint-grounded NLP, query-conditioned sentiment analysis, global combinatorial optimisation, and a self-improving RAG layer addresses the core failure modes of typical LLM-based travel agents: hallucinated POIs, static sentiment scores disconnected from trip context, locally optimal but globally suboptimal day plans, and no memory of what worked before.
","uni":"rk3408","language":"Language: Python 3; Hardware target: Apple Silicon M1 Pro 16GB; UI platform: Gradio Blocks","pid":"202605-17","m4uni":"","analytics":"Stage 1 — NLP Perception runs Llama-3.2-3B-Instruct-4bit via Apple MLX-LM on MPS. A whitelist-injected prompt extracts structured trip constraints as JSON, which a QueryValidator immediately hard-checks against closed database vocabularies — zones, categories, types, POI names — dropping any value the LLM fabricated. No LLM output reaches the pipeline unless it matches the actual database.
Stage 2 — Aspect-Based Sentiment Analysis runs deberta-v3-base-absa-v1.1 across eleven travel aspects. Aspect activation and scoring weights are conditioned on the query's group type and implicit needs, so the same POI scores differently for a family trip versus a solo visit. All sentence–aspect pairs across all POIs are batched into a single inference pass. A final combined score fuses ABSA output (60%) with a Bayesian-corrected Google rating (40%).
Stage 3 - RAG encodes past itineraries and the current query via all-MiniLM-L6-v2 (Sentence-BERT) into a FAISS cosine similarity index. The top-3 most similar past trips are retrieved, and their POIs are re-ranked by blending retrieval weight, ABSA score, and flag penalties.
Stage 4 — Pre-processing and Travel Matrix removes closed, non-operational, and night-only POIs, computes trip days respecting avoid-day constraints, places a depot at the centroid of preferred-zone candidates, and builds a symmetric travel time matrix via the Haversine formula at NYC average walking speed.
Stage 5 — VRP Solver uses Google OR-Tools CVRPTW to jointly solve POI-to-day assignment and within-day sequencing in one global optimisation — not two separate passes. Opening hours enforce time windows, pace controls daily POI limits (3/5/7), and must-include POIs carry a highier penalty to force inclusion. Up to four solver modes handle infeasibility gracefully. Output includes a full constraint satisfaction scorecard used downstream by the RAG system.
Stage 6 — Temporal Scheduling walks each day's ordered route forward from 9 AM, inserting exact travel gaps, a 60-minute meal break after noon, and flagging any opening-hours violations per stop. Completed itineraries are archived with a UUID, query profile, and plain-text description for future RAG embedding.
Stage 7 — Conversational Agent routes each user message to one of five handlers — itinerary generation, Q&A, POI recommendation, conflict explanation, dropped POI explanation — via an LLM classifier. All post-generation handlers read exclusively from the SessionMemory dict; no pipeline stage re-runs on follow-up questions. Conversation history is managed via a collections.deque windowed buffer.
Visualization uses Folium for interactive HTML maps with day-coloured polylines, animated route arrows, and per-POI popup cards, and Plotly for a Gantt-style timeline with type-coloured bars and hover tooltips. The Gradio Blocks UI presents these across five tabs — Map, Timeline, Visit Insights, Trip Stats, and Itinerary — alongside a live chat panel. Map files are served via Gradio's file server to avoid base64 size limits.","m4lname":"","industry":"Information","m3lname":"","dataset":"The system was built and tested on two custom-constructed JSON files: nyc_pois_database.json and nyc_pois_by_category.json. These are collections of Points of Interest in New York City, sourced from the Google Places API (New Places API v1). Each POI record contains: place ID, display name, location (lat/lon), primary type, categories, zone, business status, regular opening hours (structured by day and hour), Google rating, user rating count, and up to 5 visitor reviews (with review text).

How it was obtained: The POI records were fetched via Google Places API Nearby Search and Place Details endpoints, filtered to NYC boroughs, and curated into the two JSON schemas — a flat list and a category-keyed index for efficient lookup.
Dataset submission: Since the POI data is derived from the Google Places API (a public commercial API, not a standalone public dataset), it does not have an independent dataset entry to submit. However, the review text attached to POIs is the primary input for the ABSA model, and those reviews are public Google Maps user reviews.
What other data the system can support: The pipeline is designed to be city-agnostic. Any dataset that conforms to the same schema — a flat POI list with id, displayName.text, location.latitude/longitude, zone, primaryType, categories, regularOpeningHours.periods, rating, userRatingCount, and reviews[].text — can be dropped in. This means the system could support London, Paris, Tokyo, or any city whose POI data is collected in the same structure from Google Places, OpenStreetMap Overpass API, Foursquare Places API, or TripAdvisor export. The zone and category taxonomies in the present dataset would need to be updated to reflect the new city's geography and attraction types.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Recommendation System For Tourists In New York City","timestring":"Fri Dec 17 21:13:33 2021","m1uni":"jg4305","m2lname":"Wang","m1fname":"Jian","m4fname":"","m1lname":"Gao","m3fname":"Jiongxin","description":"Objective:

Our goal is to design a system for users to find recommendations in New York City for tourists conveniently and easily. These recommendations include restaurants, hotels, and some interesting places such as museums and parks. Users could choose their own requirements for these places. For example, restaurants have three attributes: Price (High or Low), Rating (High or Low) and Safety (set High default). And hotels have four attributes: Price (High or Low), Rating (High or Low), Safety (Set High default) and Room_Type (King, Queen, Suite and Double).

Innovations and Capabilities:

This system focuses on not only one kind of place, but on restaurants, hotels and different kinds of interesting places. Also, this system used mostly real-time data, processed and extracted from API's and tweets. We modified weights in the algorithm by testing the system among 10 friends. Using algorithm for recommendations designed by ourselves, the system could accurately recommend places. By utilizing IBM Watson Assistant and Socket Programming, we realized easy communication between users and the system.

","uni":"jg4305","language":"Python 3 (Above) , JavaScript, SQL","pid":"202112-28","m4uni":"","analytics":"Algorithms: Linear Programming to minimize total costs of features algorithm for recommendation. Tested weights among 10 friends and found the best one. And searched recommendations online to figure out whether those places matched our requirements.

System Modules: Front End User Interface. Back End information extraction, responses generation and recommendations generation. Back End data processing and storage.

Visualization: Used JavaScript, HTML/CSS to build a chatbot User Interface for visualization.","m4lname":"","industry":"Information","m3lname":"Ye","dataset":"1. NYPD Complaint Data Historic from NYC OpenData.
2. New York City Restaurants Dataset (Generated and processed by our team from Google Map's API, private stored on Google Cloud BigQuery)
3. New York City Hotels Dataset (Generated and processed by our team from Google Map's API, private stored on Google Cloud BigQuery)
4. New York City Interesting Places Dataset (Generated and processed by our team from Google Map's API, private stored on Google Cloud BigQuery)
5. Twitter Interesting Places (Generated and processed by our team from Tweepy)

Our software can also support: Stores in New York City, Bars in New York City and Hospitals in New York City etc.","m2uni":"yw3606","m2fname":"Yaning","m3uni":"jy3114"},{"projectname":"Resource-efficient Method for Inter-route Transit Slowdown Analysis and Prediction","timestring":"Fri Dec 19 20:32:22 2025","m1uni":"sg4406","m2lname":"","m1fname":"Sam","m4fname":"","m1lname":"Gilmer","m3fname":"","description":"Develop a resource-efficient framework for real-time prediction of transit network delays.","uni":"sg4406","language":"Python, PySpark, JavaScript, node.js","pid":"202512-24","m4uni":"","analytics":"","m4lname":"","industry":"Transportation","m3lname":"","dataset":"Historic data from the public MTA Bus Route Segment Speeds 2025 Beginning dataset: https://data.ny.gov/Transportation/MTA-Bus-Route-Segment-Speeds-Beginning-2025/kufs-yh3x/about_data

Realtime data from the MTA GTFS-RT endpoints: https://bt.mta.info/wiki/Developers/GTFSRt","m2uni":"","m2fname":"","m3uni":""},{"projectname":"NYC Parking Ticket Analysis","timestring":"Fri Dec 23 21:47:01 2022","m1uni":"ty2484","m2lname":"Jonasson","m1fname":"Taolue","m4fname":"","m1lname":"Yang","m3fname":"Brian","description":"The goal of this project is to provide both analysis of the parking enforcement behavior and prediction of behavior to be wary of when going to any location in NYC. While other projects have worked with the same dataset, all of the previous projects have mapped raw data variables against each other, providing no meaningful insight to behavioral patterns or prediction capability. This project aims to provide both in the form of time-series visualizations that can be catered to a specific need as well as predictions on future incidents given a location and time. ","uni":"by2344","language":"Python, Pyspark, Flask","pid":"202212-21","m4uni":"","analytics":"K-Means Clustering, Label Visualization in 3D, PCA, Random Forest, Decision Tree, Naive Bayesian, KNN, Linear Regression, GeoPandas","m4lname":"","industry":"Transportation","m3lname":"Yao","dataset":"This project uses data from NYC Open Data for raw parking violation datapoints. ","m2uni":"gmj2122","m2fname":"Gudmundur ","m3uni":"by2344"},{"projectname":"Autonomous Learning","timestring":"Fri Apr 23 18:15:58 2021","m1uni":"pc2939","m2lname":"Su","m1fname":"Pin-Chun","m4fname":"","m1lname":"Chen","m3fname":"","description":"This project explores a way to have a system autonomously learn a task given a \"brain\" and a \"world.\" We used Markov Brains, which can be encoded as a genome, to learn to play Pacman by genetically evolving them over generations. Using MABE (Modular Agent-Based Evolution) platform, we can easily analyze the performance of brains with different settings. The approach of this project is to try to simulate real evolution to have agents learn tasks without much human interference. And we hope our project provides some insight on what tasks are suitable for learning using this method.","uni":"pc2939","language":"C++ / Mac OS","pid":"202105-2","m4uni":"","analytics":"Analytics are provided by modules in the MABE framework. And visualizations are done with Excel using Line Graph and Bar Graph.","m4lname":"","industry":"Information","m3lname":"","dataset":"The dataset for this project is essentially the \"world\" the agent lives in, and in this case it's the Pacman game. The source code of the game is from https://github.com/daleharvey/pacman.","m2uni":"cs4026","m2fname":"Alex","m3uni":""},{"projectname":"Automated Competence Intelligence for Amazon Products Using Prompt-based Learning with Customer Reviews","timestring":"Thu May 12 04:17:21 2022","m1uni":"zl2889","m2lname":"Wu","m1fname":"Gary","m4fname":"","m1lname":"Liu","m3fname":"","description":"Competitor analysis requires comprehensive in- formation gathering and data analysis over his- torical data. However, with the rapid develop- ment of eCommerce, conducting a competitor analysis is becoming increasingly difficult with human agents. Because of the volume, variety, and velocity of data, collecting information is time-consuming and laborious. In this project, Amazon merchants will have access to a com- petitor intelligence tool based on prompt-based learning with customers’ reviews. Prompt-based learning is a This tool will allow them to com- pare their products against their competitors by a multidimensional graph as well as data-driven business insights.","uni":"zl2889","language":"Python and Jupyter Notebook with Pytorch","pid":"202205-17","m4uni":"","analytics":"Prompt-Based Learning, RoBERTa, LDA, W2V and Logistic Regression.","m4lname":"","industry":"Retail","m3lname":"","dataset":"We used amazon reviews data. http://deepyeti.ucsd.edu/jianmo/amazon/index.html to perform some of the experiments.
In this project, a web-scraper is built with selenium and beautifulsoup4 packages that enables web scraping. Sele- nium can simulate the function of a mouse to scroll through the page click on the page and extract information. Using selenium, we extract product information including the prod- uct name, current price, general rating, number of ratings, product link, asin number, etc.
","m2uni":"sw3607","m2fname":"Siyu","m3uni":""},{"projectname":"Identifying and Analysing Global Migration Trends","timestring":"Thu Dec 15 04:05:26 2022","m1uni":"mk4652","m2lname":"","m1fname":"Madeleine","m4fname":"","m1lname":"Kearns","m3fname":"","description":"Identifying migration patterns is of ongoing importance globally, both to governments and to global organisations. There is disagreement about the best methods to do this, and as the dataset is so large there are many ways the data can be visualised. Therefore, my work is building on an existing challenge by proposing a new method for a highly customizable program which can take a user input to analyse migration in that year. In the preliminary research I conducted I did not find any similar approaches in the literature on this subject, so this work tackles an existing challenge using a novel approach

The original goal was to conduct research into the causes and patterns of large-scale migration globally. However, the project was reformulated to focus on researching patterns and visualising them as it was becoming too focused on machine learning when looking to identify causes.

","uni":"mk4652","language":"Python, HTML, JavaScript, D3.js","pid":"202212-37","m4uni":"","analytics":"The programs used query-based visualisation, where the user could define the year, number of countries, or country to analyse. Geospatial visual analysis models were used to high effect to reduce the information overload caused by the high volume of data even when looking at only one country to reveal the useful information. Overall, the product has used bar plots, line graphs, choropleths, and connection analysis graphs. Geospatial connection maps were used for migration pattern analysis and choropleths were used for comparative analysis. This effectively reduced visual clutter on the visualisations and improved user cognition. This was furthered by only analysing a user-defined number of countries in the connection maps.

","m4lname":"","industry":"Social Science-Government","m3lname":"","dataset":"World Bank - Country refugee population by country/territory of origin/asylum
https://data.worldbank.org/indicator/

United Nations HCR - Migrant journeys with country of origin and destination
https://www.unhcr.org/en-us/data.html

Geographic Names Server - Sanctioned by the US Board on Geographic Names (BGN). Contains country names and locations – useful for plotting migrant journies
https://geonames.nga.mil/geonames/GNSHome/

These datasets were combined and processed in a way easily applied to future updated versions of the same datasets. The main data, from the UNHCR and World Bank, is updated yearly.
International Organization for Standardization (ISO) - Standardised country codes, also used for plotting
https://www.iso.org/obp/ui/#search
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"M(iche)Langelo: Analysis on AI-generated Art","timestring":"Fri Dec 19 06:23:53 2025","m1uni":"aps2249","m2lname":"Yoon","m1fname":"Abhitay","m4fname":"","m1lname":"Shinde","m3fname":"Christine","description":"M(iche)Langelo: Analysis on AI-generated Art is an end-to-end system designed to detect, analyze, and evaluate AI-generated images in real-world settings. The project focuses on distinguishing authentic human-created artwork from images generated by modern text-to-image models, while also identifying the source model when possible. In addition to classification, the system provides interpretable outputs through artistic style predictions and natural language image captions, enabling transparency in AI-generated content analysis.

A core innovation of this work is the integration of multiple state-of-the-art multimodal models into a unified and automated pipeline. The system combines CLIP for artistic style recognition, BLIP for semantic image captioning, and SuSy for AI-authenticity detection. Unlike traditional approaches that rely on static benchmark datasets, this pipeline operates on continuously collected real-world data from Reddit, allowing ongoing evaluation as generative models and visual styles evolve. The project further introduces a domain-specific transfer learning approach that adapts the SuSy detector into a simplified three-class classifier focused on authentic images, MidJourney outputs, and DALL·E 3 outputs. This significantly improves detection performance on contemporary AI-generated images and demonstrates the importance of adapting detectors to realistic data distributions.

By emphasizing explainability, continuous evaluation, and real-world deployment, the project addresses critical challenges in content provenance, trust in visual media, and the limitations of existing AI-detection systems. The toolkit is applicable to research, platform moderation, and policy-driven analysis of generative AI systems.","uni":"aps2249","language":"The system is implemented in Python and leverages modern machine learning and data engineering platforms. Model training and inference are performed using PyTorch and the Hugging Face ecosystem, including Transformers and Diffusers. Apache Airflow is used for automated data collection, while Google BigQuery serves as the primary storage and analytics platform. The toolkit supports execution on CPU and GPU environments and is distributed as an open-source project through GitHub.","pid":"202512-10","m4uni":"","analytics":"The analytics pipeline integrates AI-generated image detection, multimodal interpretation, and continuous performance evaluation. SuSy is used as the core detection model, operating in both its original multi-class configuration and a fine-tuned three-class setting achieved through transfer learning. Artistic style analysis is performed using CLIP in a zero-shot classification framework, producing ranked style predictions with confidence scores. Image captioning is handled by BLIP, providing semantic descriptions that support interpretability and downstream tasks.

The system also includes an optional image restyling module based on Stable Diffusion XL, which uses generated captions and prompt engineering to transform images into different artistic styles. Evaluation is performed continuously by comparing predictions against ground truth labels derived from data sources, with results stored in analytics-ready tables and exportable for further analysis. This design enables ongoing study of detector robustness and performance in evolving real-world conditions.","m4lname":"","industry":"Media","m3lname":"Okubo","dataset":"The project evaluates its models using publicly available image data collected from Reddit and curated datasets used for training. Real-world evaluation data is sourced from r/dalle2, r/midjourney, r/aiArt, and r/Art, representing both AI-generated and authentic human-created artwork. Images are collected automatically using the Reddit API, filtered to include only valid image posts, and labeled using subreddit provenance as ground truth. Metadata such as image URLs, timestamps, and source subreddits are stored in Google BigQuery for scalable analysis.

For transfer learning, the project uses a balanced dataset of approximately one thousand images per class. Authentic images are drawn from WikiArt, while AI-generated images are sourced from Reddit communities associated with MidJourney and DALL·E. This combination allows the detector to better reflect real-world generative image characteristics. The software supports additional datasets in standard image formats, enabling extension to other generative models or private image collections.","m2uni":"gy2354","m2fname":"Grace","m3uni":"cco2138"},{"projectname":"","timestring":"Tue Jul 28 19:31:45 2026","m1uni":"","m2lname":"","m1fname":"","m4fname":"","m1lname":"","m3fname":"","description":"","uni":"","language":"","pid":"","m4uni":"","analytics":"","m4lname":"","industry":"","m3lname":"","dataset":"","m2uni":"","m2fname":"","m3uni":""},{"projectname":"House Price Prediction","timestring":"Mon Dec 19 15:49:44 2022","m1uni":"yg2537","m2lname":"","m1fname":"Yue","m4fname":"","m1lname":"Gu","m3fname":"","description":"In this project, the purpose is to understand the features of relationships in real estate markets. The approaches include loading the data from the source, processing the data and data cleaning, performing the data virtualization, and applying the machine models to the prepared dataset. The
The data set that was used in this project is the data that houses sold in Washington states. By applying the models of ‘Gradient Boosting Regressor’ and ‘Linear Regressor’.The key results I found in this project are the most important features that influence the house price is the living sqft of the house. The details of this project will be described in the following paper

In recent decades, house marketing has always been a hot topic in economics. People are trading real estate not only for living but also for business. However, during the covid period, the entire market has been full of risks. People are considering a more efficient prediction model for housing prices for them to reduce the risk as much as possible. Meanwhile, despite the improving housing market, the country's overall housing supply continues to be constrained. Many people who bought homes during the past few years are still staying put, which has kept the prices from falling further. Therefore, for these reasons, I think house price prediction is a really good topic and a project for me to practice the skills I have learned in EECS 6893 Big Data analytics.
In the EECS 6893, I have learned the skills of Big Data Platforms, which include Hadoop, Spark, Cloud Storage, and HDFS systems. Big Data Analytics Algorithms, such as Pearson Correlation Similarity, Spark Clustering, Spark ML Classification, and Regression. And also Data Visualization and Graph Database by utilizing D3
","uni":"yg2537","language":"Python,D3,html,spark","pid":"202212-28","m4uni":"","analytics":"Sklearn Data modeling, Spark Data Modeling, Air WorkFlow, D3 visualization","m4lname":"","industry":"Finance","m3lname":"","dataset":"The dataset of house price information Ii used is downloaded from Kaggle The dataset is describing the house characters and sales price which is sold in Washington states. The dataset has around 5000 rows and 18 columns. It has categorical data such as, and numerical data price as target.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Semantic Alignment and LLM-Guided Detection for Chest X-Ray Understanding","timestring":"Sat Dec 20 01:16:13 2025","m1uni":"zc2799","m2lname":"Zhao","m1fname":"Zeqi","m4fname":"","m1lname":"Chen","m3fname":"","description":"The goal of this project is to develop a multimodal pathology detection system for chest X-ray images that integrates both visual information and radiology reports. Our objectives are to improve the accuracy of detecting clinically relevant pathologies, particularly small or subtle lesions that are challenging for image-only models. The key innovation is combining a YOLO-based detection backbone with contrastive pre-training between text and image modalities, enabling the model to align radiology findings with their visual manifestations. This capability allows more precise localization and a deeper understanding of clinical features. Such research is important as it demonstrates the potential of multimodal AI in enhancing diagnostic support and advancing medical imaging for clinical applications.","uni":"zc2799","language":"Implemented in Python using PyTorch, with ultralytics-based YOLO for image encoding and LLaMA-2-7B from meta for text encoding. Experiments run on Linux with NVIDlA GPUs","pid":"202512-6","m4uni":"","analytics":"Key algorithms include: vision-only YOLO for baseline detection, LLaMA-2-7B text encoder with learnable context prompts, and cross-modal contrastive pretraining to align ROI image patches with corresponding radiology report sentences. The BoxTextMatcher module fuses image and text features via cross-attention, followed by self-attention and a feed-forward network. The text-guided YOLO detector predicts bounding boxes and pathology labels. The visualization module overlays the predicted bounding box and the actual bounding box onto the X-ray image.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"We used the publicly available PadChest-GR dataset, containing 4,555 frontal chest X-ray studies with bilingual (English/Spanish) radiology reports and pathology-level bounding box annotations. The dataset was collected from the University Hospital Sant Joan d’Alacant, Spain (2009–2017) and obtained from the official BIMCV repository. For our project, we preprocessed it by selecting frontal images, filtering low-quality and pediatric cases, converting annotations to YOLO format, and keeping six clinically relevant pathologies for multimodal training and evaluation.","m2uni":"qz2541","m2fname":"Qiaochu","m3uni":""},{"projectname":"Machine Learning-Assisted Design of Novel EGFR Inhibitors for Lung Cancer","timestring":"Wed May 13 03:34:22 2026","m1uni":"ss7654","m2lname":"Cheng","m1fname":"Saha Dev","m4fname":"","m1lname":"Shanmugam","m3fname":"","description":"The objective of this project is to build an end-to-end AI workflow for EGFR inhibitor discovery that goes beyond molecule generation and supports practical candidate triage. The system generates candidate molecules, predicts potency (IC50/Kd), ranks candidates with medicinal chemistry constraints, and explains why each candidate is prioritized. Its key innovation is the integration of generation, tiered ranking, explainability, and reference-drug comparison in one toolkit. In practice, this is important because early discovery teams do not need only “more molecules”; they need transparent, defensible shortlists that balance predicted efficacy, chemical feasibility, and risk signals before moving to experiments.","uni":"ss7654","language":"Python, TypeScript, JavaScript, CSS, YAML, FastAPI, Next.js, React, RDKit, PyTorch, Pandas, NumPy","pid":"202605-04","m4uni":"","analytics":"The implemented analytics and algorithms include potency prediction (predicted IC50/Kd with pIC50/pKd ranking transforms), hard-gate medicinal chemistry checks (including Lipinski-related constraints), developability scoring, diversity shortlist construction, SA score estimation, and PAINS-style structural alert annotation. The system modules include data ingest/merge, model prediction, molecule generation, tiered ranking with explanation fields, structural-alert screening, and comparison APIs for benchmark and pairwise analysis. The visualization layer includes live generation tiles, candidate card lists with ranking metadata, 3D molecule viewing, benchmark bar charts for reference-vs-generated potency, Morgan-Tanimoto similarity strips, and detailed side-by-side candidate-versus-reference metric tables with structured ranking breakdowns.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The project is tested on EGFR-related bioactivity data from public sources, primarily ChEMBL and BindingDB, which are ingested and merged into pipeline inputs and then curated for model use. In the current workflow, these data support affinity modeling for IC50/Kd and downstream candidate ranking and comparison. The public dataset description can be summarized as “EGFR-targeted small-molecule bioactivity records from ChEMBL and BindingDB, curated and merged for affinity prediction and candidate prioritization.” Beyond this exact set, the software can support other datasets as long as they provide valid molecular structures (e.g., SMILES) and compatible activity labels (such as IC50/Kd in consistent units) through the same CSV/config-driven pipeline pattern.","m2uni":"hc3645","m2fname":"Eric","m3uni":""},{"projectname":"Deep Learning Methods for Art Generation","timestring":"Fri May 13 21:09:02 2022","m1uni":"ww2614","m2lname":"Zhu","m1fname":"Wu","m4fname":"","m1lname":"Wei","m3fname":"","description":"Due to the increasing availability of visual art collections, advanced deep learning methods, and computer vision tools, artificial intelligence is getting more opportunities to help the art community analyze and understand visual arts. The growing number of generative methods for image synthesis motivated us to explore and examine the potential for AI to understand and appreciate art. Our project experiments with several advanced generative models, including AC-EB GAN, W-GAN, and the combination of VQ-GAN and CLIP, for generating art images in two ways: either using images with the same genre or using the text prompt in the format of \"a (feeling) (genre)\". In addition, we built a web application for both art image classification and generation so that users can get AI-generated art images with a specific genre.","uni":"ww2614","language":"Python (TensorFlow, PyTorch, Flask), Javascript (ReactJS)","pid":"202205-5","m4uni":"","analytics":"Web Application: Backend (Python Flask), Frontend (ReactJS)
Model: AC-EB GAN, AC-W GAN, CLIP, VQ-GAN
Server: Google Cloud Platform, Amazon Web Services
","m4lname":"","industry":"Information","m3lname":"","dataset":"We use WikiArt Dataset. WikiArt is one of the largest online art collections of digitized paintings available (i.e., approximately 170,000 artworks). The paintings were obtained from the wikiart.org website. The dataset includes a large variety of artworks, such as paintings, sculptures, sketches, posters, etc. It also provides 14 genres ranging from different periods and series.","m2uni":"jz2969","m2fname":"Lynn","m3uni":""},{"projectname":"Generating Podcast Summaries with State-of-the-Art Language ModelsFor this project, I will try to leverage a Podcasts dataset to create a model that would shortening the corpus meaningfully to create a subset that represents the most important or relevant information within the original content, which would be the summary of the episode","timestring":"Sat Dec 18 21:41:58 2021","m1uni":"sy2953","m2lname":"","m1fname":"Siyu","m4fname":"","m1lname":"Yang","m3fname":"","description":"For this project, I will try to leverage a Podcasts dataset to create a model that would shortening the corpus meaningfully to create a subset that represents the most important or relevant information within the original content, which would be the summary of the episode
","uni":"sy2953","language":"Python, GCP, HuggingFace","pid":"20211256","m4uni":"","analytics":"BERT, BART, GPT-2","m4lname":"","industry":"Media","m3lname":"","dataset":"A podcast RSS feed is what allows users to subscribe to that podcast in order to listen to it without visiting the exact website where it is located. It also updates subscribers when new episodes are uploaded.
For this project, I have built an RSS parser to take the RSS feed provided by podcasts, parse the file, find the .mp3 file and save them to Google Cloud Storage.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Detection of Traffic Participants in Complex Scenarios","timestring":"Thu May 14 03:02:15 2020","m1uni":"gy2278","m2lname":"Hu","m1fname":"Guanhua","m4fname":"","m1lname":"Yu","m3fname":"","description":"To set up a traffic detection system in complex scenarios on mobile devices that are able to detect cars, pedestrains, traffic signs, etc. by means of machine learning.
Traffic detection is a very meaningful and useful class in the object detection tasks with many applications like:autonomous driving, traffic condition detection, traffic flow statistics, traffic accident scene restoration, traffic law violation detection etc..

","uni":"gy2278","language":"C++, python, java, swift","pid":"202005-4","m4uni":"","analytics":"We improved the tiny-YOLO network based on our dataset. We set up the Darknet Framework and train the whole model on BDD100k dataset and finally got a model and weights for mobile app development both on Android and IOS.","m4lname":"","industry":"Information","m3lname":"","dataset":"Berkeley Deep Drive","m2uni":"yh3214","m2fname":"Yunxiao","m3uni":""},{"projectname":"GeoLDM Fine-tuning for Conditional Ionizable Lipid Generation","timestring":"Wed May 14 04:04:32 2025","m1uni":"imk2133","m2lname":"Aiyengar","m1fname":"Ishraq","m4fname":"","m1lname":"Khan","m3fname":"","description":"The development of effective RNA-based therapeutics, such as mRNA vaccines and gene-editing technologies, hinges on efficient delivery systems. Lipid nanoparticles (LNPs) have emerged as the most successful non-viral vectors for this purpose. However, designing high-performing LNPs is a slow and experimentally expensive process. Our project aims to address this bottleneck by leveraging deep generative models for 3D molecular structures (GeoLDM) and prediction models for delivery performance (TransLNP), enabling in silico design and evaluation of novel LNPs.

","uni":"imk2133","language":"Python, Google Cloud Platform","pid":"202505-3","m4uni":"","analytics":"We present a pipeline for the conditional generation of ionizable lipid nanoparticles using a fine-tuned version of GeoLDM, a latent diffusion model for 3D molecular geometry generation. Our two-stage approach first adapts GeoLDM to the lipid chemical space using a large set of unlabeled lipid structures, then fine-tunes it on a labeled dataset with experimentally measured transfection efficiencies. The generated molecules are evaluated using TransLNP, a transformer-based model trained to predict transfection efficiency. Our work integrates state-of-the-art generative and predictive models in a novel application for mRNA drug delivery, although development challenges hindered the full completion of the generation-evaluation loop.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"The dataset used came from the TransLNP Github repo which included publicly sourced molecular structures originally compiled by the Uni-Mol framework. TransLNP supports input formats including SMILES strings, atomic coordinates, and molecular graph representations, making it compatible with a wide range of molecular property prediction tasks beyond LNP screening.","m2uni":"aa5479","m2fname":"Aniruddh","m3uni":""},{"projectname":"Emotion Pattern on Art","timestring":"Fri Dec 20 06:04:42 2019","m1uni":"cz2517","m2lname":"Chen","m1fname":"Chi","m4fname":"","m1lname":"Zhang","m3fname":"","description":"Current computer vision techniques enable us to extract low-level information from an image, such as texture and color. However, when it comes to high-level feature, like emotion, an abstract and crucial human feeling, is really challenging to define them. We think it is fascinating to map the general human emotions to low level features in art paintings. Understanding the emotion behind an art painting has many potential benefits. It could help us better understand the relationship between history and art. Another application would be revealing artists’ preference towards certain emotion type. We trained a support vector regression mode to obtain the score of paintings of famous artists.The input feature to our machine learning model is the lowlevel feature extracted from images","uni":"cz2517","language":"Python, JavaScript, HTML,CSS","pid":"201912-31","m4uni":"","analytics":"We implemented Support Vector Regression Model, K-Fold Cross validation, image low-level feature extraction and D3 data visualization. ","m4lname":"","industry":"Media","m3lname":"","dataset":"We used two datasets: WikiArts and Best Artwork of All Time. ","m2uni":"yc3523","m2fname":"Yixin","m3uni":""},{"projectname":"Stock Performance Prediction and Recommendation with Deep Learning Analysis","timestring":"Sat May 7 03:35:11 2022","m1uni":"ws2591","m2lname":"Chen","m1fname":"Wannuo","m4fname":"","m1lname":"Sun","m3fname":"","description":"Stock trading is a popular investment option among today's investors. The stock market, on the other hand, is notoriously difficult to anticipate, especially when it follows a random walk. As we enter the digital era, the quantity and quality of data endow us a great opportunity to investigate the underlying factors of complex phenomena, particularly in the financial market. In this project, we aim to build an application that utilizes the cutting-edge deep learning models to comprehensively analyze stock and forecast the stock price from both short-term and long-term view for a certain number of companies. With well-designed AI analysis, it could serve as a powerful tool for investors to make their investment decisions.","uni":"ws2591","language":"We mainly use Python as our programming languages and use Dash to build the dashboard webpage.","pid":"202205-16","m4uni":"","analytics":"- For historical stock data, we used finTA package to calculate the technical indicators
- For twitter sentiment analysis, we built VADAR model to compute the sentiment score
- For model building, we applied sklearn package to build KNN, SVM, Random Forest, Naive Bayes and Logistic Regression; applied keras to build LSTM model and TCN model.
- We applied Dash to build our web-based dashboard, and used plotly to visualize the results.","m4lname":"","industry":"Finance","m3lname":"","dataset":"1. Historical stock data from Yahoo! Finance:
- extract the data with yfinance API
- get the stock price and volumes since 2018-01-01
- choose the following 12 companies: Google, Meta, Amazon, Apple, Microsoft, IBM, Dell, Intel, Tencent, Cisco, Sony, HP

2. Twitter:
- extract the data with sntwitter.TwitterSearchScraper API
- get 60 tweets per day since 2018-01-01 for each company stock

Our software can support any stock price and news as the training dataset.","m2uni":"jc5657","m2fname":"Jiaqing","m3uni":""},{"projectname":"Sentiment Prediction of Game Reviews","timestring":"Sat Dec 18 01:58:21 2021","m1uni":"gl2701","m2lname":"Liu","m1fname":"Gaoge","m4fname":"","m1lname":"Liu","m3fname":"Haichao","description":"The target of our project is to offer a sentiment analysis machine, which automatically transform unlabeled game reviews into sentiment preference (recommend or not), for game lovers. As far as we know, there is no such sentiment analysis machine in video gamers' community. But as game lovers, we believe our program will not only help others decide whether to purchase or not as recommendations, which saves time and money, but also have a reasonable expectation of new games, whose scores may not be available, but whose reviews are available on social media (Twitter or forums). In terms of business value, our model could be a reference video games studios. By identifying the sentiment of comments from some platform (for example, Twitter or forums), studios could receive more realistic
comments to their games, thus polishing their games to attract more players.","uni":"gl2701","language":"Python","pid":"202112-20","m4uni":"","analytics":"For visualization, we use matplotlib.
For data processing in analytics, we use spark operation due to its ability to process large scale of data.
For feature extraction, we chose 4 typical algorithms from statistical approaches, semantic
approaches, and deep learning. TF-IDF and Hashing TF are statistical measures for evaluating the relation
between words and documents, based on term frequency. They are purely statistical tools. Secondly,
we implemented Enhanced feature, which is a semantic approach with N-gram + lexicon,
using pre-prepared sentiment lexicon to score a document by aggregating
the sentiment scores of all the words in the document. Finally, we applied deep learning methods,
which are quite popular in recent years. For word2vec and bert, they used techniques like
neural networks, CBOW, self-attention, multi-head attention and so on, to obtain word representation
for all words, and could be easily submitted to downstream tasks like sentiment analysis.

Also, we chose 4 typical classifiers, logistic regression, naive bayes, random forest, and neural networks.
There are lots of classifiers in the research and industry field, but we are more interested in
feature extraction, in our task since we want to mine the semantic pattern of game reviews,
and give feedback to users in our later works. And also, once we obtain good feature representation,
it's easy to apply them to any fancy classifiers, so these 4 classifiers are enough for our experiment.","m4lname":"","industry":"Media","m3lname":"Yi","dataset":"The dataset processed in this project is fetched from Steam Web API.

The input of the software is a sentence.","m2uni":"zl2986","m2fname":"Zihao","m3uni":"hy2664"},{"projectname":"Get Your Entertainment Recommendations Based on Your Selfie","timestring":"Fri Dec 13 18:46:49 2019","m1uni":"xd2212","m2lname":"Wang","m1fname":"Xiangzhuo","m4fname":"","m1lname":"Ding","m3fname":"Shanhen","description":"We are trying to build a web application that personalizes media recommendations as accurately as possible using only facial sentiment data. Recommender systems typically rely on implicit feedback from user-item interactions for the purposes of developing a preference profile for users. Our project aims to approach the problem of generating feedback in a very novel way.

We developed a method of collecting feedback from users in the form of facial images from the user in the process of consuming video items. To get a representation of users' liking of an item using their facial expression, we used a variation of the FaceNet neural network from a paper published by Google in 2018. Using the same dataset as the original authors, and by training a similar neural network architecture on GCP, we developed a 16-dimensional vector space embedding of a user's facial expression. In order to use this embedding for recommendations, we map this 16-dimensional representation to a scalar representing whether the user likes or dislikes the item. The mapping is done by taking the proximity of the user's expression to expressions which we consider to represent enjoyment or interest in an item.

Now our system can generate different recommendations for different users. The data server can run both locally or on GCP with high stability and speed. With more user data collected, our system will become smarter and generate better recommends.","uni":"xd2212","language":"Python/Django/Spark/PyTorch/OpenCV/Javascript/HTML/CSS","pid":"201912-39","m4uni":"","analytics":"We used four different Deep Neural Networks and a Collaborative Filtering to construct the system. For face detection and analysis, CNN based Networks were applied here. The output wiil be sent to Web App where a Collaborative Filtering will generate recommends and update itself using user data.

We relied on GCP for training our CNN We used the SparkML library's implementation of the alternating least squares algorithm to generate item recommendations for users. We tried both Tensorflow as well as PyTorch for both facial identification as well as expression recognition, and decided to keep the PyTorch implementation.

The website is developed by Django and we have merged the models into the whole web app for functional support.","m4lname":"","industry":"Media","m3lname":"Mirzoyan","dataset":"Two datasets were used for this project. For emotion classification, we used the dataset from kaggle: https://www.kaggle.com/c/facial-keypoints-detector/data. For FEC model we got data from google: https://ai.google/tools/datasets/google-facial-expression/

The second dataset was over 6 GB in size and each individual image was hosted as a different web url. We wrote a script to download the images over the course of 48 hours.","m2uni":"qw2261","m2fname":"Qi","m3uni":"sm4775"},{"projectname":"Analyzing Correlation Between Public Sentiment on Tech Companies vs. Nasdaq-100 index","timestring":"Sat Dec 18 01:00:51 2021","m1uni":"bm3024","m2lname":"Chang","m1fname":"Brian","m4fname":"","m1lname":"Mao","m3fname":"","description":"Objectives - Our goal is to analyze the correlation between public sentiment on tech companies vs. Nasdaq-100 index.
Innovations - we are tying to see if the publics sentiment towards big tech companies will influence their companies' stocks on a daily basis.
Capabilities - our tools are able to stream live twitter data and live stock market data then save it to Google BigQuery. our model can out put public sentiment scores using LSTM algorithm. as well as regression between twitter sentiment scores and stock market performance on a daily basis.
Why are these research - enable investors to make sell/buy decisions based on our prediction","uni":"bm3024","language":"Python","pid":"202112-48","m4uni":"","analytics":"Analytics: Analytics - sentiment analysis for public tweets, regression model to see correlations between public sentiment scores towards tech companies vs their stock price.
Algorithms - Support vector classifiers, Random forest, Logistic Regression, XGBoost Classifier, Simple Recurrent Neural Network (RNN), Long short-term memory (LSTM), Gated recurrent units (GRU)
System modules - Air-Flow to schedule and stream data, python file to run ML algorisms and analysis
visualization - visualization for model evaluation and regression output","m4lname":"","industry":"Finance","m3lname":"","dataset":"Dataset:
1. Streamed Twitter data using Tweepy
2. Streamed Stock Index data suing yfinace
3. Static tweets data with sentiment labels from NLTK- nltk.download('twitter_samples')","m2uni":"sc4755","m2fname":"Vincent ","m3uni":""},{"projectname":"AI Trader: Stock Price Prediction with Financial News Sentiment","timestring":"Sat Dec 20 03:59:10 2025","m1uni":"yh3881","m2lname":"Espinoza","m1fname":"Yaxuan","m4fname":"","m1lname":"Hu","m3fname":"jiexin","description":"The objective of this project is to design an end-to-end big data machine learning pipeline for stock price prediction by integrating historical market data with financial news sentiment. Traditional financial forecasting models rely primarily on numerical time-series data and often fail to capture abrupt market movements driven by external information such as breaking news or market sentiment shifts.

This project introduces a sentiment-aware forecasting framework that combines structured price data with unstructured textual data extracted from financial news. By incorporating sentiment signals as exogenous features, the system aims to improve predictive performance, particularly during periods of high information volatility.

The project emphasizes automation, scalability, and reproducibility. The pipeline supports multiple deep learning architectures under a unified experimental setup, enabling systematic comparison between price-only and sentiment-enhanced models. This research is important because it demonstrates how big data techniques and multimodal learning can be applied to real-world financial decision-making problems, bridging the gap between academic models and practical financial analytics systems.
","uni":"yh3881","language":"Python","pid":"202512-21","m4uni":"","analytics":"The system implements multiple analytics and machine learning components within a modular pipeline. Feature engineering includes lagged returns, rolling statistics, and aggregated daily sentiment features derived from financial news.

Several predictive models are evaluated, including baseline regression models, Long Short-Term Memory (LSTM) networks, Gated Recurrent Unit (GRU) networks, and hybrid architectures incorporating convolutional layers and attention mechanisms. Each model is trained under two configurations: using price-only features and using combined price and sentiment features.

Model performance is evaluated using standard regression metrics such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE), as well as directional accuracy. Visualization modules compare predicted price trajectories against ground truth values and provide comparative plots across different model configurations to support quantitative and qualitative analysis.
","m4lname":"","industry":"Finance","m3lname":"chen","dataset":"The project uses two primary types of datasets: historical stock market data and financial news data. Market data include daily open, high, low, close, and trading volume information for selected publicly traded stocks over a multi-year period. These data were obtained from publicly available financial data sources.

Financial news data consist of news articles related to the same stocks, including publication timestamps and sentiment-related metadata. News sentiment is aggregated at a daily level and temporally aligned with trading days to ensure consistency with market data.

All datasets used in this project are publicly accessible. The pipeline is designed to support additional assets, longer time horizons, and alternative data sources, such as social media sentiment, macroeconomic indicators, or intraday price data, with minimal modification. This flexibility allows the system to be extended to broader financial forecasting and analytics tasks.
","m2uni":"vne2102","m2fname":"Valeria","m3uni":"jc6630"},{"projectname":"Machine Learning Cellular Boundaries from Spatial Transcriptomics Data","timestring":"Tue May 5 19:28:52 2026","m1uni":"pmt2117","m2lname":"Chen Wu","m1fname":"Peter","m4fname":"","m1lname":"Tzelios","m3fname":"","description":"Our goal was to develop a pipeline for assigning high-resolution spatial transcriptomics expression bins to individual cell nuclei and then determining the cell type of each reconstructed cell. This is an important problem because technologies such as Visium HD provide spatially resolved gene expression at very fine resolution, but the measured bins do not directly correspond to individual cells. Accurately mapping expression bins to nuclei is therefore a critical step for reconstructing cell-level expression profiles.
Our key innovation is that the model uses both gene expression similarity and spatial positional information to assign bins to nuclei. Many simpler approaches rely primarily on spatial proximity or segmentation alone, while our approach trains a neural network to determine whether a given expression bin belongs to a candidate nearby nucleus using both transcriptomic and spatial features.
The pipeline uses Xenium data as ground truth, where cell identities are known, to train a multilayer perceptron model that predicts whether a bin–nucleus pair belongs to the same cell. The trained model is then applied to Visium HD data, where cell boundaries are not directly known. The final output is a set of reconstructed cells with assigned gene expression profiles and predicted cell types.
This toolkit is important because improved cell boundary definition and cell type assignment in spatial transcriptomics enables more accurate downstream analyses, including characterization of the tumor microenvironment, identification of cell–cell interactions, and improved understanding of spatial organization in cancer and other disease tissues.","uni":"pmt2117","language":"The project was implemented in Python. We used common scientific computing and machine learning libraries, including PyTorch for neural network modeling and Scanpy/AnnData for handling spatial transcriptomics data. The pipeline was run in a high-performance computing environment to support large spatial transcriptomics files and model training. We also incorporated publicly available state-of-the-art tools, including StarDist for nuclei segmentation from H&E images and CellTypist for cell type annotation. These tools were integrated into the overall pipeline to support image-based nuclei detection, expression-based cell type labeling, and downstream spatial transcriptomics analysis.","pid":"202605-2","m4uni":"","analytics":"The pipeline includes several major modules. First, Xenium and Visium HD datasets are processed separately to prepare expression matrices, spatial coordinates, nuclei annotations, and cell type labels. StarDist is used to segment nuclei from the H&E image, and CellTypist is used to annotate cell types based on expression profiles.
The core model is a multilayer perceptron classifier. Each model input is a candidate pair consisting of one gene expression bin and one nearby nucleus. The model receives two types of features: the gene expression difference between the bin and the nucleus-mean expression profile, and the spatial offset between the bin and nucleus. The model outputs the probability that the bin belongs to that nucleus. During inference, candidate bin–nucleus pairs are scored, and each bin is assigned to the most likely nucleus above a confidence threshold.
We evaluated model performance using several metrics, including F1 score, precision, recall, and accuracy for bin assignment. In the presentation, we focus primarily on F1 score for correctly predicting whether a bin belongs to a given nucleus, as well as cell size metrics based on the number of bins spanning the longest diagonal of the reconstructed cell. These metrics help assess both assignment accuracy and whether the reconstructed cells have biologically reasonable spatial sizes.
For visualization, we generated spatial plots of the tissue image. Expression bins are colored by their assigned cell type, and cell outlines or boundaries are shown to illustrate the reconstructed cell assignments. These visualizations allow qualitative evaluation of whether the model produces spatially coherent cell assignments that align with the underlying tissue morphology.","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Our project used a publicly available 10x Genomics dataset:
https://www.10xgenomics.com/datasets/xenium-human-lung-cancer-post-xenium-technote
This dataset contains matched Xenium and Visium HD spatial transcriptomics data from a human lung cancer sample. The Xenium data provide higher-confidence cell assignments and were used as ground truth for model training. The Visium HD data were used as the target dataset, where the goal was to assign 2µm expression bins to individual nuclei and reconstruct cell-level expression profiles.
The pipeline is designed to support other samples with similar paired or compatible spatial transcriptomics data. In particular, it can be adapted to datasets that include high-resolution expression bins, spatial coordinates, an associated tissue image, and either ground-truth or reference cell assignments for training. With appropriate preprocessing, the same framework could be applied to other tissues, disease contexts, or matched Xenium/Visium datasets.","m2uni":"lc3716","m2fname":"Lixia","m3uni":""},{"projectname":"Real-time Driver’s Emotion Recognition Based on Deep Learning","timestring":"Sat May 16 00:06:32 2020","m1uni":"cz2572","m2lname":"Guo","m1fname":"Chao","m4fname":"","m1lname":"Zhang","m3fname":"","description":"Negative emotions like anger, sadness and fear can have a significant impact on driver’s response time which may cause fatal car accidents. Therefore, an emotion monitor and a timely warning to the driver will help maintain safety on the roads. In our project, we proposed a novel real-time emotion recognition system based on CNN networks using facial expressions, eye aspect ratio and head position estimation to analyze a driver's emotion states so as to ensure driving safety and reduce the risk of accidents affected by emotions.","uni":"cz2572","language":"Python, Keras, Tensorflow, OpenCV, Dlib","pid":"202005-2","m4uni":"","analytics":"We used CNN, transfer learning VGG16 and different data augmentation methods to improve the performace of our model. We also combined emotion recognition module with drowsiness detection and distraction monitoring modules to enrich the functions of our real-time system.","m4lname":"","industry":"Information","m3lname":"","dataset":"We used three dataset FER-2013, CK+ and KMU-FED. All of these are public datasets and we can download through the websties. After preprocessing, the image data that contains the facial expressions can be used in our model.","m2uni":"lg3095","m2fname":"Longwei","m3uni":""},{"projectname":"CodeEvo: A Self-Improving Coding Agent","timestring":"Tue May 5 04:19:04 2026","m1uni":"ps3558","m2lname":"Ou","m1fname":"Peiyan ","m4fname":"","m1lname":"Sun","m3fname":"Zhuyun","description":"CodeEvo is a coding-focused self-improving agent that improves across sessions without retraining the underlying language model. The goal is to demonstrate system-level self-improvement by externalizing successful debugging experience into persistent memory, reusable skills, and searchable session history.

The project focuses on Python debugging workflows because coding tasks provide objective verification through tests. CodeEvo can run tests, inspect failure output, read and modify source files, verify fixes, save successful workflows as reusable skills, and reuse those skills on similar later bugs.

The innovation is not a new foundation model, but an implemented agent system that makes improvement visible and verifiable. A local dashboard shows live evidence including test status, saved skills, memory facts, session recall, and source code snapshots. The final demo shows a FAIL -> fix -> save skill -> similar bug -> reuse skill -> PASS workflow.
","uni":"ps3558","language":"Python 3, standard-library unittest, SQLite, JSON, Markdown, HTML/CSS/JavaScript dashboard, OpenAI-compatible chat completions API through OpenRouter.","pid":"202605-21","m4uni":"","analytics":"The implemented system includes an iterative LLM agent loop, tool-based code execution, persistent memory, reusable skill storage, session search, and a local evidence dashboard.

Main modules:
1. LLM loop: decides whether to answer directly or call tools
2. Tool executor: read files, write files, run commands, search code, search sessions
3. Memory store: saves durable project facts in JSON
4. Skill store: saves reusable debugging workflows as Markdown skill files
5. Session store: stores conversations in SQLite for cross-session recall
6. Self-improvement hook: extracts useful facts and workflows after successful multi-step tasks
7. Dashboard: visualizes chat, test status, saved skills, memory, session recall, and source code

The main self-improvement algorithm is:
Observe failing tests -> Act with tools -> Reflect on successful workflow -> Store memory/skills/sessions -> Reuse saved experience on future similar tasks.
","m4lname":"","industry":"Information","m3lname":"Jin","dataset":"This project does not use a traditional static dataset. Instead, the data consists of operational artifacts generated by the agent during coding tasks.

The tested data/artifacts include:
1. conversation history and user prompts
2. saved memory facts in JSON
3. reusable skill files in Markdown
4. command outputs and unit-test results
5. source code edits
6. SQLite session records for cross-session recall

For validation, we use controlled Python unit-test scenarios:
1. calculator arithmetic bugs: subtract and multiply bugs
2. string/parser bug: slug normalization with punctuation and repeated separators
3. edge-case bug: empty input and flat-score handling in statistics utilities

These scenarios are small but intentionally designed to test whether a debugging workflow can be saved and reused across similar coding tasks.
","m2uni":"do2487","m2fname":"Debang","m3uni":"zj2434"},{"projectname":"Development and Analysis of a Generative Algorithm for Universal Adversarial Perturbations (UAPs)","timestring":"Fri Dec 13 20:23:54 2019","m1uni":"jda2167","m2lname":"","m1fname":"Jonathan","m4fname":"","m1lname":"Armstrong","m3fname":"","description":"Note: UAP = Universal Adversarial Perturbation

Goals:
1) Use and modify source code (https://github.com/LTS4/universal) to generate UAPS
2) Analyze UAPs to build insight towards a generative algorithm
3) Design and evaluate a generative approach for building UAP algorithms

Innovations:
Successfully developed an algorithm for generating effective UAPs without dependency on images or models. This algorithm creates a UAP in seconds compared to hours using the source code from Goal 1

Why is this research important?
UAPs can be used in black box attacks against production models, e.g. trick a self driving car into thinking that a stop sign is a speed limit sign. The ability to quickly generate a UAP has the potential to wreak havoc on production systems. Making such algorithms available and understanding why they work can help produce safer and more robust systems. ","uni":"jda2167","language":"Google Cloud Storage, Big Query, and Virtual Machine; Python, HTML, javascript; Linux and Windows","pid":"201912-50","m4uni":"","analytics":"Original UAP Generation:
* Run on Linux Virtual Machine
* Code in Python with TensorFlow using a pre-trained Inception CNN.
* Training Images stored and downloaded from Google Cloud Storage Bucket

Analysis of UAP Generation:
* Run on Windows
* Code in Python with TensorFlow, reusing the pre-trained Inception CNN.
* Visualization via conversion of UAPs to images; histograms of UAP values
* Applied various transforms to UAPs to evaluate symmetries

Generative UAP Algorithm:
* UAP color layers are identically initialized from uniform distribution
* A 10 x 10 Convolution transform is selected randomly from a uniform distribution and adjusted to maintain equal positive and negative entries.
* The convolution is applied to the UAP over numerous iterations
* Re resulting UAP is \"saturated\" (all values rounded up/down to +/- 10.0)
* Each generated UAP includes a GIF visualization of how the UAP evolved over each iteration

Analysis and Comparison of UAPs
* Django website
* Data queried from Google Cloud Big Query using Python Pandas-GBQ library
* Images retrieved from Google Cloud Storage Bucket using Python Google-Cloud library
* Visualization - demo: images + perturbations + effect (fooled or not)
* Visualization - analysis: bar charts comparing original UAP algorithm against my generative algorithm","m4lname":"","industry":"Information","m3lname":"","dataset":"Initial Dataset: ImageNet ILSVRC2012 Validation Data
Derivative Dataset: 44 UAP images
Self Generated: Table (4.7 Million rows) evaluating image classification over 50k images and 94 UAPS (50 using my generative algorithm)","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Events Linkage and Impact Prediction","timestring":"Fri Apr 23 20:58:41 2021","m1uni":"tc3075","m2lname":"Doan","m1fname":"Tzu Yi","m4fname":"","m1lname":"Chuang","m3fname":"","description":"Financial industry communicates by sharing the information and data through the internet, and eventually social media websites. With the tremendous growth of social media and its user traffics, many social media platforms have become the main ways for retail investors to actively gather information, share news and update the market trends. By interacting with others on these platforms, users generate data, but also consume data. Their actions are partly aligned with it. The recent event with GME stock and the subreddit group called walstreetbets, is one of many examples where social media are used to broadcast information, that eventually triggers widespread market reactions almost instantly.

Thus, this project aims to connect the dots between the sentiments and the market movements. The project focuses on modeling the impact of news and social media on the stock market, through building knowledge graph. The project topic will address the challenges of big data: volume, velocity and variety. This project will incorporate multi-disciplinary knowledge from machine learning and market intelligence, to more domain knowledge such as capital market pricing, financial engineering and risk management.
","uni":"tc3075","language":"Python, HTML, Flask, Neo4j, Spacy, NLTK","pid":"202105-10","m4uni":"","analytics":"We analyze the correlation of S&P 500 Companies by building the knowledge graph from DBpedia
We crawled the data and store them to Neo4j and visualize it
We build a stock matching engine and triplet extractor with NLP tools and fuzzy search
We come up with a formula to build company distance matrix and apply SVD to get hidden feature
We host a website to give instant result when user input the tweet or news","m4lname":"","industry":"Finance","m3lname":"","dataset":"S&P 500 Companies info: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
DBpedia data (Symbol/Company name/GICS Sector/Description): https://github.com/dbpedia/lookup
Twitter (Tweet/Company keyword/Create date)
Nasdaq Symbol List: https://www.nasdaq.com/
","m2uni":"ad3801","m2fname":"Wendy","m3uni":""},{"projectname":"Real-time Analysis of Taxi Usage across New York City (NYC) ","timestring":"Sat Dec 17 02:50:53 2022","m1uni":"ra3141","m2lname":"Ammanamanchi","m1fname":"Rishav","m4fname":"","m1lname":"Agarwal","m3fname":"Rachana","description":"Objectives:
1) Incentivize car pooling by understanding cab’s capacity utilization
2) Improved car deployment for taxi services based on historical pickup density
3) Increase customer satisfaction through pre- allocating long haul cabs based on predicted drop off locations.

Capabilities:
1) Visualizing the cab utilization for carpooling across the city
2) Understand the trend of movement of people across New York City as the day progresses
3) Dynamic Pickup and Destination prediction using Machine Learning models

Innovations:
Although the NYC TLC dataset is collected and updated since 2010, we didn’t find any project that involved the end to end big data workflow from data streaming, stream processing, data warehousing to insight & visualization generation, and leveraging ML models to improve and predict custom use cases.
We now have a one stop solution that works with live streaming data and also supports incremental model updates.
","uni":"ra3141","language":"Google Cloud Platform, Python, PySpark (Mlib, structured streaming), DataProc, Big Query, Tableau, Google Cloud Storage, d3.js, HTML / CSS, Apache","pid":"202212-2","m4uni":"","analytics":"Algorithm 1:
Using PySpark, we took the concepts of RDD and MapReduce to aggregate the frequency of passenger count across two sectors: = 1 and >=2.
First, we used the .reduceByKey and the .map functionality to calculate the frequency of all forms of passenger count from 1 to 9. Using the RDD data created, we processed to sum up the frequency which was greater than 2, to understand the carpooling behavior across the Green Taxi data.
After plotting the data obtained from the program as a pie chart using d3.js, we see that especially for the Green Taxi data the passenger count of 1 has a frequency of almost 83%, while greater than that is just 17%.

Algorithm 2:
Visualize the movement of cabs across NYC and understand the trend based on the density of pickup and drop locations, leverage DataProc clusters and Spark. It can be used to improve deployment of vehicles for taxi services based on the pick-up locations through the day. First pre-processing step is to add the day of the week and hour columns using the date/time column - how the locations changed by the hour spread over a week. Color indicates the density of the pickup based on the latitude and longitude. Red regions indicates higher density, yellow indicates moderate, green and blue show lower density of pickups

Algorithm 3:
Data Cleansing and extract ground truths: Basic preprocessing-null values,erroneous lat and long
Use Geohashing to bin geographic locations
Bin time of day into half hour time slots
Gather pickups based on geohashes.

Features:
Time, day, Pickup Location, is weekend?

Models Evaluated (RMSE is in log base 10 space):
Random Forest (R2: 0.79, RMSE: 0.127)
K Means (R2: 0.5507, RMSE: 0.186)
Neural Networks (R2: 0.5212, RMSE: 0.192)

Visualizations:
Pie Chart
HeatMap
Bar Plots

","m4lname":"","industry":"Transportation","m3lname":"Dereddy","dataset":"The Dataset tested are:

Green Taxi Data: NYC OpenData and Socrata OpenData REST API
Yellow Taxi Data: NYC OpenData and Socrata OpenData REST API.
Uber pickups Data: https://fivethirtyeight.com

Volume: The first dataset has a dimension of (8049224, 22) while the Uber Taxi dataset has a dimension of (4534327, 4). This gives a combined dataset of more than 12 million rows.
Velocity: The real-time data is being processed in batches. For every second, we have a batch size of 100 records.
Variety: Our data is being captured in the forms of JSON and CSV and is aggregated from three different forms of transportation that of Yellow taxi, Green taxi and Uber vehicle trip records.

Data Collection through Structured Streaming:
1) Mock Stream Generator: Scrapes data from REST API’s of NYC TLC and pre-fetched data of Uber pickups and send the data at a custom rate to the target socket.
2) Stream Parser: The incoming data from various sources is now handled through Spark structured streaming socket listener, refined, pre-processed and finally written in micro-batches to Google Big Query.

The software can support any Taxi Data which provides Latitude and Longitude in forms like CSV, Parquet and Json.
","m2uni":"sa3979","m2fname":"Karthik","m3uni":"rd2998"},{"projectname":"Prediction of Helpfulness for the Amazon movie review","timestring":"Sat Dec 16 00:50:05 2023","m1uni":"cj2792","m2lname":"Yu","m1fname":"Chuyi ","m4fname":"","m1lname":"Jiang","m3fname":"","description":"Our goal is to build a model to predict a review whether it is helpful or not. According to modern research, the customer review could have a huge impact on purchase intentions in e-commence. However, there are plenty of review for a certain product. Although, there is a voting system and score system, some of them may not be useful.","uni":"cj2792","language":"python","pid":"202312-18","m4uni":"","analytics":"We used NLP and related language analysis system to generate the features

We also tried several machine learning models like random forest, and deep learning model to train the features.","m4lname":"","industry":"Retail","m3lname":"","dataset":"The dataset we used is a Amazon movie data review, according to J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.","m2uni":"zy2582","m2fname":"Zhizhen","m3uni":""},{"projectname":"Explainable AI for News Integrity: A Hybrid Classification and LLM Generation System","timestring":"Sat Dec 20 02:56:23 2025","m1uni":"hh3164","m2lname":"Chiu","m1fname":"Hung-Kai","m4fname":"","m1lname":"Huang","m3fname":"","description":"1. Project Objectives
The primary goal is to bridge the gap between accurate but opaque \"black box\" deep learning models and the human-interpretable verification needed for user trust. Specific objectives include:

Initial Credibility Assessment: Providing a rapid automated flag for potentially unreliable content using neural classifiers.

Granular Verification: Moving beyond simple binary (True/False) labels to verify individual claims within an article.

Evidence-Based Explanations: Generating human-readable verdicts grounded in specific evidence from trusted sources like Wikipedia and live web searches.

Scalability and Resilience: Building a system capable of handling large-scale knowledge bases and real-time news traffic using cloud-native infrastructure.

2. Key Innovations
The system introduces several novel architectural elements to improve automated fact-checking:

Hybrid Verification Pipeline: A unique architecture that integrates a fine-tuned RoBERTa encoder for classification with a modular Retrieval-Augmented Generation (RAG) pipeline for analysis.

Dual-Mode Claim Extraction: The implementation of \"Simple\" and \"Claimify\" modes, allowing the system to balance processing speed with high-precision decontextualization of claims.

Scalable Knowledge Retrieval: Using PostgreSQL with the pgvector extension to perform sub-second semantic searches across millions of Wikipedia document embeddings.

LLM-based Semantic Verifier: Utilizing Google's Gemini model to act as an \"intelligent judge\" that synthesizes multi-source inputs using Chain-of-Thought (CoT) reasoning.

3. System Capabilities
The architecture is designed to replicate the workflow of a human fact-checker through five primary stages:

Rapid Classification: Achieving 93.4% accuracy in initial fake news detection.

Atomic Claim Extraction: Decomposing unstructured text into standalone, verifiable statements.

Static & Dynamic Retrieval: Cross-referencing claims against a permanent Wikipedia database and live web indices via the Perplexity Search API.

Temporal Sensitivity: Successfully verifying breaking news events (e.g., a Dec 2025 shooting) that occur after a model's training data cutoff.

Conflict Resolution: Correcting \"common knowledge\" hallucinations by prioritizing retrieved evidence over the model’s internal parametric memory.

4. Importance of Research and Toolkits
This research is vital because it addresses the core limitations of modern AI in the information ecosystem:

Combating Misinformation: As digital falsehoods proliferate, automated systems must be both fast and reliable to maintain information integrity.

Building Trust through Transparency: By explaining why a verdict was reached and providing source citations, the system encourages practical adoption by human users.

Solving the Knowledge Cutoff: Unlike standard LLMs (like ChatGPT), which may hallucinate when faced with events beyond their training data, this system's use of real-time search ensures accuracy for current events.

Big Data Scalability: By decoupling compute (Google Cloud Run) and storage (PostgreSQL), the project demonstrates how to scale fact-checking to handle real-world traffic loads.","uni":"hh3164","language":"Python, Google Cloud Run, PostgreSQL with pgvector, Perplexity Search API, Groq API, Gemini API","pid":"202512-13","m4uni":"","analytics":"System Modules
FakeNewsDetector: Performs the initial rapid credibility assessment using fine-tuned transformer encoders.

ClaimExtractor: Utilizes Llama 3.1 to decompose articles into atomic, decontextualized, and verifiable claims.

Static & Dynamic Retrievers: Modules that fetch evidence from a localized Wikipedia knowledge base and the live web.

LLM-based Verifier (LLMExplainer): Powered by Google Gemini 2.5 Flash, this acts as the \"reasoning engine\" to synthesize all inputs into a final verdict.

Analytics & Algorithms
Deep Learning Classifiers: Fine-tuned RoBERTa and DistilBERT models used for binary classification of text veracity.

Semantic Search Algorithms:

Cosine Similarity: Uses the <=> operator to find relevant factual evidence based on semantic meaning rather than just keywords.

IVFFlat (Inverted File Flat): An indexing algorithm implemented to maintain sub-second query performance as the knowledge base grows.

Retrieval-Augmented Generation (RAG): A hybrid approach that combines neural text classification with external evidence retrieval.

Chain-of-Thought (CoT) Prompting: Directs the LLM to articulate intermediate reasoning steps before reaching a final verdict to ensure transparency.

Visualization and User Interface
The user-facing layer is designed to turn complex data into human-interpretable reports:

Streamlit Framework: A web-based frontend that serves as the central orchestrator for user configuration and analysis execution.

Interactive Sidebar: Allows users to toggle extraction modes (Simple vs. Claimify) and adjust retrieval depth.

Visual Verdict Cards: Uses color-coding (e.g., Red for \"Misleading,\" Green for \"Verified\") to provide immediate clarity on news integrity.

Expandable Evidence Panels: Enables users to inspect the specific source text and citations backing each claim for independent verification.","m4lname":"","industry":"Information","m3lname":"","dataset":"The omarkamali/wikipedia-monthly dataset provides comprehensive, regularly updated snapshots of Wikipedia content across over 341 languages. Unlike static archival dumps, this dataset is refreshed monthly to capture the latest articles, events, and cultural developments, addressing the \"knowledge cutoff\" problem common in NLP research. It features clean, pre-processed plain text where MediaWiki markup has been parsed and removed, making it immediately ready for training Large Language Models (LLMs), retrieval-augmented generation (RAG) pipelines, and multilingual analysis.

The WELFake (Word Embedding over Linguistic Features for Fake News Detection) dataset is a large-scale corpus designed for training robust fake news classification models. It consolidates four smaller, widely used open-source datasets—Kaggle, McIntire, Reuters, and BuzzFeed Political—into a single unified resource. With over 72,000 articles, it provides a broad and diverse range of news content, preventing overfitting to specific writing styles or narrow topics. The dataset is notable for its excellent class balance, making it an ideal foundational dataset for teaching models to distinguish between authentic journalism and fabricated stories.

The Cartinoe5930/Politifact_fake_news dataset is a collection of approximately 21,000 political claims and articles that have been manually fact-checked by experts from PolitiFact. Designed for tasks like fake news detection and automated fact-checking, it serves as a high-quality ground truth resource by aggregating real-world political statements and their verified truthfulness labels. The dataset is particularly valuable for training models to distinguish between nuanced degrees of truth in political discourse, ranging from \"True\" to \"Pants on Fire\" lies.","m2uni":"lc4021","m2fname":"Liang-Jie ","m3uni":""},{"projectname":"AI Companion: Speech, Emotion and Context Recognition ","timestring":"Thu May 14 23:47:27 2020","m1uni":"ly2451","m2lname":"Parmar","m1fname":"Liane","m4fname":"","m1lname":"Young","m3fname":"","description":"The goal of the project is to build an audio-based emotional companion. While there are many AI assistants on the market today (Siri, Alexa, etc), they lack the capability to provide empathetic responses in an emotionally-based situation. By identifying and incorporating emotion into our chatbot training, our system will gain emotional context from a conversation and be able to provide a logical reply in response.

We experimented with several datasets and methods to determine the best fit for our project goals. We also incorporated and trained our chatbot model with emotion-tagged conversation data to improve the training results. We built our own Seq2Seq model with attention layer, which allowed us to process different types emotions gained through speech and video recognition methods. Our exploration of different methods and datasets was important because our project goals needed a customized solution to achieve our expected results.","uni":"ly2451","language":"Python, Tensorflow environment, Keras","pid":"202005-3","m4uni":"","analytics":"The following algorithms and methods were used in their respective parts of our implementation:

- Speech Emotion Recognition: feature extraction (MFCCs, Mel Spectrogram, chroma), ANN, KNN, CNN, confusion matrix metrics
- Speech-to-Text: Google Cloud Speech-to-Text API with speechrecognition library
- Chatbot: Seq2Seq model with Attention layer
","m4lname":"","industry":"Information","m3lname":"","dataset":"We used several datasets for different areas of the project development: RAVDESS, Santa Barbara Corpus of Spoken American English, Cornell Movie-Dialogs Corpus, and DailyDialogs dataset.
All of these datasets were public at the time of this project.","m2uni":"prp2126","m2fname":"Prutha","m3uni":""},{"projectname":"B11: Customer Interaction — Insurance Product Sales & Marketing Strategy","timestring":"Fri May 3 19:59:23 2024","m1uni":"cw3512","m2lname":"Li","m1fname":"Chengyu","m4fname":"","m1lname":"Wang","m3fname":"","description":"
The goal here is to analyze sales and marketing strategies by going through the insurance related Reddit data. This information helps insurance companies refine their products and services to better meet the needs of their target audience.
In our research, we will first fetch post content for topic detection, and then we will Fetch live
stream comments from Reddit for sentiment and emotion analysis. In the process of automatically
identifying the main themes or topics present in online comments, we can have a deeper understanding of the preferences and feedback of clients towards product sales and marketing strategies used in the
insurance industry.","uni":"cw3512","language":"Reddit, MongoDB, Fast API, REACT, Google Pub/Sub","pid":"202405-8","m4uni":"","analytics":"We are using live stream comment sentiment and emotion score fetched into MongoDB to to analyze the
sales and marketing strategies used in the insurance Industry. We deploy the code on an EC2
instance.
VADER (Valence Aware Dictionary and sentiment Reasoner) is a lexicon and rule-based sentiment
analysis tool specifically designed for analyzing sentiment in text data. We use it to analyze the
sentiment of text data by considering both individual word sentiment scores and the contextual
information provided by grammatical and syntactical rules. The output is a categorical sentiment
label that reflects the overall sentiment expressed in the text.
Latent Dirichlet Allocation (LDA) is a generative probabilistic model in which each document is assumed to be a mixture of various topics, and each topic is characterized by a distribution over words. We use LDA to run topic detection on all the posts data.
ChatGpt is being used to generate summary and human-readable topics for the posts.
On the React front end, we provide the TopicSection and CommentsSection. Users could choose the time range, select the keyword/emotion, and fetch the results.
","m4lname":"","industry":"Finance","m3lname":"","dataset":"We directly get and use data from Reddit. We did not base our research on an unchanged/fixed dataset.","m2uni":"hl3700","m2fname":"Huiyuan","m3uni":""},{"projectname":"Identifying Financial Trends for Effective Investment Portfolio Management using Machine Learning Techniques","timestring":"Fri May 5 23:09:20 2023","m1uni":"sa4084","m2lname":"Kamath","m1fname":"Suryanarayana ","m4fname":"","m1lname":"Akella","m3fname":"","description":"The motivation for investment strategy is to maximize returns and minimize risk. With the volatility associated with financial markets, we implemented suitable machine learning techniques to build investment strategies for managing portfolios keeping the fundamentals of minimizing risk and maximizing returns for long term in mind.

In summary, the motivation for investment strategy is to achieve long-term financial success by making informed decisions about how to allocate capital in a manner that is aligned with the investor's goals and risk tolerance.

The research in this domain is necessary as financial markets produce large amounts of data and are volatile in nature. It is necessary to find solutions to the problems in this domain in an optimized way which ML/AI allows for.

We have implemented LSTM and a few regression models for the ML part followed by solving the MVO part for portfolio allocation.","uni":"sa4084","language":"Javascript, Python, ReactJs, Imgbb, Django, REST, Sqllite","pid":"202305-9","m4uni":"","analytics":"Algorithms: LSTM, Linear regression, Rand Forest Regression, Gradient Boosting Regression.
Optimization: Mean Variance Optimization
System Modules: sklearn, keras, reactchartjs,pickle,numpy,matplotlib,django,CORS,etc.
Visualizations: Stock Data for individual stocks, portfolio performance, lstm predictions.","m4lname":"","industry":"Finance","m3lname":"","dataset":"The dataset used was the S&P 500 dataset which has stock data such as High, Low, Open, Close,etc., for stocks on a day to day basis.
Link: https://www.kaggle.com/datasets/paultimothymooney/stock-market-data","m2uni":"ak4808","m2fname":"Aayush","m3uni":""},{"projectname":"Decoding and Analyzing Medical Data","timestring":"Sat May 16 01:41:14 2020","m1uni":"sl4653","m2lname":"Sun","m1fname":"Sirui","m4fname":"","m1lname":"Li","m3fname":"","description":"
Use medical information to build a model for similar patient retrieval and treatment recommendation.","uni":"sl4653","language":"Python, R","pid":"202005-20","m4uni":"","analytics":"Word2vec, Dynamic Time Warping, R Shiny","m4lname":"","industry":"Life Science","m3lname":"","dataset":"Our dataset is from MIMIC-III Critical Care Database. MIMIC-III (Medical Information Mart for Intensive Care III) is a freely-accessible database, and it includes health data between 2001 and 2012 from over 40,000 ICU patients of the Beth Israel Deaconess Medical Center. Submitting a request to access the dataset is required.
","m2uni":"xs2338","m2fname":"Xiaoli","m3uni":""},{"projectname":"AI Powered Investment Management System","timestring":"Wed Jan 5 21:11:34 2022","m1uni":"ycj2103","m2lname":"Raj","m1fname":"Yash","m4fname":"","m1lname":"Jain","m3fname":"Sunjana","description":"-Build a comprehensive wealth management system to help users make informed financial decisions
-An end-to-end system where a user can perform in-depth analysis for stocks, make a trade and watch their portfolio","uni":"ycj2103","language":"Python, Streamlit, MySQL Server, Tweepy","pid":"202112-57","m4uni":"","analytics":"Visualization: Candle Charts, Line charts, Forecasting charts
Algorithms: Sentiment Analysis, Recommendation system, FB Prophet
","m4lname":"","industry":"Finance","m3lname":"Ramana","dataset":"-We got the S&P500 stock price data from Yahoo finance using the Yahoo Finance API
-For twitter data we used the Twitter's developer API to get the top 20000 tweets related to a a particular stock in real time","m2uni":"ar4283","m2fname":"Ayush","m3uni":"sc4921"},{"projectname":"Automatic Finding Sales Leads for Concert Tickets","timestring":"Thu May 14 01:43:21 2020","m1uni":"ks3198","m2lname":"","m1fname":"Kexin","m4fname":"","m1lname":"Su","m3fname":"","description":"Sales leads, in simple words, is just people or businesses that are potential buyers or potential customers of a product. An accurate sales-lead-finding process is very important to a company because it enables the seller to spend less money on marketing, yet have a higher conversion rate. However, the current situation is that about 60% of marketers think it is a great challenge to generate traffic and leads by marketing to someone who is not really interested in their product. Therefore, in this project, I want to design a system that automatically helps ticket sellers promote their concert tickets by recommending potential customers, this system will ideally make the sales-lead-finding process more accurate and efficient. Different from the traditional recommendation system that focuses on recommend products to buyers, the system that I want to create focuses on the sellers. The system will take in the concert information, and return a list of potential buyers and their corresponding personal information so that the salesperson can promote to those selected users and have a higher chance of them buying the ticket. ","uni":"ks3198","language":"Python, html","pid":"202005-7","m4uni":"","analytics":"Sentiment Analysis, creating simple knowledge graphs, and topic modeling using LDA are implemented. I also implemented a web application as a visualization of the whole system. ","m4lname":"","industry":"Retail","m3lname":"","dataset":"The concert information dataset I used is scraped from the SeatGeek website using their official API. The Google Trend information is gathered using unofficial API for Python called Pytrend. The twitter information is accessed through a Twitter API called Tweepy.

The system can support any form of concert information as long as the csv file contains a 'Performer Name' column and \"Performer Genre\" column.","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Investment Strategy – Automatic Dynamic Asset Allocation","timestring":"Fri May 5 16:26:18 2023","m1uni":"zcb2110","m2lname":"","m1fname":"Zachary","m4fname":"","m1lname":"Burpee","m3fname":"","description":" - Setup successful algorithmic portfolio adjuster using RL for decisions on a daily-basis, based off of calculated reward, analyst ratings, and potential earnings

- Using LSTM and DL models, classify trends of market downturn and potential entry/exit points

- Have interactive user portal to input portfolio capital, asset values, and risk tolerance to give immediate feedback of profit potential in annual versus recommended moves and profit potential in annual (or time frame chosen)

- Have user input which asset class they are comfortable with investing in to diversify portfolio optimally

- Real-time implementation of portfolio balancer and pseudo-trades performed over a time period using assets chosen

- Evaluate individual asset gradients based on LSTM/DL/ML and incorporating popularity (volume, news)","uni":"zcb2110","language":"Python","pid":"202305-4","m4uni":"","analytics":"Analytics:
- EDA for covariance of different asset classes
- Volume, Value, and Velocity of data scraped and hosted all were taken into consideration
- Sector and asset correlation diagrams for training models with highly correlated and little correlation datasets

Algorithms:
- Bellman EQ for RL
- Keras DL and LSTM models with loss functions of MSE
- Webscraping algorithm for accessing real-time data
- Portfolio BCR for expected return and respective weights
- Feature extraction from data using pandas

System Modules:
- Data processing -> DL/LSTM models -> News Sentiment models -> One Hot indicators -> RL BUY/SELL/HOLD model -> One Hot indicators -> DL model for weighting -> Updating parameters -> App Display

Visualizations:
- Native Python App displaying portfolio values, overall gain, and trades made throughout the day
","m4lname":"","industry":"Finance","m3lname":"","dataset":"- Training data for model predictions consisted of Y-finance API downloaded (publicly available)
- Training data for refinement of model predictions are Kaggle datasets for Lv.2 order books
- Training data for news sentiment was a combination of scraped data and previously available Kaggle headlines","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Automatic Story Telling on Public Events","timestring":"Fri May 15 03:05:01 2020","m1uni":"xt2233","m2lname":"Wang","m1fname":"Xinluan","m4fname":"","m1lname":"Tian","m3fname":"","description":"Social media is a new platform where people can learn about news. However, information on social media can be overwhelmed and fragmented. It’s very important to have a platform that can aggregate all information on one public event and automatically generate stories that describe the event in human style. Here we use the online streaming of Twitter on a particular public event and use NLP techniques to extract key information about the event.
we utilize Apache Spark and Python to demonstrate a real-life example: dealing, analyzing, and extracting insights from social network data in real-time.
","uni":"xt2233","language":"Python, Spark","pid":"202005-5","m4uni":"","analytics":"PCA, Extractive Summarization, timeline visualization, bert-extractive-summarizer, word embedding","m4lname":"","industry":"Media","m3lname":"","dataset":"Online streaming from Twitter API. We analyzed several events including Yosemite wildfire; Syria chemical weapons; death of poet Maya Angelou and 2014 World Cup.
","m2uni":"jw3514","m2fname":"Jiayao","m3uni":""},{"projectname":"Antibody Developability ","timestring":"Wed May 14 04:26:32 2025","m1uni":"ar4383","m2lname":"Cheng","m1fname":"Anne Caroline","m4fname":"","m1lname":"Rabello Rolim","m3fname":"Andrew","description":"Our goal was to build a model to predict antibody developability from sequence data. We focused on CDR regions, which are the most important for binding. Key features include custom BPE tokenizer, region-aware masking, and CDR extraction from a rich dataset. This work can help screen antibody candidates before expensive and long lab experiments. ","uni":"ar4383","language":"Python","pid":"202505-2","m4uni":"","analytics":"CDR Extraction, BPE Tokenization, Masked Language Model with BERT-style transformer","m4lname":"","industry":"Life Science","m3lname":"Chang","dataset":"Trained on 728K unique antibody sequences from 4 public datasets: ABSD, DOTAD, SAbDab and PLabDab. ","m2uni":"cc5210","m2fname":"Julie","m3uni":"cc5223"},{"projectname":"Adaptive Planning and Execution System (APEX)","timestring":"Tue May 5 22:37:27 2026","m1uni":"ss7561","m2lname":"Natarajan","m1fname":"Sanskruti","m4fname":"","m1lname":"Shejwal","m3fname":"Vaishnavi","description":"APEX (Adaptive Planning and Execution System) addresses a fundamental gap in multi-agent warehouse automation: most systems either plan globally but can't react in real time, or react locally but lack strategic coordination. APEX bridges this gap through hierarchical planning where a strategic layer decomposes orders into tasks and a tactical layer executes them with real-time conflict resolution.

Key objectives: coordinate heterogeneous robot fleets across a two-layer planning hierarchy, handle dynamic disruptions without full replanning, and integrate a large language model as an optional strategic planner alongside classical algorithms.

Innovations: a clean escalation protocol between planning layers (LocalReplanner → EscalationSignal → StrategicCoordinator), a validated TaskGraphDelta framework ensuring LLM outputs are schema-verified before application, and a MAP-style Gemini pipeline that augments deterministic HTN planning with language-driven reasoning while guaranteeing automatic fallback.

Why important: warehouse automation is a billion-dollar industry problem. Current systems rely on brittle hand-coded rules that break under disruptions. In this project, we study how symbolic planning, search-based assignment, and LLM reasoning can be safely composed in dynamic multi-agent environments.","uni":"ss7561","language":"language: Python ; Platform: macOS, Windows","pid":"202605-15","m4uni":"","analytics":"Planning algorithms:
- HTN (Hierarchical Task Networks) — recursive goal decomposition using applicability checks. Decomposes fulfill_order into PICK → TRANSPORT → STAGE → DISPATCH
- UCT MCTS (Monte Carlo Tree Search with Upper Confidence Bound) - search over agent-to-task assignment space, scoring by feasibility, proximity, and load balance
- MAP pipeline (Gemini LLM) — four-stage specialist prompting (decompose → predict → monitor → coordinate) producing validated TaskGraphDelta edits to the HTN plan

Pathfinding algorithms:
- CBS (Conflict-Based Search) — multi-agent pathfinding via constraint tree, resolves vertex and edge conflicts across all agents simultaneously
- A* with constraints — low-level single-agent pathfinding within CBS, extended with vertex/edge constraint sets and a ReservationTable for space-time occupancy tracking

System modules:
- StrategicCoordinator — planning mode dispatcher (HTN / MCTS / MAP)
- DomainTranslator + TaskResolver — abstract task grounding to physical warehouse IDs
- TacticalExecutor — per-agent instruction queues, SimPy process loop
- LocalReplanner + EscalationSignal — disruption handling with escalation boundary
- EpisodeDriver — headless scenario runner with full telemetry
- ExperimentRunner — sequential scenario sweeps
- StochasticEventGenerator — random disruption injection at configurable rates
- TaskGraphDelta — validated incremental graph edits with schema enforcement

Visualization and analytics:
- Pygame grid renderer — real-time 2D warehouse visualization with agent status rings, path overlays, fleet status panel, and telemetry
- VideoRecorder — MP4 export via imageio/ffmpeg
- FastAPI dashboard — web UI showing metrics, event timeline, consistency hints, and embedded video playback per run
- MetricsCollector - event-driven KPI tracking (completion rate, idle fraction, collision count, planned vs executed conflicts, replan/escalation counts)
- plot_benchmarks.py - benchmark visualization script for cross-scenario comparison","m4lname":"","industry":"Retail","m3lname":"Varaghavenkatagiri","dataset":"APEX does not use an external dataset - it generates its own simulation data through a synthetic warehouse environment built on SimPy discrete-event simulation. All scenario data is procedurally generated and deterministic via fixed random seeds.
","m2uni":"kmn2161","m2fname":"Kirthana","m3uni":"vv2411"},{"projectname":"Spark-Powered Infectious Epidemic Dashboard for the US","timestring":"Thu Dec 19 16:37:24 2024","m1uni":"cl4404","m2lname":"Zhao","m1fname":"Chang","m4fname":"","m1lname":"Liu","m3fname":"","description":"In this study, we present a Spark-powered dashboard for analyzing and visualizing infectious epidemic data, such as COVID-19, Influenza, and Pneumonia, in the United States. Our workflow progresses step-by-step from raw datasets, where we first preprocess the data to clean and structure it. Kafka is used for stream processing, efficiently handling high-throughput data streams, while Spark performs the core data analysis to extract meaningful insights. Using Python code, we process the data and integrate the findings, which are then visualized using ECharts. We output the results in JSON format and generate rich bar charts and other visualizations. Finally, we consolidate everything into a unified HTML page to create an interactive and informative dashboard.
","uni":"cl4404","language":"python, spark, kafka, pycharm, AWS S3","pid":"202412-25","m4uni":"","analytics":"Analytics: Sum, Compare, Sort, Group, Prediction, Z-score for abnormal data
Algorithm: Linear Regression, Decision Tree
Visualization: Bar chart, line chart, funnel , pictorial bar, word cloud, map","m4lname":"","industry":"Life Science","m3lname":"","dataset":"US COVID-19 county dataset,available on the Kaggle platform, specifically named us-counties.csv. This CSV file encompasses a comprehensive collection of COVID-19 cases in the United States, spanning from the initial reported case until May 12, 2022. It comprises approximately 2.5 million rows of data, each representing specific instances of COVID-19 cases within different counties across the nation. The columns included in the dataset consist of the date of the reported case, the corresponding county and state, as well as the number of confirmed cases and deaths.
In addition to this primary dataset, we also incorporated data from the CDC’s Provisional Death Counts for Influenza, Pneumonia, and COVID-19. This dataset was used to conduct a simple workflow test to validate the accuracy and functionality of our dashboard generation process.
","m2uni":"yz4624","m2fname":"Yuncheng","m3uni":""},{"projectname":"Real-time Music Recommendation System","timestring":"Sat Dec 18 03:30:55 2021","m1uni":"zj2324","m2lname":"Wang","m1fname":"Zhejian","m4fname":"","m1lname":"Jin","m3fname":"Zhiqing","description":"To build a real-time music recommendation system. We build the system by using a lot of gcp services and we find that it is useful for our future work.","uni":"zj2324","language":"python, gcp","pid":"202112-18","m4uni":"","analytics":"Use K-means to get the types, and get rating by considering the relative ratio of a type of songs to all songs. Using ALS to get the recall results. We use gcp services such as BigQuery, Dataflow, Datastore, Cloud Build and Cloud Storage, Container Registry.","m4lname":"","industry":"Information","m3lname":"Wang","dataset":"We collected it ourselves","m2uni":"yw3747","m2fname":"Yixin","m3uni":"zw2780"},{"projectname":"ViSQL: A Vision-Augmented Autonomous Data Scientist for End-to-End Natural-Language Analytics","timestring":"Wed May 13 04:33:34 2026","m1uni":"ql2481","m2lname":"","m1fname":"Leah","m4fname":"","m1lname":"Li","m3fname":"","description":"ViSQL is an end-to-end natural-language analytics agent that takes a question (and optionally a reference chart) and produces an executed SQL query, a stylistically-matched visualization, and a grounded analytical report. It closes the full analyst loop — SQL plumbing, visualization, and narrative — that today's LLM tools address only one piece at a time.

The core innovation is a set of three architectural guardrails that suppress the LLM failure modes most damaging in a production analytics setting: (1) a strict A/B test gate that prevents causal claims on observational data by demoting non-experimental questions away from the ab_test route; (2) live schema introspection with a pre-execution table-existence check that catches hallucinated table names before incurring BigQuery cost; and (3) an empty-data guard that refuses to generate a narrative when SQL returns zero rows, eliminating \"fabricated findings\" failures. A multimodal style-imitation subsystem extracts a structured StyleSpec from a user-supplied reference chart and applies it via matplotlib — vision describes, matplotlib executes, so the vision model never produces pixels and cannot hallucinate data.

The project matters because it addresses the gap between LLM individual competence (writing SQL, summarizing results, generating charts) and the disciplined end-to-end workflow a senior analyst would impose: matching stakeholder visual style, refusing causal claims on observational data, and refusing to narrate on empty results. ViSQL also makes the contribution of self-correction measurable — we report retry-rate (17%), recovery-rate (55%), and a +6 pp marginal lift in execution accuracy from the retry loop — turning a usually-invisible reliability mechanism into a first-class metric. ","uni":"ql2481","language":"Python 3.10+ (~3,600 LOC across 14 modules). Frontend: Streamlit + Flask. ML/serving: PyTorch, Hugging Face Transformers, bitsandbytes (4-bit NF4 quantization), PEFT (LoRA). Retrieval: FAISS, sentence-transformers. Data: BigQuery (google-cloud-bigquery), pandas. APIs: Anthropic SDK for Claude Sonnet 4.5. Visualization: matplotlib. Statistics/ML: scipy, scikit-learn, PyTorch. Hardware: single NVIDIA A100 (40 GB) for local model serving and LoRA fine-tuning. ","pid":"202605-18","m4uni":"","analytics":"SYSTEM MODULES (six-stage core pipeline):
1. Task Router — Llama-3.1-8B-Instruct (4-bit) classifies questions into 5 buckets: single_chart, dashboard, ab_test, ml_modeling, sql_only.
2. Schema Linker — top-k=5 table retrieval via sentence-transformers (all-MiniLM-L6-v2) embeddings + FAISS index, with live BigQuery INFORMATION_SCHEMA introspection.
3. CoT Planner — Claude Sonnet 4.5 with chain-of-thought prompting; emits paired trace and structured blocks.
4. SQL Agent — Claude Sonnet 4.5 conditioned on plan + sub-schema + k=3 retrieved Spider exemplars; BigQuery dialect-aware system prompt; pre-execution table-existence check; self-correction retry loop with error-classifier (missing_object, syntax_error, wildcard_or_dataset_qualification, type_error) and up to 3 retries.
5. Analysis Dispatcher — 5 branches: chart, multi-chart dashboard, A/B test (chi-squared on 2x2 contingency + Wald CI on lift), ML modeling (sklearn linear + tree + PyTorch MLP with gradient-attribution feature importance), raw extraction.
6. Report Writer — Claude Sonnet 4.5 with hard empty-data guard.

MULTIMODAL SUBSYSTEM: Llama-3.2-11B-Vision-Instruct (4-bit) extracts a structured StyleSpec JSON (palette, gridlines, axes, spines, color mood, chart-type hint) from a reference chart; matplotlib renders the user's data in that style.

LOCAL BASELINE: SQLCoder-7B-2 with LoRA fine-tuning (rank-16 adapters on attention + FFN; α=32; QLoRA-style 4-bit base) on Spider 1.0 train.

ALGORITHMS: Few-shot retrieval (RAG over 7,000 Spider exemplars), chain-of-thought prompting, parameter-efficient fine-tuning (LoRA/QLoRA), chi-squared test, MLP gradient-attribution feature importance, CIELAB ΔE-76 color-difference metric.

VISUALIZATIONS: Style-conditioned matplotlib bar/line/scatter/dashboard rendering driven by extracted StyleSpec.

EVALUATION: Router macro-F1 (0.92), SQL exec-accuracy on Spider 1.0 dev (0.74), self-correction retry-rate (17%) and recovery-rate (55%), style imitation self-consistency via CIELAB ΔE-76 (mean 7.4, below JND threshold of 10), report quality via LLM-as-judge across four axes (4.1/5).","m4lname":"","industry":"Information","m3lname":"","dataset":"We evaluated on three public BigQuery datasets plus one synthesized A/B experiment, chosen to stress different aspects of the pipeline:

(1) SEC Quarterly Financials (Finance, 10 tables, XBRL long-format) — public via Google Cloud Public Datasets. Stresses join-planning and pivot logic on key–value-style fact tables.

(2) theLook E-commerce (Retail, 7 tables: orders, products, users, inventory, etc.) — public via BigQuery Public Datasets. Multi-table joins; serves as the gold-standard demo dataset.

(3) Google Analytics Sample (Web Analytics) — public via BigQuery Public Datasets. Nested STRUCT and repeated fields, sharded date-partitioned tables. Stresses BigQuery-specific dialect: UNNEST, wildcard-table queries, sub-selects on repeated records.

(4) Synthetic A/B experiment — variant assignment via a deterministic hash on user_id on top of the theLook users table; treatment receives a synthetic +8% conversion lift. Used as the positive case for the A/B test gate evaluation.

For SQL benchmarking we additionally used Spider 1.0 (Yu et al., EMNLP 2018) — 7,000 train (question, SQL) pairs for both LoRA fine-tuning of SQLCoder-7B-2 and for the FAISS few-shot exemplar index, plus a 200-example dev slice for execution-accuracy evaluation.

Other data ViSQL can support: any BigQuery dataset with a queryable INFORMATION_SCHEMA. Schemas are fetched at runtime, so adding a new dataset requires no code changes. Porting to Snowflake or Postgres requires swapping the dialect rules in the SQL agent's system prompt and the INFORMATION_SCHEMA introspection module. ","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Trending Products Recommender for E-Commerce","timestring":"Sun Dec 23 05:22:50 2018","m1uni":"ad3363","m2lname":"Lee","m1fname":"Alex","m4fname":"","m1lname":"Dziena","m3fname":"Johan","description":"Product selection and pricing is a critical problem for retail businesses - an appropriate selection of available products, priced correctly for the market is highly correlated with revenue growth and is extremely important for retailers’ brand value. In e-commerce, this problem becomes even more urgent, as customers can easily visit another e-commerce retailer to find what they’re looking for at the price point they want. Reactivity to the market needs to happen extremely fast, with product inventories and pricing changes happening in near-realtime on many sites.

We've created a streaming analytics engine to provide buyers and product category owners at retailers market intelligence on trending products, public sentiment on those products, and demand. The output of this engine could be used to inform purchasing decisions and pricing at the retailer, and could also be integrated with personalized recommenders on an e-commerce site, allowing recommendations to customers of products that are trending in near-realtime, that match their favorite categories, and that they may not have discovered yet.
","uni":"ad3363","language":"Python 3.5, Javascript, HTML, CSS; Pyspark, HBase, pipenv, npm, Bootstrap, JQuery, Node.js, Flask; 8-core Google Compute Engine VM running Ubuntu 16.04.5","pid":"201812-44","m4uni":"","analytics":"Text / Sentiment Analysis:
TF-IDF
Tri-gram

Dimensionality reduction and Feature Selection:
Chi-Squared

Classification:
Logistic Regression
Naive Bayes

Visualization:
Bar chart
Sortable table
Wordcloud

Subsystems:
- Tweet Ingestion: This is a multiprocessing, concurrent system that ingests tweets via both Twitter’s Search API[8] and Streaming API[9]. Tweets are parsed and persisted to HBase for further processing.
- Sentiment Analysis: This is an analysis pipeline that processed all of the data ingested in Tweet Ingestion, and classifies them as having negative or positive sentiment.
- Product Tagging: This is an analysis pipeline that classifies each tweet with a specific product name, using a probabilistic classification model built from Amazon product reviews.
- Rest API: The Rest API provides several endpoints for extracting data from HBase for use in our Data-Driven Visualization interface. The implementation is based on Flask[10] and SHC[11].
- Data-Driven Visualization: The Data-Driven visualization interface is a web application using Bootstrap[12], D3[13], and JQuery[14] for visualization and presentation, and Node.js[15] for web serving.","m4lname":"","industry":"Retail","m3lname":"Sulaiman","dataset":"“Amazon Customer Reviews Dataset.” [Online]. Available: https://s3.amazonaws.com/amazon-reviews-pds/readme.html. [Accessed: 22-Dec-2018]
“For Academics - Sentiment140 - A Twitter Sentiment Analysis Tool.” [Online]. Available: http://help.sentiment140.com/for-students/. [Accessed: 13-Dec-2018]
“Overview.” [Online]. Available: https://developer.twitter.com/en/docs/tweets/search/overview.html. [Accessed: 22-Dec-2018]
“Overview.” [Online]. Available: https://developer.twitter.com/en/docs/tweets/filter-realtime/overview.html. [Accessed: 22-Dec-2018]","m2uni":"jl5102","m2fname":"Jaewon","m3uni":"js5063"},{"projectname":"Generating Podcast Summaries with State-of-the-Art Language Models","timestring":"Sun Dec 19 07:13:35 2021","m1uni":"sy2953","m2lname":"","m1fname":"Siyu","m4fname":"","m1lname":"Yang","m3fname":"","description":"For this project, I will try to leverage a Podcasts dataset to create a model that would shortening the corpus meaningfully to create a subset that represents the most important or relevant information within the original content, which would be the summary of the episode
","uni":"sy2953","language":"Python, GCP, HuggingFace","pid":"202112-56","m4uni":"","analytics":"BERT, BART, GPT-2","m4lname":"","industry":"Media","m3lname":"","dataset":"A podcast RSS feed is what allows users to subscribe to that podcast in order to listen to it without visiting the exact website where it is located. It also updates subscribers when new episodes are uploaded.
For this project, I have built an RSS parser to take the RSS feed provided by podcasts, parse the file, find the .mp3 file and save them to Google Cloud Storage.
","m2uni":"","m2fname":"","m3uni":""},{"projectname":"Back to Normality: Covid-19 Time Series Forecasting","timestring":"Sat Dec 18 03:46:49 2021","m1uni":"zj2297","m2lname":"Zhu","m1fname":"Zhiheng","m4fname":"","m1lname":"Jiang","m3fname":"Zezhong","description":"Provide a vision for the near future development of the pandemic in NYC to reduce uncertainty
","uni":"zj2297","language":"Python, Airflow, D3","pid":"202112-19","m4uni":"","analytics":"ARIMA, Vector Auto-regression, RNN","m4lname":"","industry":"Life Science","m3lname":"Fan","dataset":"NYC Health: provide case (600+ samples) and vaccination data (300+ samples)
CDC: provide more data for more complicated models (10,000+ after sampling)","m2uni":"gz2311","m2fname":"Guotian","m3uni":"zf2274"},{"projectname":"Prediction of Stock Trend with Media Sentiment Analysis ","timestring":"Sat Dec 22 02:56:43 2018","m1uni":"qw2264","m2lname":"Ma","m1fname":"Qinyuan","m4fname":"","m1lname":"Wei","m3fname":"Long","description":"Stock price changes every day. There are many factors may influence the price of stock price such as news releases on earnings and profits, introduction of a new product, or a change of management. From the prospect of behavioral economics, that the emotions and moods of individuals affect their decision-making process, thus, leading to a direct correlation between “public sentiment” and “market sentiment”. So, if we know the “public sentiment”, we can estimate the “market sentiment”, and then predict the stock market.
Twitter, as known as social media, is a place where people talk publicly about their sentiment. With the now-a-days technology tools, we can do sentiment analysis based on the tweets text, extract people’s sentiment about a certain thing. So, the problem we would like to solve is to build a model that can analyze the twitter data and try to predict stock price with the sentiment data.
","uni":"qw2264","language":"Python 3.6, Scala; jupyter notebook, Spark shell on Linux 16.0.4","pid":"201812-30","m4uni":"","analytics":"Analytics: Sentiment Analysis, machine learning on big data to make prediction;
Algorithms: Linear Regression, random forest, Multilayer perceptron (MLP);
System modules: Data pre-processing, data processing, data visualisaiton (each module in one jupyter notebook);
Visualization: Twitter wordclouds, web application, candlestick chark for stock prices Most related words graphs based on D3 and Node.js;
","m4lname":"","industry":"Finance","m3lname":"Jiao","dataset":"The dataset we use is the twitter7 dataset which is a collection of 476 million tweets collected between June to December in 2009. The size of this dataset is about 25GB and comes from Stanford Large Network Data Collection (SNAP). It includes 17,069,982 users, 476,553,560 tweets, 181,611,080 URLs, 49,293,684 Hashtags and 71,835,017 retweets.","m2uni":"jm4750","m2fname":"Jiliang","m3uni":"lj2463"}]});