procDatasetInfo({"datasets":[{"timestring":"Tue Sep 22 16:21:10 2014","uni":"sy2518","groups":"Finance","url":"http://www.proshares.com/resources/etf_data_downloads.html","datasetname":"History data of proshared ETFs","description":"This webpage contains history data of ETFs in the form of .txt that may be helpful to students in finance group.","useterm":"public"},{"timestring":"Tue Sep 22 16:31:23 2014","uni":"sy2518","groups":"Finance","url":"http://www.livecharts.co.uk/historicaldata.php","datasetname":"History currency exchange rate data","description":"This is a page that offers free download of history exchange rate data download. ","useterm":"public"},{"timestring":"Tue Sep 23 08:21:58 2014","uni":"pd2049","groups":"Energy-Transportation-Industry","url":"http://","datasetname":"NYC TLC 2013","description":"Available for academic use under the NYC OpenFOIL program. I can share as needed with students. Contains information about every metered taxicab ride in NYC in 2013.","useterm":"public"},{"timestring":"Tue Sep 23 08:31:12 2014","uni":"pd2049","groups":"Energy-Transportation-Industry","url":"http://","datasetname":"NYC TLC 2012","description":"Available for academic use under the NYC OpenFOIL program. I can share as needed with students. Contains information about every metered taxicab ride in NYC in 2012.","useterm":"public"},{"timestring":"Tue Sep 23 08:42:24 2014","uni":"pd2049","groups":"Energy-Transportation-Industry","url":"http://njtpa.org/Data-Maps/Surveys/Household-Travel-Survey/2010-11-Regional-Household-Travel-Survey-Data-Set.aspx","datasetname":"Regional Household Transportation Survey (NYC)","description":"RHTS - regional household transportation survey for NYC metropolitan area.","useterm":"public"},{"timestring":"Tue Sep 23 16:21:58 2014","uni":"jh3478","groups":"Media","url":"http://www.imdb.com/interfaces","datasetname":"IMDB databse","description":"The plain text database file from IMDB","useterm":"public"},{"timestring":"Tue Sep 23 16:21:58 2014","uni":"bas2226","groups":"Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"https://data.sfgov.org/browse?limitTo=datasets&utf8=%E2%9C%93","datasetname":"DataSF","description":"Welcome to DataSF! DataSF is the central clearinghouse for data published by the City and County of San Francisco","useterm":"public"},{"timestring":"Tue Sep 23 16:23:21 2014","uni":"bas2226","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"https://www.data.gov/","datasetname":"Data.gov","description":"The home of the U.S. Government’s open data Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more.","useterm":"public"},{"timestring":"Tue Sep 23 16:24:45 2014","uni":"bas2226","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"http://datahub.io/","datasetname":"DataHub","description":"The Datahub provides free access to many of CKAN's core features, letting you search for data, register published datasets, create and manage groups of datasets, and get updates from datasets and groups you're interested in. You can use the web interface or, if you are a programmer needing to connect the Datahub with another app, the CKAN API.","useterm":"public"},{"timestring":"Tue Sep 23 16:27:00 2014","uni":"bas2226","groups":"Finance, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government","url":"https://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1","datasetname":"Open Data (Amazon)","description":"Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications.","useterm":"public"},{"timestring":"Tue Sep 23 16:30:32 2014","uni":"bas2226","groups":"Finance, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government","url":"http://databib.org/","datasetname":"DataBib","description":"Databib is a searchable catalog / registry / directory / bibliography of research data repositories. ","useterm":"public"},{"timestring":"Tue Sep 23 22:38:51 2014","uni":"jh3478","groups":"Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government","url":"http://www.kaggle.com/home","datasetname":"Various dataset for competition","description":"African soil data, Movie reviews, Forest cover, ...","useterm":"public"},{"timestring":"Wed Sep 24 17:09:54 2014","uni":"mw2969","groups":"Finance","url":"http://www.globalfinancialdata.com/platform/search.aspx?db=worldbank","datasetname":"Exchange rate dataset","description":"Global exchange rate dataset provided by GFD(Global Financial Dataset)","useterm":"public"},{"timestring":"Wed Sep 24 19:50:31 2014","uni":"hf2286","groups":"Media, Information, SocialScience-Government","url":"http://yann.lecun.com/exdb/mnist/","datasetname":"MNIST","description":"The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.","useterm":"public"},{"timestring":"Wed Sep 24 20:03:33 2014","uni":"ss4716","groups":"Media","url":"http://stats.nba.com/players.html","datasetname":"NBA Player Data Set","description":"Comprehensive data of all the active and historical players in NBA.","useterm":"public"},{"timestring":"Thu Sep 25 09:54:49 2014","uni":"ss4609","groups":"Finance, Retail, Media, LifeScience, SocialScience-Government","url":"https://dev.twitter.com/streaming/public","datasetname":"Twitter Stream","description":"Public stream data from twitter filtered for news sources.","useterm":"public"},{"timestring":"Thu Sep 25 10:11:36 2014","uni":"yq2158","groups":"Finance","url":"http://finance.yahoo.com/stock-center/#mkt-movers","datasetname":"Yahoo Stock Dataset","description":"It provides real-time stock values which are available to the public","useterm":"public"},{"timestring":"Thu Sep 25 10:44:16 2014","uni":"rh2648","groups":"Information","url":"http://www.face-rec.org/databases/","datasetname":"Face Recognition","description":"Face Recognition homepage ( http://www.face-rec.org/databases/ ) is a website that contains the links for several databases. We will review every database and decide which one to choose based on the information the pictures in each database provides. Elva - There is another link in this point regarding the database information. This database is public. You can copy the same description in the “Dataset Description” part. ","useterm":"public"},{"timestring":"Thu Sep 25 12:00:59 2014","uni":"yd2302","groups":"Finance, Retail","url":"http://www.scaleunlimited.com/datasets/public-datasets/","datasetname":"Public datasets","description":"Including every type of datasets.","useterm":"public"},{"timestring":"Thu Sep 25 17:54:22 2014","uni":"cl300","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"http://webscope.sandbox.yahoo.com/catalog.php","datasetname":"Yahoo Labs Datasets","description":"53 Big Data datasets for research purpose: -- Language Data -- Graph and Social Data -- Ratings and Classification Data -- Advertising and Market Data -- Competition Data -- Computing Systems Data -- Image Data ","useterm":"public"},{"timestring":"Thu Sep 25 18:00:27 2014","uni":"cl300","groups":"Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government","url":"http://stat-computing.org/dataexpo/","datasetname":"Statistical Computing Datasets","description":"2013: Soul of the Community 2011: Deepwater horizon oil spill 2009: Airline on time data 2006: NASA meteorological data. Electronic copy of entries 1997: Hospital Report Cards 1995: U.S. Colleges and Universities 1993: Oscillator time series & Breakfast Cereals 1991: Disease Data for Public Health Surveillance 1990: King Crab Data 1988: Baseball 1986: Geometric Features of Pollen Grains 1983: Automobiles ","useterm":"public"},{"timestring":"Thu Sep 25 18:05:26 2014","uni":"cl300","groups":"Finance","url":"http://compbio.cs.uic.edu/data/bitcoin/","datasetname":"Bitcoin dataset","description":"Bitcoin is a software-based online payment system. Payments are recorded in a public ledger using its own unit of account, which is also called bitcoin.Payments work peer-to-peer without a central repository or single administrator, which has led the US Treasury to call bitcoin a decentralized virtual currency. Although its status as a currency is disputed, media reports often refer to bitcoin as a cryptocurrency or digital currency.","useterm":"public"},{"timestring":"Thu Sep 25 20:03:42 2014","uni":"jh3478","groups":"Media","url":"http://www.baseball-databank.org/","datasetname":"baseball records","description":"comprehensive baseball databases.","useterm":"public"},{"timestring":"Thu Sep 25 21:42:38 2014","uni":"ww2339","groups":"Information, SocialScience-Government, Telecom","url":"https://snap.stanford.edu/data/","datasetname":"Stanford Large Network Dataset Collection","description":"Social networks : online social networks, edges represent interactions between people Networks with ground-truth communities : ground-truth network communities in social and information networks Communication networks : email communication networks with edges representing communication Citation networks : nodes represent papers, edges represent citations Collaboration networks : nodes represent scientists, edges represent collaborations (co-authoring a paper) Web graphs : nodes represent webpages and edges are hyperlinks Amazon networks : nodes represent products and edges link commonly co-purchased products Internet networks : nodes represent computers and edges communication Road networks : nodes represent intersections and edges roads connecting the intersections Autonomous systems : graphs of the internet Signed networks : networks with positive and negative edges (friend/foe, trust/distrust) Location-based online social networks : Social networks with geographic check-ins Wikipedia networks and metadata : Talk, editing and voting data from Wikipedia Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets Online communities : Data from online communities such as Reddit and Flickr Online reviews : Data from online review systems such as BeerAdvocate and Amazon ","useterm":"public"},{"timestring":"Mon Oct 6 09:38:17 2014","uni":"tma2131","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"http://www.gdeltproject.org","datasetname":"GDELT Project","description":"Taken from GDELT page: http://www.gdeltproject.org/data.html#rawdatafiles Event Database: The GDELT Event Database contains over a quarter-billion records organized into a set of tab-delimited files by date. Through March 31, 2013 records are stored in monthly and yearly files by the date the event took place. Beginning with April 1, 2013, files are created daily and records are stored by the date the event was found in the world's news media rather than the date it occurred (97%+ of events are reported within 24 hours of happening, but a small number of events each day are past events being mentioned for the first time - if an event has been seen before it will not be included again). Files are ZIP compressed in tab delimited format, but named with a \".CSV\" extension to address some software packages that will not accept .TXT or .TSV files. Knowledge Graph: The GDELT Global Knowledge Graph begins April 1, 2013 and consists of two parallel data streams, one encoding the entire knowledge graph with all of its fields, and the other encoding only the subset of the graph that records \"counts\" of a set of predefined categories like number of protesters, number killed, or number displaced or sickened. Such counts may occur independently of the CAMEO events in the primary GDELT event stream, such as mentions of those killed in industrial accidents (which are not captured in CAMEO) or those displaced by a natural disaster or sickened by a disease epidemic. In this way, the GKG Counts File can be used to produce a daily \"Death Tracker\" to map all mentions of death across the world each day, or an \"Affected Tracker\" to indicate how many persons were sickened/displaced/stranded each day (at least as recorded in the global news media). These files are named as \"YYYYMMDD.gkg.csv.zip\" and posted by 6AM EST each morning seven days a week. The second file is the full graph file, which contains the actual graph connecting all persons, organizations, locations, emotions, themes, counts, events, and sources together each day. It also contains a list of the EventIDs of each event found in the same article as the extracted information, allowing rich contextualization of events. These files are named as \"YYYYMMDD.gkgcounts.csv.zip\" and posted by 6AM EST each morning seven days a week. The Global Knowledge Graph is currently in \"alpha\" release and may change over time as we introduce new capabilities and expand its underlying algorithms.","useterm":"public"},{"timestring":"Thu Nov 6 19:24:34 2014","uni":"tkp2108","groups":"Finance","url":"http://www.histdata.com/download-free-forex-data/","datasetname":"Intraday forex prices","description":"Intraday forex prices, important ones (usd/eur/jpy/chd) since 2000. ","useterm":"public"},{"timestring":"Thu Nov 6 19:27:17 2014","uni":"tkp2108","groups":"Finance","url":"http://www.fxstreet.com/economic-calendar/","datasetname":"Historical financial events","description":"Various announcement dates, etc. ","useterm":"public"},{"timestring":"Mon Nov 10 14:42:46 2014","uni":"yl3199","groups":"Finance","url":"http://www.google.com/finance/getprices?i=[PERIOD]&p=[DAYS]d&f=d,o,h,l,c,v&df=cpct&q=[TICKER]","datasetname":"stock dataset","description":" [PERIOD] defines interval or frequency in seconds, [DAYS] denotes the historical data period, where \"10d\" means that we need historical stock prices data for the past 10 days and [TICKER] determines the ticker symbol of the stock. Forming the dataset can get the stock price in certain period.","useterm":"public"},{"timestring":"Wed Nov 19 14:06:21 2014","uni":"mz2417","groups":"Retail","url":"http://www.yelp.com/dataset_challenge","datasetname":"Yelp Dataset Challenge","description":"The Challenge Dataset includes data from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh: 42,153 businesses 320,002 business attributes 31,617 check-in sets 252,898 users 955,999 edge social graph 403,210 tips 1,125,458 reviews","useterm":"public"},{"timestring":"Wed Nov 19 14:08:40 2014","uni":"mz2417","groups":"Retail","url":"http://myleott.com/op_spam/","datasetname":"Deceptive Opinion Spam Dataset","description":"400 truthful positive reviews from TripAdvisor (described in [1]) 400 deceptive positive reviews from Mechanical Turk (described in [1]) 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp (described in [2]) 400 deceptive negative reviews from Mechanical Turk (described in [2]) References [1] M. Ott, Y. Choi, C. Cardie, and J.T. Hancock. 2011. Finding Deceptive Opinion Spam by Any Stretch of the Imagination. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [2] M. Ott, C. Cardie, and J.T. Hancock. 2013. Negative Deceptive Opinion Spam. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.","useterm":"public"},{"timestring":"Wed Nov 19 20:43:33 2014","uni":"hx2168","groups":"LifeScience","url":"http://bluebuttonconnector.healthit.gov/developers/","datasetname":"Blue Button Connector ","description":"The Blue Button Connector is a website that helps people get started finding out about Blue Button and where they might be able to get their own health information online. This web site is about transparency—as it continues to get populated it aims to let people know what is and is not available today. The site also helps people find applications (apps) and tools that are available to manage and make use of their electronic health data. The Blue Button Connector is not a repository or government database that houses patients' health information or pulls that information from various sources into a single place. The Blue Button Connector also helps developers find out what type of electronic health data is being shared with people so they can build applications (apps) and tools that utilize that health data to help consumers better understand and use their health information. It’s a tool for use by multiple parties that creates transparency about the increasing availability of structured health data and where organizations are along the continuum of adopting and implementing ways that patients can access and use their health data online.","useterm":"public"},{"timestring":"Wed Nov 19 20:47:34 2014","uni":"hx2168","groups":"Retail","url":"http://www.programmableweb.com/category/food/apis?category=20048","datasetname":"food data","description":"Various APIs, you can also find other categories","useterm":"public"},{"timestring":"Mon Dec 8 17:18:11 2014","uni":"to2232","groups":"LifeScience","url":"https://tcga-data.nci.nih.gov/tcga/","datasetname":"The Cancer Genome Atlas","description":"The Cancer Genome Atlas (TCGA) Data Portal provides a platform for researchers to search, download, and analyze data sets generated by TCGA. It contains clinical information, genomic characterization data, and high level sequence analysis of the tumor genomes. ","useterm":"public"},{"timestring":"Mon Dec 8 21:51:15 2014","uni":"cz2321","groups":"Telecom","url":"https://aws.amazon.com/datasets/41740","datasetname":"Common Crawl Corpus","description":"Common Crawl is a non-profit organization dedicated to providing an open repository of web crawl data that can be accessed and analyzed by everyone. The most current crawl data sets includes three different types of files: Raw Content, Text Only, and Metadata. The data sets from before 2012 contain only Raw Content files.","useterm":"public"},{"timestring":"Mon Dec 8 22:04:35 2014","uni":"sz2476","groups":"Finance, Retail","url":"http://www.icpsr.umich.edu/icpsrweb/ICPSR/studies/34441?q=34441&searchSource=revise","datasetname":"Consumer Expenditure Survey, 2011: Interview Survey and Detailed Expenditure Files (ICPSR 34441)","description":"The Consumer Expenditure Survey (CE) program provides a continuous and comprehensive flow of data on the buying habits of American consumers including data on their expenditures, income, and consumer unit (families and single consumers) characteristics. These data are used widely in economic research and analysis, and in support of revisions of the Consumer Price Index.","useterm":"public"},{"timestring":"Mon Dec 8 22:33:39 2014","uni":"rw2526","groups":"LifeScience","url":"https://www.cs.purdue.edu/commugrate/data_access/all_data_sets_more.php?search_fd0=87","datasetname":"ICML 2004 Physiological dataset","description":"This data set was made available for the Physiological Data Modeling Contest at ICML 2004. The data was collected from subjects using BodyMedia wearable body monitors while performing their usual activities. These monitors record acceleration, heat flux, galvanic skin response, skin temperature, and near-body temperature. The training data set includes several sessions for each of multiple subjects, with measurements stored each minute during a session. The test data set includes further sessions from the same subjects, as well as sessions recording measurements from new subjects who did not feature in the training data. Each record in the data includes an annotation code giving information about the kind of activity that the subject was performing at that time.","useterm":"public"},{"timestring":"Mon Dec 8 22:47:00 2014","uni":"zh2210","groups":"LifeScience","url":"http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/","datasetname":"fMRI Images","description":"fMRI data using for study brain activity","useterm":"public"},{"timestring":"Mon Dec 8 23:09:44 2014","uni":"yz2575","groups":"Media, Information","url":"http://www.cp.jku.at/datasets/MMTD/","datasetname":"Million Music Tweet Dataset","description":"The data set contains listening histories inferred from microblogs. Each listening event identified via twitter-id and user-id is annotated with temporal (date, time, weekday, timezone), spatial (longitude, latitude, continent, country, county, state, city), and contextual (information on the country) information. In addition, pointers to artist and track are provided as a matter of course. Moreover, the data includes references to other music-related platforms (musicbrainz, 7digital, amazon).","useterm":"public"},{"timestring":"Tue Dec 9 01:48:19 2014","uni":"rjb2150","groups":"Information","url":"http://hoflink.com/~npollock/chess.html","datasetname":"chess games dataset","description":"A corpus of chess games in the public domain, compiled by Norman Pollock, available here: http://hoflink.com/~npollock/chess.html dataset id: gm2006.pgn number of games: 74,726 number of players: 1,227 minimum player Elo rating: 2475 years included: 2006 - 2014 gameplay restrictions: no blitz or correspondence games","useterm":"public"},{"timestring":"Wed Dec 10 20:23:24 2014","uni":"yc2911","groups":"Media, Information, SocialScience-Government","url":"http://www.yelp.com/dataset_challenge","datasetname":"Yelp 2014 Challenge Dataset","description":"The Challenge Dataset includes data from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh: 42,153 businesses 320,002 business attributes 31,617 check-in sets 252,898 users 955,999 edge social graph 403,210 tips 1,125,458 reviews","useterm":"public"},{"timestring":"Wed Dec 10 20:29:04 2014","uni":"yq2158","groups":"Finance","url":"http://finance.yahoo.com/","datasetname":"Yahoo Finance Dataset","description":"To get both historical and real-time stock data for free","useterm":"public"},{"timestring":"Wed Dec 10 21:48:49 2014","uni":"yy2496","groups":"Finance","url":"http://www.proshares.com/resources/etf_data_downloads.html","datasetname":"Historical HAVs for All ProShares ETFs","description":"Information for specific ProShares ETFs is available by clicking on the \"NAV History\" link from any product page. Obtain historical NAVs, NAV change (both dollar and percentage change), and shares outstanding.","useterm":"public"},{"timestring":"Wed Dec 10 22:56:06 2014","uni":"aoa2124","groups":"Energy-Transportation-Industry","url":"https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption","datasetname":"Individual Household Electric Power Consumption","description":"1.date: Date in format dd/mm/yyyy 2.time: time in format hh:mm:ss 3.global_active_power: household global minute-averaged active power (in kilowatt) 4.global_reactive_power: household global minute-averaged reactive power (in kilowatt) 5.voltage: minute-averaged voltage (in volt) 6.global_intensity: household global minute-averaged current intensity (in ampere) 7.sub_metering_1: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered). 8.sub_metering_2: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light. 9.sub_metering_3: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.","useterm":"public"},{"timestring":"Wed Dec 10 23:13:53 2014","uni":"hz2361","groups":"Media, Information","url":"http://snap.stanford.edu/data/web-Movies.html ; http://webscope.sandbox.yahoo.com/catalog.php?datatype=r","datasetname":"movie_review, movie_test, movie_train","description":"Two datasets is used in the project. One database is from Amazon movie review center, which contains 7911684 movie reviews. Each review contains movie name, movie rating score, date of the review and comments. That data is public. The other dataset is from Yahoo lab. The dataset contains the rating of different movies from various customers. It contains train text and test text files. Test text files are gathered chronologically after the training data. In the train data there are 211231 ratings from 7642 customers covering 11915 movies. In the test data there are 10136 ratings. This one is restricted.","useterm":"restricted"},{"timestring":"Thu Dec 11 00:03:20 2014","uni":"sh3246","groups":"Finance","url":"http://finance.yahoo.com","datasetname":"Yahoo Finance Tick Data","description":"Tick data for S&P 500 Format : http://chartapi.finance.yahoo.com/instrument/1.0/[TICKER]/chartdata;type=quote;range=1d/csv Example: http://chartapi.finance.yahoo.com/instrument/1.0/GOOG/chartdata;type=quote;range=1d/csv [TICKER]: This is the ticker symbol of the security ","useterm":"public"},{"timestring":"Thu Dec 11 01:41:01 2014","uni":"rg2936","groups":"Retail, Media, Information","url":"http://www.yelp.com/dataset_challenge","datasetname":"Yelp","description":"http://www.yelp.com/dataset_challenge","useterm":"public"},{"timestring":"Mon Dec 22 21:20:26 2014","uni":"hz2361","groups":"Media","url":"http://snap.stanford.edu/data/web-Movies.html","datasetname":"Amazon movie comments dataset","description":"Database is from Amazon movie review center, which contains 7911684 movie reviews. Each review contains movie name, movie rating score, helpfulness, date of the review and comments. ","useterm":"public"},{"timestring":"Mon May 18 16:08:51 2015","uni":"dl2943","groups":"Media, Information, SocialScience-Government","url":"http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010","datasetname":"AFINN-111","description":"AFINN is a list of English words rated for valence with an integer between -5 and 5. The words have been manually labeled by Finn Arup Nielsen in 2009-2011. The file is tab-separated. And AFINN-111 is the newest version with 2477 words and phrases. ","useterm":"public"},{"timestring":"Mon May 18 19:20:43 2015","uni":"ss4609","groups":"Finance","url":"http://twitter.com","datasetname":"Twitter","description":"Twitter data","useterm":"public"},{"timestring":"Mon May 18 22:55:57 2015","uni":"mt2994","groups":"Finance","url":"https://dev.twitter.com/rest/public","datasetname":"Twitter dataset","description":"The REST APIs provide programmatic access to read and write Twitter data. Author a new Tweet, read author profile and follower data, and more. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON. If your intention is to monitor or process Tweets in real-time, consider using the Streaming API instead.","useterm":"public"},{"timestring":"Tue May 19 03:13:36 2015","uni":"ph2439","groups":"Media, Information","url":"http://sail.usc.edu/iemocap/","datasetname":"IEMOCAP","description":"IEMOCAP (Interactive Emotional Dyadic Motion Capture) database contains audio, transcriptions, video, and motion-capture recordings of mixed gender pairs of actors. There are five sessions (ten actors total) in the database. Each short voice source is labeled by three human annotators using either dimensional or categorical labels. ","useterm":"public"},{"timestring":"Tue May 19 06:33:09 2015","uni":"ly2324","groups":"Media","url":"http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/","datasetname":"Oxford Building Image Set","description":"Some in-campus photos in Oxford","useterm":"public"},{"timestring":"Tue May 19 17:17:50 2015","uni":"sm3891","groups":"Information","url":"http://www.wordfrequency.info/sample.asp","datasetname":"Shortsample wordlist + genre frequency wordfrequency.info Corpus of Contemporary American English","description":"Wordlist + genre frequency Frequency in five main genres: spoken, fiction, popular magazine, newspaper, academicFrequency in each of the 40+ sub-genres (e.g. MAG-Sports, NEWS-Financial, ACAD-Medicine) With the frequency data for specific genres and sub-genres, you can create customized wordlists for specific purposes: medical, technology, sports, etc.","useterm":"public"},{"timestring":"Thu Dec 17 15:45:34 2015","uni":"sd2810","groups":"Media","url":"http://www2.informatik.uni-freiburg.de/~cziegler/BX/","datasetname":"Book-Crossing DataSet","description":"Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books The Book-Crossing dataset comprises 3 tables. BX-Users Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values. BX-Books Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site. BX-Book-Ratings Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0.","useterm":"public"},{"timestring":"Thu Dec 17 17:01:30 2015","uni":"xl2523","groups":"Media","url":"http://ai.stanford.edu/~amaas/data/sentiment/","datasetname":"Stanford AI Lab Large Movie Review Dataset","description":"This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.","useterm":"public"},{"timestring":"Thu Dec 17 18:21:38 2015","uni":"qs2147","groups":"Retail","url":"https://www.kaggle.com/c/otto-group-product-classification-challenge/data","datasetname":"Otto group products information","description":"Each row corresponds to a single product. There are a total of 93 numerical features, which represent counts of different events. All features have been obfuscated and will not be defined any further. There are nine categories for all products. Each target category represents one of our most important product categories (like fashion, electronics, etc.). The products for the training and testing sets are selected randomly. ","useterm":"public"},{"timestring":"Thu Dec 17 18:57:04 2015","uni":"tj2330","groups":"Media","url":"https://webscope.sandbox.yahoo.com/download.php?r=26189&d=","datasetname":"Yahoo Webscope Dataset: Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0","description":"The dataset of this project is downloaded from Yahoo Webscope Dataset: Yahoo! Movies User Ratings and Descriptive Content Information, v.1.0. There are many movie rating datasets on the Internet, we choose this data set because except for movie id, user id and rating, it provides more information about the users: their gender and ages. This additional information gives us more choice to analyze the dataset. The dataset is tested both in original and after classification. Here is a breif description on the dataset: (1) \"ydata-ymovies-user-movie-ratings-train-v1_0.txt\" contains a small sample of Yahoo! users' ratings of movies, with the following fields: 0 anonymized user_id 1 movie_id 2 rating(from 1(F) to 13(A+)) 3 converted rating(from 1 to 5: A-,A, A+ will be converted to 5) (2) \"ydata-ymovies-user-demographics-v1_0.txt\" contains user demographic information, with the following fields: 0 anonymized user_id 1 birthyear 2 gender (3) \"ydata-ymovies-mapping-to-eachmovie-v1_0.txt\" contains a mapping from the movie ids used in this Yahoo! Movies dataset to the corresponding movies ids and titles used in the EachMovie dataset. The mapping may be incomplete or incorrect. The EachMovie dataset was created by the Digital Equipment Corporation's Systems Research Center and is not associated with Yahoo! or available via Yahoo!. The file contains the following fields: 0 yahoo_movie_id 1 movie title 2 eachmovie_movie_id ","useterm":"restricted"},{"timestring":"Thu Dec 17 19:16:44 2015","uni":"ak3808","groups":"Finance","url":"http://quantquote.com/files/quantquote_daily_sp500_83986.zip","datasetname":"QuantQuote Free Historical Stock Data","description":"This dataset contains historical stock tick quotes for the stocks listed on the S&P 500 index","useterm":"public"},{"timestring":"Thu Dec 17 19:49:27 2015","uni":"sr3254","groups":"Information","url":"https://www.eventbrite.com/d/ny--new-york/events/","datasetname":"Events around New York on eventbrite","description":"The dataset is fetched from https://www.eventbrite.com/d/ny--new-york/events/ with a crawler. It contains The events available around New York and some attributes of the event like time, date, cost, etc...","useterm":"public"},{"timestring":"Thu Dec 17 20:16:00 2015","uni":"st2957","groups":"SocialScience-Government","url":"https://www.kaggle.com/c/sf-crime/data","datasetname":"Crime records from SFPD Crime Incident Reporting system","description":"This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. Data fields: Dates - timestamp of the crime incident Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict. Descript - detailed description of the crime incident (only in train.csv) DayOfWeek - the day of the week PdDistrict - name of the Police Department District Resolution - how the crime incident was resolved (only in train.csv) Address - the approximate street address of the crime incident X - Longitude Y - Latitude","useterm":"public"},{"timestring":"Thu Dec 17 20:20:06 2015","uni":"sp3290","groups":"Media, Information, SocialScience-Government","url":"http://www.itl.nist.gov/iad/humanid/feret/feret_master.html","datasetname":"The Color Facial Recognition Technology (FERET) Database","description":"The FERET image corpus was assembled to support government monitored testing and evaluation of face recognition algorithms using standardized tests and procedures. The final corpus, presented here, consists of 14051 eight-bit grayscale images of human heads with views ranging from frontal to left and right profiles.","useterm":"restricted"},{"timestring":"Thu Dec 17 20:24:43 2015","uni":"sp3290","groups":"Media, Information, SocialScience-Government","url":"http://www.vision.caltech.edu/html-files/archive.html","datasetname":"Caltech Background Dataset","description":"Background image dataset. Collected by Markus Weber at California Institute of Technology. ","useterm":"public"},{"timestring":"Thu Dec 17 20:27:44 2015","uni":"sp3290","groups":"Media, Information, SocialScience-Government","url":"http://vasc.ri.cmu.edu/idb/images/face/frontal_images/images.html","datasetname":"CMU/VASC Image Database","description":"The image dataset is used by the CMU Face Detection Project and is provided for evaluating algorithms for detecting frontal views of human faces. This particular test set was originally assembled as part of work in Neural Network Based Face Detection. It combines images collected at CMU and MIT.","useterm":"public"},{"timestring":"Thu Dec 17 21:04:33 2015","uni":"dy2307","groups":"Retail","url":"http://www.irc.atr.jp/crest2010_HRI/ATC_dataset/","datasetname":"ATC pedestrian tracking dataset","description":"The data we used in the following analysis were obtained from the project primary purposed in enabling mobile social robots to work in public spaces , founded by JST/CREST in japan [1] The data provided here was collected between October 24, 2012 and November 29, 2013. The data collection was done every week on Wednesday and Sunday, from morning until evening (9:40-20:20). The dataset consists of 92 days in total. The data is provided as CSV files, one file for each day (file names are in the format atc-YYYYMMDD.csv). Each row in a CSV file corresponds to a single tracked person at a single instant, and it contains the following fields: time [ms] (unixtime + milliseconds/1000), person id, position x [mm], position y [mm], position z (height) [mm], velocity [mm/s], angle of motion [deg], facing angle [deg] Reference: [1] D. Brscic, T. Kanda, T. Ikeda, T. Miyashita, \"Person position and body direction tracking in large public spaces using 3D range sensors\", IEEE Transactions on Human-Machine Systems, Vol. 43, No. 6, pp. 522-534, 2013","useterm":"public"},{"timestring":"Thu Dec 17 21:14:38 2015","uni":"zs2262","groups":"Media","url":"http://crcv.ucf.edu/data/UCF101.php","datasetname":"UCF101","description":"UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories.","useterm":"public"},{"timestring":"Thu Dec 17 21:14:51 2015","uni":"pz2209","groups":"Media","url":"http://www.netflixprize.com","datasetname":"NetFlix movie rating data set","description":"The movie rating files contain over 100 million ratings from 480 thousand randomly-chosen, anonymous Netflix customers over 17 thousand movie titles. The data were collected between October, 1998 and December, 2005 and reflect the distribution of all ratings received during this period. The ratings are on a scale from 1 to 5 (integral) stars. To protect customer privacy, each customer id has been replaced with a randomly-assigned id. The date of each rating and the title and year of release for each movie id are also provided.","useterm":"public"},{"timestring":"Thu Dec 17 21:21:19 2015","uni":"ts2957","groups":"Retail, Information","url":"http://www.yelp.com/dataset_challenge","datasetname":"Yelp dataset","description":"Dataset from Yelp Dataset Challenge in 2016, including users, businesses, and reviews related information. ","useterm":"public"},{"timestring":"Thu Dec 17 21:53:42 2015","uni":"ar3579","groups":"Energy-Transportation-Industry, Information, SocialScience-Government","url":"https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95","datasetname":"NYC OPEN DATA: NYPD Motor Vehicle Collisions","description":"Details of Motor Vehicle Collisions in New York City provided by the Police Department (NYPD)","useterm":"public"},{"timestring":"Thu Dec 17 22:12:11 2015","uni":"cz2393","groups":"Finance","url":"http://data.un.org/Data.aspx?q=gdp&d=SNAAMA&f=grID%3a101%3bcurrID%3aUSD%3bpcFlag%3a0;","datasetname":"GDP by Type of Expenditure at current prices - US dollars","description":"GDP by Type of Expenditure at current prices - US dollars","useterm":"public"},{"timestring":"Thu Dec 17 22:12:57 2015","uni":"cz2393","groups":"SocialScience-Government","url":"http://data.un.org/Data.aspx?q=student&d=UNESCO&f=series%3aED_FSOABS","datasetname":"Students from a given country studying abroad (outbound mobile students)","description":"Students from a given country studying abroad (outbound mobile students)","useterm":"public"},{"timestring":"Thu Dec 17 22:22:55 2015","uni":"ar3390","groups":"Information","url":"https://collegescorecard.ed.gov/data/","datasetname":"CollegeFactSet","description":"Data Set which has key statistics to every major university in the US, based on a variety of parameters for admissions","useterm":"public"},{"timestring":"Thu Dec 17 22:28:27 2015","uni":"yx2318","groups":"Finance, Retail, Media, Information, LifeScience","url":"http://www2.informatik.uni-freiburg.de/~cziegler/BX/","datasetname":"Book-Crossing Dataset","description":" Collected by Cai-Nicolas Ziegler in a 4-week crawl (August / September 2004) from the Book-Crossing community with kind permission from Ron Hornbaker, CTO of Humankind Systems. Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books. The Book-Crossing dataset comprises 3 tables. BX-Users Contains the users. Note that user IDs (`User-ID`) have been anonymized and map to integers. Demographic data is provided (`Location`, `Age`) if available. Otherwise, these fields contain NULL-values. BX-Books Books are identified by their respective ISBN. Invalid ISBNs have already been removed from the dataset. Moreover, some content-based information is given (`Book-Title`, `Book-Author`, `Year-Of-Publication`, `Publisher`), obtained from Amazon Web Services. Note that in case of several authors, only the first is provided. URLs linking to cover images are also given, appearing in three different flavours (`Image-URL-S`, `Image-URL-M`, `Image-URL-L`), i.e., small, medium, large. These URLs point to the Amazon web site. BX-Book-Ratings Contains the book rating information. Ratings (`Book-Rating`) are either explicit, expressed on a scale from 1-10 (higher values denoting higher appreciation), or implicit, expressed by 0. ","useterm":"public"},{"timestring":"Thu Dec 17 22:45:36 2015","uni":"yo2265","groups":"Retail, Information","url":"http://www.yelp.com/dataset_challenge","datasetname":"Yelp Dataset Challenge","description":"business { 'type': 'business', 'business_id': (encrypted business id), 'name': (business name), 'neighborhoods': [(hood names)], 'full_address': (localized address), 'city': (city), 'state': (state), 'latitude': latitude, 'longitude': longitude, 'stars': (star rating, rounded to half-stars), 'review_count': review count, 'categories': [(localized category names)] 'open': True / False (corresponds to closed, not business hours), 'hours': { (day_of_week): { 'open': (HH:MM), 'close': (HH:MM) }, ... }, 'attributes': { (attribute_name): (attribute_value), ... }, } review { 'type': 'review', 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'stars': (star rating, rounded to half-stars), 'text': (review text), 'date': (date, formatted like '2012-03-14'), 'votes': {(vote type): (count)}, } user { 'type': 'user', 'user_id': (encrypted user id), 'name': (first name), 'review_count': (review count), 'average_stars': (floating point average, like 4.31), 'votes': {(vote type): (count)}, 'friends': [(friend user_ids)], 'elite': [(years_elite)], 'yelping_since': (date, formatted like '2012-03'), 'compliments': { (compliment_type): (num_compliments_of_this_type), ... }, 'fans': (num_fans), } ","useterm":"public"},{"timestring":"Thu Dec 17 23:07:56 2015","uni":"zw2289","groups":"Information","url":"http://https://www.yelp.com/dataset_challenge/dataset","datasetname":"Yelp challenge cup dataset","description":"It is yelp challenge 2015 dataset which provide information about reviews of different shops in different cities","useterm":"restricted"},{"timestring":"Thu Dec 17 23:22:14 2015","uni":"ll2698","groups":"SocialScience-Government","url":"https://collegescorecard.ed.gov/data/","datasetname":"US College Scorecard","description":"The College Scorecard is designed to increase transparency by providing more data to help students and families compare college costs and outcomes as they weigh the trade-offs of different colleges, accounting for their own needs and educational goals.","useterm":"public"},{"timestring":"Thu Dec 17 23:27:25 2015","uni":"ab3955","groups":"LifeScience","url":"https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3","datasetname":"Inpatient Prospective Payment System IPPS Provider","description":"Each line of data gives the average price a specific hospital charged for a specific diagnosis related group and the total number of patients discharged from that hospital with that diagnosis related group.","useterm":"public"},{"timestring":"Thu Dec 17 23:33:33 2015","uni":"xd2169","groups":"Energy-Transportation-Industry, Information","url":"https://data.cityofnewyork.us/view/n4kn-dy2y","datasetname":"Green Taxi Trip ","description":"Green Taxi Trip Information","useterm":"public"},{"timestring":"Thu Dec 17 23:34:31 2015","uni":"xd2169","groups":"Energy-Transportation-Industry, Information","url":"https://data.cityofnewyork.us/view/ba8s-jw6u","datasetname":"Yellow Taxi Trip","description":"Yello Taxi Trip Information","useterm":"public"},{"timestring":"Thu Dec 17 23:35:38 2015","uni":"xd2169","groups":"Energy-Transportation-Industry, Information","url":"https://data.cityofnewyork.us/Transportation/Medallion- Drivers-Passenger-Assistance-Trained/td5q-ry6d ","datasetname":"Assistance_Trained_Data","description":"Driver's License information with disabled service","useterm":"public"},{"timestring":"Thu Dec 17 23:40:51 2015","uni":"pz2210","groups":"Energy-Transportation-Industry","url":"https://data.cityofnewyork.us/Public-Safety/collision/bpv4-gfc4","datasetname":"NYPD Motor Vehicle Collisions","description":"The dataset includes the details of Motor Vehicle Collisions, like location, time, type, etc. in New York City provided by the Police Department. ","useterm":"public"},{"timestring":"Thu Dec 17 23:43:54 2015","uni":"cz2351","groups":"SocialScience-Government","url":"https://www.kaggle.com/c/sf-crime/data","datasetname":"San Francisco Crime Classification","description":"This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. Data fields: Dates - timestamp of the crime incident Category - category of the crime incident (only in train.csv). This is the target variable you are going to predict. Descript - detailed description of the crime incident (only in train.csv) DayOfWeek - the day of the week PdDistrict - name of the Police Department District Resolution - how the crime incident was resolved (only in train.csv) Address - the approximate street address of the crime incident X - Longitude Y - Latitude","useterm":"public"},{"timestring":"Thu Dec 17 23:46:41 2015","uni":"hp2414","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"https://dev.twitter.com/overview/documentation","datasetname":"Twitter API","description":"API of twitter","useterm":"public"},{"timestring":"Thu Dec 17 23:50:32 2015","uni":"tw2516","groups":"LifeScience","url":"http://www.ncbi.nlm.nih.gov/mesh","datasetname":"MEDLINE","description":"MEDLINE is the U.S. National Library of Medicine® (NLM) premier bibliographic database that contains more than 22 million references to journal articles in life sciences with a concentration on biomedicine.","useterm":"public"},{"timestring":"Thu Dec 17 23:55:35 2015","uni":"mc4081","groups":"Energy-Transportation-Industry","url":"http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","datasetname":"NYC Taxi and Limousine Commission","description":"This dataset includes trip records from all trips completed in yellow and green taxis in NYC in 2014 and select months of 2015. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.","useterm":"public"},{"timestring":"Thu Dec 17 23:56:33 2015","uni":"sc3919","groups":"Information, SocialScience-Government","url":"http://www.wnyc.org/story/nyc-opens-traffic-crash-data-finally/","datasetname":"Traffic Accidents in NYC ","description":"The accidents information is compiled in the format of DATE, TIME, BOROUGH, ZIP CODE, LOCATION, CONTRIBUTING FACTORS, VEHICLE TYPE etc.","useterm":"public"},{"timestring":"Thu Dec 17 23:58:42 2015","uni":"jz2612","groups":"Media, Information","url":"http://qwone.com/~jason/20Newsgroups/","datasetname":"20 news data","description":"Please see the website description.","useterm":"public"},{"timestring":"Fri Dec 18 00:19:20 2015","uni":"jc4295","groups":"Information","url":"http://www.hikinginglacier.com/hiking-glacier-national-park.htm, http://www.hikinginthesmokys.com/difficulty.htm, http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm","datasetname":"National Park Hiking Trail Ratings","description":"This is the National Park Hiking Trail Ratings","useterm":"public"},{"timestring":"Fri Dec 18 01:23:19 2015","uni":"jc4295","groups":"Media","url":"http://www.hikinginglacier.com/hiking-glacier-national-park.htm, http://www.hikinginthesmokys.com/difficulty.htm, http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm","datasetname":"National Park Hiking Trails","description":"National Park Hiking Trails http://www.hikinginglacier.com/hiking-glacier-national-park.htm http://www.hikinginthesmokys.com/difficulty.htm http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm ","useterm":"public"},{"timestring":"Fri Dec 18 05:59:25 2015","uni":"cy2403","groups":"Information","url":"https://developer.riotgames.com/api/methods","datasetname":"Roit API","description":"Roit API will be used for collecting dataset. Various APIs are available to provide data information of different aspects of the game. Code written in different programming languages can be easily found on github to utilize the API and fetch data. ","useterm":"public"},{"timestring":"Fri Dec 18 08:20:51 2015","uni":"zh2220","groups":"LifeScience","url":"https://www.kaggle.com/c/datasciencebowl","datasetname":"plankton image","description":"In total, Oregon State University’s Hatfield Marine Science Center has captured nearly 50 million plankton images over an 18-day period. This is more than 80 terabytes of data! They need your help creating an automated classification process to better understand the image contents. For this competition, Hatfield scientists have prepared a large collection of labeled images, approximately 30k of which are provided as a training set. Each raw image was run through an automatic process to extract regions of interest, resulting in smaller images that contain a single organism/entity. You must create an algorithm that assigns class probabilities to a given image. Several characteristics of this problem make this classification difficult: There are many different species, ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies. Representatives from each taxon can have any orientation within 3-D space. The ocean is replete with detritus (often decomposing plant or animal matter that scientists like to call “whale snot”) and fecal pellets that have no taxonomic identification but are important in other marine processes. Some images are so noisy or ambiguous that experts have a difficult time labeling them. Some amount of noise in the ground truth is thus inevitable. The presence of \"unknown\" classes require models to handle the special cases of unidentifiable objects. ","useterm":"public"},{"timestring":"Fri Dec 18 08:27:58 2015","uni":"zh2220","groups":"LifeScience","url":"https://www.kaggle.com/c/datasciencebowl/data","datasetname":"plankton image","description":"In total, Oregon State University’s Hatfield Marine Science Center has captured nearly 50 million plankton images over an 18-day period. This is more than 80 terabytes of data! They need your help creating an automated classification process to better understand the image contents. For this competition, Hatfield scientists have prepared a large collection of labeled images, approximately 30k of which are provided as a training set. Each raw image was run through an automatic process to extract regions of interest, resulting in smaller images that contain a single organism/entity. You must create an algorithm that assigns class probabilities to a given image. Several characteristics of this problem make this classification difficult: There are many different species, ranging from the smallest single-celled protists to copepods, larval fish, and larger jellies. Representatives from each taxon can have any orientation within 3-D space. The ocean is replete with detritus (often decomposing plant or animal matter that scientists like to call “whale snot”) and fecal pellets that have no taxonomic identification but are important in other marine processes. Some images are so noisy or ambiguous that experts have a difficult time labeling them. Some amount of noise in the ground truth is thus inevitable. The presence of \"unknown\" classes require models to handle the special cases of unidentifiable objects. ","useterm":"public"},{"timestring":"Fri Dec 18 10:22:31 2015","uni":"rw2611","groups":"Media","url":"http://www.sodasoccer.com/search/club/1/721/19E5DB7AC0E7303A.shtml","datasetname":"football","description":"footballgame","useterm":"public"},{"timestring":"Wed Dec 23 21:01:03 2015","uni":"cz2342","groups":"SocialScience-Government","url":"https://www.kaggle.com/c/sf-crime/download/test.csv.zip","datasetname":"san francisco crime rate","description":"This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set. ","useterm":"public"},{"timestring":"Wed Dec 23 21:24:42 2015","uni":"rw2611","groups":"SocialScience-Government","url":"http://www.sodasoccer.com/search/club/1/721/19E5DB7AC0E7303A.shtml","datasetname":"soccer data","description":"500 soccer games","useterm":"public"},{"timestring":"Wed Dec 23 23:29:54 2015","uni":"zw2327","groups":"Retail","url":"http://www.yelp.com/dataset_challenge","datasetname":"Yelp Data Set","description":"Contains Consumers' review and resturant's detail","useterm":"public"},{"timestring":"Wed Dec 23 23:32:16 2015","uni":"zw2327","groups":"Information","url":"http://jmcauley.ucsd.edu/data/amazon/links.html","datasetname":"Amazon","description":"Amazon data, including review on user base. Please contact Julian McAuley before use the data.","useterm":"public"},{"timestring":"Thu Dec 24 00:18:26 2015","uni":"jc4295","groups":"Media","url":"http://www.hikinginglacier.com/hiking-glacier-national-park.htm, http://www.hikinginthesmokys.com/difficulty.htm, http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm, http://www.tetonhikingtrails.com/hiking-grand-teton-national-park.htm","datasetname":"National Parks' Hiking Trails' Ratings","description":"National Parks' Hiking Trails' Ratings URL: http://www.hikinginglacier.com/hiking-glacier-national-park.htm http://www.hikinginthesmokys.com/difficulty.htm http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm http://www.tetonhikingtrails.com/hiking-grand-teton-national-park.htm ","useterm":"public"},{"timestring":"Thu Dec 24 00:30:44 2015","uni":"jc4295","groups":"Information","url":"http://www.hikinginglacier.com/hiking-glacier-national-park.htm, http://www.hikinginthesmokys.com/difficulty.htm, http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm, http://www.tetonhikingtrails.com/hiking-grand-teton-national-park.htm","datasetname":"National Parks' Hiking Trails' Ratings","description":"National Parks' Hiking Trails' Ratings URL: http://www.hikinginglacier.com/hiking-glacier-national-park.htm http://www.hikinginthesmokys.com/difficulty.htm http://www.rockymountainhikingtrails.com/hiking-rocky-mountain-national-park.htm http://www.tetonhikingtrails.com/hiking-grand-teton-national-park.htm ","useterm":"public"},{"timestring":"Thu Dec 24 02:00:17 2015","uni":"zl2406","groups":"Information","url":"https://archive.org/details/stackexchange","datasetname":"Stack Exchange Data Explorer","description":"This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.","useterm":"public"},{"timestring":"Thu Dec 24 15:17:15 2015","uni":"cz2342","groups":"SocialScience-Government","url":"https://www.kaggle.com/c/sf-crime/download/test.csv.zip","datasetname":"san francisco crime rate","description":"This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7... belong to test set, week 2,4,6,8 belong to training set","useterm":"public"},{"timestring":"Wed May 11 19:52:53 2016","uni":"aab2234","groups":"Retail, LifeScience","url":"https://www.kaggle.com/c/expedia-hotel-recommendations/data","datasetname":"Expedia-Hotel-Data","description":"This data set has been provided by Expedia as logs of customer behavior. These include what customers searched for, how they interacted with search results (click/book), whether or not the search result was a travel package (hotel booking + flight ticket). The data belongs to a Kaggle competition and is a random selection from Expedia and is not representative of the overall statistics. The image right below shows the description of the training data set for this competition. ","useterm":"public"},{"timestring":"Wed May 11 21:09:18 2016","uni":"up2131","groups":"LifeScience","url":"http://leafsnap.com/dataset/","datasetname":"Leafsnap","description":"the Leafsnap dataset consists of images of leaves taken from two different sources, as well as their automatically-generated segmentations: 23147 Lab images, consisting of high-quality images taken of pressed leaves, from the Smithsonian collection. These images appear in controlled backlit and front-lit versions, with several samples per species. 7719 Field images, consisting of \"typical\" images taken by mobile devices (iPhones mostly) in outdoor environments. These images contain varying amounts of blur, noise, illumination patterns, shadows, etc.","useterm":"public"},{"timestring":"Thu May 12 02:30:17 2016","uni":"kks2142","groups":"Finance","url":"https://www.lendingclub.com/info/download-data.action","datasetname":"LendingClub Data","description":"Loan performance of peer-to-peer loans originated by LendingClub","useterm":"public"},{"timestring":"Thu May 12 02:31:16 2016","uni":"kks2142","groups":"Media","url":"http://gdeltproject.org/","datasetname":"GDELT","description":"Enormous news database that scrapes articles from major newspapers in almost every country, in all major languages","useterm":"public"},{"timestring":"Thu May 12 06:29:29 2016","uni":"zj2203","groups":"Media, Information","url":"https://developer.valvesoftware.com/wiki/Steam_Web_API","datasetname":"Steam Web Api","description":"Steam exposes an HTTP based Web API which can be used to access many Steamworks features. The API contains public methods that can be accessed from any application capable of making an HTTP request, such as game client or server. The API also contains protected methods that require authentication and are intended to be accessed from trusted back-end applications.","useterm":"public"},{"timestring":"Thu May 12 10:10:31 2016","uni":"ys2867","groups":"Information, SocialScience-Government","url":"http://www.kasrl.org/jaffe.html","datasetname":"Japanese Female Facial Expression dataset","description":"It has 213 different people facial images with 7 different facial expressions.","useterm":"public"},{"timestring":"Thu May 12 10:37:08 2016","uni":"ab3955","groups":"Information, SocialScience-Government","url":"https://catalog.data.gov/dataset/nypd-7-major-felony-incidents/resource/b678d70c-b717-4fbe-9e0b-b81dbc5e0d61","datasetname":"NYPD 7 Major Felony Incidents","description":"Quarterly update of Seven Major Felonies at the incident level. For privacy reasons, incidents have been moved to the midpoint of the street segment on which they occur. ","useterm":"public"},{"timestring":"Thu May 12 10:37:56 2016","uni":"gtm2122","groups":"Information","url":"https://catalog.data.gov/dataset/nypd-7-major-felony-incidents/resource/b678d70c-b717-4fbe-9e0b-b81dbc5e0d61","datasetname":"NYPD 7 Major Felony","description":"Historical Dataset of crimes reported in New York City in all the burroughs. It gives location of crime in GPS coordinates as well. Quarterly update of Seven Major Felonies at the incident level. For privacy reasons, incidents have been moved to the midpoint of the street segment on which they occur. ","useterm":"public"},{"timestring":"Thu May 12 11:04:00 2016","uni":"sx2172","groups":"Retail, LifeScience, SocialScience-Government","url":"https://www.census.gov/did/www/sahie/","datasetname":"SAHIE","description":"The U.S. Census Bureau’s Small Area Health Insurance Estimates (SAHIE) program produces timely estimates for all counties and states by detailed demographic and income groups. The SAHIE program produces single-year estimates of health insurance coverage for every county in the U.S. The estimates are model-based and consistent with the American Community Survey (ACS). They are based on an \"area-level\" model that uses survey estimates for domains of interest, rather than individual responses. The estimates are \"enhanced\" with administrative data, within a Hierarchical Bayesian framework. SAHIE data can be used to analyze geographic variation in health insurance coverage, as well as disparities in coverage by race/ethnicity, sex, age and income levels that reflect thresholds for state and federal assistance programs. Because consistent estimates are available from 2008 to 2014, SAHIE reflects annual changes over time.","useterm":"public"},{"timestring":"Thu May 12 11:04:44 2016","uni":"sx2172","groups":"Retail, LifeScience, SocialScience-Government","url":"http://www.census.gov/did/www/saipe/","datasetname":"SAIPE","description":"Small Area Income and Poverty Estimates (SAIPE) are produced for school districts, counties, and states. The main objective of this program is to provide updated estimates of income and poverty statistics for the administration of federal programs and the allocation of federal funds to local jurisdictions. Estimates for 2014 were released in December 2015. These estimates combine data from administrative records, postcensal population estimates, and the decennial census with direct estimates from the American Community Survey to provide consistent and reliable single-year estimates. These model-based single-year estimates are more reflective of current conditions than multi-year survey estimates.","useterm":"public"},{"timestring":"Thu May 12 11:06:09 2016","uni":"pz2210","groups":"LifeScience","url":"https://www.yelp.com/dataset_challenge","datasetname":"Yelp - NYC Nightlife Dataset","description":"The Challenge Dataset: 2.2M reviews and 591K tips by 552K users for 77K businesses; 566K business attributes, e.g., hours, parking availability, ambience; Social network of 552K users for a total of 3.5M social edges; Aggregated check-ins over time for each of the 77K businesses; 200,000 pictures from the included businesses;","useterm":"public"},{"timestring":"Thu May 12 11:35:53 2016","uni":"jy2736","groups":"LifeScience","url":"https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/3m9u-ws8e","datasetname":"The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified dataset.","description":"The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified file contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.","useterm":"public"},{"timestring":"Thu May 12 11:40:40 2016","uni":"gss2147","groups":"Energy-Transportation-Industry","url":"https://www.citibikenyc.com/system-data","datasetname":"Citibike System Data","description":"Citi Bike Trip Histories We publish downloadable files of Citi Bike trip data. The data includes: Trip Duration (seconds) Start Time and Date Stop Time and Date Start Station Name End Station Name Station ID Station Lat/Long Bike ID User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member) Gender (Zero=unknown; 1=male; 2=female) Year of Birth","useterm":"public"},{"timestring":"Thu May 12 11:42:44 2016","uni":"as4916","groups":"SocialScience-Government","url":"https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95","datasetname":"NYC Motor Vehicle Collision Data","description":"Details of all accidents over the past three years in NYC","useterm":"public"},{"timestring":"Thu May 12 11:48:06 2016","uni":"sl4017","groups":"Media","url":"http://yann.lecun.com/exdb/mnist/; https://www.cs.toronto.edu/~kriz/cifar.html","datasetname":"ImageNet","description":"MNIST Handwritten digits: a training set of 60,000 examples, and a test set of 10,000 examples Cifar-10/100 The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. ImageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool [Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012)] ","useterm":"public"},{"timestring":"Thu May 12 11:51:13 2016","uni":"ai2336","groups":"LifeScience","url":"https://www.yelp.com/dataset_challenge","datasetname":"Yelp academic dataset","description":" business { 'type': 'business', 'business_id': (encrypted business id), 'name': (business name), 'neighborhoods': [(hood names)], 'full_address': (localized address), 'city': (city), 'state': (state), 'latitude': latitude, 'longitude': longitude, 'stars': (star rating, rounded to half-stars), 'review_count': review count, 'categories': [(localized category names)] 'open': True / False (corresponds to closed, not business hours), 'hours': { (day_of_week): { 'open': (HH:MM), 'close': (HH:MM) }, ... }, 'attributes': { (attribute_name): (attribute_value), ... }, } review { 'type': 'review', 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'stars': (star rating, rounded to half-stars), 'text': (review text), 'date': (date, formatted like '2012-03-14'), 'votes': {(vote type): (count)}, } user { 'type': 'user', 'user_id': (encrypted user id), 'name': (first name), 'review_count': (review count), 'average_stars': (floating point average, like 4.31), 'votes': {(vote type): (count)}, 'friends': [(friend user_ids)], 'elite': [(years_elite)], 'yelping_since': (date, formatted like '2012-03'), 'compliments': { (compliment_type): (num_compliments_of_this_type), ... }, 'fans': (num_fans), } check-in { 'type': 'checkin', 'business_id': (encrypted business id), 'checkin_info': { '0-0': (number of checkins from 00:00 to 01:00 on all Sundays), '1-0': (number of checkins from 01:00 to 02:00 on all Sundays), ... '14-4': (number of checkins from 14:00 to 15:00 on all Thursdays), ... '23-6': (number of checkins from 23:00 to 00:00 on all Saturdays) }, # if there was no checkin for a hour-day block it will not be in the dict } tip { 'type': 'tip', 'text': (tip text), 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'date': (date, formatted like '2012-03-14'), 'likes': (count), } photos (from the photos auxiliary file) This file is formatted as a JSON list of objects. [ { \"photo_id\": (encrypted photo id), \"business_id\" : (encrypted business id), \"caption\" : (the photo caption, if any), \"label\" : (the category the photo belongs to, if any) }, {...} ]","useterm":"public"},{"timestring":"Thu May 12 11:55:04 2016","uni":"rm3330","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience","url":"https://data.sfgov.org/","datasetname":"San Fransisco Open Data","description":"Dataset for all buildings, census data and recreation data in San Francisco","useterm":"public"},{"timestring":"Thu May 12 11:57:57 2016","uni":"jsc2226","groups":"LifeScience","url":"https://data.medicare.gov/data/archives/hospital-compare","datasetname":"Medicare Hospital Compare dataset","description":"Gives extensive data about detailed hospital ratings on a large variety of parameters including patient feedback. State, National and Hospital medicare spending. Number of Cases grouped by DRG (Diagnostic Related Group) IDs, etc. ","useterm":"public"},{"timestring":"Thu May 12 12:16:59 2016","uni":"cy2403","groups":"Information","url":"http://datamine.mta.info/","datasetname":"MTA api","description":"Detailed information for each running train, including location, time, station, and delay","useterm":"public"},{"timestring":"Thu May 12 17:56:05 2016","uni":"kk3098","groups":"Information","url":"https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data","datasetname":"Challenges in Representation Learning: Facial Expression Recognition Challenge","description":"The data consists of 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. The task is to categorize each face based on the emotion shown in the facial expression in to one of seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral). train.csv contains two columns, \"emotion\" and \"pixels\". The \"emotion\" column contains a numeric code ranging from 0 to 6, inclusive, for the emotion that is present in the image. The \"pixels\" column contains a string surrounded in quotes for each image. The contents of this string a space-separated pixel values in row major order. test.csv contains only the \"pixels\" column and your task is to predict the emotion column. The training set consists of 28,709 examples. This dataset was prepared by Pierre-Luc Carrier and Aaron Courville, as part of an ongoing research project.","useterm":"public"},{"timestring":"Thu May 12 20:01:30 2016","uni":"sjb2198","groups":"SocialScience-Government","url":"http://www.cs.toronto.edu/~delve/data/adult/desc.html","datasetname":"Adult","description":"Contains US Census data, for income classification","useterm":"public"},{"timestring":"Fri May 13 09:26:46 2016","uni":"np2544","groups":"LifeScience","url":"https://www.yelp.com/dataset_challenge/dataset","datasetname":"Yelp Academic Dataset","description":"It contains information about each of its businesses, the different reviews, different users, check ins and tips. The data is in json format. For our project, we only used the review and the business information files. ","useterm":"public"},{"timestring":"Thu Nov 10 14:43:16 2016","uni":"ya2366","groups":"Finance","url":"https://finance.yahoo.com/","datasetname":"yahoo finance data","description":"We will use the different stock data provided on this website.The data records the historical movement of different stocks.","useterm":"public"},{"timestring":"Thu Nov 10 14:57:40 2016","uni":"bcs2149","groups":"Finance, SocialScience-Government","url":"http://build.kiva.org/","datasetname":"Kiva Microfinance Lending","description":"Kiva serves as a website to help finance small loans in developing countries. By providing these loans, individuals and communities can get access to capital that would otherwise be unprofitable for many lenders to originate.","useterm":"public"},{"timestring":"Thu Nov 10 17:34:09 2016","uni":"hl2915","groups":"Energy-Transportation-Industry","url":"https://data.cityofnewyork.us/view/ba8s-jw6u","datasetname":"taxi dataSet","description":"The dataset contains the departure time and arrival time including additional attribute like fares, time,trip distance, locations for pick up and drop off","useterm":"public"},{"timestring":"Thu Nov 10 19:20:46 2016","uni":"kz2246","groups":"Media","url":"https://www.kaggle.com/c/word2vec-nlp-tutorial/data","datasetname":"IMDB movie reviews","description":"labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review. testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. sampleSubmission - A comma-delimited sample submission file in the correct format. ","useterm":"public"},{"timestring":"Thu Nov 10 20:27:21 2016","uni":"ll3057","groups":"Media","url":"https://musicbrainz.org/doc/MusicBrainz_Database","datasetname":"MusicBrainz","description":"The MusicBrainz Database is built on the PostgreSQL relational database engine and contains all of MusicBrainz' music metadata. This data includes information about artists, release groups, releases, recordings, works, and labels, as well as the many relationships between them. The database also contains a full history of all the changes that the MusicBrainz community has made to the data. ","useterm":"public"},{"timestring":"Thu Nov 10 20:52:02 2016","uni":"yw2875","groups":"Information","url":"http://webscope.sandbox.yahoo.com/catalog.php?datatype=r","datasetname":"Yahoo Lab movie rating data","description":"This dataset contains a small sample of the Yahoo! Movies community's preferences for various movies, rated on a scale from A+ to F. Users are represented as meaningless anonymous numbers so that no identifying information is revealed. The dataset also contains a large amount of descriptive information about many movies released prior to November 2003, including cast, crew, synopsis, genre, average ratings, awards, etc. The dataset may be used by researchers to validate recommender systems or collaborative filtering algorithms, including hybrid content and collaborative filtering algorithms. The dataset may serve as a testbed for relational learning and data mining algorithms as well as matrix and graph algorithms including PCA and clustering algorithms. The size of this dataset is 23MB.","useterm":"public"},{"timestring":"Thu Nov 10 22:43:37 2016","uni":"xj2191","groups":"Finance","url":"https://quantquote.com/historical-stock-data","datasetname":"QuantQuote Free Historical Stock Data","description":"The dataset that used in our project is provided by QuantQuote. It includes the date, time, high price, low price, open price, close price and trading volume of SP500 company stocks from 1998 to 2013. The dataset downloaded is a zip file which contains 501 csv files that corresponding to each of the company.","useterm":"public"},{"timestring":"Thu Nov 10 22:56:35 2016","uni":"yw2928","groups":"Media","url":"http://webscope.sandbox.yahoo.com/catalog.php?datatype=a","datasetname":"A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)","description":"It is a data sample of user activities over several months at Yahoo webpages, which includes user interactions with pages, ads, and search results for a training period of 90 days and labels from a test period of 2 weeks.","useterm":"public"},{"timestring":"Thu Nov 10 22:56:40 2016","uni":"mz2597","groups":"Media","url":"http://webscope.sandbox.yahoo.com/catalog.php?datatype=a","datasetname":"A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)","description":"It is a data sample of user activities over several months at Yahoo webpages, which includes user interactions with pages, ads, and search results for a training period of 90 days and labels from a test period of 2 weeks. ","useterm":"public"},{"timestring":"Thu Nov 10 23:15:53 2016","uni":"yr2301","groups":"LifeScience","url":"https://www.kaggle.com/sogun3/uspollution","datasetname":"U.S. Pollution Data Since 2000","description":"This dataset deals with pollution in the U.S. Pollution in the U.S. has been well documented by the U.S. EPA but it is a pain to download all the data and arrange them in a format that interests data scientists. Hence I gathered four major pollutants (Nitrogen Dioxide, Sulphur Dioxide, Carbon Monoxide and Ozone) for every day from 2000 - 2016 and place them neatly in a csv file.","useterm":"public"},{"timestring":"Sat Nov 12 15:24:30 2016","uni":"jf3030","groups":"Retail, Information","url":"http://jmcauley.ucsd.edu/data/amazon/","datasetname":"Amazon Review Dataset","description":"This is the website for Amazon review dataset. Particularly, we will only use the rating data.","useterm":"public"},{"timestring":"Sat Nov 12 15:27:41 2016","uni":"ik2338","groups":"Media","url":"http://labrosa.ee.columbia.edu/millionsong/","datasetname":"Million Song Dataset","description":"The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.","useterm":"public"},{"timestring":"Sat Nov 12 15:31:24 2016","uni":"mjc2261","groups":"Media","url":"http://labrosa.ee.columbia.edu/millionsong/","datasetname":"Million Song Data Set","description":"The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's) To help new researchers get started in the MIR field The core of the dataset is the feature analysis and metadata for one million songs, provided by The Echo Nest. The dataset does not include any audio, only the derived features. ","useterm":"public"},{"timestring":"Sun Nov 13 12:00:01 2016","uni":"yx2382","groups":"SocialScience-Government","url":"https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95","datasetname":"NYPD Motor Vehicle Collisions ","description":"This dataset contains accident information in NYC, in which we can specify the locations, street names, contribution factors, number of injuries or casualties, etc.","useterm":"public"},{"timestring":"Sun Nov 13 12:06:39 2016","uni":"jl4564","groups":"Energy-Transportation-Industry","url":"https://www.kaggle.com/c/state-farm-distracted-driver-detection/data","datasetname":"Driver Image","description":"The dataset will be collected from a Kaggle competition: State Farm Distracted Driver Detection. It includes two files: imgs.zip, a zipped folder of all (train/test) images, and driver_imgs_list.csv, a list of training images, their subject (driver) id, and class id. ","useterm":"public"},{"timestring":"Sun Nov 13 13:21:57 2016","uni":"ra2659","groups":"Information","url":"http://isophonics.net/datasets","datasetname":"Chord Annotations Dataset","description":"This dataset is from the Centre for Digital Music (C4DM) at Queen Mary, University of London. It has hundreds of songs annotated with chords from singers like Michael Jackson, Carole King, Queen and the Beatles.","useterm":"public"},{"timestring":"Sun Nov 13 15:55:59 2016","uni":"mz2584","groups":"Finance","url":"https://finance.yahoo.com/quote/AAPL/history?p=AAPL","datasetname":"Apple Stocks prizes","description":"Nasdaq stock prize of Apple in a certain period of time.","useterm":"public"},{"timestring":"Sun Nov 13 16:12:07 2016","uni":"yw2902","groups":"Finance","url":"https://www.kaggle.com/c/santander-product-recommendation/data","datasetname":"sample_submission.csv; test_ver2.csv; train_ver2.csv","description":"The datasets are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records of products a customer has, such as \"credit card\", \"savings account\", etc. We will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind_(xyz)_ult1, which are the columns #25 - #48 in the training data. We will predict what a customer will buy in addition to what they already had at 2016-05-28. The test and train sets are split by time, and public and private leaderboard sets are split randomly.","useterm":"public"},{"timestring":"Sun Nov 13 17:02:49 2016","uni":"av2674","groups":"Retail, Information","url":"https://www.yelp.com/dataset_challenge/dataset","datasetname":"Yelp Dataset Challenge","description":"This is a dataset composed of json files and which is comprised of the various types of data that Yelp has. It includes information about businesses, reviews, checkins and users. ","useterm":"public"},{"timestring":"Sun Nov 13 18:16:32 2016","uni":"gc2662","groups":"Energy-Transportation-Industry, Information","url":"https://catalog.data.gov/dataset/traffic-violations-56dda","datasetname":"Traffic Violations","description":"We are using the Traffic Violations from Montgomery County dataset from data.gov. It is a public data set on the government website. There are a large number of interesting attributes in the data such as: Date Of Stop, Time Of Stop, Description, Location, Accident, Belts, Personal Injury, Property Damage, Fatal, Commercial License, Commercial Vehicle, Alcohol, Work Zone, Make, Model, Color Violation Type, Race, Gender, Geolocation We can combine these columns to find insightful results and predictions.","useterm":"public"},{"timestring":"Sun Nov 13 18:32:08 2016","uni":"gc2662","groups":"Energy-Transportation-Industry, Information","url":"https://catalog.data.gov/dataset/traffic-violations-56dda","datasetname":"Traffic Violations","description":"We are using the Traffic Violations from Montgomery County dataset from data.gov. It is a public data set on the government website. There are a large number of interesting attributes in the data such as: Date Of Stop, Time Of Stop, Description, Location, Accident, Belts, Personal Injury, Property Damage, Fatal, Commercial License, Commercial Vehicle, Alcohol, Work Zone, Make, Model, Color Violation Type, Race, Gender, Geolocation We can combine these columns to find insightful results and predictions. ","useterm":"public"},{"timestring":"Sun Nov 13 20:52:22 2016","uni":"yt2549","groups":"Information, LifeScience, SocialScience-Government","url":"https://www.yelp.com/developers/documentation/v2/business","datasetname":"Yelp business data, reviews data","description":"This is a dataset of restaurants' recommendation and reviews.","useterm":"public"},{"timestring":"Sun Nov 13 22:44:48 2016","uni":"wz2363","groups":"Information, LifeScience","url":"https://www.ncbi.nlm.nih.gov/genbank/","datasetname":"Human DNA Dataset","description":"The dataset is about human DNA sequence. ","useterm":"public"},{"timestring":"Sun Nov 13 23:54:49 2016","uni":"cr2826","groups":"Media","url":"https://www.nist.gov/itl/iad/image-group/color-feret-database","datasetname":"color FERET Database","description":"color FERET Database: The database contains 1564 sets of images for a total of 14,126 images that includes 1199 individuals and 365 duplicate sets of images. A duplicate set is a second set of images of a person already in the database and was usually taken on a different day. For some individuals, over two years had elapsed between their first and last sittings, with some subjects being photographed multiple times. This time lapse was important because it enabled researchers to study, for the first time, changes in a subject's appearance that occur over a year.","useterm":"public"},{"timestring":"Mon Nov 14 00:10:08 2016","uni":"mz2594","groups":"Information","url":"https://www.kaggle.com/stackoverflow/stacksample","datasetname":"StackSample: 10% of Stack Overflow Q&A","description":"Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website. This is organized as three tables: Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10. Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table. Tags contains the tags on each of these questions","useterm":"public"},{"timestring":"Mon Nov 14 16:29:00 2016","uni":"cx2187","groups":"Energy-Transportation-Industry","url":"https://www.kaggle.com/five-thirty-eight/uber-pickups-in-new-york-city","datasetname":"Uber Pickups in New York","description":"This directory contains data on over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015. Trip-level data on 10 other for-hire vehicle (FHV) companies, as well as aggregated data for 329 FHV companies, is also included. All the files are as they were received on August 3, Sept. 15 and Sept. 22, 2015. ","useterm":"public"},{"timestring":"Mon Nov 14 16:34:59 2016","uni":"cx2187","groups":"Energy-Transportation-Industry","url":"https://data.cityofnewyork.us/data?browseSearch=taxi&type=&agency=&cat=&scope=","datasetname":"Taxi Trip Data","description":"This dataset includes trip records from all trips completed in yellow and green taxis in NYC from 2009 to 2015. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab Passenger Enhancement Program (TPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.","useterm":"public"},{"timestring":"Mon Nov 14 18:31:44 2016","uni":"yq2207","groups":"Media, Information, LifeScience, SocialScience-Government","url":"https://www.kaggle.com/datasets?sortBy=hottest&group=featured&search=voice","datasetname":"voice.csv","description":"This database was created to identify a voice as male or female, based upon acoustic properties of the voice and speech. The dataset consists of 3,168 recorded voice samples, collected from male and female speakers. The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).","useterm":"public"},{"timestring":"Mon Nov 14 23:52:40 2016","uni":"yx2385","groups":"Energy-Transportation-Industry, LifeScience, SocialScience-Government","url":"http://faostat3.fao.org/download/FB/FBS/E","datasetname":"1967-2013 Food Balance Sheets for 42 selected countries (and updated regional aggregates)","description":"These are key countries in the calculation of the global prevalence of undernourishment (PoU) indicators that will be published in the upcoming ''The State of Food Insecurity in the World 2015'' (SOFI) Countries: Afghanistan, Algeria, Angola, Bangladesh, Belize, Brazil, Burkina Faso, Chad,China (Hong Kong SAR), China (Macao SAR), China (mainland), China (Taiwan Province of), Colombia, Cote d'Ivoire, Democratic People's Republic of Korea, Dominican Republic, Ethiopia, Guatemala, Haiti, India, Indonesia, Jamaica, Kenya, Madagascar, Mexico, Mozambique, Myanmar, Nepal, Nigeria, Oman, Pakistan, Panama, Paraguay, Peru, Philippines, Sri Lanka, Thailand, United Republic of Tanzania, Viet Nam, Yemen, Zambia, Zimbabwe","useterm":"public"},{"timestring":"Tue Nov 15 14:10:38 2016","uni":"ys2867","groups":"Information","url":"https://www.yelp.com/dataset_challenge/dataset","datasetname":"Yelp dataset","description":"Each file is composed of a single object type, one json-object per-line.","useterm":"public"},{"timestring":"Tue Nov 15 20:58:24 2016","uni":"jz2776","groups":"Retail, Information","url":"https://www.yelp.com/dataset_challenge","datasetname":"Yelp Dataset Challenge ","description":"The restaurants review texts from Yelp Dataset Challenge.","useterm":"public"},{"timestring":"Wed Nov 16 19:03:09 2016","uni":"km3194","groups":"SocialScience-Government","url":"http://snap.stanford.edu/data/cit-Patents.html","datasetname":"High-energy physics citation network Dataset ","description":"Stanford Large Network Dataset Collection ","useterm":"restricted"},{"timestring":"Wed Nov 16 19:35:53 2016","uni":"fy2207","groups":"Information","url":"https://snap.stanford.edu/data/loc-gowalla.html","datasetname":"loc-gowalla_edges.txt.gz loc-gowalla_totalCheckins.txt.gz","description":"Gowalla is a location-based social networking website where users share their locations by checking-in. The friendship network is undirected and was collected using their public API, and consists of 196,591 nodes and 950,327 edges. They have collected a total of 6,442,890 check-ins of these users over the period of Feb. 2009 - Oct. 2010. The first dataset is the undirected friendship connection edges from the users. The second dataset is the detailed check-in data of users, including the user-id, check-in time, latitude, longitude and location-id.","useterm":"public"},{"timestring":"Wed Nov 16 20:48:21 2016","uni":"tl2710","groups":"Media, SocialScience-Government","url":"https://www.kaggle.com/semioniy/predictemall","datasetname":"Pokemon_occur","description":"Dataset consists of roughly 293,000 pokemon sightings (historical appearances of Pokemon), having coordinates, time, weather, population density, distance to pokestops/ gyms etc. as features. The target is to train a machine learning algorithm so that it can predict where pokemon appear in future. ","useterm":"public"},{"timestring":"Wed Nov 16 22:15:04 2016","uni":"yc3228","groups":"Media","url":"http://grouplens.org/datasets/movielens/latest/","datasetname":"group lens","description":"• 24,000,000 ratings and 670,000 tag applications applied to 40,000 movies by 260,000 users. • Includes tag genome data with 12 million relevance scores across 1,100 tags. • Last updated 10/2016. ","useterm":"public"},{"timestring":"Wed Nov 16 23:21:02 2016","uni":"vm2486","groups":"Finance, Retail","url":"https://www.kaggle.com/c/bosch-production-line-performance/data","datasetname":"Production line data","description":"This dataset is from the kaggle competiotion as mentioned on this page: https://www.kaggle.com/c/bosch-production-line-performance The data for this competition represents measurements of parts as they move through Bosch's production lines. Each part has a unique Id. The goal is to predict which parts will fail quality control (represented by a 'Response' = 1). The dataset contains an extremely large number of anonymized features. Features are named according to a convention that tells you the production line, the station on the line, and a feature number. E.g. L3_S36_F3939 is a feature measured on line 3, station 36, and is feature number 3939. On account of the large size of the dataset, files are separated by the type of feature they contain: numerical, categorical, and finally, a file with date features. The date features provide a timestamp for when each measurement was taken. Each date column ends in a number that corresponds to the previous feature number. E.g. the value of L0_S0_D1 is the time at which L0_S0_F0 was taken. ","useterm":"public"},{"timestring":"Thu Nov 17 11:37:17 2016","uni":"cl3406","groups":"Information","url":"https://github.com/emilkayumov/kaggle-dota2-win-prediction/blob/master/input/features.csv","datasetname":"feature.csv","description":"match_id: Match identifier in a data set start_time: The start time of the match (unixtime) lobby_type: Type of room in which players are going (in the transcript dictionaries/lobbies.csv) Sets attributes for each player (Radiant team players - the prefix rN, of Dire - dN): r1_heroPlayer character (transcript in dictionaries / heroes.csv) r1_level: Maximum character level achieved (in the first 5 minutes of the game) r1_xp: Maximum lessons learned r1_gold: Value achieved hero r1_lh: The number of dead units r1_kills: The number of dead players r1_deaths: The number of deaths of the hero r1_items: The number of items purchased first_blood_time: Playing time the first blood first_blood_team: The team that carried out the first blood (0 - Radiant, 1 - Dire) first_blood_player1: Player, was involved in the event first_blood_player2: The second player, was involved in the event The signs for each team (prefixes radiant_and dire_) radiant_bottle_time: The first acquisition object command \"bottle\" radiant_courier_time: The subject of the acquisition time, \"courier\" radiant_flying_courier_time: The subject of the acquisition time \"flying_courier\" radiant_tpscroll_count: Number of objects \"tpscroll\" in the first 5 minutes radiant_boots_count: Number of objects \"boots\" radiant_ward_observer_count: Number of objects \"ward_observer\" radiant_ward_sentry_count: Number of objects \"ward_sentry\" radiant_first_ward_time: Install the first team \"observer\", ie object, which allows you to see a part of the playing field The result of the match (data fields are missing from the test sample, because they contain information that goes beyond the first 5 minutes of the match) duration: duration radiant_win: 1 if won Radiant team 0 - otherwise Status towers and Barraco by the end of the match (see. Dataset fields description) tower_status_radiant tower_status_dire barracks_status_radiant barracks_status_dire","useterm":"public"},{"timestring":"Thu Nov 17 14:55:01 2016","uni":"jjg2188","groups":"LifeScience","url":"https://www.kaggle.com/c/seizure-prediction/data","datasetname":"American Epilepsy Society Seizure Prediction Challenge","description":"preictal_segment_N.mat - the Nth preictal training data segment interictal_segment_N.mat - the Nth non-seizure training data segment test_segment_N.mat - the Nth testing data segment Each .mat file contains a data structure with fields as follow: data: a matrix of EEG sample values arranged row x column as electrode x time. data_length_sec: the time duration of each data row sampling_frequency: the number of data samples representing 1 second of EEG data. channels: a list of electrode names corresponding to the rows in the data field sequence: the index of the data segment within the one hour series of clips. For example, preictal_segment_6.mat has a sequence number of 6, and represents the iEEG data from 50 to 60 minutes into the preictal data.","useterm":"public"},{"timestring":"Thu Nov 17 18:20:17 2016","uni":"sb3766","groups":"Finance","url":"https://www.kaggle.com/wendykan/lending-club-loan-data","datasetname":"Lending Club Loan Data","description":"These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the \"present\" contains complete loan data for all loans issued through the previous completed calendar quarter.","useterm":"public"},{"timestring":"Thu Nov 17 18:26:02 2016","uni":"xj2178","groups":"Energy-Transportation-Industry","url":"https://www.kaggle.com/five-thirty-eight/uber-pickups-in-new-york-city","datasetname":"Uber pickups in NYC","description":"The dataset contains, roughly, four groups of files: Uber trip data from 2014 (April - September), separated by month, with detailed location information Uber trip data from 2015 (January - June), with less fine-grained location information Non-Uber FHV (For-Hire Vehicle) trips. The trip information varies by company, but include day of trip, time of trip, pickup location, driver's for-hire license number, and vehicle's for-hire license number. Aggregate ride and vehicle statistics for all FHV companies (and, occasionally, for taxi companies) ","useterm":"public"},{"timestring":"Thu Nov 17 18:28:50 2016","uni":"nc2663","groups":"Finance","url":"https://www.kaggle.com/wendykan/lending-club-loan-data/","datasetname":"Lending Club Loan Data","description":"These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the \"present\" contains complete loan data for all loans issued through the previous completed calendar quarter.","useterm":"public"},{"timestring":"Tue Dec 13 16:43:25 2016","uni":"xl2601","groups":"Finance, Retail, SocialScience-Government","url":"http://snap.stanford.edu/data/amazon-meta.html","datasetname":"Amazon product co-purchasing network metadata","description":"Amazon Co-purchasing directed network and Metadata dataset. We found this data on stanford dataset library. The dataset includes Title,Salesrank,List of similar products (that get co-purchased with the current product),Detailed product categorization,Product reviews: time, customer, rating, number of votes, number of people that found the review helpful.","useterm":"public"},{"timestring":"Wed Dec 14 12:36:06 2016","uni":"sc4097","groups":"Media","url":"http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm-360K.html","datasetname":"lastfm 360k","description":"This dataset contains tuples (for ~360,000 users) collected from Last.fm API.","useterm":"public"},{"timestring":"Wed Dec 14 18:19:33 2016","uni":"bjz2107","groups":"Finance, Retail, Media, Energy-Transportation-Industry, Information, LifeScience, SocialScience-Government, Telecom","url":"https://archive.org/details/twitterstream","datasetname":"Twitter Dump","description":"All tweets","useterm":"public"},{"timestring":"Wed Dec 14 18:51:58 2016","uni":"kc3031","groups":"LifeScience","url":"https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/npsr-cm47/data","datasetname":"Hospital Inpatient Discharges (SPARCS De-Identified): 2013","description":"The hospital inpatient discharge 2013 dataset can be downloaded from the New York state ny.gov website. The dataset contains about 1.8 million records of hospital stays in New York state and non-clinical data elements such as patient demographics, total charges and length of stay for each visit. We also utilized hospital inpatient discharge 2010 and 2014 as compared dataset. The links to these two dataset pages are as below. Hospital Inpatient Discharges (SPARCS De-Identified): 2010 https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/mtfm-rxf4/data Hospital Inpatient Discharges (SPARCS De-Identified): 2014 https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/rmwa-zns4/data","useterm":"public"},{"timestring":"Wed Dec 14 19:00:43 2016","uni":"ds3516","groups":"Retail","url":"http://snap.stanford.edu/data/amazon-meta.html","datasetname":"Amazon product co-purchasing network metadata","description":"Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j. The data was collected by crawling Amazon website and contains product metadata and review information about 548,552 different products (Books, music CDs, DVDs and VHS video tapes).","useterm":"public"},{"timestring":"Wed Dec 14 19:14:44 2016","uni":"ll3078","groups":"Information","url":"https://data.cityofnewyork.us/Public-Safety/Historical-New-York-City-Crime-Data/hqhv-9zeg","datasetname":"Historical New York City Crime Data","description":"The New York City Police Department records reported crime and offense data based upon the New York State Penal Law and other New York State laws. For statistical presentation purposes the numerous law categories and subsections are summarized by law class, Felony, Misdemeanor and Violation. The tabular data compiles reported crime and offense data recorded by the New York City Police Department. Separate tables are presented for the Seven Major Felonies, Non-Seven Major Felony Crimes, Misdemeanors and Violations.","useterm":"public"},{"timestring":"Wed Dec 14 20:31:13 2016","uni":"xw2401","groups":"Energy-Transportation-Industry","url":"http://web.mta.info/developers/download.html","datasetname":"MTA Bus Historical Data","description":"Date_received is the specific date of message receipt by server. Hour_received is the specific hour of message receipt by server. Min_received is the specific minute that the bus data is received. Vehicle_id is a 3 or 4- digit bus number Route_id is what the bus was inferred to be serving Trip_id represents the stopping pattern inferred for the given bus at the given time. Trip ID’s are only representative; they may not actually represent the trip a bus was serving. Next_stop_id is the next_stop the bus will serve ","useterm":"public"},{"timestring":"Wed Dec 14 20:34:00 2016","uni":"ra2659","groups":"Media, Information","url":"http://ddmal.music.mcgill.ca/research/billboard","datasetname":"Billboard Hot 100 (years) Dataset","description":"The Billboard Hot 100 (years) Dataset is from the McGill Billboard Project. It contains hundreds of files of annotated songs. Each file (song) has three columns with the begin time frame, end time frame and human annotated chord for that duration. There are also additional files with more information about the songs. ","useterm":"public"},{"timestring":"Wed Dec 14 21:21:06 2016","uni":"yc3171","groups":"Energy-Transportation-Industry","url":"http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","datasetname":"Yellow and green taxi trip records in 2016 in NYC","description":"yellow (~1.6G each) and green (~230M each)","useterm":"public"},{"timestring":"Wed Dec 14 21:54:50 2016","uni":"pw2406","groups":"Information","url":"https://www.kaggle.com/c/the-nature-conservancy-fisheries-monitoring/data","datasetname":"The Nature Conservancy Fisheries Monitoring","description":"The Conservancy is looking to the future by using cameras to dramatically scale the monitoring of fishing activities to fill critical science and compliance monitoring data gaps. You are encouraged to develop algorithms to automatically detect and classify species of tunas, sharks and more that fishing boats catch, which will accelerate the video review process. Faster review and more reliable data will enable countries to reallocate human capital to management and enforcement activities which will have a positive impact on conservation and our planet.","useterm":"public"},{"timestring":"Wed Dec 14 22:20:23 2016","uni":"tl2710","groups":"Retail, Media, Energy-Transportation-Industry","url":"https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9/data","datasetname":"311_dataset","description":"daily description of 311 complaints received, from 2010 to present","useterm":"public"},{"timestring":"Wed Dec 14 22:22:35 2016","uni":"tl2710","groups":"Retail, Energy-Transportation-Industry, Information, SocialScience-Government","url":"https://www.kaggle.com/semioniy/predictemall","datasetname":"PokemonGo_occurence","description":"A 200K records of pokemons that appeared in the game, together with the location, time, weather and various features. It's available from Kaggle","useterm":"public"},{"timestring":"Wed Dec 14 22:23:25 2016","uni":"xj2191","groups":"Finance","url":"https://www.quandl.com/data/DY4-Chinese-Stock-Prices/documentation/about","datasetname":"quandl Chinese stocks","description":"The dataset that used in our project is provided by www.quandl.com. It includes the date, time, high price, low price, open price, close price and trading volume of Chinese company stocks from 2006 to 2016.","useterm":"public"},{"timestring":"Wed Dec 14 22:32:29 2016","uni":"sz2629","groups":"Finance","url":"http://www.wind.com.cn/en/wft.html","datasetname":"Tick by Tick futures Data From Wind ","description":"It is the tick by tick market price data, of four types of futures: IFB1: CSI 300 XU1: FTSE China A50 HI1:Hang Seng HC1: H-shares ","useterm":"public"},{"timestring":"Wed Dec 14 22:33:37 2016","uni":"jg3752","groups":"Information","url":"https://snap.stanford.edu/data/web-Movies.html","datasetname":"Amazon Movie Data Reviews","description":"Amazon Movie Data Reviews","useterm":"public"},{"timestring":"Wed Dec 14 22:42:37 2016","uni":"ls3301","groups":"Information","url":"http://labrosa.ee.columbia.edu/millionsong/","datasetname":"million song data set","description":"million song data set","useterm":"public"},{"timestring":"Wed Dec 14 22:54:23 2016","uni":"hx2208","groups":"Information","url":"https://www.yelp.com/dataset_challenge","datasetname":"yelp","description":"Yelp users, reviews and business dataset","useterm":"public"},{"timestring":"Wed Dec 14 22:59:16 2016","uni":"yz2978","groups":"Media, Information","url":"https://www.kaggle.com/c/expedia-hotel-recommendations/data","datasetname":"expedia customers information","description":"We mainly utilize the dataset from Kaggle (with the size of 4.1GB) for our recommendation system. It is used for training and testing the recommendation model.","useterm":"public"},{"timestring":"Wed Dec 14 22:59:29 2016","uni":"av2674","groups":"Retail, Information","url":"https://www.yelp.com/dataset_challenge/dataset","datasetname":"Yelp Dataset","description":"The dataset contains reviews, information on businesses, users, check-ins, tips and photos uploaded on Yelp by users. ","useterm":"public"},{"timestring":"Wed Dec 14 23:04:11 2016","uni":"zy2233","groups":"Media","url":"http://jmcauley.ucsd.edu/data/amazon/","datasetname":"Amazon Books Review data","description":"This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).","useterm":"public"},{"timestring":"Wed Dec 14 23:14:38 2016","uni":"yz2993","groups":"Finance","url":"https://www.quandl.com/data/WIKI-Wiki-EOD-Stock-Prices/documentation/documentation?modal=null","datasetname":"WIKI_STOCK_PRICES","description":"End of day stock prices, dividends and splits for 3,000 US companies, curated by the Quandl community and released into the public domain.","useterm":"public"},{"timestring":"Wed Dec 14 23:17:54 2016","uni":"jy2736","groups":"SocialScience-Government","url":"http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","datasetname":"Green_tripdata_2016-06","description":"The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. ","useterm":"public"},{"timestring":"Wed Dec 14 23:24:42 2016","uni":"jz2776","groups":"Information","url":"https://www.yelp.com/dataset_challenge","datasetname":"Yelp Dataset Challenge Round 8","description":"Dataset: The restaurant review text from the 8th Round Yelp Dataset Challenge 2.7M reviews and 649K tips by 687K users for 86K businesses 566K business attributes, e.g., hours, parking availability, ambiance Social network of 687K users for a total of 4.2M social edges ","useterm":"public"},{"timestring":"Wed Dec 14 23:28:02 2016","uni":"ha2399","groups":"Finance","url":"https://www.kaggle.com/c/santander-product-recommendation/data","datasetname":"Kaggle Santander dataset","description":"In this competition, you are provided with 1.5 years of customers behavior data from Santander bank to predict what new products customers will purchase. The data starts at 2015-01-28 and has monthly records of products a customer has, such as \"credit card\", \"savings account\", etc. You will predict what additional products a customer will get in the last month, 2016-06-28, in addition to what they already have at 2016-05-28. These products are the columns named: ind_(xyz)_ult1, which are the columns #25 - #48 in the training data. You will predict what a customer will buy in addition to what they already had at 2016-05-28. The test and train sets are split by time, and public and private leaderboard sets are split randomly.","useterm":"public"},{"timestring":"Wed Dec 14 23:47:37 2016","uni":"yw2928","groups":"Media","url":"https://webscope.sandbox.yahoo.com/catalog.php?datatype=a&did=78","datasetname":"Yahoo Data Targeting User Modeling","description":"This data set contains a small sample of user profiles and their interests generated from several months of user activities at Yahoo webpages. Each user is represented as one feature vector and its associated labels, where all user identifiers were removed. Feature vectors are derived from user activities during a training period of 90 days, and labels from a test period of 2 weeks that immediately followed the training period. Each dimension of the feature vector quantifies a user activity with a certain interest category from an internal Yahoo taxonomy (e.g., \"Sports/Baseball\", \"Travel/Europe\"), calculated from user interactions with pages, ads, and search results, all of which are internally classified into these interest categories. The labels are derived in a similar way, based on user interactions with classified pages, ads, and search results during the test period. It is important to note that there exists a hierarchical structure among the labels, which is also provided in the data set. All feature and label names are replaced with meaningless anonymous numbers so that no identifying information is revealed. The dataset is of particular interest to Machine Learning and Data Mining communities, as it may serve as a testbed for classification and multi-label algorithms, as well as for classifiers that account for structure among labels. ","useterm":"public"},{"timestring":"Wed Dec 14 23:50:39 2016","uni":"ml3810","groups":"Retail","url":"http://jmcauley.ucsd.edu/data/amazon/","datasetname":"Amazon electronic review dataset","description":"Size: 1.48 GB Instance: 60k products with over 1.5 million reviews Features: 9 features Reviewer ID ASIN Reviewer Name Helpful Review Text Overall Summary Unix Review Time Review Time ","useterm":"public"},{"timestring":"Thu Dec 15 00:01:55 2016","uni":"axc2105","groups":"Media, Information","url":"http://press.liacs.nl/mirflickr/","datasetname":"MIRFLICKR-1M","description":"1 million Flickr images under the Creative Commons license (also a 25000 version)","useterm":"public"},{"timestring":"Thu Dec 15 00:09:30 2016","uni":"mz2594","groups":"Information","url":"https://www.kaggle.com/stackoverflow/stacksample","datasetname":"Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website ","description":"Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website. This is organized as three tables: Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10. Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table. Tags contains the tags on each of these questions ","useterm":"public"},{"timestring":"Thu Dec 15 01:07:44 2016","uni":"mjc2261","groups":"Media","url":"http://labrosa.ee.columbia.edu/millionsong/","datasetname":"Million Song Dataset","description":"The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are: To encourage research on algorithms that scale to commercial sizes To provide a reference dataset for evaluating research As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest's) To help new researchers get started in the MIR field ","useterm":"public"},{"timestring":"Thu Dec 22 17:09:32 2016","uni":"xj2178","groups":"Energy-Transportation-Industry","url":"http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml","datasetname":"NYC Taxi Dataset","description":"The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).","useterm":"public"},{"timestring":"Thu Dec 22 19:11:48 2016","uni":"sz2629","groups":"Finance","url":"https://drive.google.com/drive/folders/0B8fQWea3OAC_WEd5b1NQU3VvQTQ?usp=drive_web","datasetname":"Futures price data for futures with index HC1, HI1, IFB1, XU1","description":"Four types of Chinese Futures tick data downloaded from Wind Financial Terminal. It contains the tick data of futures with index HC1, HI1, IFB1, XU1.","useterm":"restricted"},{"timestring":"Thu May 11 13:33:47 2017","uni":"lz2494","groups":"Information","url":"http://vintage.winklerbros.net/facescrub.html","datasetname":"FaceScrub, EmotiM","description":"FaceScrub: A Dataset With Over 100,000 Face Images of 530 People EmotiM: Contain emotion dataset. www.emotim.com/","useterm":"public"},{"timestring":"Thu May 11 14:27:27 2017","uni":"tp2522","groups":"Finance","url":"https://stocktwits.com/developers/docs","datasetname":"StockTwits Dataset","description":"StockTwits.com is a website that is similar to Twitter, but all the posts are related to stocks. People only post messages about information related to stock or their stock exchange record. StockTwits has thousands of experienced traders that posting and sharing their real-time investment ideas and suggestions. We believe that through analyzing the StockTwit data, people can get useful information about the public sentiment toward a stock, and can use it to predict the general movement of the stock.","useterm":"public"},{"timestring":"Thu May 11 21:54:57 2017","uni":"as5147","groups":"Information","url":"https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html","datasetname":"Cornell movie dialog corpus","description":"Fictional conversation from raw movie scripts 220,579 conversational exchanges between 10,292 pairs of movie characters - involves 9,035 characters from 617 movies - in total 304,713 utterances - movie metadata included: - genres - release year - IMDB rating - number of IMDB votes - IMDB rating - character metadata included: - gender (for 3,774 characters) - position on movie credits (3,321 characters) ","useterm":"public"},{"timestring":"Thu May 11 22:43:12 2017","uni":"sg2665","groups":"Retail","url":"https://neo4j.com/developer/guide-importing-data-and-etl/","datasetname":"Northwind Sample Database (Neo4j) - Graph DB-Model","description":"The Northwind database is a sample database used by Microsoft to demonstrate the features of some of its products, including SQL Server and Microsoft Access. The database contains the sales data for Northwind Traders, a fictitious specialty foods export-import company. Although the code taught in this class is not specific to Microsoft products, we use the Northwind database for many of our examples because many people are already familiar with it and because there are many resources for related learning that make use of the same database. The diagram below shows the table structure of the Northwind database.","useterm":"public"},{"timestring":"Fri Aug 4 02:06:44 2017","uni":"kc2980","groups":"Information, LifeScience","url":"http://image-net.org/challenges/LSVRC/2014/browse-synsets","datasetname":"ImageNet 1000-class dataset","description":"contains images with 1000 classes with labels.","useterm":"public"},{"url":"https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2019_07","datasetname":"Reddit Posts and Comments Since 2005","timestring":"Thu Dec 12 22:07:50 2019","uni":"ft2515","description":"The dataset contains features for every Reddit post and comment since 2005 to July 2019 (as of time of writing). Features include post/comment text, upvotes, downvotes, score, time created, etc.","groupstr":"Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/tsiaras/uk-road-safety-accidents-and-vehicles ","datasetname":"UK Road Safety: Traffic Accidents and Vehicles Detailed dataset of road accidents and involved vehicles in the UK (2005-2017)","timestring":"Fri Dec 13 19:35:33 2019","uni":"hz2620","description":"Context
The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research.

The creation of this dataset was inspired by the one previously published by Dave Fisher-Hickey. However, this current dataset features the following significant improvements over its predecessor:

It covers a wider date range of events.
Most of the coded data variables have been transformed to textual strings using relevant lookup tables, enabling more efficient and \"human-readable\" analysis.
It features detailed information about the vehicles involved in the accidents.
Content
The data come from the Open Data website of the UK government, where they have been published by the Department of Transport.

The dataset comprises of two csv files:

Accident_Information.csv: every line in the file represents a unique traffic accident (identified by the Accident_Index column), featuring various properties related to the accident as columns. Date range: 2005-2017
Vehicle_Information.csv: every line in the file represents the involvement of a unique vehicle in a unique traffic accident, featuring various vehicle and passenger properties as columns. Date range: 2004-2016
The two above-mentioned files/datasets can be linked through the unique traffic accident identifier (Accident_Index column).

The dataset will keep being updated as more data become available by the Department of Transport.

Acknowledgements
Thanks to Dave Fisher-Hickey for previously publishing, what I consider to be, the first solid and structured version of this dataset on Kaggle.

Also thanks to data.gov.uk for making this information publicly available.

Last but not least, thanks to The Data Lab for allocating me some much needed time to assemble this dataset.

Inspiration
Go crazy using the dataset. Don't go crazy while driving.","groupstr":"Information,SocialScience-Government","useterm":"public"},{"url":"https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge and http://www.consortium.ri.cmu.edu/ckagree/","datasetname":"Fer2013 and CK+","timestring":"Fri May 17 04:57:55 2019","uni":"jy2913","description":"CK+: video-based FER dataset containing 593 sequences across 123 subjects which are FACS coded at the peak frame

Fer-2013: static image-based FER dataset containing 28,709 instance of images
","groupstr":"","useterm":"public"},{"url":"http://jmcauley.ucsd.edu/data/amazon/links.html","datasetname":"Amazon Product Data","timestring":"Sun Dec 23 05:54:24 2018","uni":"sb4027","description":"We used the Amazon Product Data posted by Julian McAuley from UCSD. This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014. It includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

For our project, we used the Cell Phone and Accessories category which had over 3 million rows, 9 columns and 300,000 products. We had emailed Julian to get access to the complete dataset.

","groupstr":"Retail","useterm":"public"},{"url":"https://www.kaggle.com/c/the-winton-stock-market-challenge/data","datasetname":"The Winton Stock Market Challenge","timestring":"Fri Dec 13 14:40:04 2019","uni":"hg2532","description":"File descriptions:
train.csv - the training set, including the columns of:
Feature_1 - Feature_25
Ret_MinusTwo, Ret_MinusOne
Ret_2 - Ret_120
Ret_121 - Ret_180: target variables
Ret_PlusOne, Ret_PlusTwo: target variables
Weight_Intraday, Weight_Daily
test.csv - the test set, including the columns of:
Feature_1 - Feature_25
Ret_MinusTwo, Ret_MinusOne
Ret_2 - Ret_120

","groupstr":"Finance","useterm":"public"},{"url":"https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html","datasetname":"KDD-1998","timestring":"Sat May 16 03:13:19 2020","uni":"zx2276","description":"","groupstr":"Finance","useterm":"public"},{"url":"","datasetname":"","timestring":"Tue Jul 28 06:51:54 2026","uni":"","description":"","groupstr":"","useterm":""},{"url":"https://www.kaggle.com/dahlia25/metacritic-video-game-comments/data","datasetname":"Games' information and their comments","timestring":"Fri Dec 13 03:52:12 2019","uni":"fy2252","description":"There are general information for 5000 games. And there are more than 270 thousands user comments for 3420 games.","groupstr":"Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/c/pubg-finish-placement-prediction/data","datasetname":"PUBG game stats","timestring":"Sat Dec 22 05:05:07 2018","uni":"yw3180","description":"Dataset introduction:
This dataset contains a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

Data fields:
DBNOs - Number of enemy players knocked.
assists - Number of enemy players this player damaged that were killed by teammates.
boosts - Number of boost items used.
damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
headshotKills - Number of enemy players killed with headshots.
heals - Number of healing items used.
Id - Player’s Id
killPlace - Ranking in match of number of enemy players killed.
killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
killStreaks - Max number of enemy players killed in a short amount of time.
kills - Number of enemy players killed.
longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
matchDuration - Duration of match in seconds.
matchId - ID to identify match. There are no matches that are in both the training and testing set.
matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
revives - Number of times this player revived teammates.
rideDistance - Total distance traveled in vehicles measured in meters.
roadKills - Number of kills while in a vehicle.
swimDistance - Total distance traveled by swimming measured in meters.
teamKills - Number of times this player killed a teammate.
vehicleDestroys - Number of vehicles destroyed.
walkDistance - Total distance traveled on foot measured in meters.
weaponsAcquired - Number of weapons picked up.
winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
numGroups - Number of groups we have data for in the match.
maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.
","groupstr":"Information","useterm":"public"},{"url":"http://labrosa.ee.columbia.edu/millionsong/","datasetname":"million-song-dataset","timestring":"Sat Dec 22 16:18:38 2018","uni":"jl5173","description":"For dataset, we have three csv files:
features.csv: Timbre features for each song.
ratings.csv: Used for CF recommendation, only contains three columns: user_id, song_id, play counts
song_info.csv: We matched songs' name with songs' IDs in this csv, visualize users with songs' names instead of songs' ID

Any dataset with user information, song information and ratings can be applied in our software package

we fetch them through this website:http://labrosa.ee.columbia.edu/millionsong/","groupstr":"Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data","datasetname":"New York City Taxi Fare","timestring":"Fri Dec 21 17:32:36 2018","uni":"dw2834","description":"train.csv - Input features and target fare_amount values for the training set (about 55M rows).
test.csv - Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.

key - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer, but this doesn't matter, it should just be used as a unique ID field. Required in your submission CSV. Not necessarily needed in the training set, but could be useful to simulate a 'submission file' while doing cross-validation within the training set.
Features

pickup_datetime - timestamp value indicating when the taxi ride started.
pickup_longitude - float for longitude coordinate of where the taxi ride started.
pickup_latitude - float for latitude coordinate of where the taxi ride started.
dropoff_longitude - float for longitude coordinate of where the taxi ride ended.
dropoff_latitude - float for latitude coordinate of where the taxi ride ended.
passenger_count - integer indicating the number of passengers in the taxi ride.

fare_amount - float dollar amount of the cost of the taxi ride. This value is only in the training set; this is what you are predicting in the test set and it is required in your submission CSV.","groupstr":"Energy-Transportation-Industry","useterm":"public"},{"url":"https://mmspg.epfl.ch/food-image-datasets","datasetname":"Food-11","timestring":"Sat Dec 22 05:51:19 2018","uni":"by2267","description":"The dataset we use is Food-11 dataset. It totally contains 16,643 food images, which are divided into three parts. Training dataset includes 9,866 images, validation dataset includes 3,430 images and evaluation dataset includes 3,347 images. There are 11 food categories, which are Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit. The total size of the dataset is about 1.16 GB.","groupstr":"Retail,Media","useterm":"public"},{"url":"https://www.kaggle.com/datasets/raghadalharbi/breast-cancer-gene-expression-profiles-metabric ","datasetname":"METABRIC Dataset","timestring":"Tue May 5 16:57:05 2026","uni":"drt2145","description":"Type: Gene expression and clinical metadata
Observations: Approximately 2,000 breast cancer patients
Features: Thousands of gene expression variables and clinical features
Target: Treatment response outcome proxy
Use: Model training and evaluation

The primary dataset used in this project is the METABRIC (Molecular Taxonomy of Breast Cancer International Consortium) dataset. This dataset contains gene expression profiles and clinical metadata such as tumor stage, receptor status (ER, HR2, etc), and treatment information. The dataset is publicly available and was accessed through Kaggle. ","groupstr":"LifeScience","useterm":"public"},{"url":"https://www.kaggle.com/adamschroeder/crimes-new-york-city#Crime%20Column%20Description.csv","datasetname":"Crime Dataset","timestring":"Sat Dec 22 15:41:38 2018","uni":"zd2212","description":"The data of crimes is from the Kaggle: 2014-2015 Crimes reported in all 5 boroughs of New York City4. This dataset reports 2014-2015 crimes in all 5 boroughs of New York City, which contains 23 fields. The attributes that we are going to use are id, timestamp, coordinate, precinct number. The total size of this dataset is 53MB. This dataset will further process as two datasets: (1) processed data1 contains weekday, hour, precinct number, and the number of crimes; (2) processed data2 contains weekday, hour, coordinate, and the number of crimes.","groupstr":"Information","useterm":"public"},{"url":"http://web.mta.info/developers/turnstile.html","datasetname":"MTA Subway Dataset","timestring":"Sat Dec 22 15:43:29 2018","uni":"zd2212","description":"MTA subway dataset is pub- lished at MTA’s official website6. Since the crime dataset is from 2014-2015, we downloaded the MTA subway dataset from 2014-2015. The total size of this dataset is about (900MB), including 11 fields. The attributes that we care about are station id, timestamp, number of entries, number of exits, station coordinate.","groupstr":"Information","useterm":"public"},{"url":"https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9","datasetname":"311 Service Requests from 2010 to Present","timestring":"Fri Dec 13 09:44:07 2019","uni":"dso2119","description":"All 311 Service Requests from 2010 to present. This information is automatically updated daily.","groupstr":"Finance,Energy-Transportation-Industry,Information,SocialScience-Government","useterm":"public"},{"url":"https://www.kaggle.com/c/kkbox-music-recommendation-challenge/data","datasetname":"KKBOX music recommendation dataset","timestring":"Sun Dec 23 03:09:53 2018","uni":"wt2247","description":"KKBOX provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided.","groupstr":"Media","useterm":"public"},{"url":"https://www.zillow.com/research/data/","datasetname":"Zillow Research","timestring":"Sat Dec 22 04:07:12 2018","uni":"mg3847","description":"Open info for housing in US","groupstr":"","useterm":"public"},{"url":"https://snap.stanford.edu/data/twitter7.html","datasetname":"Twitter7","timestring":"Sat Dec 22 02:43:39 2018","uni":"qw2264","description":"The twitter7 dataset is a collection of 476 million tweets collected between June to December in 2009. The size of this dataset is about 25GB and comes from Stanford Large Network Data Collection (SNAP). It includes 17,069,982 users, 476,553,560 tweets, 181,611,080 URLs, 49,293,684 Hashtags and 71,835,017 retweets. ","groupstr":"Finance,Retail,Media,Information,LifeScience,SocialScience-Government","useterm":"public"},{"url":"http://files.grouplens.org/datasets/movielens/ml-latest.zip","datasetname":"","timestring":"Thu Dec 12 23:47:48 2019","uni":"","description":"","groupstr":"","useterm":"public"},{"url":"http://insideairbnb.com/get-the-data.html","datasetname":"newyorklisting","timestring":"Fri Dec 13 06:54:17 2019","uni":"ms5904","description":"Quantities of datasets providing information about room renting in various cities all over the world.","groupstr":"Information","useterm":"public"},{"url":"https://archive.org/details/twitterstream?sort=-publicdate","datasetname":"twitter history data","timestring":"Sat Dec 14 01:29:11 2019","uni":"sw3385","description":"twitter data for every month, the latest data is 2019/05","groupstr":"Information","useterm":"public"},{"url":"https://www.10xgenomics.com/datasets/xenium-human-lung-cancer-post-xenium-technote","datasetname":"Post-Xenium Technical Note: Xenium v1 and Xenium Prime 5K for FFPE Human Lung Cancer","timestring":"Tue May 12 22:49:35 2026","uni":"pmt2117","description":"This dataset contains matched Xenium and Visium HD spatial transcriptomics data from a human lung cancer sample. The Xenium data provide higher-confidence cell assignments and were used as ground truth for model training. The Visium HD data were used as the target dataset, where the goal was to assign 2µm expression bins to individual nuclei and reconstruct cell-level expression profiles.","groupstr":"LifeScience","useterm":"public"},{"url":"https://www.kaggle.com/c/home-credit-default-risk/data","datasetname":"Home Credit ","timestring":"Sun May 19 00:45:26 2019","uni":"rl2987","description":"We used the data set from a Kaggle competition \textit{Predict Credit Default Risk}","groupstr":"","useterm":"public"},{"url":"https://www.kaggle.com/c/facial-keypoints-detector/data","datasetname":" Emotion and identity detection from face images","timestring":"Fri Dec 13 17:32:48 2019","uni":"xd2212","description":"input: 48x48 pixel gray values (between 0 and 255)
target: emotion category (beween 0 and 6: anger=0, disgust=1, fear=2, happy=3, sad=4, surprise=5, neutral=6)
auxiliary information which can be used but not as input for predicting emotions at test time: identity (-1: unknown, positive integers = ID, not all contiguous integer values). The auxiliary file contains the identity associated with all the training examples.

Data (14 MB)","groupstr":"Media,LifeScience","useterm":"public"},{"url":"https://www.kaggle.com/grikomsn/amazon-cell-phones-reviews","datasetname":"AmazonPhone","timestring":"Fri Dec 13 16:39:13 2019","uni":"zl2696","description":"We got data set from Kaggle. Our data set is information of phone products from Amazon.com.
Here are some specific description of our dataset.

1. items.csv
Contains 700+ cell phone items from Amazon.com with minimal 1 star review.
2. reviews.csv
Contains 70000+ reviews for all products at items.csv.

","groupstr":"Retail","useterm":"public"},{"url":"https://grouplens.org/datasets/movielens/25m/","datasetname":"Movielens","timestring":"Sat Dec 21 02:31:45 2019","uni":"kl3157","description":"MovieLens 20M movie ratings. Stable benchmark dataset. 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Includes tag genome data with 15 million relevance scores across 1,129 tags. Released 12/2019","groupstr":"Information","useterm":"public"},{"url":"https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/data","datasetname":"TakingData Adtraking data","timestring":"Fri Dec 13 15:03:27 2019","uni":"jy3012","description":"Each row of the training data contains a click record, with the following features.

ip: ip address of click.
app: app id for marketing.
device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
os: os version id of user mobile phone
channel: channel id of mobile ad publisher
click_time: timestamp of click (UTC)
attributed_time: if user download the app for after clicking an ad, this is the time of the app download
is_attributed: the target that is to be predicted, indicating the app was downloaded
Note that ip, app, device, os, and channel are encoded.

The test data is similar, with the following differences:

click_id: reference for making predictions
is_attributed: not included","groupstr":"Finance,Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/usdot/flight-delays#flights.csv","datasetname":"2015 Flight Delays and Cancellations","timestring":"Fri Dec 13 02:32:28 2019","uni":"yl4003","description":"The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled, and diverted flights is published in DOT's monthly Air Travel Consumer Report and in this dataset of 2015 flight delays and cancellations.","groupstr":"Energy-Transportation-Industry","useterm":"public"},{"url":"https://nijianmo.github.io/amazon/index.html","datasetname":"Amazon Review Data (2018)","timestring":"Fri Dec 13 08:47:25 2019","uni":"my2570","description":"This dataset contains product reviews and metadata from Amazon, including 233.1 million reviews spanning May 1996 - October 2018.

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). This is an updated dataset including: Product information, e.g. color (white or black), size (large or small), package type (hardcover or electronics), product images that are taken after the user received the product, bullet-point descriptions under product title, technical details table (attribute-value pairs), similar products table.","groupstr":"Retail","useterm":"public"},{"url":"https://www.kaggle.com/datasets/orvile/gene-expression-profiles-of-breast-cancer","datasetname":"GSE25066 Dataset","timestring":"Tue May 5 16:58:19 2026","uni":"drt2145","description":"Type: Gene expression and clinical metadata
Observations: Approximately 508 breast cancer patients
Use: External testing and real world simulation of patient uploads s

The GSE25066 dataset from the Gene Expression Omnibus (GEO) was used as an external validation and testing source. This dataset contains 508 patient gene expression samples that were used to simulate real world patient uploads to test the project’s ability to process new patient data and generate predictions. The dataset is publicly available and was accessed through Kaggle.","groupstr":"LifeScience","useterm":"public"},{"url":"http://www.fakenewschallenge.org/","datasetname":"Fake News Detection","timestring":"Sun Dec 23 06:22:35 2018","uni":"yd2466","description":"The dataset we use comes from Stage 1 of the Fake News Challenge. In total, the dataset is comprised of 1683 article bodies, 49972 headlines, and ground truth labels corresponding to each headline-body pair. The size of the joined dataframe is over 100MB. We split the entire dataset into training and validation according to a 85% to 15% ratio.","groupstr":"Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/c/nyc-taxi-trip-duration/data & https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data","datasetname":"New York City Taxi Trip Duration & New York City Taxi Fare Prediction","timestring":"Sat Dec 22 03:43:24 2018","uni":"jh3874","description":"The first dataset is the dataset we downloaded from the Kaggle competition, and its dataset is based on the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). The data was sampled and cleaned for the purposes of this playground competition. Based on individual trip attributes, participants should predict the duration of each trip in the test set.

The second dataset includes the following information:
train.csv - Input features and target fare_amount values for the training set (about 55M rows).
test.csv - Input features for the test set (about 10K rows). Your goal is to predict fare_amount for each row.
sample_submission.csv - a sample submission file in the correct format (columns key and fare_amount). This file 'predicts' fare_amount to be $11.35 for all rows, which is the mean fare_amount from the training set.","groupstr":"","useterm":"public"},{"url":"https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95","datasetname":"NYPD Motor Vehicle Collisions - Crashes","timestring":"Fri Dec 13 08:00:46 2019","uni":"jc5020","description":"The Motor Vehicle Collisions crash table contains details on the crash event. Each row represents a crash event. The Motor Vehicle Collisions data tables contain information from all police reported motor vehicle collisions in NYC.","groupstr":"Information","useterm":"public"},{"url":"https://gitlab.com/ucdavisnlp/persuasionforgood","datasetname":"persuasive chatlog","timestring":"Fri May 15 20:17:14 2020","uni":"ml4407","description":"The dataset PersuasionForGood is in directory data/. There are two subdirectories under data/, 1) data/AnnotatedData, which contains the 300 annotated dialogs along with the participants information, 2) data/FullData, which contains all the 1017 dialogs along the participants information.","groupstr":"Information,SocialScience-Government","useterm":"public"},{"url":" https://www.wikiart.org/","datasetname":"WikiArts","timestring":"Fri Dec 13 04:40:52 2019","uni":"cz2517","description":"This is a dataset containing more than 20,000 paintings of different artists from different period of time. ","groupstr":"Media","useterm":"public"},{"url":"https://www.kaggle.com/stackoverflow/stacksample","datasetname":"StackSample: 10% of Stack Overflow Q&A","timestring":"Sat Dec 22 17:15:06 2018","uni":"yl4020","description":"Dataset with the text of 10% of questions and answers from the Stack Overflow programming Q&A website.

This is organized as three tables:
Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.

Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.

Tags contains the tags on each of these questions","groupstr":"Information","useterm":"public"},{"url":"https://github.com/uchidalab/book-dataset","datasetname":"Normalized Book Cover Dataset","timestring":"Fri Dec 13 08:49:51 2019","uni":"my2570","description":"This dataset contains 207,572 books from the Amazon and is enriched by genre and full (224x224x3) images in one easy download instead of scraping each cover and normalizing it.

All book cover images are hosted by and copyright Amazon.com, Inc. The the use of the book cover images is fair use for academic purposes.","groupstr":"Retail","useterm":"public"},{"url":"https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html","datasetname":"Cornell Movie Dialogue Corpus ","timestring":"Sat May 18 02:38:06 2019","uni":"zl2697","description":"This corpus contains a large metadata-rich collection of fictional conversations extracted from raw movie scripts:","groupstr":"Information","useterm":"public"},{"url":"https://www.animecharactersdatabase.com/index.php","datasetname":"ANIME CHARACTERS DATABASE","timestring":"Fri Dec 13 08:10:21 2019","uni":"yz3365","description":"This is a dataset that contains lots of different anime characters. Users can search the wanted characters by category and key words.","groupstr":"Media","useterm":"public"},{"url":"https://s3.amazonaws.com/peyck.es/BDA_Project/oct18-oct19.csv","datasetname":"A year of Tweets about Nike. ","timestring":"Fri Dec 13 16:42:34 2019","uni":"zmp2105","description":"We gathered a year of tweets at or about the brand Nike. We collected this data using python libraries to ensure that all tweets containing the word or hashtag \"Nike\" and tweets to the official Nike accounts were included. We did not use the twitter API because of the limitations on query history and rate. The data set is (~1.1GB) and contains over 4 million samples. Note that there are less than ten days that have no tweets. We repeatedly received 403 errors while attempting to collect tweets for some days. ","groupstr":"Finance,Media","useterm":"public"},{"url":"https://archive.ics.uci.edu/ml/datasets/HIGGS","datasetname":"Higgs Boson ","timestring":"Sat Dec 22 00:33:52 2018","uni":"sp3567","description":"The size of the dataset is 8.3GB and consists of 11 million entries. The data was produced using Monte-Carlo simulations. There are a total of 28 features of which the first 21 are kinematic features and the last 7 features were functions of the first 21 features; these are high-level features derived by physicists to help discriminate between the two classes.

Download link: https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz","groupstr":"LifeScience","useterm":"public"},{"url":"https://www.kaggle.com/semioniy/predictemall/home","datasetname":"Predict'em All","timestring":"Fri Dec 21 21:50:49 2018","uni":"tl2861","description":"Overview

PokemonGo is a mobile augmented reality game developed by Niantic inc. for iOS, Android, and Apple Watch devices. It was initially released in selected countries in July 2016. In the game, players use a mobile device's GPS capability to locate, capture, battle, and train virtual creatures, called Pokémon, who appear on the screen as if they were in the same real-world location as the player.

Dataset

Dataset consists of roughly 293,000 pokemon sightings (historical appearances of Pokemon), having coordinates, time, weather, population density, distance to pokestops/ gyms etc. as features. The target is to train a machine learning algorithm so that it can predict where pokemon appear in future. So, can you predict'em all?)

Feature description

pokemonId - the identifier of a pokemon, should be deleted to not affect predictions. (numeric; ranges between 1 and 151)
latitude, longitude - coordinates of a sighting (numeric)
appearedLocalTime - exact time of a sighting in format yyyy-mm-dd'T'hh-mm-ss.ms'Z' (nominal)
cellId 90-5850m - geographic position projected on a S2 Cell, with cell sizes ranging from 90 to 5850m (numeric)
appearedTimeOfDay - time of the day of a sighting (night, evening, afternoon, morning)
appearedHour/appearedMinute - local hour/minute of a sighting (numeric)
appearedDayOfWeek - week day of a sighting (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
appearedDay/appearedMonth/appearedYear - day/month/year of a sighting (numeric)
terrainType - terrain where pokemon appeared described with help of GLCF Modis Land Cover (numeric)
closeToWater - did pokemon appear close (100m or less) to water (Boolean, same source as above)
city - the city of a sighting (nominal)
continent (not always parsed right) - the continent of a sighting (nominal)
weather - weather type during a sighting (Foggy Clear, PartlyCloudy, MostlyCloudy, Overcast, Rain, BreezyandOvercast, LightRain, Drizzle, BreezyandPartlyCloudy, HeavyRain, BreezyandMostlyCloudy, Breezy, Windy, WindyandFoggy, Humid, Dry, WindyandPartlyCloudy, DryandMostlyCloudy, DryandPartlyCloudy, DrizzleandBreezy, LightRainandBreezy, HumidandPartlyCloudy, HumidandOvercast, RainandWindy) // Source for all weather features
temperature - temperature in celsius at the location of a sighting (numeric)
windSpeed - speed of the wind in km/h at the location of a sighting (numeric)
windBearing - wind direction (numeric)
pressure - atmospheric pressure in bar at the location of a sighting (numeric)
weatherIcon - a compact representation of the weather at the location of a sighting (fog, clear-night, partly-cloudy-night, partly-cloudy-day, cloudy, clear-day, rain, wind)
sunriseMinutesMidnight-sunsetMinutesBefore - time of appearance relatively to sunrise/sunset Source
population density - what is the population density per square km of a sighting (numeric, Source)
urban-rural - how urban is location where pokemon appeared (Boolean, built on Population density, <200 for rural, >=200 and <400 for midUrban, >=400 and <800 for subUrban, >800 for urban)
gymDistanceKm, pokestopDistanceKm - how far is the nearest gym/pokestop in km from a sighting? (numeric, extracted from this dataset)
gymIn100m-pokestopIn5000m - is there a gym/pokestop in 100/200/etc meters? (Boolean)
cooc 1-cooc 151 - co-occurrence with any other pokemon (pokemon ids range between 1 and 151) within 100m distance and within the last 24 hours (Boolean)
class - says which pokemonId it is, to be predicted.
Data dump

All pokemon sightings (in JSON file, without features) can be found in Discussion \"Datadump\"","groupstr":"Information","useterm":"public"},{"url":"https://www.kaggle.com/ikarus777/best-artworks-of-all-time","datasetname":"Best Artworks of All time","timestring":"Fri Dec 13 04:38:02 2019","uni":"cz2517","description":"The dataset contains a collection of paintings of 50 most influential artists in the history.

This dataset contains three files:
artists.csv: dataset of information for each artist
images.zip: collection of images (full size), divided in folders and sequentially numbered
resized.zip: same collection but images have been resized and extracted from folder structure","groupstr":"Media","useterm":"public"},{"url":"https://www.kaggle.com/annavictoria/speed-dating-experiment","datasetname":"Speed Dating Experiment","timestring":"Sat Dec 22 22:57:55 2018","uni":"sj2909","description":"The dataset used in the project is a questionnaire collection from speed dating experiments during the period of 2002 to 2004, collected by Columbia Business School. The dataset contains 8378 rows and 195 columns, including participants’ demographics, interests, attribute scores, etc. The dataset is publicly available at https://www.kaggle.com/annavictoria/speed-dating-experiment","groupstr":"","useterm":"public"},{"url":"https://mimic.physionet.org/about/mimic/","datasetname":"MIMIC-III Critical Care Database","timestring":"Sat May 16 01:38:13 2020","uni":"sl4653","description":"MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.","groupstr":"LifeScience","useterm":"public"},{"url":"https://s3.amazonaws.com/amazon-reviews-pds/readme.html","datasetname":"Amazon Customer Reviews Dataset.","timestring":"Sun Dec 23 05:25:18 2018","uni":"ad3363","description":"A large set of Amazon customer reviews, segmented by product category.","groupstr":"","useterm":"public"},{"url":"NYC Yellow Taxis Dataset","datasetname":"NYC Yellow Taxis Dataset","timestring":"Sat Dec 22 15:44:07 2018","uni":"zd2212","description":"Yellow taxis dataset covered Manhattan and this dataset contains detailed timestamp and number of passenger. We retrieve this dataset from NYC open data7. This dataset contains 19 fields. The total size of this dataset is about 25GB. The attributes that we care about are the timestamp, the number of passengers, pickup coordinate, drop off coordinate.","groupstr":"Information","useterm":"public"},{"url":"http://files.grouplens.org/datasets/movielens/ml-latest.zip | https://ai.stanford.edu/~amaas/data/sentiment/","datasetname":"MovieLens Dataset and IMdB Comments Dataset","timestring":"Thu Dec 12 23:59:14 2019","uni":"mka2156","description":"MovieLens Dataset: 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Includes tag genome data with 14 million relevance scores across 1,100 tags.

IMdB Dataset: This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided.","groupstr":"Media,Information,Telecom","useterm":"public"},{"url":"https://www.alphavantage.co/support/#api-key","datasetname":"Alpha Vantage","timestring":"Fri Dec 13 14:41:20 2019","uni":"hg2532","description":"Real time data for Stock predictions","groupstr":"Finance","useterm":"public"},{"url":"https://www.kaggle.com/rtatman/188-million-us-wildfires","datasetname":"1.88 Million US Wildfires (1992 to 2015)","timestring":"Fri Dec 13 16:42:57 2019","uni":"sb4283","description":"This data publication contains a spatial database of wildfires that occurred in the United States from 1992 to 2015. It is the third update of a publication originally generated to support the national Fire Program Analysis (FPA) system. The wildfire records were acquired from the reporting systems of federal, state, and local fire organizations. The following core data elements were required for records to be included in this data publication: discovery date, final fire size, and a point location at least as precise as Public Land Survey System (PLSS) section (1-square mile grid). The data were transformed to conform, when possible, to the data standards of the National Wildfire Coordinating Group (NWCG). Basic error-checking was performed and redundant records were identified and removed, to the degree possible. The resulting product, referred to as the Fire Program Analysis fire-occurrence database (FPA FOD), includes 1.88 million geo-referenced wildfire records, representing a total of 140 million acres burned during the 24-year period.

Reference: Taken from Kaggle description

Short, Karen C. 2017. Spatial wildfire occurrence data for the United States, 1992-2015 [FPA_FOD_20170508]. 4th Edition. Fort Collins, CO: Forest Service Research Data Archive. https://doi.org/10.2737/RDS-2013-0009.4","groupstr":"LifeScience,SocialScience-Government","useterm":"public"},{"url":"https://polygon.io","datasetname":"Daily Equities Data","timestring":"Tue May 13 23:24:45 2025","uni":"jl6962","description":"Daily OHLCV bars (Open, High, Low, Close, Volume) for a select group of large-cap stocks and the SPY ETF, spanning January to March 2023, used for feature engineering and backtesting in the multi-agent trading system.","groupstr":"Finance,Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/datasnaek/league-of-legends","datasetname":"(LoL) League of Legends Ranked Games","timestring":"Sat Dec 22 21:59:52 2018","uni":"kl3065","description":" This is a collection of over 50,000 ranked EUW games from the game League of Legends containing multiple game fields. This dataset was collected using the Riot Games API, which makes it easy to lookup and collect information on a users ranked history and collect their games. ","groupstr":"Information","useterm":"public"},{"url":"https://www.kaggle.com/netzone/eda-and-fraud-detection/data","datasetname":"Synthetic Financial Datasets For Fraud Detection","timestring":"Sat Dec 22 00:07:25 2018","uni":"ax2127","description":"There is a fraud detection data on Kaggle competition (https://www.kaggle.com/netzone/eda-and-fraud-detection/data), called Synthetic Financial Datasets For Fraud Detection. The dataset has 6,362,320 rows, The variables we would consider in this dataset are transaction amount (Column Name: amount), user IDs (Column Name: nameOrig, nameDest), and we could simulate other node attributes based on the transaction frequency or account information. We performed bootstrap, modification, and simulation according to the basic logics we defined for money laundering activities.","groupstr":"Finance","useterm":"public"},{"url":"https://bdd-data.berkeley.edu/","datasetname":"BDD100k ","timestring":"Thu May 14 02:56:42 2020","uni":"gy2278","description":"The largest and most diverse driving video and picture dataset with rich annotations called BDD100K.","groupstr":"Information","useterm":"public"},{"url":"https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data","datasetname":"Wikipedia Comments","timestring":"Fri Dec 21 18:58:43 2018","uni":"cs3736","description":"A large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

toxic
severe_toxic
obscene
threat
insult
identity_hate","groupstr":"Media,Information,LifeScience,SocialScience-Government","useterm":"public"},{"url":"https://archive.org/details/twitterstream","datasetname":"Twitter 2014 Dataset","timestring":"Sat Dec 22 15:45:37 2018","uni":"zd2212","description":"The twitter dataset that we had used in this project is extremely large. The total size of the data is 313.7GB. This dataset contains useful information like timestamp, context, user, location. To process this dataset, we read the data chunk by chunk because of its large size. Then we clean up the data and use the NLP-based algorithm to calculate sentiment using the context provided. Eventually, we aggregated the data by weekday and hour which eventually conduct the processed data, including weekday, hour, number of positive, number of neutral, number of negative information. It worth mentioning that we didn’t find the twitter dataset cover the whole year. We found the data covered Feb, Mar, Apr, May, Oct, Nov, and Dec in 2014.","groupstr":"Information","useterm":"public"},{"url":"https://osf.io/wxvth/","datasetname":"Johnsons-IPIP-NEO","timestring":"Sun May 17 08:34:15 2020","uni":"nc2677","description":"Data from Johnson, J. A. (2014). Measuring thirty facets of the five factor model with a 120-item public domain inventory: Development of the IPIP-NEO-120. J. of Research in Personality, 51, 78-89.

IPIP300.dat contains 307,313 cases of item responses to the IPIP-NEO-300 in ASCII format. IPIP300.por.zip contains those data in portable SPSS format. DAT300.doc is a MS Word file that describes the formatting of these data files.

IPIP120.dat contains 619,150 cases of item responses to the IPIP-NEO-120. IPIP120-.por.zip contains those data in portable SPSS format. DAT120.doc is a MS Word file that describes the formatting of these data files.","groupstr":"SocialScience-Government","useterm":"public"},{"url":"https://www.kaggle.com/rounakbanik/the-movies-dataset","datasetname":"TMDB Movie Data Set","timestring":"Fri Dec 13 06:01:01 2019","uni":"xz2809","description":"These files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

This dataset also has files containing 26 million ratings from 270,000 users for all 45,000 movies. Ratings are on a scale of 1-5 and have been obtained from the official GroupLens website.","groupstr":"Information","useterm":"public"},{"url":"https://on.ny.gov/2qa8Qlm","datasetname":"Hospital Inpatient Discharges (SPARCS De-Identified): 2017","timestring":"Fri Dec 13 06:40:15 2019","uni":"yl4042","description":"The Statewide Planning and Research Cooperative System (SPARCS) Inpatient De-identified File contains discharge level detail on patient characteristics, diagnoses, treatments, services, and charges. This data file contains basic record level detail for the discharge. The de-identified data file does not contain data that is protected health information (PHI) under HIPAA. The health information is not individually identifiable; all data elements considered identifiable have been redacted. For example, the direct identifiers regarding a date have the day and month portion of the date removed.","groupstr":"LifeScience","useterm":"public"},{"url":"https://mimic.physionet.org/","datasetname":"MIMIC-III","timestring":"Fri May 15 20:50:41 2020","uni":"hl3353","description":"In this project, we used detailed and comprehensive patients’ data from MIMIC-III dataset which contains 26 tables and medical records for 41000+ critical care patients. MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database consisting of deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. It includes demographics, vital signs, laboratory tests, medications, and more. Details are available on the MIMIC website: https://mimic.physionet.org/
","groupstr":"LifeScience","useterm":"public"},{"url":"https://github.com/googlecreativelab/quickdraw-dataset","datasetname":"The Quick, Draw! Dataset","timestring":"Sat Dec 22 19:59:09 2018","uni":"xl2719","description":"The Quick Draw Dataset consists of 50 million drawings across 345 categories, contributed by players of the game Quick, Draw!.","groupstr":"","useterm":"public"},{"url":"https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data","datasetname":"NYC airbnb","timestring":"Fri Dec 13 15:03:28 2019","uni":"ty2417","description":"Airbnb listings and metrics in NYC, NY, USA. It's a public dataset from Airbnb, including 48k rows and 16 columns.","groupstr":"Information","useterm":"public"},{"url":"http://kahlan.eps.surrey.ac.uk/savee/Download.html","datasetname":"SAVEE","timestring":"Sat May 16 02:21:43 2020","uni":"ka2744","description":"4 subjects, 480 sequences used for training and evaluation of the audio pipeline","groupstr":"Information","useterm":"public"},{"url":"http://www.tse.com.tw/zh/","datasetname":"Taiwan Exchange Corporation Dataset","timestring":"Sat Dec 22 08:53:53 2018","uni":"cc4338","description":"1.Stock data from 1992 are provided
2.Up to 900 companies on the stock market
3.About 250 trading days for each year
4.Data size is 0.3KB for each trading day
5.The total data size is roughly 27*900*250*0.3≈1800MB
","groupstr":"","useterm":"public"},{"url":"https://github.com/Sapphirine/202005-23-Building-Clinical-Decision-Support-System-for-CVD/blob/master/admission.csv","datasetname":"admission","timestring":"Sat May 16 03:52:37 2020","uni":"xw2657","description":"The admission dataset used in this study contains electronic medical records of 61000+ patients, with 19 attributes containing the basic information of the patients, such as their admission number, insurance, language, and diagnosis from the doctor. With the understanding of the data, we found there exists 15,693 distinct diagnoses which means that different patients will share the same disease with different but similar symptoms. Other important attributes such as ADMITTIME provides the date and time the patient was admitted to the hospital, while DISCHTIME provides the date and time the patient was discharged from the hospital. If applicable, DEATHTIME provides the time of in-hospital death for the patient. ","groupstr":"LifeScience","useterm":"public"},{"url":"http://www.seanlahman.com/baseball-archive/statistics/","datasetname":"Lahmen’s Baseball Database","timestring":"Sat Dec 22 01:41:04 2018","uni":"yz3284","description":"This database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2017 with 28 datasets. It includes data from the two current leagues (American and National), the four other \"major\" leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875.","groupstr":"","useterm":"public"},{"url":"https://nlds.soe.ucsc.edu/node/34/done?sid=6810&token=13e8a7bd7d224be85d562a5edd1022c5","datasetname":"Movie Scripts","timestring":"Fri Dec 13 10:02:36 2019","uni":"dd2941","description":"We used data from UC Santa Cruz (Natural Language and Dialogue Systems, Film Corpus) and OMDB API. We got the scripts and box office data from them respectively. We then transformed the scripts to get the emotion and sentiment analysis.

Film Corpus has all scripts free to download. From OMBD we used their API and got the tier which was assigned base on box office profits.

OMBD API: was removed a not long ago. We got the data before it stopped working.","groupstr":"Media,Information,Telecom","useterm":"public"},{"url":"https://www.kaggle.com/daveianhickey/2000-16-traffic-flow-england-scotland-wales","datasetname":"","timestring":"Fri Dec 13 19:33:04 2019","uni":"","description":"","groupstr":"","useterm":"public"},{"url":"https://www.kaggle.com/wendykan/lending-club-loan-data","datasetname":"Lending Club Loan Data","timestring":"Sun Dec 23 06:05:20 2018","uni":"es3573","description":"Taken from Kaggle overview:

\"These files contain complete loan data for all loans issued through the 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. The file containing loan data through the \"present\" contains complete loan data for all loans issued through the previous completed calendar quarter. Additional features include credit scores, number of finance inquiries, address including zip codes, and state, and collections among others. The file is a matrix of about 890 thousand observations and 75 variables. A data dictionary is provided in a separate file.\"","groupstr":"Finance","useterm":"public"},{"url":"https://tianchi.aliyun.com/getStart/information.htm?spm=5176.100067.5678.2.77b655052XICWe&raceId=231670","datasetname":"fashionAI apparel dataset ","timestring":"Sat Dec 22 17:40:12 2018","uni":"jb4076","description":"Machine analysis of apparel could easily be affected by the dimension and shape of clothes, the distance and angle of camera shooting, and even how the apparel is displayed or the model is posing. Detection of apparel key points in images can help to improve the performance of applications such as alignment of clothes, recognition of clothes local attributes and auto-editing of apparel images.

We provide a dataset for key point localization of the apparel in practical scenarios. In this dataset, a set of clothing key points is defined on the basis of fashion design. The set is further refined into six subsets according to the following six women apparel categories: blouse, outwear, trousers, skirt, dress and jumpsuit, respectively. Currently, the dataset of the first five categories is open for download (the jumpsuit category is omitted since it is uncommon in the real-world scenario), including 41 sub-categories and 24 kinds of key points in total. There are altogether 100,000 annotated images in this dataset.","groupstr":"","useterm":"public"},{"url":"https://developer.pubg.com/","datasetname":"PUBG Match and Telemetry Data","timestring":"Fri Dec 21 16:28:06 2018","uni":"yz3477","description":"The dataset is collected from the official PUBG Developer API, which returns data based on certain formats of URL that user requests, with a rate limit of 10 entries per minute.
The size of the dataset is around 55G.
Specifically speaking, the dataset contains 30,000 instances of match data, each of which records general information of a match, and 5,000 instances of telemetry data, each of which records detailed information of events happened in a match. Each instance of the dataset is stored in the json format. ","groupstr":"Information","useterm":"public"},{"url":"https://ai.google/tools/datasets/google-facial-expression/","datasetname":"Google facial expression comparison dataset","timestring":"Fri Dec 13 17:28:23 2019","uni":"xd2212","description":"This dataset is a large-scale facial expression dataset consisting of face image triplets along with human annotations that specify which two faces in each triplet form the most similar pair in terms of facial expression. Each triplet in this dataset was annotated by six or more human raters. This dataset is quite different from existing expression datasets that focus mainly on discrete emotion classification or action unit detection.

This dataset is intended to aid researchers working on topics related to facial expression analysis such as expression-based image retrieval, expression-based photo album summarization, emotion classification, expression synthesis, etc.","groupstr":"Media,LifeScience","useterm":"public"},{"url":"https://www.kaggle.com/c/understanding_cloud_organization/overview","datasetname":"Understanding Clouds from Satellite Images","timestring":"Fri Dec 13 18:44:30 2019","uni":"yx2478","description":"6 GB satellite image dat including 5546 train images, 3698 test images and a train.csv with 2 columns: 1) train image indices with label names: Sugar, Fish, Flower, Gravel. 2) Encoded pixels are provided for each label in each image.
","groupstr":"LifeScience","useterm":"public"},{"url":"http://help.sentiment140.com/for-students/","datasetname":"Sentiment140","timestring":"Sun Dec 23 05:24:25 2018","uni":"ad3363","description":"For Academics - Sentiment140 - A Twitter Sentiment Analysis Tool","groupstr":"","useterm":"public"},{"url":"https://www.kaggle.com/shubhmamp/english-premier-league-match-data","datasetname":"English Premier League In-match Stats from Season 14-15 to Season 17-18","timestring":"Fri Dec 13 14:40:22 2019","uni":"qg2172","description":"This dataset contains all match stats from Season 14-15 to Season 17-18 of the English Premier League. It has four folders, each corresponding to records of one specific season, and each folder contains two JSON files.
One of the files describes in-game match stats for every match like player touches, passes, shots, yellow cards, saves, etc. Some of the stats are available as aggregate stats for the entire team and some of them are player specific.
The second file describes general match outcomes like the full time and half-time scores, etc.
This dataset contains stats of 1500+ matches, 1700+players, and 1M+ records, which performs a solid foundation to predict player performances and how a particular player/team plays against another.
","groupstr":"Information","useterm":"public"},{"url":"https://www.yelp.com/dataset/challenge","datasetname":"Yelp Dataset","timestring":"Sun May 19 01:12:43 2019","uni":"jz2985","description":"It is a public data set provided by Yelp.","groupstr":"","useterm":"public"},{"url":"https://s3.amazonaws.com/amazon-reviews-pds/tsv/index.txt","datasetname":"Amazon Reviews Dataset","timestring":"Sun Dec 23 04:56:25 2018","uni":"jl5102","description":"Contains product reviews from Amazon.com

SAMPLE CONTENT:
https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_us.tsv
https://s3.amazonaws.com/amazon-reviews-pds/tsv/sample_fr.tsv

DATA COLUMNS:
marketplace - 2 letter country code of the marketplace where the review was written.
customer_id - Random identifier that can be used to aggregate reviews written by a single author.
review_id - The unique ID of the review.
product_id - The unique Product ID the review pertains to. In the multilingual dataset the reviews
for the same product in different countries can be grouped by the same product_id.
product_parent - Random identifier that can be used to aggregate reviews for the same product.
product_title - Title of the product.
product_category - Broad product category that can be used to group reviews
(also used to group the dataset into coherent parts).
star_rating - The 1-5 star rating of the review.
helpful_votes - Number of helpful votes.
total_votes - Number of total votes the review received.
vine - Review was written as part of the Vine program.
verified_purchase - The review is on a verified purchase.
review_headline - The title of the review.
review_body - The review text.
review_date - The date the review was written.","groupstr":"Retail","useterm":"public"},{"url":"https://www.yelp.com/dataset","datasetname":"Yelp Open Dataset","timestring":"Fri Dec 13 10:20:27 2019","uni":"ab4685","description":"The Yelp dataset contains about 6.6M reviews for over 190k businesses. It contains Business Data, User Data, Reviews data, Checkin and Tips data and the data is in the json format. We found the dataset online from Yelp’s open data challenge. ","groupstr":"Information","useterm":"public"},{"url":"https://quickdraw.withgoogle.com/data","datasetname":"Quick Draw Dataset","timestring":"Fri Dec 13 14:58:59 2019","uni":"xs2291","description":"Over 15 million players have contributed millions of drawings playing Quick, Draw! These doodles are a unique data set that can help developers train new neural networks, help researchers see patterns in how people around the world draw, and help artists create things we haven’t begun to think of. That’s why we’re open-sourcing them, for anyone to play with.","groupstr":"Media,Information","useterm":"public"},{"url":"https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments","datasetname":"Google Public Bigquery: Reddit Comments and Posts","timestring":"Fri Dec 13 16:42:51 2019","uni":"tak2151","description":"Public BigQuery dataset containing Reddit comments/post data from all subreddits, dating back to 2005. Approximately 1.7 Billion Comments in total, containing also user information, flairs, upvotes/downvotes, etc. ","groupstr":"Media,Information","useterm":"public"},{"url":"https://davischallenge.org/index.html","datasetname":"Davis: Densely Annotated VIdeo Segmentation","timestring":"Fri Dec 13 05:34:32 2019","uni":"yl4238","description":"DAVIS (Densely Annotated VIdeo Segmentation), consists of fifty high quality, full HD video sequences, spanning multiple occurrences of common video object segmentation challenges such as occlusions, motion-blur and appearance changes. Each video is accompanied by densely annotated, pixel-accurate and per-frame ground truth segmentation.","groupstr":"Media,Information","useterm":"public"},{"url":"https://www.kaggle.com/c/ga-customer-revenue-prediction/data","datasetname":"Google Merchandise Store Data ","timestring":"Sat Dec 22 00:45:58 2018","uni":"xc2418","description":"1.fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
2.channelGrouping - The channel via which the user came to the Store.
3.date - The date on which the user visited the Store.
4.device - The specifications for the device used to access the Store.
5.geoNetwork - This section contains information about the geography of the user.
6.socialEngagementType - Engagement type, either \"Socially Engaged\" or \"Not Socially Engaged\".
7.totals - This section contains aggregate values across the session.
8.trafficSource - This section contains information about the Traffic Source from which the session originated.
9.visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
10.visitNumber - The session number for this user. If this is the first session, then this is set to 1.
11.visitStartTime - The timestamp (expressed as POSIX time).
12.hits - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
13.customDimensions - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
","groupstr":"","useterm":"public"},{"url":"https://ai.stanford.edu/~acoates/stl10/","datasetname":"STL-10","timestring":"Sat May 16 02:58:04 2020","uni":"yh3223","description":"The STL-10 dataset is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. It is inspired by the CIFAR-10 dataset but with some modifications. In particular, each class has fewer labeled training examples than in CIFAR-10, but a very large set of unlabeled examples is provided to learn image models prior to supervised training. The primary challenge is to make use of the unlabeled data (which comes from a similar but different distribution from the labeled data) to build a useful prior. We also expect that the higher resolution of this dataset (96x96) will make it a challenging benchmark for developing more scalable unsupervised learning methods.","groupstr":"Information","useterm":"public"},{"url":"https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data","datasetname":"FER2013, KMU-FED, CK+","timestring":"Sat May 16 00:10:05 2020","uni":"lg3095","description":"The data consists of 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. The task is to categorize each face based on the emotion shown in the facial expression in to one of seven categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral).

train.csv contains two columns, \"emotion\" and \"pixels\". The \"emotion\" column contains a numeric code ranging from 0 to 6, inclusive, for the emotion that is present in the image. The \"pixels\" column contains a string surrounded in quotes for each image. The contents of this string a space-separated pixel values in row major order. test.csv contains only the \"pixels\" column and your task is to predict the emotion column.

The training set consists of 28,709 examples. The public test set used for the leaderboard consists of 3,589 examples. The final test set, which was used to determine the winner of the competition, consists of another 3,589 examples.

This dataset was prepared by Pierre-Luc Carrier and Aaron Courville, as part of an ongoing research project. They have graciously provided the workshop organizers with a preliminary version of their dataset to use for this contest.","groupstr":"Information","useterm":"public"}]});