If you have ever worked on a personal data science project, you have probably spent a lot of time browsing the Internet and looking for interesting datasets to analyze. This article will show you some places to find such datasets.
Data.world describes itself at “the social network for data people”, but could be more correctly describe as “GitHub for data”. It is a place where you can search for, copy, analyze, and download datasets. In addition, you can upload your data to data.world and use it to collaborate with others.
In a relatively short time it has become one of the “go to” places to acquire data, with lots of user contributed datasets as well as fantastic datasets through data.world’s partnerships with various organizations including a large amount of data from the US Federal Government. …
MongoDB is a cross-platform document-oriented distributed database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is one of the fastest databases in existence. It allows you to store any kind of data you want without any IO bottleneck.
Elasticsearch is an open source and analytics search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine followed by Apache Solr.
MongoDB is used for storage, and Elasticsearch is used to perform full-text indexing over the data. Hence, the combination of MongoDB for storing and Elasticsearch for indexing is a common architecture that many organizations follow. …
Anomaly or outlier detection is the process of identifying data points, observations, or events that deviate from normal behaviours or distribution in datasets. Anomalous data can indicate potential critical incidents, such as fraudulent transactions, network intrusion, technical failure, etc. In contrast to standard classification or prediction tasks, anomaly detection is often applied on unlabelled dataset, taking only the internal structure and correlation of the dataset into account.
Numerous machine learning models are suitable for outlier detection. However, supervised models are more constraining than unsupervised models as they need to be provided with labelled datasets. This requirement is particularly expensive when the labelling must be performed by humans. Dealing with a heavily imbalanced class distribution, which is inherent to outlier detection, can also affect the efficiency of supervised models. …
Pandas is the most popular library for data wrangling and processing in Python. It has a lot of different functions that make data manipulation and transformation quite simple and flexible. But Pandas is known to have issues about scalability and efficiency.
By default, Pandas executes its functions as a single process using only one CPU core, so it does not natively take advantage of all of the cores on your system and computing power effectively. When it comes to handling large datasets or extensive calculations, Pandas will become very slow.
Modin is a new lightweight library designed to parallelize and accelerate Pandas DataFrames by automatically distributing the computation across all of the system’s available CPU cores. With that, Modin claims to be able to get nearly linear speed-up to the number of CPU cores on your system for Pandas DataFrames of any size [1]. …
About