If you have ever worked on a personal data science project, you have probably spent a lot of time browsing the Internet and looking for interesting datasets to analyze. This article will show you some places to find such datasets.

1. Data.world

Image for post
Image for post

Data.world describes itself at “the social network for data people”, but could be more correctly describe as “GitHub for data”. It is a place where you can search for, copy, analyze, and download datasets. In addition, you can upload your data to data.world and use it to collaborate with others.

In a relatively short time it has become one of the “go to” places to acquire data, with lots of user contributed datasets as well as fantastic datasets through data.world’s partnerships with various organizations including a large amount of data from the US Federal Government. …


Three Ways to Index MongoDB with Elasticsearch

Image for post
Image for post

1. Introduction

MongoDB is a cross-platform document-oriented distributed database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is one of the fastest databases in existence. It allows you to store any kind of data you want without any IO bottleneck.

Elasticsearch is an open source and analytics search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine followed by Apache Solr.

MongoDB is used for storage, and Elasticsearch is used to perform full-text indexing over the data. Hence, the combination of MongoDB for storing and Elasticsearch for indexing is a common architecture that many organizations follow. …


Introduction and Performance Comparison of Various Outlier Detection Models

1. Introduction

Anomaly or outlier detection is the process of identifying data points, observations, or events that deviate from normal behaviours or distribution in datasets. Anomalous data can indicate potential critical incidents, such as fraudulent transactions, network intrusion, technical failure, etc. In contrast to standard classification or prediction tasks, anomaly detection is often applied on unlabelled dataset, taking only the internal structure and correlation of the dataset into account.

Image for post
Image for post
Photo by Will Myers on Unsplash

Numerous machine learning models are suitable for outlier detection. However, supervised models are more constraining than unsupervised models as they need to be provided with labelled datasets. This requirement is particularly expensive when the labelling must be performed by humans. Dealing with a heavily imbalanced class distribution, which is inherent to outlier detection, can also affect the efficiency of supervised models. …


A Python library to achieve that with only one line of code

Image for post
Image for post
Photo by David Becker on Unsplash

Pandas Library

Pandas is the most popular library for data wrangling and processing in Python. It has a lot of different functions that make data manipulation and transformation quite simple and flexible. But Pandas is known to have issues about scalability and efficiency.

By default, Pandas executes its functions as a single process using only one CPU core, so it does not natively take advantage of all of the cores on your system and computing power effectively. When it comes to handling large datasets or extensive calculations, Pandas will become very slow.

Modin Library

Modin is a new lightweight library designed to parallelize and accelerate Pandas DataFrames by automatically distributing the computation across all of the system’s available CPU cores. With that, Modin claims to be able to get nearly linear speed-up to the number of CPU cores on your system for Pandas DataFrames of any size [1]. …

About

Nate Dong, Ph.D.

A full stack data scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store