So I’m back to the LEGO dataset. In a previous post, the plot of the relative frequency of LEGO colors showed that, although there is a wide range of colors on the whole, just a few make up the majority of brick colors. This situation is similar to that encountered with texts, where common words – articles and prepositions, for example – occur frequently but those words’ meaning doesn’t add much to a (statistical) understanding of the text.

# Mapping San Francisco's open data with leaflet

In this post I create an interactive map of the San Francisco 311 service requests related to San Francisco’s homeless residents. To make the maps I use the R leaflet package which provides an R interface to the interactive Javascript mapping library of the same name. The data are available through San Francisco’s open data portal, DataSF, which is powered by a Socrata backend. I use two packages, RSocrata and soql, to simplify the process of querying Socrata API.

# Exploring the Lego dataset with SQL and dplyr, part II

In the previous post I went over using the R standardized relational database API, DBI, to create a database and build tables from the Lego CSV files. In this post we will be using the dplyr package to query and manipulate the data. I will walk through how dplyr handles calls database queries and then I will use a few simple queries and ggplot to visualize how color the change in Lego brick colors over the years.

# Exploring the Lego dataset with SQL and dplyr

There are a number of reasons why using relational databases for your data analyses can be a good practice: doing so requires cleaning and tidying your data, it helps preserve data integrity, and it allows to manipulate and query data without loading the full dataset into memory. In this tutorial, we will be using the lego dataset. It’s moderately sized at 3 MB and the data comes from a relational database, so it has already been cleaned and normalized.

# Fractal dimension as a statistical property

Take a line of unit length and remove the piece between 1/3 and 2/3. The two remaining pieces are 1/3rd the length of the original. We can think of each as being a smaller image of the original line and again remove the middle third of these two pieces. This leaves us with 4 identitcal lines each 1/3*1/3 = 1/9th of the original piece. The process doesn’t affect the end points of each line, and repeat this middle-third removal ad-infinitum and you end up with a set of infinitely many points, the ‘Cantor dust’.

# About

I am interested in math, algorithms, and creating things with data. I have a good deal of experience using R for statistical and predictive modeling but I plan on exploring other languages and tools in later blog posts. I recently finished my masters degree in mathematics and am looking for a data science adjacent role in the Bay Area. This blog was made with the help of Yihui Xie’s blogdown package.

# Projects

Master's Thesis PDF | Website Improvement of epsilon-complexity estimation and an application to seizure prediction, San Francisco State University. 2017. Epsilon-complexity as defined by Darkhovsky and Piryatinska is a time series feature designed to measure the intrinsic complexity of the time series. For this thesis, I implemented the epsilon-complexity estimation procedure in an `R` package. Some family of approximations is used in the estimation procedure. Several approximation methods were implemented and the classification performance of the epsilon-complexity coefficients was tested on simulated time series.