Machine Learning & Deep Learning

  • bedside books on Machine Learning

    Recommended books on Machine Learning

  • Define a Custom Metric function in Keras

    When developing a model, it is useful to quantify its performance with a pre-defined metric. For classification problem for example, the use of log loss is common, and for regression problem the mean squared error is typically used. Keras integrates those 2 metrics per default.

  • A Deep Neural Network Classifier of Traffic Signs with 99.0% accuracy

    A Deep Neural Network Classifier of Traffic Signs with 99.0 % accuracy   The Traffic Sign Classification is the 2nd project of the self Driving Car Nanodegree. The goal is to implement a Neural Network that predicts, with good accuracy (hopefully better than Human performance), the label of German Traffic signs. It was an opportunity to implement a Deep Neural Network from scratch, and on a topic extremely relevant for Self Driving Cars. Data visualization The dataset (available here) includes the training set (about 39k images) and the test set (12k images) stored in separate pickle files. All the images are scaled to (32x32)px with 3 color channels. There are 43 different labels. Here are a few examples of labels name: 0: speed limit (20km/h) 1: speed limit (30km/h) 10: No passing for vehicles over 3.5 metric tons 22: Bumpy Road 41: End of No Passing 42: End of no passing by vehicles over 3.5 metric tons 20 images drawn randomly from the training set. The red number is the associated label.A image-to-image comparison show variation in illumination (brightness/contrast), filling factor (the sign size versus the image size), position of the sign in the image, background sceneries/colors. Also, in the same image, there are local variations in illumination, background color and pattern uniformity. Data Exploration Now, let’s see how the label data is distributed. Class Frequency in original training datasetThe class distribution in the train set is not uniform: the label with the highest frequency (2250 counts for label=2) has about 10 times more images than the label with the lowest count (210 counts for label=0). What about the pixel intensity distribution? Below are the histogram of 5 images from the same class (label=38): the histogram are clearly different although the images belong to the same class. Red channel histogram of 5 images from the class label 38.

  • Install NVidia GeForce GTX card for Deep Learning GPU computing

    The following steps are good only if the NVIdia card is used for GPU acceleration computing only, not for display. My machine has 2 graphic cards: a Rodeon ATI (not built-in) for display and the NVIdia GForce for computing. Before starting, make sure to save all your work.

  • My notes of the Udacity class - Intro to Descriptive Statistics

  • Interactive plots with plotly

  • Creating boxplots with seaborn

    One of the first steps in analysing a dataset is the Data Exploration. In this short post, I will focus on the vizualization of the data and the outliers using BoxPlot. To learn more on Data Exploration, check this very thorough post.

  • Creating Customer Segmentation

    In this post, we will identify customers segments using data collected from customers of a wholesale distributor in Lisbon (Portugal). The dataset includes the various customers annual spending amounts (reported in monetary units) of diverse product categories for internal structure. The project includes several steps: explore data (determine if any product categories are highly correlated), scale each product category, identify and remove outliers, dimension reduction using PCA, implement a clustering algorithm to segment the customer data and finally compare segmentation. The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

  • Building a Student Intervention Model

    As education has grown to rely more on technology, vast amounts of data has become available for examination and prediction. Logs of student activities, grades, interactions with teachers and fellow students, and more, are now captured in real time through learning management systems like Canvas and Edmodo. Within all levels of education, there exists a push to help increase the likelihood of student success, without watering down the education or engaging in behaviors that fail to improve the underlying issues. Graduation rates are often the criteria of A local school district has a goal to reach a 95% graduation rate by the end of the decade by identifying students who need intervention before they drop out of school. We will build a model that predicts how likely a student is to pass their high school final exam: **the model must be effective while using the least amount of computation costs.

  • Datasets to practice Machine Learning

    If you are looking for datasets to practice your data analysis and Machine Learning skills, here are a few websites. All the datasets listed are available for free.

  • Predict Housing Prices in Boston

    In this project, we analyze the prices of homes in suburbs of Boston. We build a predictive model and train/test it on collected data. The performance of the model is then evaluated. The model can then be used to make certain predictions about a home — in particular, its monetary value. This model would prove to be invaluable for someone like a real estate agent who could make use of such information on a daily basis.

  • how to choose a ML-algorithm

    Here is a good guide on how to choose the right predictor depending on the characteristics of the dataset.