In Memory Data Mining Tools

luoq08@gmail.com OR hzluoqiang@corp.netease.com

general process of data mining

  • data acquisition
    • crawler
  • data wrangling
    • load data(sql, csv/xls, json, html)
    • clean
    • transformation (join, group, sort)
    • data visualization
  • model

focus of this talk

  • in a single machine
  • data fitting in memory(sometimes out of core)
  • opensource: python and some shell tools

Not included

  • hadoop, mpi, spark

python

Why is Python so popular in machine learning?

pros:

  • easy to use
  • libraries; swiss army knife of machine learning.
  • speed up using C(not easy to write)
  • dynamic(live coding, easy to manipulate)

cons:

  • parallel: multiprocessing

  • dynamic(test needed)

setup

  1. Install anaconda

    conda install xxx
    pip install yyy
  2. start jupyter

    jupyter notebook

    access server. ssh tunnel may be useful

  3. import libraries

import numpy as np
import scipy.sparse as sp
import pandas as pd

from sklearn.linear_model import LogsiticRegression
import xgboost as xgb

import gensim
import nltk
from sklearn.feature_extraction import Counte

import matplotlib.pylab as plt
import seaborn
seaborn.set()
%matplotlib inline

SciPy

The SciPy Stack: Scientific Computing Tools for Python

jupyter

matlab like enviroment

  • numpy: dense matrix
  • scipy: scipy.sparse for sparse matrix; linear algebra, optimization ...
  • matplotlib: matlab like plotting; image based, static
  • seaborn: matplotlib based; high level; more attractive
  • mpld3: bring matplotlib to d3

scipy

X = sp.csr_matrix((V, (I, J)))

matplotlib

pandas

  • io, pd.read_csv, pd.read_xlsx, pd.read_sql, pd.read_json, pd.read_hdf
  • DataFrame object: np.array with row and column label; different types for columns; tabular data
  • Panel object: 3d DataFrame, sometimes usefull( stock data)
  • group by, sort, join(pd.merge), reshape/pivot

caution:

  • slow as k,v store(vectorize)
  • many tricks

more:

scikit-learn

  • active community and development; clear interface
  • good documentation with reference
  • fullset of algorithms; pipeline; parameter tuning
  • wrap famous tools libsvm, liblinear

sklearn API

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(loss='l1')
model.fit(X_train, y_train)
model.predict(X_test)
model.predict_proba(X_test)
``model.predict````model.transform``
ClassificationPreprocessing
RegressionDimensionality Reduction
ClusteringFeature Extraction
 Feature selection

shell

demo

function work(){
pv --rate -i 5 \
 | csvcut -c 'images_array_1,images_array_2' | csvjson --stream \
  | parallel --gnu -k --pipe -N 20  --jobs 16 python -m feature.image_feature

}

# Generate image feature for training data set and testing data set
cat data/data_files/image_itemPairs_train.csv | work > data/data_files/image_feature_train.csv
cat data/data_files/image_itemPairs_test.csv | work > data/data_files/image_feature_test.csv

feature.image_feature.py:

if __name__ == '__main__':
    import sys
    for line in sys.stdin:
        line = line.rstrip()
        #do something with line
        ...
        print(result)

more data mungling tools

bokeh

  • interactive visualization library that targets modern web browsers
  • quickstart

d3.js

Data-Driven Documents

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
  • bind data to DOM, and manipulate
  • powerful
  • prepare to write code

data acquisition

web crawling

data clean

More Machine Learning tools

xgboost

  • performance verified (in various kaggle competition)
  • handle nonlinear relation
  • handle missing value, no need for standardization
  • fast, scalable
  • support R, python, julia, scala, java
  • sklearn interface

Text/ NLP

vowpal wabbit

  • out of core; online; scalable; $10^{12}$ sparse features; linear model
  • hashing trick for raw text feature

How to learn more

  • Google is your friend
  • youtube

My list