Data Science¶

https://en.wikipedia.org/wiki/Data_science

Science
- Open Science
- Scientific Method
Reproducibility
- Ten Simple Rules
- Linked Reproducibility

SageMathCloud (SageMath)
Jupyter Notebook
- Jupyter Docker Stacks (Conda)
- Jupyter Extensions
- Jupyter and Reproducibility

datasciencemasters.org¶

“The Open Source Data Science Masters”
http://datasciencemasters.org/

Ten Simple Rules¶

Homepage: http://collections.plos.org/ten-simple-rules
Hashtag: #TenSimpleRules
Twitter: https://twitter.com/hashtag/TenSimpleRules?src=hash

#TenSimpleRules for Reproducible Computational Research¶

“Ten Simple Rules for Reproducible Computational Research”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285
DOI: 10.1371/journal.pcbi.1003285 Featured in PLOS Collections

For Every Result, Keep Track of How It Was Produced

Avoid Manual Data Manipulation Steps

Archive the Exact Versions of All External Programs Used

Version Control All Custom Scripts

Record All Intermediate Results, When Possible in Standardized Formats

For Analyses That Include Randomness, Note Underlying Random Seeds

Always Store Raw Data behind Plots

Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

Connect Textual Statements to Underlying Results

Provide Public Access to Scripts, Runs, and Results

For Every Result, Keep Track of How It Was Produced
- RDF, JSON-LD (e.g. W3C PROV)
- Workflow
- Knowledge Engineering > Linked Data
Avoid Manual Data Manipulation Steps
- Workflow
- Continuous Delivery
  - Test Automation (e.g. Test Driven Development)
Archive the Exact Versions of All External Programs Used
- Jupyter and Reproducibility (%version_information, %watermark) (should be “Reproducibility and Jupyter Notebook”)
Version Control All Custom Scripts
- Revision Control (e.g. Distributed Version Control)
Record All Intermediate Results, When Possible in Standardized Formats
- Linked Data (e.g. 5 ★ Linked Open Data)
For Analyses That Include Randomness, Note Underlying Random Seeds

Python random functions:
```
print(os.environ['PYTHONHASHSEED'])
RANDOMSEED = 1  # /dev/[x]random

import random
random.seed(RANDOMSEED)

import numpy as np
np.random.seed(RANDOMSEED)    # Seed
print(np.random.get_state())  # State
np.random.rand(4, 2) # (rows, cols, [...])
np.random.randn(4, 2) # "standard normal" distribution
```
- http://docs.scipy.org/doc/numpy/reference/routines.random.html#distributions
Python hash randomization and algorithmic determinism:

python -R

https://docs.python.org/3/using/cmdline.html#cmdoption-R

PYTHONHASHSEED

https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
Always Store Raw Data behind Plots
- Or, “Generate all plots from [source-controlled] [transforms-of] raw data”
- ./data
- ./tests/data
- ./nb/data (./notebooks)
- Data Visualization, Data Visualization Tools
Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
- pandas:
  - http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking
  - http://pandas.pydata.org/pandas-docs/stable/reshaping.html#combining-with-stats-and-groupby
- Schema.org: https://schema.org/docs/full.html
- SKOS:
  
  http://www.w3.org/TR/skos-reference/
  
  http://www.w3.org/TR/skos-reference/skos.html
  
  skos:narrower, skos:narrowerTransitive, skos:broader , skos:broaderTransistive, [...]
- XKOS: “An SKOS extension for representing statistical classifications”
  
  http://rdf-vocabulary.ddialliance.org/xkos.html
- RDF Data Cubes: “The RDF Data Cube Vocabulary”
  
  qb:DataSet, qb:Dimension, qb:ObservationGroup, qb:Slice, [...]
  
  http://www.w3.org/TR/vocab-data-cube/
Connect Textual Statements to Underlying Results
- Linked Data: URIs, URLs, #uri-fragments
- Turtle / TriG: <> (this document, this named graph)
- ReStructuredText
  - http://sphinx-doc.org/rest.html#footnotes #citations #substitutions
  - https://github.com/yoloseem/awesome-sphinxdoc
- Linked Reproducibility: URIs, URLs, #uri-fragments
Provide Public Access to Scripts, Runs, and Results

#TenSimpleRules for Creating a Good Data Management Plan¶

“Ten Simple Rules for Creating a Good Data Management Plan”
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004525
DOI: 10.1371/journal.pcbi.1004525

Determine the Research Sponsor Requirements

Identify the Data to Be Collected

Define How the Data Will Be Organized

Explain How the Data Will Be Documented

Describe How Data Quality Will Be Assured

Present a Sound Data Storage and Preservation Strategy

Define the Project’s Data Policies

Describe How the Data Will Be Disseminated

Assign Roles and Responsibilities

Prepare a Realistic Budget

http://journals.plos.org/plosone/s/data-availability

> PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.

Data, Information, Knowledge, & Wisdom¶

https://en.wikipedia.org/wiki/Data

https://en.wikipedia.org/wiki/Information

https://en.wikipedia.org/wiki/Knowledge (see: Knowledge Engineering)

https://en.wikipedia.org/wiki/Wisdom

# Lead -> Gold

Data is information
Information is data
Raw data is not knowledge
Wisdom compares knowledges

Optimization¶

https://en.wikipedia.org/wiki/Mathematical_optimization

Find local and global optima (maxima and minima) within an n-dimensional field which may be limited by resource constraints.

# Global optima of a 1-dimensional list
points = [10, 20, 100, 20, 10]
global_max, global_min = max(points), min(points)
assert global_max == 100
assert global_min == 10

# Local optima of a 1-dimensional list
sample = points[:1]
local_max, local_min = max(sample), min(sample)
assert local_max == 20
assert local_min == 10

# A 2-dimensional list ...
points = [(-0.5, 0),
          (0,  0.5),
          (0.5,  0),
          (0, -0.5)]

Smoothies¶

Data

Inputs, Outputs

Revenue:

2014-01-01 1200 CDT  $80
2014-01-01 1210 CDT  $100
2014-01-01 1500 CDT  $20

Expenses:

2014-01-01 wages     $256 ($8/hr * 8hrs * 4 people)
2014-01-01 utilities $100

Information

Aggregations, Tendencies

Revenue (gross):

2014-01-01  total: $200

Expenses:

2014-01-01  total: $356

Net:

2013-01-01  net:  -$200
2014-01-01  net:  -$156

On Mondays, we usually (on (simple) average) make about $500.

Knowledge

Positive net revenue is good.
One customer is worth the world to us.

Wisdom

We could save money by not being open on New Years Day, but, our loyal customers would not be happy about that.

Body Temperature¶

Data

time, body temp, outdoor temp, indoors/outdoors
time, exercise type, intensity, duration

Information

Daily temperature variance is about n degrees

Knowledge

Walking outside when it is warm increases body temperature
Walking outside when it is cold decreases body temperature
Exercise increases body temperature

Wisdom

If it’s 1745, and body temperature is n degrees above baseline, I’m probably walking outside and it is hot out.

Theory¶

Linked Reproducibility¶

Hashtag: #LinkedReproducibility
Twitter: https://twitter.com/hashtag/LinkedReproducibility
Wrdrddocs: LinkedReproducibility

Note

This heading is now merged into a separate page: LinkedReproducibility

Math¶

https://en.wikipedia.org/wiki/Mathematics

https://en.wikipedia.org/wiki/Outline_of_mathematics

https://en.wikipedia.org/wiki/Mathematics_education#Methods

http://www.iflscience.com/brain/math-gifs-will-help-you-understand-these-concepts-better-your-teacher-ever-did

Math Courses¶

Project Euler¶

https://en.wikipedia.org/wiki/Project_Euler

https://projecteuler.net/

Math Algorithm Problems

Rosalind¶

Web: http://rosalind.info/

Bioinformatics and Data Science Algorithm Problems and Exercises

Mathematical Notation¶

See:

Knowledge Engineering > Symbols
Units > Units and RDF

LaTeX¶

Wikipedia: https://en.wikipedia.org/wiki/LaTeX

https://en.wikipedia.org/wiki/LaTeX#Examples

MathJax¶

Wikipedia: https://en.wikipedia.org/wiki/MathJax

Docs: http://docs.mathjax.org/en/latest/tex.html

MathJax is a Javascript library for displaying MathML, LaTeX, and ASCIIMathML markup in a browser.

http://meta.math.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference

MathJax and IPython Notebook / Jupyter Notebook:

MathML¶

Wikipedia: https://en.wikipedia.org/wiki/MathML

ASCIIMathML¶

Wikipedia: https://en.wikipedia.org/wiki/ASCIIMathML

ASCII
MathML

Information Theory¶

https://en.wikipedia.org/wiki/Information_theory

https://en.wikipedia.org/wiki/Entropy_(information_theory)

https://en.wikipedia.org/wiki/Signal_(electrical_engineering)

https://en.wikipedia.org/wiki/Noise_(signal_processing)

https://en.wikipedia.org/wiki/Signal-to-noise_ratio

https://en.wikipedia.org/wiki/Probability_theory

https://www.khanacademy.org/math/probability

Linear Algebra¶

https://en.wikipedia.org/wiki/Linear_algebra

https://www.khanacademy.org/math/linear-algebra
http://www.ulaff.net/
https://github.com/ULAFF/notebooks/ (Jupyter Notebooks)

Calculus¶

https://en.wikipedia.org/wiki/Calculus

Statistics¶

https://en.wikipedia.org/wiki/Statistics

https://en.wikipedia.org/wiki/Outline_of_statistics

https://en.wikipedia.org/wiki/Category:Statistics

Parametric Statistics¶

https://en.wikipedia.org/wiki/Parametric_statistics

Regression Analysis¶

https://en.wikipedia.org/wiki/Regression_analysis

https://en.wikipedia.org/wiki/Template:Regression_bar

Nonparametric Statistics¶

https://en.wikipedia.org/wiki/Nonparametric_statistics

Descriptive Statistics¶

https://en.wikipedia.org/wiki/Descriptive_statistics

Statistical Inference¶

https://en.wikipedia.org/wiki/Statistical_inference

Causality¶

https://en.wikipedia.org/wiki/Causality

https://en.wikipedia.org/wiki/Correlation_and_dependence

https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation

https://en.wikipedia.org/wiki/Sensitivity_analysis

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc

Analysis¶

https://en.wikipedia.org/wiki/Data_analysis

https://en.wikipedia.org/wiki/Big_data

https://en.wikipedia.org/wiki/Data_processing#Data_processing_functions

Learning¶

https://en.wikipedia.org/wiki/Learning

https://en.wikipedia.org/wiki/Autodidacticism

https://en.wikipedia.org/wiki/Perceptual_learning

https://en.wikipedia.org/wiki/Pattern_recognition_(psychology)#False_pattern_recognition

https://en.wikipedia.org/wiki/Rhetoric

https://en.wikipedia.org/wiki/Socratic_method

https://en.wikipedia.org/wiki/Socratic_questioning

https://en.wikipedia.org/wiki/Platonic_dialogue#The_dialogues

https://en.wikipedia.org/wiki/Dialectic

https://en.wikipedia.org/wiki/Dialogue

https://en.wikipedia.org/wiki/Perturbation_theory_(quantum_mechanics)

https://en.wikipedia.org/wiki/Validated_learning

https://en.wikipedia.org/wiki/Organizational_learning

See: Knowledge Engineering

Data Mining¶

https://en.wikipedia.org/wiki/Data_mining

https://en.wikipedia.org/wiki/Knowledge_extraction

https://en.wikipedia.org/wiki/Extract,_transform,_load

Data Dredging¶

Wikipedia: https://en.wikipedia.org/wiki/Data_dredging

!
Causality
spurious correlations
- http://tylervigen.com/spurious-correlations

Machine Learning¶

Wikipedia: https://en.wikipedia.org/wiki/Machine_learning
Awesome: https://github.com/onurakpolat/awesome-bigdata
Awesome: https://github.com/josephmisiti/awesome-machine-learning

https://en.wikipedia.org/wiki/Online_machine_learning

Deep Learning¶

Wikipedia: https://en.wikipedia.org/wiki/Deep_learning

Datasets¶

awesome-public-datasets¶

https://github.com/caesar0301/awesome-public-datasets

https://github.com/caesar0301/awesome-public-datasets#search-engines

Awesome¶

https://github.com/bayandin/awesome-awesomeness

Tools¶

ETL¶

Wikipedia: https://en.wikipedia.org/wiki/Extract,_transform,_load

https://en.wikipedia.org/wiki/Extract,_transform,_load#Real-life_ETL_cycle

Workflow¶

Scientific Method
Project Management
https://en.wikipedia.org/wiki/Checklist
https://en.wikipedia.org/wiki/Scientific_workflow_system
Units of measure
I/O Transforms of information(/energy)

“Data Provenance”, “Data Lineage”

See:

Techniques¶

Automated Workflows¶

Standard, Automated Workflows

Q: Is there confirmation bias in starting with e.g. simple regression analysis?

Q: Which factors did we know we were capturing?

5 ★ Linked Open Data¶

http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆

Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).

☆☆

Publish structured data on the Web in a machine-readable format (e.g. XML).

☆☆☆

Publish structured data on the Web in a documented, non-proprietary data format (e.g. CSV, KML).

☆☆☆☆

Publish structured data on the Web as RDF (e.g. Turtle, RDFa, JSON-LD, SPARQL.)

☆☆☆☆☆

In your RDF, have the identifiers be links (URLs) to useful data sources.

—http://5stardata.info/

See: Knowledge Engineering, Semantic Web Standards

Data Visualization¶

Wikipedia: https://en.wikipedia.org/wiki/Data_visualization

Visualizing Data Science¶

The Data Science Venn Diagram

Field representations

Data Visualization Tools¶

Matplotlib¶

Wikipedia: https://en.wikipedia.org/wiki/Matplotlib
Homepage: https://matplotlib.org/
Src: https://github.com/matplotlib/matplotlib
Docs: https://matplotlib.org/contents.html

ref:Scipy lectures:

http://scipy-lectures.github.io/intro/matplotlib/matplotlib.html
Scientific-python-lectures:

http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb
http://stanford.edu/~mwaskom/software/seaborn/index.html
http://tonysyu.github.com/mpltools/auto_examples/index.html#style-package
http://mpld3.github.io/ (Matplotlib + D3.js)
conda install matplotlib (Conda (Anaconda))

.

pandas plot functions generate matplotlib charts.

Seaborn¶

Src: https://github.com/mwaskom/seaborn
Docs: http://seaborn.pydata.org/
Docs: http://seaborn.pydata.org/examples/

“Seaborn is a Python visualization library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics.”

Mayavi¶

Wikipedia: https://en.wikipedia.org/wiki/MayaVi
Src: https://github.com/enthought/mayavi
Docs: http://docs.enthought.com/mayavi/mayavi/

“Mayavi: 3D scientific data visualization and plotting in Python“
ref:Scipy lectures:

https://scipy-lectures.github.io/packages/3d_plotting/

Bokeh¶

Src: https://github.com/bokeh/bokeh

Docs: https://bokeh.pydata.org/

VisPy¶

Homepage: http://vispy.org/ (OpenGL)

Src: https://github.com/vispy/vispy

Vega¶

Homepage: https://trifacta.github.io/vega/

Vincent¶

Src: https://github.com/wrobstory/vincent

Plotly¶

Wikipedia: https://en.wikipedia.org/wiki/Plotly

Homepage: https://plot.ly/

PyQtGraph¶

http://www.pyqtgraph.org/ (OpenGL)

qgrid¶

Src: https://github.com/quantopian/qgrid

(SlickGrid w/ IPython Notebook/ Jupyter Notebook
pandas support

D3.js¶

Wikipedia: https://en.wikipedia.org/wiki/D3.js

Homepage: http://d3js.org/

Three.js¶

Wikipedia: https://en.wikipedia.org/wiki/Three.js

Homepage: http://threejs.org/

(WebGL)

Google ARCore Web is built on Three.js
React VR is built on Three.js

Sigmajs¶

Homepage: http://sigmajs.org/

Graphs in Javascript