Data Science¶
https://en.wikipedia.org/wiki/Data_science
- Science
- Open Science
- Scientific Method
- Reproducibility
Open Source Tools:
- SageMathCloud (SageMath)
- Jupyter Notebook
- Jupyter Docker Stacks (Conda)
- Jupyter Extensions
- Jupyter and Reproducibility
datasciencemasters.org¶
Ten Simple Rules¶
#TenSimpleRules for Reproducible Computational Research¶
- For Every Result, Keep Track of How It Was Produced
- Avoid Manual Data Manipulation Steps
- Archive the Exact Versions of All External Programs Used
- Version Control All Custom Scripts
- Record All Intermediate Results, When Possible in Standardized Formats
- For Analyses That Include Randomness, Note Underlying Random Seeds
- Always Store Raw Data behind Plots
- Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
- Connect Textual Statements to Underlying Results
- Provide Public Access to Scripts, Runs, and Results
For Every Result, Keep Track of How It Was Produced
- RDF, JSON-LD (e.g. W3C PROV)
- Workflow
- Knowledge Engineering > Linked Data
Avoid Manual Data Manipulation Steps
- Workflow
- Continuous Delivery
Archive the Exact Versions of All External Programs Used
- Jupyter and Reproducibility (
%version_information
,%watermark
) (should be “Reproducibility and Jupyter Notebook”)
- Jupyter and Reproducibility (
Version Control All Custom Scripts
Record All Intermediate Results, When Possible in Standardized Formats
- Linked Data (e.g. 5 ★ Linked Open Data)
For Analyses That Include Randomness, Note Underlying Random Seeds
Python random functions:
print(os.environ['PYTHONHASHSEED']) RANDOMSEED = 1 # /dev/[x]random import random random.seed(RANDOMSEED) import numpy as np np.random.seed(RANDOMSEED) # Seed print(np.random.get_state()) # State np.random.rand(4, 2) # (rows, cols, [...]) np.random.randn(4, 2) # "standard normal" distribution
Python hash randomization and algorithmic determinism:
Always Store Raw Data behind Plots
- Or, “Generate all plots from [source-controlled] [transforms-of] raw data”
./data
./tests/data
./nb/data
(./notebooks
)- Data Visualization, Data Visualization Tools
Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
SKOS:
skos:narrower
,skos:narrowerTransitive
,skos:broader
,skos:broaderTransistive
, [...]XKOS: “An SKOS extension for representing statistical classifications”
RDF Data Cubes: “The RDF Data Cube Vocabulary”
qb:DataSet
,qb:Dimension
,qb:ObservationGroup
,qb:Slice
, [...]
Connect Textual Statements to Underlying Results
- Linked Data: URIs, URLs,
#uri-fragments
- Turtle / TriG:
<>
(this document, this named graph) - ReStructuredText
- http://sphinx-doc.org/rest.html#footnotes #citations #substitutions
- https://github.com/yoloseem/awesome-sphinxdoc
- Linked Reproducibility: URIs, URLs,
#uri-fragments
- Linked Data: URIs, URLs,
Provide Public Access to Scripts, Runs, and Results
#TenSimpleRules for Creating a Good Data Management Plan¶
- Determine the Research Sponsor Requirements
- Identify the Data to Be Collected
- Define How the Data Will Be Organized
- Explain How the Data Will Be Documented
- Describe How Data Quality Will Be Assured
- Present a Sound Data Storage and Preservation Strategy
- Define the Project’s Data Policies
- Describe How the Data Will Be Disseminated
- Assign Roles and Responsibilities
- Prepare a Realistic Budget
http://journals.plos.org/plosone/s/data-availability
> PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.
Data, Information, Knowledge, & Wisdom¶
https://en.wikipedia.org/wiki/Data
https://en.wikipedia.org/wiki/Information
https://en.wikipedia.org/wiki/Knowledge (see: Knowledge Engineering)
https://en.wikipedia.org/wiki/Wisdom
# Lead -> Gold
- Data is information
- Information is data
- Raw data is not knowledge
- Wisdom compares knowledges
Optimization¶
https://en.wikipedia.org/wiki/Mathematical_optimization
Find local and global optima (maxima and minima) within an n-dimensional field which may be limited by resource constraints.
# Global optima of a 1-dimensional list
points = [10, 20, 100, 20, 10]
global_max, global_min = max(points), min(points)
assert global_max == 100
assert global_min == 10
# Local optima of a 1-dimensional list
sample = points[:1]
local_max, local_min = max(sample), min(sample)
assert local_max == 20
assert local_min == 10
# A 2-dimensional list ...
points = [(-0.5, 0),
(0, 0.5),
(0.5, 0),
(0, -0.5)]
Smoothies¶
Data
Inputs, Outputs
Revenue:
2014-01-01 1200 CDT $80
2014-01-01 1210 CDT $100
2014-01-01 1500 CDT $20
Expenses:
2014-01-01 wages $256 ($8/hr * 8hrs * 4 people)
2014-01-01 utilities $100
Information
Aggregations, Tendencies
Revenue (gross):
2014-01-01 total: $200
Expenses:
2014-01-01 total: $356
Net:
2013-01-01 net: -$200
2014-01-01 net: -$156
On Mondays, we usually (on (simple) average) make about $500.
Knowledge
- Positive net revenue is good.
- One customer is worth the world to us.
Wisdom
We could save money by not being open on New Years Day, but, our loyal customers would not be happy about that.
Body Temperature¶
Data
time, body temp, outdoor temp, indoors/outdoors
time, exercise type, intensity, duration
Information
Daily temperature variance is about n degrees
Knowledge
- Walking outside when it is warm increases body temperature
- Walking outside when it is cold decreases body temperature
- Exercise increases body temperature
Wisdom
If it’s 1745, and body temperature is n degrees above baseline, I’m probably walking outside and it is hot out.
Theory¶
Science¶
https://en.wikipedia.org/wiki/Science
https://en.wikipedia.org/wiki/Outline_of_science
https://en.wikipedia.org/wiki/Category:Science
Cognitive Biases¶
https://en.wikipedia.org/wiki/Cognitive_bias
https://en.wikipedia.org/wiki/Heuristics_in_judgment_and_decision-making
https://en.wikipedia.org/wiki/List_of_cognitive_biases
- https://en.wikipedia.org/wiki/Confirmation_bias
- https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc
- https://en.wikipedia.org/wiki/Logical_fallacies#See_also
- https://en.wikipedia.org/wiki/List_of_fallacies
- https://en.wikipedia.org/wiki/Controlling_for_a_variable
- “distance walked per day”
- “sports played” (sport, years)
Open Science¶
Scientific Method¶
https://en.wikipedia.org/wiki/Scientific_method
https://en.wikipedia.org/wiki/Argument
https://en.wikipedia.org/wiki/Empirical_evidence
https://en.wikipedia.org/wiki/Hypothesis
- https://en.wikipedia.org/wiki/Statistical_hypothesis_testing
- https://en.wikipedia.org/wiki/Null_hypothesis
- https://en.wikipedia.org/wiki/Alternative_hypothesis
- https://en.wikipedia.org/wiki/Dependent_and_independent_variables
Reproducibility¶
Systematic Review¶
Linked Reproducibility¶
#LinkedReproducibility
Note
This heading is now merged into a separate page: LinkedReproducibility
Math¶
https://en.wikipedia.org/wiki/Mathematics
https://en.wikipedia.org/wiki/Outline_of_mathematics
https://en.wikipedia.org/wiki/Mathematics_education#Methods
Math Courses¶
“Mathematics for Computer Science” (CC-BY-SA 3.0)
https://en.wikipedia.org/wiki/Kaggle#How_Kaggle_competitions_work
Rosalind¶
Bioinformatics and Data Science Algorithm Problems and Exercises
Mathematical Notation¶
- https://en.wikipedia.org/wiki/Outline_of_mathematics#Mathematical_notation
- https://en.wikipedia.org/wiki/List_of_mathematical_symbols
- https://en.wikipedia.org/wiki/List_of_mathematical_symbols_by_subject
- https://en.wikipedia.org/wiki/Greek_letters_used_in_mathematics,_science,_and_engineering
- https://en.wikipedia.org/wiki/Latin_letters_used_in_mathematics
See:
- Knowledge Engineering > Symbols
- Units > Units and RDF
MathJax¶
MathJax is a Javascript library for displaying MathML, LaTeX, and ASCIIMathML markup in a browser.
MathJax and IPython Notebook / Jupyter Notebook:
MathML¶
Information Theory¶
https://en.wikipedia.org/wiki/Information_theory
https://en.wikipedia.org/wiki/Entropy_(information_theory)
https://en.wikipedia.org/wiki/Signal_(electrical_engineering)
https://en.wikipedia.org/wiki/Noise_(signal_processing)
Linear Algebra¶
Calculus¶
https://en.wikipedia.org/wiki/Calculus
- https://www.khanacademy.org/math/precalculus
- https://www.khanacademy.org/math/differential-calculus
- https://www.khanacademy.org/math/integral-calculus
- https://www.khanacademy.org/math/multivariable-calculus
- https://www.khanacademy.org/math/differential-equations
- https://en.wikipedia.org/wiki/AP_Calculus
- http://apcentral.collegeboard.com/apc/public/courses/teachers_corner/2178.html
- http://www.sagemath.org/calctut/
- http://boxen.math.washington.edu/home/wdj/teaching/calc1-sage/
- http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-5-Sympy.ipynb
- http://scipy-lectures.github.io/advanced/sympy.html#calculus
- https://www.class-central.com/subject/calculus-and-mathematical-analysis
Statistics¶
https://en.wikipedia.org/wiki/Statistics
https://en.wikipedia.org/wiki/Outline_of_statistics
https://en.wikipedia.org/wiki/Category:Statistics
- https://en.wikipedia.org/wiki/Notation_in_probability_and_statistics
- http://apcentral.collegeboard.com/apc/public/courses/teachers_corner/2151.html
- https://www.class-central.com/search?q=statistics
Parametric Statistics¶
Nonparametric Statistics¶
https://en.wikipedia.org/wiki/Nonparametric_statistics
Descriptive Statistics¶
Statistical Inference¶
Analysis¶
https://en.wikipedia.org/wiki/Data_analysis
https://en.wikipedia.org/wiki/Big_data
https://en.wikipedia.org/wiki/Data_processing#Data_processing_functions
Learning¶
https://en.wikipedia.org/wiki/Learning
- http://plato.stanford.edu/entries/learning-formal/
- http://plato.stanford.edu/entries/logic-inductive/
https://en.wikipedia.org/wiki/Autodidacticism
https://en.wikipedia.org/wiki/Perceptual_learning
https://en.wikipedia.org/wiki/Pattern_recognition_(psychology)#False_pattern_recognition
https://en.wikipedia.org/wiki/Rhetoric
https://en.wikipedia.org/wiki/Socratic_method
https://en.wikipedia.org/wiki/Socratic_questioning
https://en.wikipedia.org/wiki/Platonic_dialogue#The_dialogues
https://en.wikipedia.org/wiki/Dialectic
https://en.wikipedia.org/wiki/Dialogue
https://en.wikipedia.org/wiki/Perturbation_theory_(quantum_mechanics)
https://en.wikipedia.org/wiki/Validated_learning
Data Mining¶
https://en.wikipedia.org/wiki/Data_mining
Data Dredging¶
- !
- Causality
- spurious correlations
Machine Learning¶
Deep Learning¶
- https://en.wikipedia.org/wiki/Biological_neural_network
- https://en.wikipedia.org/wiki/Artificial_neural_network
- https://en.wikipedia.org/wiki/Recurrent_neural_network
- http://www.scholarpedia.org/article/Recurrent_neural_networks
- https://en.wikipedia.org/wiki/Feedforward_neural_network
- https://en.wikipedia.org/wiki/Convolutional_neural_network
- https://en.wikipedia.org/wiki/Perceptron
- https://en.wikipedia.org/wiki/Reservoir_computing
- http://deeplearning.net/
Tools¶
ETL¶
Workflow¶
- Scientific Method
- Project Management
- https://en.wikipedia.org/wiki/Checklist
- https://en.wikipedia.org/wiki/Scientific_workflow_system
- Units of measure
- I/O Transforms of information(/energy)
“Data Provenance”, “Data Lineage”
- https://en.wikipedia.org/wiki/Provenance#Data_provenance
- https://en.wikipedia.org/wiki/Data_lineage#Data_Provenance
- W3C PROV Provenance Ontology
See:
Techniques¶
Automated Workflows¶
Standard, Automated Workflows
Q: Is there confirmation bias in starting with e.g. simple regression analysis?
Q: Which factors did we know we were capturing?
5 ★ Linked Open Data¶
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
☆
Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆
Publish structured data on the Web in a machine-readable format (e.g. XML).
☆☆☆
Publish structured data on the Web in a documented, non-proprietary data format (e.g. CSV, KML).
☆☆☆☆
Publish structured data on the Web as RDF (e.g. Turtle, RDFa, JSON-LD, SPARQL.)
☆☆☆☆☆
In your RDF, have the identifiers be links (URLs) to useful data sources.
Data Visualization¶
Visualizing Data Science¶
The Data Science Venn Diagram
- http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
- http://datascienceassn.org/content/fourth-bubble-data-science-venn-diagram-social-sciences
Field representations
Data Visualization Tools¶
- https://github.com/vinta/awesome-python#data-visualization
- https://github.com/sorrycc/awesome-javascript#data-visualization
- https://pandas.pydata.org/pandas-docs/stable/ecosystem.html#visualization
Matplotlib¶
ref:Scipy lectures:
http://scipy-lectures.github.io/intro/matplotlib/matplotlib.html
Scientific-python-lectures:
http://tonysyu.github.com/mpltools/auto_examples/index.html#style-package
http://mpld3.github.io/ (Matplotlib + D3.js)
conda install matplotlib
(Conda (Anaconda))
.
- pandas plot functions generate matplotlib charts.
Seaborn¶
- “Seaborn is a Python visualization library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics.”
Mayavi¶
“Mayavi: 3D scientific data visualization and plotting in Python“
ref:Scipy lectures:
Bokeh¶
VisPy¶
Plotly¶
PyQtGraph¶
D3.js¶
Three.js¶
(WebGL)
- Google ARCore Web is built on Three.js
- React VR is built on Three.js
See Also¶
- Tools > Semantic Web Tools
- Art & Design
- Machine Learning