Data Science

https://en.wikipedia.org/wiki/Data_science

Open Source Tools:

datasciencemasters.org

“The Open Source Data Science Masters”

Ten Simple Rules

#TenSimpleRules for Reproducible Computational Research

“Ten Simple Rules for Reproducible Computational Research”
DOI: 10.1371/journal.pcbi.1003285 Featured in PLOS Collections
  1. For Every Result, Keep Track of How It Was Produced
  2. Avoid Manual Data Manipulation Steps
  3. Archive the Exact Versions of All External Programs Used
  4. Version Control All Custom Scripts
  5. Record All Intermediate Results, When Possible in Standardized Formats
  6. For Analyses That Include Randomness, Note Underlying Random Seeds
  7. Always Store Raw Data behind Plots
  8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected
  9. Connect Textual Statements to Underlying Results
  10. Provide Public Access to Scripts, Runs, and Results
  1. For Every Result, Keep Track of How It Was Produced

  2. Avoid Manual Data Manipulation Steps

  3. Archive the Exact Versions of All External Programs Used

  4. Version Control All Custom Scripts

  5. Record All Intermediate Results, When Possible in Standardized Formats

  6. For Analyses That Include Randomness, Note Underlying Random Seeds

    Python random functions:

    print(os.environ['PYTHONHASHSEED'])
    RANDOMSEED = 1  # /dev/[x]random
    
    import random
    random.seed(RANDOMSEED)
    
    import numpy as np
    np.random.seed(RANDOMSEED)    # Seed
    print(np.random.get_state())  # State
    np.random.rand(4, 2) # (rows, cols, [...])
    np.random.randn(4, 2) # "standard normal" distribution
    

    Python hash randomization and algorithmic determinism:

  7. Always Store Raw Data behind Plots

  8. Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected

  9. Connect Textual Statements to Underlying Results

  10. Provide Public Access to Scripts, Runs, and Results

#TenSimpleRules for Creating a Good Data Management Plan

“Ten Simple Rules for Creating a Good Data Management Plan”
DOI: 10.1371/journal.pcbi.1004525
  1. Determine the Research Sponsor Requirements
  2. Identify the Data to Be Collected
  3. Define How the Data Will Be Organized
  4. Explain How the Data Will Be Documented
  5. Describe How Data Quality Will Be Assured
  6. Present a Sound Data Storage and Preservation Strategy
  7. Define the Project’s Data Policies
  8. Describe How the Data Will Be Disseminated
  9. Assign Roles and Responsibilities
  10. Prepare a Realistic Budget

http://journals.plos.org/plosone/s/data-availability

> PLOS journals require authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception.

Data, Information, Knowledge, & Wisdom

https://en.wikipedia.org/wiki/Data

https://en.wikipedia.org/wiki/Information

https://en.wikipedia.org/wiki/Knowledge (see: Knowledge Engineering)

https://en.wikipedia.org/wiki/Wisdom

# Lead -> Gold
  • Data is information
  • Information is data
  • Raw data is not knowledge
  • Wisdom compares knowledges

Optimization

https://en.wikipedia.org/wiki/Mathematical_optimization

Find local and global optima (maxima and minima) within an n-dimensional field which may be limited by resource constraints.

# Global optima of a 1-dimensional list
points = [10, 20, 100, 20, 10]
global_max, global_min = max(points), min(points)
assert global_max == 100
assert global_min == 10

# Local optima of a 1-dimensional list
sample = points[:1]
local_max, local_min = max(sample), min(sample)
assert local_max == 20
assert local_min == 10

# A 2-dimensional list ...
points = [(-0.5, 0),
          (0,  0.5),
          (0.5,  0),
          (0, -0.5)]

Smoothies

Data

Inputs, Outputs

Revenue:

2014-01-01 1200 CDT  $80
2014-01-01 1210 CDT  $100
2014-01-01 1500 CDT  $20

Expenses:

2014-01-01 wages     $256 ($8/hr * 8hrs * 4 people)
2014-01-01 utilities $100

Information

Aggregations, Tendencies

Revenue (gross):

2014-01-01  total: $200

Expenses:

2014-01-01  total: $356

Net:

2013-01-01  net:  -$200
2014-01-01  net:  -$156

On Mondays, we usually (on (simple) average) make about $500.

Knowledge

  • Positive net revenue is good.
  • One customer is worth the world to us.

Wisdom

We could save money by not being open on New Years Day, but, our loyal customers would not be happy about that.

Body Temperature

Data

time, body temp, outdoor temp, indoors/outdoors
time, exercise type, intensity, duration

Information

Daily temperature variance is about n degrees

Knowledge

  • Walking outside when it is warm increases body temperature
  • Walking outside when it is cold decreases body temperature
  • Exercise increases body temperature

Wisdom

If it’s 1745, and body temperature is n degrees above baseline, I’m probably walking outside and it is hot out.

Theory

Science

https://en.wikipedia.org/wiki/Science

https://en.wikipedia.org/wiki/Outline_of_science

https://en.wikipedia.org/wiki/Category:Science

Math

https://en.wikipedia.org/wiki/Mathematics

https://en.wikipedia.org/wiki/Outline_of_mathematics

https://en.wikipedia.org/wiki/Mathematics_education#Methods

Rosalind

Bioinformatics and Data Science Algorithm Problems and Exercises

Analysis

https://en.wikipedia.org/wiki/Data_analysis

https://en.wikipedia.org/wiki/Big_data

https://en.wikipedia.org/wiki/Data_processing#Data_processing_functions

Techniques

Automated Workflows

Standard, Automated Workflows

Q: Is there confirmation bias in starting with e.g. simple regression analysis?

Q: Which factors did we know we were capturing?

5 ★ Linked Open Data

http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).

☆☆

Publish structured data on the Web in a machine-readable format (e.g. XML).

☆☆☆

Publish structured data on the Web in a documented, non-proprietary data format (e.g. CSV, KML).

☆☆☆☆

Publish structured data on the Web as RDF (e.g. Turtle, RDFa, JSON-LD, SPARQL.)

☆☆☆☆☆

In your RDF, have the identifiers be links (URLs) to useful data sources.

http://5stardata.info/

See: Knowledge Engineering, Semantic Web Standards

Data Visualization Tools

Mayavi

qgrid

  • (SlickGrid w/ IPython Notebook/ Jupyter Notebook
  • pandas support

Three.js

(WebGL)

  • Google ARCore Web is built on Three.js
  • React VR is built on Three.js

Sigmajs

See Also