Data Analysis with Python – The Tools and the Data

Welcome to the Inaugural Post for my Learning Python Series

The Post is the start of a series of walk throughs from Start to Finish of my journey into Data Analysis and Data Science with Python

 

The Tools and Loading the Data

What are the tools to download in order to get started building in Python

How do I load the data and construct the domain

How do I do some basic analysis on the data to get a feel for the relationships.

The Next Part

The next post will utilize Panda to perform quicker more structure data analysis.

 

 

The Tools

I downloaded and installed the Anaconda distribution along with the Visual Studio Python Tools

The Anaconda distribution is a python distribution that contains many many scientific libraries.

The Data

Fortunately there is a massive amount of data that you can have fun and experiment with.

The UCI Machine Learning Repository has a huge amount of data.

Examples

  • Mice Protein Expression
    • Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.
    • Mice Protein
  • Car Evaluation Data Set
    • Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.
  • Adult Data Set
    • Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset.

I am going to use the Adult Data Set in my Examples.

 

Loading the Data and Graphing the Data

Grab the Data from

http://archive.ics.uci.edu/ml/datasets/Adult

import csv

with open('C:\adult.test','r') as f:
    for line in f: 
        reader = csv.reader(f)
        for row in reader: 
            age = row[0]
            workclass = row[1]
            fnlweight = row[2]
            education = row[3]
            educationnum = row[4]
            maritalstatus = row[5]
            occupation = row[6]

            print workclass

This will print out the workclass column in the data.

Resulting in an output like

 State-gov
 Federal-gov
 Private
 Private
 Private
 Local-gov
 Private
 Local-gov

In order to start to looking into the data we can use someone of Python’s built in magic to bucket the data and create some histograms.

Histograms will tell us how the data is distributed and start to give us clues about the shape of the data.

In order to create a histogram we will use the collections library to count the data

 

def create_histogram(labels, values, bucket_size, title):
    plt.bar(labels, values)
    plt.title(title)
    plt.show()
    

agelist = list()

with open ('C:\Users\Jon\Documents\adult.test','r') as f:
	for line in f:
		reader = csv.reader(f)
		for row in reader:
			try:
				age = row[0]
				agelist.append(age)

			except IndexError:
				print("something")

agelistfloat = [float(x) for x in agelist]
agedist = Counter(agelist)

labels, values = zip(*agedist.items())

valueslistfloat = [float(x) for x in values]
labelsliststring = [float(x) for x in labels]

create_histogram(labelsliststring,valueslistfloat,5,"Age Distribution Simple")

Once I get the file

I take my agelist and run it through Counter.

Counter allows for rapid tallying of data.  It returns a defaultdict object that list each age and how many occurences there were.

We then unzip the list to labels and values.  * reverses the zip operation.

We then call the pyplot.bar(labels, values) to show the graph.

Simple Histogram

We can see that our data is a right-skewed distribution.  When you look at the data you can see it is evenly distributed over the income generating population.  At around 18 it starts and starts to gradually tail off at the peak of around 40.

 

First Refactoring – Making the code a bit more compact and readable

I mapped each row to a named tuple in order to iterate through the data a bit more intuitively

First I created a dictionary of the columns in the data.

    economic_columns = ['age', 'workclass', 'fnlwght', 'education', 'educationNum', 'maritalStatus', 'occupation', 'relationship', 'race', 'sex', 'capitalGain', 'capitalLoss', 'hoursPerWeek','nativeCountry', 'income']

    EconRecord = collections.namedtuple('econ',economic_columns)

 

I then created a object that would represent the named tuple.

    rowlist = list()

    for econ in map(EconRecord._make, csv.reader(open('C:\Users\Jon\Documents\adult.txt', "r"))):
        rowlist.append(econ)

I used the EconRecord and the map function to apply EconRecord._make to every record in the collection.  Creating a new econ record for each row in the file.

The result is being able is being to aggregate the items in a bit more cleanly with more concise readable code.

    agelistint = [int(x.age) for x in rowlist]
    agedist = Counter(agelistint)
    labels, values = zip(*agedist.items())

The Next Series we will start to scatter plot and look for relationships in the data using Panda.

 

References 

Python Lists

The Book – Data Science from Scratch

Beautiful Plots With Pandas and Matplotlib

Collections 

CSV  – Python

Python Structure 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s