Tuesday, 17 June 2014

SQL and Databases

Data is great, but where do you put it so that you can access it easily?
One popular option is a relational database accessed using SQL (structured query language). Whilst SQL has a general set of commands, different database systems have their own additional commands. Examples of these systems are PostgreSQL, SQLite, and MariaDB (open source), with commercial solutions from, for example, Oracle, IBM, Microsoft and SAP.

I'm currently going through Udacity's Intro to Data Science, which introduces and has you carry out some basic SQL quieries.  However, I wanted to go further and learn SQL in more detail.  I found two really nice resources to learn SQL: sqlzoo and w3schools.  I can really recommend them!

Here are a few important basic commands:

SELECT parameter1, parameter2
FROM database
WHERE parameter1 > 10 AND parameter2 < 10
AND parameter3 IN ('X', 'Y', 'Z')
AND parameter4 = 
    (SELECT avg(parameter4)
     FROM world
     WHERE parameter5 > 10)
GROUP BY  parameter5;





Thursday, 12 June 2014

Simple Social Network Project

After completing the course part of Udacity's Intro to Computer Science, there was a final project, which was to create and then (in my case)* analyse a social network centred around users, who they are connected to, and which games they like to play.

The starting point was a string of sentences, e.g. "John is connected to Peter, Paul and Sarah. John likes to play Arrow Stars, Super Captain Jim, and I am Pilot"

There was a design brief that one should create a data structure to hold information about users and a set of procedures that should also be available, for example, to add a new user, to see if two users are connected and to get the common connections of two users.

In the end, I also chose to analyse the network, create a histogram of the most popular games and a histogram of the most popular users.  In addition, using the python package networkx, I visualized the network between the n most popular users, highlighting the m most popular users (where n > m), which is shown below (where bold lines indicate that both users have "liked" each other).

* one had to create a procedure of one's own that either did something to the network or analysed it. I chose to analyse the network, but, for me, this also included adding to the network as well.

Monday, 9 June 2014

Git and GitHub

This week I started the John Hopkins University Data Science Specialization via Coursera.  This course has 9 modules, plus a Capstone Project.

The first course is The Data Scientist's Toolbox, which introduces the version control system Git and the online respository system GitHub.  If you are going to develop programs, especially if you are going to be collaborating with others, version control is a must - how else will you be able to keep track of everything you have done and everything else everyone else is doing?

Here are some common commands you will need in the command line:

git init
initialize a git repository.

git remote add origin https://github.com/username/repository.git
associate your local repository with a repository on github.

git add -A
add and update files in your repository.

git commit -m "message relating what you have done for this version"
commit a version to your repository.

git push
push the committed version of your (local) repository to the associated remote repository.

As a first commit to my Github account, I have added my simple social network, which was my final project from the Udacity course Intro to Computer Science. I"ll discuss this in my next post.


Wednesday, 4 June 2014

So, where is the beginning?

Everyone has a different starting point.

In my case, it's admitedly not zero - I've been working with experimental data and theoretical models for a while, mainly using MATLAB and Mathematica. However, the field of Data Science (as it is, in general, talked about) has little place* for MATLAB.  In fact, data science is a whole other beast, from the point of view of languages and technology.

One of those seemingly important languages is python. So, to begin my journey, so to speak, I have just completed the Intro to Computer Science course from Udacity.  I found this really useful because I've never had any formal computer science training - my programming to date has been learning as and when I needed to do something, rather then learning important fundamental concepts.  The course is based solely on python, so I can really recommend it as a way to learn the fundamental parts of the python language, plus all the general computer programming concepts, e.g. if statements, while loops, for loops, recursion, as well as how to build a search engine! In fact, the course revolves around understanding and building a search engine, so, in the end, you really have built a basic search engine, using the ranking algorithm used by Google when it first appeared all those years ago.
Then, the final part of the Udacity course was a Final Project, which was to build and analyse a social network from strings of user data. I'll post about my final project a little later.



* I must say, though, I prescribe definitely to the idea that you should choose the best tool for the job (which could be MATLAB!).  First the problem, then the solution through the best technology you can muster.