I signed up to the mailing list of the Data Science Association a little while ago. They are a non-profit organisation whose guiding principle is "To promote data science to improve life, business and government." What I find neat about what they want to do, is that they want to set standards for the data science industry, so that it doesn't just become some hocus pocus world where anyone can claim to be a data scientist (and hence potentially give serious data scientist a bad name).
The other mailing list I signed up to is Data Science Central. This site is great! I can really recommend it! It's packed full of interesting articles. For example: 38 Seminal Articles Every Data Scientist Should Read and 66 job interview questions for data scientists.
Monday, 18 August 2014
Tuesday, 17 June 2014
SQL and Databases
Data is great, but where do you put it so that you can access it easily?
One popular option is a relational database accessed using SQL (structured query language). Whilst SQL has a general set of commands, different database systems have their own additional commands. Examples of these systems are PostgreSQL, SQLite, and MariaDB (open source), with commercial solutions from, for example, Oracle, IBM, Microsoft and SAP.
I'm currently going through Udacity's Intro to Data Science, which introduces and has you carry out some basic SQL quieries. However, I wanted to go further and learn SQL in more detail. I found two really nice resources to learn SQL: sqlzoo and w3schools. I can really recommend them!
Here are a few important basic commands:
SELECT parameter1, parameter2
FROM database
WHERE parameter1 > 10 AND parameter2 < 10
AND parameter3 IN ('X', 'Y', 'Z')
AND parameter4 =
(SELECT avg(parameter4)
FROM world
WHERE parameter5 > 10)
GROUP BY parameter5;
One popular option is a relational database accessed using SQL (structured query language). Whilst SQL has a general set of commands, different database systems have their own additional commands. Examples of these systems are PostgreSQL, SQLite, and MariaDB (open source), with commercial solutions from, for example, Oracle, IBM, Microsoft and SAP.
I'm currently going through Udacity's Intro to Data Science, which introduces and has you carry out some basic SQL quieries. However, I wanted to go further and learn SQL in more detail. I found two really nice resources to learn SQL: sqlzoo and w3schools. I can really recommend them!
Here are a few important basic commands:
SELECT parameter1, parameter2
FROM database
WHERE parameter1 > 10 AND parameter2 < 10
AND parameter3 IN ('X', 'Y', 'Z')
AND parameter4 =
(SELECT avg(parameter4)
FROM world
WHERE parameter5 > 10)
GROUP BY parameter5;
Thursday, 12 June 2014
Simple Social Network Project
After completing the course part of Udacity's Intro to Computer Science, there was a final project, which was to create and then (in my case)* analyse a social network centred around users, who they are connected to, and which games they like to play.
The starting point was a string of sentences, e.g. "John is connected to Peter, Paul and Sarah. John likes to play Arrow Stars, Super Captain Jim, and I am Pilot"
There was a design brief that one should create a data structure to hold information about users and a set of procedures that should also be available, for example, to add a new user, to see if two users are connected and to get the common connections of two users.
In the end, I also chose to analyse the network, create a histogram of the most popular games and a histogram of the most popular users. In addition, using the python package networkx, I visualized the network between the n most popular users, highlighting the m most popular users (where n > m), which is shown below (where bold lines indicate that both users have "liked" each other).
* one had to create a procedure of one's own that either did something to the network or analysed it. I chose to analyse the network, but, for me, this also included adding to the network as well.
The starting point was a string of sentences, e.g. "John is connected to Peter, Paul and Sarah. John likes to play Arrow Stars, Super Captain Jim, and I am Pilot"
There was a design brief that one should create a data structure to hold information about users and a set of procedures that should also be available, for example, to add a new user, to see if two users are connected and to get the common connections of two users.
In the end, I also chose to analyse the network, create a histogram of the most popular games and a histogram of the most popular users. In addition, using the python package networkx, I visualized the network between the n most popular users, highlighting the m most popular users (where n > m), which is shown below (where bold lines indicate that both users have "liked" each other).
* one had to create a procedure of one's own that either did something to the network or analysed it. I chose to analyse the network, but, for me, this also included adding to the network as well.
Monday, 9 June 2014
Git and GitHub
This week I started the John Hopkins University Data Science Specialization via Coursera. This course has 9 modules, plus a Capstone Project.
The first course is The Data Scientist's Toolbox, which introduces the version control system Git and the online respository system GitHub. If you are going to develop programs, especially if you are going to be collaborating with others, version control is a must - how else will you be able to keep track of everything you have done and everything else everyone else is doing?
Here are some common commands you will need in the command line:
git init
initialize a git repository.
git remote add origin https://github.com/username/repository.git
associate your local repository with a repository on github.
git add -A
add and update files in your repository.
git commit -m "message relating what you have done for this version"
commit a version to your repository.
git push
push the committed version of your (local) repository to the associated remote repository.
As a first commit to my Github account, I have added my simple social network, which was my final project from the Udacity course Intro to Computer Science. I"ll discuss this in my next post.
The first course is The Data Scientist's Toolbox, which introduces the version control system Git and the online respository system GitHub. If you are going to develop programs, especially if you are going to be collaborating with others, version control is a must - how else will you be able to keep track of everything you have done and everything else everyone else is doing?
Here are some common commands you will need in the command line:
git init
initialize a git repository.
git remote add origin https://github.com/username/repository.git
associate your local repository with a repository on github.
git add -A
add and update files in your repository.
git commit -m "message relating what you have done for this version"
commit a version to your repository.
git push
push the committed version of your (local) repository to the associated remote repository.
As a first commit to my Github account, I have added my simple social network, which was my final project from the Udacity course Intro to Computer Science. I"ll discuss this in my next post.
Wednesday, 4 June 2014
So, where is the beginning?
Everyone has a different starting point.
In my case, it's admitedly not zero - I've been working with experimental data and theoretical models for a while, mainly using MATLAB and Mathematica. However, the field of Data Science (as it is, in general, talked about) has little place* for MATLAB. In fact, data science is a whole other beast, from the point of view of languages and technology.
One of those seemingly important languages is python. So, to begin my journey, so to speak, I have just completed the Intro to Computer Science course from Udacity. I found this really useful because I've never had any formal computer science training - my programming to date has been learning as and when I needed to do something, rather then learning important fundamental concepts. The course is based solely on python, so I can really recommend it as a way to learn the fundamental parts of the python language, plus all the general computer programming concepts, e.g. if statements, while loops, for loops, recursion, as well as how to build a search engine! In fact, the course revolves around understanding and building a search engine, so, in the end, you really have built a basic search engine, using the ranking algorithm used by Google when it first appeared all those years ago.
Then, the final part of the Udacity course was a Final Project, which was to build and analyse a social network from strings of user data. I'll post about my final project a little later.
* I must say, though, I prescribe definitely to the idea that you should choose the best tool for the job (which could be MATLAB!). First the problem, then the solution through the best technology you can muster.
In my case, it's admitedly not zero - I've been working with experimental data and theoretical models for a while, mainly using MATLAB and Mathematica. However, the field of Data Science (as it is, in general, talked about) has little place* for MATLAB. In fact, data science is a whole other beast, from the point of view of languages and technology.
One of those seemingly important languages is python. So, to begin my journey, so to speak, I have just completed the Intro to Computer Science course from Udacity. I found this really useful because I've never had any formal computer science training - my programming to date has been learning as and when I needed to do something, rather then learning important fundamental concepts. The course is based solely on python, so I can really recommend it as a way to learn the fundamental parts of the python language, plus all the general computer programming concepts, e.g. if statements, while loops, for loops, recursion, as well as how to build a search engine! In fact, the course revolves around understanding and building a search engine, so, in the end, you really have built a basic search engine, using the ranking algorithm used by Google when it first appeared all those years ago.
Then, the final part of the Udacity course was a Final Project, which was to build and analyse a social network from strings of user data. I'll post about my final project a little later.
* I must say, though, I prescribe definitely to the idea that you should choose the best tool for the job (which could be MATLAB!). First the problem, then the solution through the best technology you can muster.
Saturday, 31 May 2014
data science, from the beginning
So, here it is. The beginning.
Right now I'm an experimental quantum physicist, but I just can't shake the feeling that it's time for something new.
I've been looking around at various different industries and, having read a few articles about 'data science' and 'big data', I went along to DataScienceDay in Berlin. I was completely sold: data, modelling, machine learning, predictive analytics - using data analytics to solve business problems and social issues. This is what I want to do!
So, I'm taking the plunge. However, first, I have a lot of new things to learn in the world of data science. This blog is going to be a diary of what I learn, as a reference for me to come back to (and, if anyone is reading this, I hope it helps you too).
So, here goes...
Right now I'm an experimental quantum physicist, but I just can't shake the feeling that it's time for something new.
I've been looking around at various different industries and, having read a few articles about 'data science' and 'big data', I went along to DataScienceDay in Berlin. I was completely sold: data, modelling, machine learning, predictive analytics - using data analytics to solve business problems and social issues. This is what I want to do!
So, I'm taking the plunge. However, first, I have a lot of new things to learn in the world of data science. This blog is going to be a diary of what I learn, as a reference for me to come back to (and, if anyone is reading this, I hope it helps you too).
So, here goes...
Subscribe to:
Posts (Atom)