As I mentioned in my comments on Coursera’s Executive Data Science specialisation, I have looked at a lot of online data science and statistics courses to find useful training material, understand the skills of people who have done these online courses, plus learn a bit myself.
One of the best known sets of courses is Coursera’s Data Science Specialisation, created by John Hopkins University. It is a ten course program that covers the data science process from data collection to the production of data science products. It focuses on implementing the data science process in R.
This specialisation is a signal that someone is familiar with data analysis in R – and the units are not bad if learning R is your goal. But this specialisation (nor any other similar length course I have reviewed to date) doesn’t offer a shortcut to the statistical knowledge necessary for good data science. A few university length units seem to be the minimum, and even they need to be paired with experience and self-directed study (not to mention some skepticism of what we can determine).
The specialisation assessments are such that you can often pass the courses without understanding what you have been taught. Points for some courses are awarded for “effort” (see Statistical Inference below). While capped at three attempts per 8 hours, the multiple choice quizzes have effectively unlimited attempts. I don’t have a great deal of faith in university assessment processes either – particularly in Australia where no-one wants to disrupt the flood of fees from international students by failing someone – but the assessment in these specialisations require even less knowledge or effort. They’re not much of a signal of anything.
If you are wondering whether you should audit or pay for the specialisation, you can’t submit the assignments under the audit option. But the quizzes are basic and you can find plenty of assignment submissions on GitHub or RPubs against which you can check your work.
Here are some notes on each course. I looked through each of these over a year or so, so there might be some updates to the earlier courses (although a quick revisit suggests my comments still apply).
- The Data Scientist’s Toolbox: Little more than an exercise in installing R and git, together with an overview of the other courses in the specialisation. If you are familiar with R and git, skip.
- R Programming: In some ways the specialisation could have been called R Programming. This unit is one of the better of the ten, and gives a basic grounding in R.
- Getting and Cleaning Data: Not bad for getting a grasp of the various ways of extracting data into R, but watching video after video of imports of different formats makes for less-than exciting viewing. The principles on tidy data are important – the unit is worth doing for this alone.
- Exploratory Data Analysis: Really a course in charting in R, but a decent one at that. There is some material on principal components analysis and clustering that will likely go over most people’s heads – too much material in too little time.
- Reproducible Research: The subject of this unit – literate (statistical) programming – is one of the more important subjects covered in the specialisation. However, this unit seemed cobbled together – lectures repeated points and didn’t seem produced to a logical structure. The last lecture is a conference video (albeit one worth watching). If you compare this unit to the (outstanding) production effort that has gone into the Applied Data Science with Python specialisation, this unit compares poorly.
- Statistical Inference: Likely too basic for someone with a decent stats background, but confusing for someone without. This unit hits home how it isn’t possible to build a stats background in a couple of hours a week over four weeks. The peer assessment caters to this through criteria such as “Here’s your opportunity to give this project +1 for effort.”, with option “Yes, this was a nice attempt (regardless of correctness)”.
- Regression Models: As per statistical inference, but possibly even more confusing for those without a stats background.
- Practical Machine Learning: Not a bad course for getting across implementing a few machine learning models in R, but there are better background courses. Start with Andrew Ng’s Machine Learning, and then work through Stanford’s Statistical Learning (which also has great R materials). Then return to this unit for a slightly different perspective. As for many of the other specialisation units, it is at a level too high for someone with no background. For instance, there is no point where they actually describe what machine learning is.
- Developing Data Products: This course is quite good, covering some of the major publishing tools, such as Shiny, R Markdown and Plotly (although skip the videos on Swirl). The strength of this specialisation is training in R, and that is what this unit focuses on.
- Data Science Capstone: This course can be best thought of as a commitment device that will force you to learn a certain amount about natural language processing in R (the topic of the project). You are given a task with a set of milestones, and you’re left to figure it out for yourself. Unless you already know something about natural language processing, you will have to review other courses and materials and spend a lot of time on the discussion boards to get yourself across the line. Skip it and do a natural language processing course such as Coursera’s Applied Text Mining in Python (although this assumes a fair bit of skill in Python). Besides, you can only access the capstone if you have paid for and completed the other nine units in the specialisation.