PTP 5. Data Science Stream / A. Introduction to data science
PTP Outcomes:
Learn about the history of data science, where it’s come from, where it’s going to
Get an overview of the main principles, methods and applications of data science
Understand how data scientists develop insights from raw data, and how they present those insights to others
Become familiar with the most important tools, languages, platforms being used by data scientists
Learn about the fundamentals of data engineering
Some thoughts before I being
My Personalised Training Plan has been set after reasonably extensive research online, taking account of topics covered by various online and offline schools/courses, the advice of Analytics Vidhya and Ajit Jaokar on how to move into a data science career, as well as a review of key skills required by data scientists based on current jobs advertised online (e.g. on LinkedIn).
I’m grateful for the opportunity to study the Data Science specialisation offered by Johns Hopkins University (link here) and the Python for Everyone specialisation by Uni. of Michigan (link here), both available on Coursera.org. Studying these two courses will, I believe, provide good coverage of the main items in the Programming and Data Science streams of my PTP. I will supplement these courses with further reading/research/study where applicable.
Thoughts and notes as I go along
Leek/JHU refers to data science as a science, i.e. it should be conducted in a scientific way and gets results which should be ‘reproducible’. I agree there needs to be this aspect of scientific rigour. But my own take is we have an opportunity to move things (life, business, the world in general) in the direction of art+science together, working in synergy. Surely there is also an art/creative element to data science too? Creativity in the framing of the problem, in the gathering and sorting of the data. Then, above all, in the developing of insights from that data, and in the presenting of those insights to others (through story-telling, visualisation) which moves them to take action. This sounds like more than just science to me.
Leek/JHU course aims to (1) teach us how to do data science scientifically, meaning we get results which are ‘reproducible’ by others (i.e. objective results); (2) show us the full life cycle from gathering data to presenting it; (3) ensure we practice the skills to embed our learning through doing [and also, one assumes, so we become conversant enough in those skills/techniques that we can replicate them across multiple problems and in various industries/settings].
These are the key outcomes I’ve set in the Data Science stream of my PTP so I believe I’ve made an excellent choice enrolling on this course. Above all, studying this course will sort out for me quite quickly whether or not I really am suited (intellectually, skills-wise, etc.) to a career shift to Data Science/Analytics. My motivation at this stage in the journey is sky-high (isn’t it always when starting out on something new?) but I do believe it’s not just the novelty factor at work here. We’ll see!
For the Leek/JHU course I need to set up a Github account, which is a platform for working collaboratively with others on developing, sharing and editing programming code. It’s free to sign up and use for open-source/public projects, with paid options available for private developments (currently starts at $7/month, so not onerous). I’ll set up a public/free account initially; later on as I develop further into the Data Science/programming fields I can convert to a private account as necessary.
One important point here (how I’m feeling right now):
This is my first time entering this world of programming/etc. (apart from way back at school or during maths undergrad – they’re both sooo long ago I’m pretty much discounting them right now) and I feel a bit apprehensive. I feel out of place in an environment I perceive to be populated by geeks and hard-core coding. (Thoughts running through my mind: Am I too old? Not intelligent enough? Too female?).
Being a complete novice is intimidating. But I am determined to crack on as this is my future I’m looking to improve by following this training plan. And after all, I cannot know at all at this stage how this new venture is going to turn out (succeed or fail). I can only commit myself to do my best and just see what happens. It’s with this approach that I’ll proceed.
Key thought coming up for me initially:
Data science is the “art of the possible”, it’s primarily a problem-solving discipline (wrapped up in those fancy, modern-day buzz words). What each of us learns as we progress our careers (in whatever field) is that we are being paid not to do task x or task y, but rather to solve problems for people. And this is becoming the key skill in the 21st century career: to frame problems and to solve them (and not when the problem is easy, but specifically when the problem is hard). If I progress this study course with one eye on the hard skills and techniques of data science/analytics and the other eye firmly on ‘getting better at solving problems’ then it will be a valuable exercise regardless of whether I ‘succeed’ or ‘fail’ at becoming a data scientist in my late 40’s.
Some take-aways from the Leek/JHU introductory video:
Data is everywhere and getting easier to gather and store [due specifically to the internet/social media, but as Ajit Jaokar also points out, increasingly due to IoT ecosystems – with more devices coming online/becoming connected, that data pool will also rise exponentially]
Data science came very much out of the statistical sciences. As Jeff Leek points out, data science is about helping us to solve problems, and often where there is a problem it’s because there is uncertainty. Statistics is very good at helping us draw out conclusions in those environments of uncertainty, so it inevitably plays a very large part in data science.
So, why now? Exponential growth in data (as highlighted above) means now is a great time to get involved in data. There’s a growing demand for these skills. There’s also a growing number of prizes/competitions for using data to solve big real-world problems which can positively affect humanity going forward. So this needn’t just be about earning more money for the shareholders, but can be useful for helping to make a better world for all of us.
On programming languages:
Leek/JHU course uses R, although both R and Python are seen as industry standards within data science. Both are available for free and have large and growing libraries of packages/algorithms available to perform commonly-required functions. Python has been described by LifeHacker.com as ‘fun and easy to learn’. I first came across R on the Folio Society Devotees group on LibraryThing.com and fell in love with it there (promising myself one day I’d learn how to use it). I’m really looking forward to learning both.
Key skills of data scientists (as defined by Leek/JHU course):
Hacker mind-set – able to go out and find the answer yourself; knows where to go to get the answers; not fazed by dealing with new data or new algorithms/libraries
Curiosity – willingness to ask new questions and see things new ways
I would add one more to the list:
Willingness to drink from the ‘poison chalice’ that is analytics. My reference particularly being the previous company worked at, where the finance head said data and analytics was the responsibility of marketing and systems; the systems head said it was the responsibility of marketing and finance; and (yep, you guessed it) the head of marketing said it was down to finance and systems to sort out. All the while I was there saying, “guess what, the whole company needs to get a grip on the data and analytics because that’s what’s going to help us drive the business forward in the new economy. Interesting that I am now leading up the new Management Information project (including the data and analytics side) at that same company. Poison chalice or not, herein lies the future!
References
Jeff Leek/JHU: Data Science specialisation by Jeff Leek, Johns Hopkins University on Coursera