taught myself data science through online courses, books, and YouTube videos. After almost a year of self-learning, I am now working as a data scientist.
Somewhere along this journey, I’d often find myself lost in self doubt.
I read countless articles emphasizing that the only way to break into data science was to gain a strong grasp of statistics, mathematics, linear algebra, and predictive modelling.
While this is true to some extent, it has led to an assumption that only a data science Master’s graduate can become a data scientist.
How much math is required for a data scientist?When I was teaching myself data science, I’d often hit a wall and find myself unsure of how to proceed. I felt like my mathematical background wasn’t strong enough.
To make the transition into data science, I decided to take online courses in calculus, statistics, and linear algebra.
I already had a mathematical background, since I took mathematics and further math during my A-levels.
I spent around two months practicing differentiation, integration and matrix manipulation. Then, I took multiple courses on statistics and probability.
It took me around 6 months to brush up on all the recommended math pre-requisite for a data scientist.
Did this help my transition into data science?
Yes and no.
Most calculus and linear algebra concepts I learnt had no direct application to the model building process.
My day to day job doesn’t require me to know how to compute a Taylor Series without a calculator.
However, I don’t regret spending time on these courses. My problem solving skills really improved as I took these math courses and did homework everyday.
Out of all the math courses I took, I found statistics to have the biggest direct impact on the work I was doing.
I learnt about the different sampling techniques, the different types of distributions and how they can be applied to normalize large datasets, hypothesis testing, and feature selection.
Having a solid understanding of how the algorithms worked really improved my model building process.
To take some statistics courses, however, you need to be familiar with some mathematical notation. This is where a basic (pre-university, or maybe high school level) understanding of calculus comes in.
You will need to know basic concepts like summation and differentiation to follow along to the statistics courses, but too much depth isn’t required.
The good news is that you can ignore gatekeepers.
You don’t need to have a Master’s degree or a PhD in statistics to get a job in data science.
Everything you need to know can be self learnt.
In fact, many of my colleagues (including the head of my data science team) do not come from a math background. A lot of them have business degrees, and managed to teach themselves data science.
Going beyond math and statistics
More important than the math requirement, however, is the ability to implement and scale the algorithms you build.
Most of my work as a data scientist has been within a Jupyter Notebook. However, one of the datasets I had to work with a couple of days back had around 30 million rows. This meant that I couldn’t build the model locally.
I spent most of my time setting up an environment on AWS to build and train my model.
I had to change the code I was writing from Python to Pyspark. Pyspark is an API written in Python that supports Apache Spark, and allows you to execute code parallelly. This means that it can process large amounts of data quickly and allows you to work with big data.
All of this had to be done quickly, and I had to change all my codes to a different language within a day.
As a data scientist, you need to have enough programming and SQL knowledge to be able to adapt and scale your model.
This is a skill that comes with practice, so I strongly suggest coding and solving problems for at least 3–4 hours everyday.
Managing the end to end workflow
Even more important than the model building process is the ability to work on an end-to-end process.
This starts with first understanding the business requirement.
Then, you need to know the different data points required to solve the business problem.
To collect this data, you will need strong technical and programming skills. You will need to know how to work with APIs, and how to build web scrapers.
After collecting data, you need to clean it, preprocess, and do some preliminary analysis before you start with the model building process.
Also, not all problems require machine learning to solve.
If you are asked to build a model to solve a business problem and you realize that machine learning isn’t necessary in that context, then suggest otherwise.
The model building part is probably the least time consuming of the entire data science project, and your work doesn’t end there.
Once you build the model, you will need to present it in a way that is comprehensive to someone from a non-technical background.
You will need to create use-cases and scenarios explaining what you have built, with easy to understand visualizations.
For all these reasons, a seemingly simple data science project can take a lot longer than expected to complete. Business requirements are constantly changing, so expect to be called back to make changes in the model or incorporate new features.
Mathematics and statistics are important in learning data science, but their role in the industry is overstated.
Competent programming skills and the ability to solve business problems with data will take you a lot further in the industry than simply being good at math.