One of the questions that my students and trainees who want to begin their career as a data scientist ask is - "From where do I start my data science project as a beginner ?" As we know, there is no data science without data and many think that gathering data is the very first step. But gathering data will be very challenging task if you are not clear about your data science objective or business value. If you do not know clearly what problems to solve, you will not be able to create efficient models or solutions or even be able to collect proper data. Thus, I tell my students that "Problem definition" is the very first step.
Problem definition means knowing the exact problem and being clear about the objectives to achieve. This will also provide you an overview of the data that will be desired to achieve your objectives, plus the techniques to gather reliable and useful data from the relevant sources.
Real life data is dirty and raw, which means that the data can be very random and can be obtained from multiple sources in different formats, sizes and features. No information can be extracted from such random heaps of digital mess. It is a daunting task for data scientists and data engineers to properly convert this mess to something understandable and valuable data to apply suitable learning algorithms. When we apply algorithms on training data set, we create a trained model. The trained model itself is not enough, it needs to be properly deployed and presented so that we can use the trained model for test data sets.
Hence, this whole data science approach afters suitable data has been collected is categorised into following three phases:
The pre-processing phase is the phase of sensing, feeling, observing, understanding, visualising and modifying the data if necessary. Sensing and feeling of data refers that the core essence of the data should be preserved while removing the inconsistencies and noises observed in the data. Understanding means that all the features in the data should be clear and sensible. One should be able to visualise the specific statistical and/or non-statistical properties and modify the results and data if needed. Data pre-processing steps, thus can be roughly be categorised as:
- Data understanding: get some domain knowledge, create data dictionary, understand what all data features refer to, learn from discussions or observation etc.
- Data cleaning: smooth noisy features, remove inconsistencies and outliers if any, balance the imbalanced data features, fill in the missing values if required etc.
- Data integration: integrate data from different sources for example data from structured database and no-structured database, data from data warehouse and data cubes, data from different databases, data cubes and files etc.
- Data transformation: transform the data set, for example data normalisation (convert all the data set to have same range of values in order to get rid of the inconsistent relationships that exist between different features) and data standardisation (convert the relevant data features to have zero mean and unit variance so that results from different features can be properly compared)
- Data modification(reduction/aggregation): Remove all the unwanted or dependent features so that the essence of data and problem is not changed and/or add all the relevant features so that problem is better described. For numerical data, the algorithms such as principal component analysis (PCA) is used for dimensionality reduction.
- Data discretisation: split the whole range of numbers into interval of equal sizes or equal frequencies, merge the intervals if there are many of them etc.
- Data visualisation: Perform exploratory data analysis and observe the statistical and/or non-statistical results. Make meaning from the observation.
- Supervised learning algorithms: Supervised learning algorithms are used when the available dataset has the labels. In other words, supervised learning algorithms are used when we know the input parameters as well as the corresponding output values for those parameters. One of the challenges of supervised learning is thus to obtain the desired data. Supervised learning problems are mainly Regression type and Classification type . Regression type learning algorithms are used solve regression problems with continuous labeled dataset such as forecasting the price of a house, predicting the buy rate of logistic goods etc. Some of the well-known regression machine learning algorithms are- Linear Regression, Polynomial regression, Ridge Regression, LASSO, Elastic Net, Least Angle Regression (LARS), Orthogonal Matching Pursuit (OMP), Bayesian Ridge Regression, Automatic Relevance Determination Regression and some robust estimators such as Random Sample Consensus (RANSAC), Theil-Sen estimator, Huber Regression etc. On the other hand, if we have discrete type of labels such as separation of cats from dogs, determination of handwritten digits, determination of hand-written letters etc, we use classification algorithms such as Logistic Regression, Decision Trees, Random Forest, Support Vector classifiers, K Nearest Neighbours, Neural networks etc.
- Unsupervised learning algorithms: Unsupervised algorithms contain only the input features without labels. We use clustering and dimensionality reduction approaches in unsupervised learning algorithms. Clustering includes algorithms such as : K Means Clustering, Mean Shift clustering, Density Based Spatial Clustering of Applications with Noise (DBSCAN), Expectation Maximisation Clustering using Gaussian mixture models, Agglomerative Hierarchical Clustering etc.; whereas the dimensionality reduction includes algorithms such as: principal component analysis, multi-dimensional scaling, Linear Discriminant analysis, Isomap, t- distributed stochastic neighbour embedding (t-sne) etc.
- Occam's Razor Simplest model is always the best model
- No Free Lunch Theorem:There is not a single machine learning model that work best for all the problems.
The data pre-processing phase provides us "processed data". Then comes the processing phase, where we train our machine learning model with the suitable dataset. Before that, we divide the available processed dataset into: training data, validation data and test data . We then select the machine learning model with proper algorithms to be trained with the training dataset. This is called the training phase where care should be taken not to use other datasets except the training data. Also be aware that different learning algorithms are suitable for different types of data. Hence, always take care to choose the most suitable algorithms for your data. There are different categorisation of machine learning approaches. One of the categorisations based on the available input and output data is Supervised learning and Unsupervised learning .
We should know how the trained model works. We measure the performance of our model using different metrics such as accuracy, confusion matrix, Area under curve (AUC) etc. using the validation dataset. Sometimes, there are internal parameters which need to be tuned in order to improve the performance of our model. Such approach is called hyper parameter tuning. Validation dataset is also used for that specific purpose of hyper parameter tuning. We use cross validation techniques in order to fully utilise all the available data for both training and validation set. While choosing the model, one should always remember two of the following theorems:
Hence selecting a machine learning model requires trying out different models on the given dataset and selecting the simplest and the optimal of them all. Once everything (hyper parameter and other parameters ) is set for the chosen model, it can be deployed on the given system, which is the post processing phase.
Post processing phase includes deploying our trained machine learning model. There are many approaches to deploy the model. One can use APIs or other available platforms for the deployment of the model. Nowadays many ML models are also deployed using Amazon Web Services (AWS) Sagemaker.
However the real time deployment of an ML model is still a challenging task because the processing time needs to be very fast. Some ML models are converted to predictive model markup language (PMML) and that PMML model is deployed. Now we can use Tensorflow from java script, and a real time model is deployed.
As a data scientist, be very clear on each of these three phases. It is not necessary that a single person has to be perfect in all of these phases but one has to clearly understand them. Rest are all tools and techniques that one can learn after starting to work. Also, the best way to learn is by solving the real life problems. Start analysing, thinking and finding the problems from daily life that can be solve using data science. Never underestimate the power of problem definition.
Be aware of the following tools and platforms that are commonly used in data science. Nobody knows them all but start learning by choosing any of them which you feel easy and valuable.