# Amchi Mumbai Vs Dilwalon ki Delhi

Whose Dream will come true?

Since the beginning of the IPL in 2008, it has attracted viewership all around the world. Immense uncertainty and last moment nail biters and super overs have compelled fans to watch the matches. In a very short time period, IPL has become the highest revenue generating cricket league.

Analytics has been a part of the Sports industry for a long time. Data scientists accompany the team to give direct inputs to the coaches in specific departments to decipher opponent’s bowler/batsman weakness and make match winning strategy against key players.

During the matches, you must have also come across several instances where the chance of winning of a particular team is predicted through real time online polls.

In this blog, we would be

**Predicting the score**for the team batting first- Predicting
**the winner of the finals — right after the toss, post completion of first innings, after completion of powerplay in second innings.**

**LEARNING MORE ABOUT THE DATA SCIENCE CYCLE**

The data science cycle involves a series of steps that ultimately leads to the ‘perfect’ information insight that a data analyst strives to get. So, let’s decipher each and every step and see how they have been practically implied in visualising data regarding our beloved IPL’s data set.

**Business understanding**:

Did you expect a direct jump into data analysis and stuff? Well, hold your horses, it is not that easy and quick as it seems to be.

Before you directly dive into the numbers and start analysing them, knowing and understanding the business/project objective is of utmost importance. It is critical to comprehend the problem that you are trying to solve. But the question is, how to really go about it? According to Microsoft Azure’s blog, we typically use data science to answer five types of questions:

- How much or how many? (regression)
- Which category? (classification)
- Which group? (clustering)
- Is this weird? (anomaly detection)
- Which option should be taken? (recommendation)

The variables that need to be predicted should be identified correctly. And then the focus should shift on developing a detailed problem statement and analysing its effect on the targeted client/customer..

In our/this case,the objective is to analyze the past 12 years of data of IPL in order to attain some insights for the current season of IPL. Yes, you read it right,we will be using the past 12 years of IPL’s data to draw some inferences about the season 13.

We will also analyze the data to determine the highest run scorer, highest wicket taker, highest boundary scorer team on a particular stadium and probability of winning a match if the team wins the toss for that match.

**2. Data mining**:

Just as a typical mining work involves ‘digging out’ of necessary materials from a concerned source, data mining can be related on similar lines.

Post setting up the detailed problem statement, now it is time for gathering data from various viable sources. But at this day and age of ‘information overload’, how to find the right data? That is also a mammoth task. Isn’t it? But don’t worry, we have your back. The following questions need to be answered while collecting data: -

- What data do I need for my project?
- Where does it live?
- How can I obtain it?
- What is the most efficient way to store and access all of it?

Ask these questions to yourself and mine what you need!

With respect to the data set we had,we used tools like Pivot Tables in order to determine the part of the data that will be of our use to arrive at the solution of our problem statement. Apart from this, the technique of Aggregation was used to collate the data points pertaining to the same category. This step really helps to get an overall idea about the data set.

**3. Data cleaning**:

If you were thinking that mining out data is the end of the game, well it is not, more is yet to come. In fact, now comes the most time-consuming step of all- data cleaning. According to interviews with data scientists, this process (also referred to as ‘data janitor work’) can often take 50 to 80 percent of their time.

Well, is it safe when we say that the Data Science Cycle suffers from Obsessive Compulsive Disorder (OCD) as our mothers do. Anyone? No?. Let us explain this

It’s because data cleaning is a very important step.The data in its raw form cannot be used directly and is infused with problems like missing data values, outliers which totally skew the judgement when included in decision making.

In our analysis of IPL dataset, mean Imputation has been used in order to find the missing values from the data set.When a particular value is missing from the column,the mean of rest of the values of that column is taken and that is used as filler for the vacant position.

Further, to remove the outliers the InterQuartile Range(Quartile 3- Quartile 1) is multiplied by 1.5 and all the values above it{(Q3-Q1)*1.5} are removed from data as they become fit to be called outliers.

**4. Data exploration and feature engineering:**

Now once the data is clean, it is ready for use. This is the stage from which the analysis of data starts. The patterns, biases and relations between the variables and constants are deciphered and understood. It can involve creation of graphs, studying the outliers and many other things. The hypotheses of the problem statement are established and are rejected/ not rejected based on our data exploration.

The data by now is totally fit for usage. It is complete as it has no missing values and is capable of providing right and meaningful insights.

Here,we separate the required data from the pool of cleaned data so that its analysis can be carried out and the required insights can be obtained which is called data extraction.

**Data Extraction in Excel**

When there is a lot of data and we use Find (Ctrl+F) to find all the cells that contain certain code. But the results would have been all over the place. Manually finding each column can be very tedious. If you want to know how to extract all the finds to a target worksheet — value & column format, we have something called macros of them.

Let’s say you have some data in a range like this.

we can make macros to get data from the rows. If you need more background on macros, https://chandoo.org/wp/find-and-extract-results/ is a nice reference point.

By creating macros in excel using VBA, data extraction tasks can be automated. In order to tabulate the performance of a team, information was derived from ball by ball data for each innings in a match.

We divided the phases in an innings as — powerplay (1–6 overs), middle (7–15 overs) and death (16–20 overs). Other than this, runs scored till the fall of 2nd wicket, number of extras and number of boundaries scored was also calculated for each of the innings. Using a for loop for all the IPL matches of the last 12 seasons every inning’s data was compiled.

**Macros Used:**

1) Powerplay

Total number of runs scored and the number of wickets lost, during the overs 1–6 in an inning.

2) Middle overs (7-15)

Total number of runs scored and the number of wickets lost, during the overs 7–15 in an inning.

3) Death overs (16-20)

Total number of runs scored and the number of wickets lost, during the overs 16–20 in an inning.

4) Runs scored till the fall of 2nd wicket

Total number of runs scored till the fall of 2nd wicket in an inning.

5) Extras

Total number of runs accumulated due to extras in an inning.

6) Boundaries

Total number of runs scored and the number of wickets lost, during the overs 1–6 in an inning.

2) Middle overs (7-15)

Total number of runs scored and the number of wickets lost, during the overs 7–15 in the match

for instance the below macro is used to find out the total runs scored till the 2nd wicket falls.

Here the column runs_top3 denotes the result of the macro.

Similarly we can find out the number of runs and number of wickets in a powerplay using the below macro.

Now let’s go astep ahead…

Data analytics is not only about studying the existing data. There lie things beyond that.

**5. Predictive modelling:**

As the name suggests, predictive modelling attempts to answer the question, “what might possibly happen in future.”, using a mathematical process. This is the stage where machine learning finally comes into the frame. It is basically the process of using already known results to create, process and validate a model that can be used to forecast future outcomes.

**Models for IPL Prediction**

**Predicting the Score for the team batting first**

Artificial neural networks (ANN) predict an output value as a function of the input parameters. To predict the score of the team batting first, we have used the **ANN Regression model,** that takes input parameters such as chasing team, team batting first,venue, wickets in the powerplay, and runs in the power play. And the output variable would be the total number of runs scored. we are currently getting the mean absolute error of roughly 22, which means the predicted score has an error rate of **±** 22.

**Predicting the Match winner after the toss, first innings and powerplay of second innings**

Here we are using a logistic regression to predict the winning team. The input variables are the team batting first, chasing team, venue, team winning the toss, runs scored by batting team in powerplay, wickets in powerplay, total runs and total wickets. The output variable is the winning team. We have used several logistic regression models like Logistic Regression from sklearn, Decision tree classifier, SVM and Random Forest Classifier, we pick the best model based on the accuracy.

Here we are getting the best accuracy of 0.900 with the sklearn’s logistic regression model.

**6. Data visualization:**

Phew. Finally comes the most interesting part, the part where huge chunks of data are converted into easily comprehensible pieces of information, just for people like you and me.

But is it so simple as it looks to be? Perhaps not. Data visualization is a tricky field. This is because it involves not only statistics and mathematics, but also psychology, communication and art. The different visual elements such as maps, charts and graphs make it easy for us to understand trends, outliers and patterns in data.

Here we used the Data Visualization software **Tableau** which makes it possible to depict all the insights drawn in the form of graphs.This steps makes it easier to understand the insights and makes it visually appealing as well. Here is a snippet of one of our graphs that we had created using tableau.

# Summing it up

Analytics is a huge part of sports today, do follow up on Moneyball and Inside Edge, available on Netflix and Amazon Prime Video respectively to see how the application is far beyond you ever imagined. Follow our social media handles on Facebook, Instagram, Linkedin to get real time predictions of the match. Stay tuned for more interesting blogs from us in the future! :)

#DataAnalytics #DataScience #Prediction #MachineLearning #Cricket #IPL2020 #Dream11