Cookies Psst! Do you accept cookies?

We use cookies to enhance and personalise your experience.
Please accept our cookies. Checkout our Cookie Policy for more information.

Retail Company Data Analytics (Predicting Future Sales)

In this project, I analyzed a dataset containing sales data for a retail company using linear algebra concepts. Moreover, I use linear algebra techniques to identify trends and patterns in sales figures. Apply algorithms, possibly regression analysis, to predict future sales based on historical data.

DATA ACQUISITION AND PREPARATION

the retail store dataset that was curated personally according to these data attributes. Date, Sales Amount, Number of Product Sold, Marketing Expenditure, Region. Sample datasets from Kaggle were used as a template for this particular dataset. The retail dataset is an Excel file that contains 8 columns and 500 rows. In this dataset, the data under customer ID are descriptive but specifically nominal—likewise the data under date, location, and product_category_preferences attributes. The data under the number of products sold and frequency attributes are discrete data types. At the same time, sales amount and marketing expenditure are continuous data types. However, these two distinct data types are subsets of numerical data.

Below is an image showing the retail dataset in an xlsx format.

Image description

Below is an image showing how the retail dataset was loaded into a dataframe

Image description

Data Cleaning

After importing the retail dataset in google colab, I imported the pandas library as pd. This enabled me to load the dataset into a dataframe. To do this, an instance was created with pd and stored in the variable ‘data’.
Moreover, I cleaned the retail dataset because it is a necessity that must be carried out as it ensures that the data is quality and consistent for analysis. Thus, I checked the column names if they are correct using ‘data.columns’. Afterwards, I converted the sales column into a numeric column by removing the dollar sign. The dollar sign attached to the figures make it a string.
Before performing this cleaning, I used the function ‘data.info()’ to derive some information from the retail dataset. I understood that the number of entries in the dataset was 500. Also, I got to understand the data type of each column. The memory usage was 31.4+ KB.
After, I converted the date into datetime format. I went a step further to check for missing values by using data.isnull(). The result I received indicated that there was no missing. To double prove, I replaced all the missing values with the mean value, using the fillna method.

Below is an image of the various data cleaning that was performed on the retail dataset

Image description

Image description

Exploratory Data Analysis
Moving forward, I performed an exploratory data analysis on the Retail dataset, a data generated by a retail company. I utilized the following methods to perform the first part of the EDA; data.describe(), data.info(). The data.describe() method was used to derive descriptive statistics from the retail dataset. The data.info() method was used to derive the information about the dataframe including index dtype and columns, non nulls values and memory usage(pandas, 2023). However, I performed some data type conversions.
Initially, I converted the sales column into a numeric column simply by getting rid of the dollar sign Next, I converted the date column in the dataset into datetime format.
The data.describe() method gave an output of the metrics indicating the count, mean, min, max, percentile and standard deviation of each column in the dataset. It indicates that all columns have 500 non-null entries, indicating no missing values in these columns. An example for the mean is the Number of product sold columns being 7.68 and that of the Sales being 49.91. These and more descriptive stats will be displayed in the images below.
The data.info() method displayed an output that indicated the memory usage of the dataframe to be 31.4 KB. The dtype of each of the columns. These and more descriptive stats will be displayed in the images below.

Data Visualization

The next aspect of the EDA,I visualized the histogram of each numeric column as well as a pairplot to see the relationships between numeric variables.

Below are images showing the code snippets of the Exploratory Data Analysis.

Image description

Image description

Image description

Image description

Image description

Linear Algebra & Model Training
In this sales prediction project, I applied a couple of linear algebra concepts. Linear algebra is a branch of mathematics that aims at solving systems of linear equations with a finite number of unknowns(Schilling, Nachtergaele, and Lankham, n.d.). The concepts implemented are covariance matrix and singular value decomposition. Covariance matrix is the representation of covariance values of each pair of variables in multivariable data(Builtin, 2023). Singular value decomposition of a matrix is the factorization of that matrix into the product of three matrices. Example is A = UDV T where the columns of U an V are orthonormal and the matrix is diagonal with positive real entries(Guruswami, n.d.).
For this project, I converted the sales column into numerics because it had the dollar currency attached which made it a string. The reason for this is because it will be used as a target variable. I created a variable X(more or less like a new dataframe) which should contain all the columns with the exception of the columns I dropped. My target variable being y, which contains the values in the sales column. Afterwards, I calculated the covariance matrix(cov_matrix = np.cov(X.T)) by creating an instance of the function in the numpy library. However, the T is used to transpose the data frame so that each row represents a variable. The covariance matrix represents a measure of how much two random variables change together between each pair of features in X.
The singular value decomposition(SVD) performs on the features of X and returns three matrices. In the line of code the U is the unitary matrix having left singular vectors, S which are non-negative numbers in a decreasing order. Vt a unitary matrix having right singular vectors.

Below are the code snippets for the covariance matrix and the SVD

Image description

Image description

Model Training
The machine learning model utilized to predict future marketing expenditure is linear regression. Linear regression is a machine learning model where the independent variable is used to determine the dependent continuous variable.
Firstly, I split the retail dataset into training and testing. Then I extracted the index of the marketing expenditure column. This is to identify the marketing expenditure column in the features. Then I created a list of feature columns and dropped the sales columns. A simple linear regression was performed with the marketing expenditure. The marketing expenditure would be used as the only predictor.
Afterward, a multiple linear regression was performed using all features. Multiple linear regression is a statistical technique that models the relationship between a dependent variable and two or more independent variables(Frost, 2023). More or less like an upgrade of linear regression.

Below are the code snippets of the linear regression model and multiple linear regression model

Image description

Evaluation Metrics

The R-squared for the simple regression (-0.005806533949882953) and that of the multiple linear regression (0.008004018071779306) indicate that the models explains virtually none of the variability of the response data around the mean. Moreover, the negative values suggest that the model performs worse than a horizontal line (mean of the dependent variable). The RSME for the multiple linear regression is 29.246565814382784. Thai value(29.25) means that, on average, the model’s predictions are off by 29.25. The coefficients of the multiple linear regression for the features are as below; feature 1 is -0.21182487, feature 2 is -0.00075948 and feature 3 is -0.27559952.Each of these values represent the change in the dependent variable for a change in the respective independent variable. The negative coefficient indicates an inverse relationship between the feature and the target variable. The intercept of the multiple linear regression is 55.55450162269854. This value(55.55) represents the expected value of the dependent variable when all the independent variables are zero.

Business Implication

The results of the evaluation metrics of the models have some implications for the sales of the retail company. The low and negative R-squared values indicate that the model does not fit the data well and has poor predictive capabilities. This suggests that the features used are not good predictors of the target variable. In effect, It will require feature engineering. The high RMSE confirms that the model’s predictions are not accurate.
This means the models need improvement. One of them is that the data that are collected should be more relevant. Also, feature selection should be implemented to ensure that the best features are selected for the models.
This implies that the model is not reliable for predicting future sales for the retail company.

Conclusion

In conclusion, the sales prediction analysis of the retail company was carried out by linear algebra concepts namely; covariance matrix and singular value decomposition. These linear algebra concepts provided valuable insights before the model training. The covariance matrix made me understand which variables to include in the linear regression and multiple linear regression models based on their relationships with the target variable. In simple terms, the covariance matrix was used to identify which variables have the strongest relationships with sales.
The SVD helped to reduce the dimensionality and the multicollinearity and improved the stability and performance of the regression models. This was achieved by focusing on the most significant components.
The linear regression model implemented in the sales prediction analysis of the retail company provided some valuable insights into the relationships between various factors and their impact on sales performance. After the modeling, the performance was checked using the R-squared values for the accuracy. The root means squared error(rmse). The coefficients indicated the direction and magnitude of the relationships between the independent variables and the dependent variables. Moreover, the negative coefficients indicated an inverse relationship.

Project Reflection

The reflection on the sale prediction of the retail company will cover some challenges I encountered including anomalies in the dataset that was not fit for the linear algebra concepts(covariance matrix and SVD), and the insights that I derived as well as future prediction.
The dataset had some anomalies that would have made it hectic for analysis. However, the retail dataset was fairly good but the sale column of the dataset was recognized as a string because of the dollar sign attached to it. I had to detach the dollar sign from the figures to make it numeric since it was needed in the linear algebra concepts namely; covariance matrix and SVD.
The result of linear and multiple regression models indicated that the model is not a good predictor of future sales. Thus, feature selection or hyperparameter tuning is essential to make sure the right column is selected for the training and testing of the model.

Last Stories

What's your thoughts?

Please Register or Login to your account to be able to submit your comment.