Data Analyst | ML Engineer
Data Analysis enthusiast with a strong academic background in Mathematics and Computer Science, experienced in Data Analysis, Machine Learning, and proficient in Advanced Excel, Python, SQL and Power BI. Desperately waiting to gain hands-on experience to combine tireless hunger for new skills with desire to exploit cutting-edge data science technology.
Parul University, Gujarat [2020 – 2024]
Under Graduation: BTech, Computer Science Engineering with Specialization in Artificial Intelligence
Windows
Power BI
, Excel
Python
, SQL
, HTML
, CSS
Django
, Streamlit
Self-motivated
, Quick learner
, Social
, Resilient
, Team Player
, Public Speaker
Objective: Build an app that predicts cancer by showing the visualization based on the given cell nuclei measurements.
Dataset Description:
Attribute Information:
The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
Project Structure:
app
:
main.py
: contains the main function of the appdata
:
data.csv
: contains the datasetmodel
:
main.py
: contains logistic regression model code to predictmodel.pkl
: pickle file of the modelscaler.pkl
: pickle file of the scalerLibraries Used:
pandas
: used to clean the dataset by removing unnecessary columns and transforming categorical variables into numerical ones.StandardScaler
: used to preprocess numerical data before model training to ensure that different features have the same scale.train_test_split
: used to split datasets into training and testing subsets.LogisticRegression
: used to train a logistic regression model on labeled data where the target variable has two classes, predicting the probability of a sample belonging to a certain class.accuracy_score
: used to compare the predicted labels to the true labels and return the accuracy ratio of correctly predicted samples to the total number of samples.pickle
: used to save the machine learning model and scaler object into binary files to use later or in a different environment.streamlit
: used to create the user interface, generate sliders, visualize graphs, and display predictions.plotly.graph_objects
: used to create interactive plots and charts.numpy
: used for numerical operations in data preparation.Data Preparation:
StandardScaler
.Visualization:
go.ScatterPolar
) is used for visualizing the mean, standard error, and worst values of various features related to breast cancer diagnosis.App Snippet:
Results:
Conclusion:
Objective: Create a one-line Exploratory Data Analysis (EDA) experience.
Working: This app analyzes the uploaded CSV files, providing in-depth insights into the dataset’s characteristics through exploratory data analysis techniques.
Features:
Libraries Used:
numpy
: used for various numerical operations, especially in generating example datasets or handling numerical data within the analysis.pandas
: used to read and manipulate CSV files, organize data into DataFrames, and facilitates data exploration and presentation.streamlit
: used as the primary framework for building the EDA web application, allowing easy integration of data visualizations, user inputs, and data analysis tools.ydata_profiling
: Used to create comprehensive and interactive exploratory data analysis reports based on the uploaded or example datasets, providing detailed insights into the data’s characteristics.streamlit_pandas_profiling
: Facilitates the direct embedding of Pandas Profiling reports generated by ydata_profiling into the Streamlit app, allowing users to visualize and interact with the analysis within the app interface.App Snippet:
Result: Based on the uploaded CSV file or the example dataset, the EDA report is generated with the help of ydata-profiling and Streamlit on this app.
Conclusion: Streamlit framework made easy to build web application for machine learning by simplifying the creation of interactive and data-driven apps.
Objective: Analyse the number of illiterates in Telangana during the year 2014 and compare it with the current year(2023).
Data Source: The dataset is taken from Open Data Telangana
Dataset Description: The dataset provides information about the number of illiterates in the rural areas of Telangana State by gender to gram panchayat level. This data is according to the old districts during the perioid 2014.
Dashboard Snippet:
Insights Gained:
Result: Conducted in-depth analysis of the illiteracy rates in rural Telangana using data sourced from Open Data Telangana and analyzed the number of illiterates in 2014 and compared these statistics with the current year (2023) to gauge changes and trends in literacy rates over the years.
Conclusion: Even after 9 years, the district that was once known as Mahabubnagar has the highest percentage of illiterates in Telangana.
Objective: Collect comprehensive data on electronic gadgets commonly used by software employees or students, such as laptops, tablets, smartphones, smartwatches, headphones, earphones, and earbuds, during the Diwali season to especially emphasize the offer deals.
Data Source: The data is scrapped from ‘amazon.in’ website.
Dataset Description: Our required data from the webpage:
name
: Title of the productbrand
: Brand of the productmodel_name
: Model name of the productscreen_size
: Display size of the screencolour
: Colour of the productcpu_model
: CPU model of the productram_memory_installed_size
: Installed size of ram memory in the productoperating_system
: Operating System of the productmrp
: Actual price of the productoffer
: Offer on the productnumber_of_purchase_in_last_month
: Number of purchases of the product in last monthnumber_of_ratings
: Number of ratings received for the productrating
: Overall rating of the productData Preperation: Required data was extracted from the webpage by finding the mentioned tags and if no such tag was found then that value is replaced with an empty string. Products having no title value were removed from the dataset and then saved as a csv file.
Libraries Used:
requests
: Used for sending HTTP requests to websites to fetch the HTML content of web pagesBeautifulSoup (from bs4)
: Used for parsing the HTML content obtained using the ‘requests’ librarypandas
: Used to create the DataFrame to organize and structure the scraped datanumpy
: Used for numerical operations, especially in handling numerical data.Data Snippet
Result: Was successfully able to web scrape the amazon.in data once before the header i used got blocked or restricted.
Conclusion: Got to know that the success of web scrapping depends on various factors:
Headers
: Many websites require a valid user-agent string which allows/blocks/restricts the reuests.Website Policies
: Some websites like amazon are having strict anti-scraping policies that are preventing or limiting scraping attempts.Website Changes
: As we are relying on specific HTML tags to web scrape the data, sometimes our code might not be able to find the required HTML elements if the website changes its structure.