An End to End Machine Learning Project (Part 1)
The DS Job Hunt Project
Part of a series solving The Data Science Job Hunt
Typically, when I develop a machine learning project for a company I try to follow a standard process. The process is not just to ensure that I am delivering a quality product, but it also forces some effective communication with my team and stakeholders. The textbooks and instructors always harp on these methods (CRISP-DM or others) and it is easy to dismiss or believe the answers on the “template” are obvious, but in the real world I have caught many BIG directional issues by slowing down in the very beginning and setting up a proper process. I have also been burned by skipping this template with urgent yet misguided requests from high level stakeholders.
The process I follow is:
1. Clarify the problem and constraints.
2. Establish Metrics for Success
3. Understand the Data Sources
4. Explore the Data
5. Preprocess the Data
6. Feature Engineering
7. Model Selection
8. Model Training
9. Model Deployment
10. Model Monitoring and Maintenance
Clarify the Problem and Constraints
This is the most important step in the process. It is crucial to understand the problem you are trying to solve and the constraints you are working within. This is the step where you will ask questions like:
- What is the business problem we are trying to solve?
- What is the value to the business of solving this problem?
- How will the rest of the business use the output of this project?
- What are the constraints we are working within? “We can only use data that is publicly available”, “We need to be able to explain the model to the regulators”, or “We need to be able to update the model every month.”
- What are the risks of a bad output?
- Are there any ethical considerations?
This is also the step in the processes that benefits from a Data Scientist with plenty of real world experience and confidence inside an organization. Here are two real examples of projects I was a part of where this step was crucial:
“Can you build a model to predict the number of customers that will churn next month?”
This seemed like a fairly straightforward request, but after building a “Biz Problem” presentation we found out that the business had many interpretations of what “churn” meant. Some though it meant a customer that has cancelled their subscription, others thought it meant a customer that has not logged in for a month. This was a crucial distinction that needed to be clarified before we could proceed. Given the nature of modern corporate America, this actually turned into a ugly political issue that needed to be resolved by the CEO and took weeks to solve. It also means that we saved weeks of work by not starting on the wrong project.
“Build a clustering model to segment our customers.”
This was a major failure on my part. The request came from a high-level stakeholder that claimed to be very familiar with machine learning but also was very unavailable for questions and preferred to keep a strict political hierarchy (My direct manager also believed the request as I did). I should have asked more questions about the business problem and the constraints and forced a presentation to get started. It turned out that the stakeholder was not familiar with machine learning and was actually looking for a way to filter the customers based on their purchase history. This was a simple SQL query that could have been done in a few minutes and built into a dashboard in a day. Instead, I spent weeks collecting data and building a clustering model that wasn’t needed at the time. We did end up using the model later, but it was a rush job that could have been avoided with a nice presentation of “Problem & Constraints” before moving forward.
For The Data Science Job Hunt:
- Business Problem: Help me find a job.
- Value to the Business: I will be able to find patterns in the search process that signal success to find a _better_ job faster.
- Use of Output: I will use the output to track my job search progress and optimize my job search. These patterns might be “I get more interviews when I apply to jobs on LinkedIn” or “Following up with a recruiter increases my chances of getting an interview.”, etc.
- Constraints: I can only use publicly available data. I need to be able to explain the model to potential employers. I need to be able to update the model as I get more data.
- Risks: I could miss out on a great job because I didn’t apply. I could waste time applying to jobs that I’m not qualified for.
- Ethical Considerations: I need to make sure that I’m not discriminating against any group of people or exposing any sensitive information about a potential employer.
As this project is a little atypical, the problem statement is a more vague than usual. Typically in a problem like this, another question we ask is “Does this even need the level of effort (or a machine learning model)?” In this case, I believe it does because I want to practice my data science skills and build a portfolio of projects that I can showcase to potential employers. But in a real business setting, probably not.
However, an important learning from the problem statement above: This is not a machine learning project. Sad face emoji. Instead, to solve the problem above, we really only need a thorough analysis to find some basic patterns of success. That does still require some data engineering, data visualization, and data analysis on the project to showcase a few skills. And, I will shoe-horn some machine learning along the way, but it is not the focus of the project and our Problem and Constraints step shows us that.
Establish Metrics for Success
I like to establish metrics as early in the process as possible. Before collecting data and certainly before modeling. The success metrics will guide the rest of the project. If you know what you are trying to optimize for, you can make better decisions about what data to collect, what features to engineer, what model to use, etc. But more importantly, at this early phase of the project its more likely you are in frequent communication with the stakeholders doing kickoff meetings, building project briefs or creating Jira tickets. By agreeing on metrics early, you can also ensure that you are all on the same page and that you are not wasting time on a project that will not be used because you can’t define or agree on success.
I like to define two metrics for a project.
- Model metric: Accuracy, Precision, Recall, F1, AUC, etc.
- Business metric: Revenue, Cost, Customer Satisfaction, etc.
A model metric is problem dependent and is firmly in the domain of the data scientist, but defining a business metric and how those two relate is usually a hot topic of conversation across departments.
In discussing these metrics early, I usually have an opportunity to explain the difference between the two and why they are both important. I also like to explain that the model metric is a proxy for the business metric and that we all know that the model metric is not the end goal, but a tool to help us reach the end goal. This step also gives me a chance to lightly explain the more technical model metric (tailored to the audience, of course). Then, when later results are presented, I find it a little easier to revisit what the model metric means and why it is important.
As an example, I once worked on modeling new customer acquisitions. The business had a process of application, then purchase. Because applications had a higher volume, making a prediction on which user behavior was likely to apply was easier and more accurate for the model and could be considered a strong proxy for sales. However, some folks at the company believed that certain customers converted from application to purchase at a higher rate than others. By simply increasing applications we might simply be increasing the number of low converting customers. A deeper dive into the data showed that this was not the case and that the model metric was a strong proxy for the business metric. Without this conversation early in the process, I would have wasted time and resources on a project that may not have been used.
Data Buy-In: It is usually at this point in the conversation that I can also influence the data collection process. By connecting the business goal with the model metrics, I find greater buy-in from various departments to help me collect the data we need. “I see how this could help us increase revenue, I’ll talk with all of our CS reps and be sure they are entering specific information from calls correctly.”
For The Data Science Job Hunt:
Regarding the metrics for this project, I have a few ideas:
- Model Metric: Or, in this case, an Analysis Metric. I will need to define what a “successful” job application is and what a “successful” job search looks like. This will be a little tricky because I don’t have a lot of data to work with. Immediately, a proxy for a successful job application is a response. I will need to define these metrics more clearly as I collect more data.
- Business Metric: Again, the “business” metric is a little more abstract. The “success” metric really only happens once, but I am looking to optimize my job search. So, reducing time and increasing job offers are the obvious metrics, but again, there is low data here and nothing to compare against. So, this project is more about the journey than the destination.
Understand the Data Sources
This is the step where I typically start to organize data around the problem. I will need to understand where the data is coming from, how it is collected, and what it looks like. Very often an organization has a fixed data source that it has been collecting over time. I’ll need to use experience and intuition to understand if the data is relevant to making a prediction, does it signal information about the problem. If not, can we acquire more (purchase, scrape, etc.) or engineer features that will help us make a prediction.
This is also the step where I will start to think about the quality of the data and what I can do to improve it, the privacy and security of the data and what I can do to protect it, and the volume of the data and what I can do to scale it.
I like to start by asking questions like:
- Where is the data coming from?
- How is the data collected?
- When is the data collected?
- Where is the data stored?
- Is the historical data the same as future data?
- How is the data accessed?
- How is the data processed?
- How is the data labeled?
- How is the data updated?
- Is the data archived on a schedule?
- How is the data secured?
Asking these questions now can once again stop the process early. If a company asks to build a customer segmentation model, but knows nothing of its customers other than their email address and purchase history, it is likely that the project will not be successful. Or if a company looks to understand a propensity to buy and believes that it has plenty of data on its current customers, but no way to identify new customers or track them through the sales funnel, again, the project will not be successful.
For The Data Science Job Hunt:
For this DS Job tracker I will create new data sources. The business problem is to “find me a new job”. So, I will need data on the jobs I have applied for. I will be creating a longer post on the process of setting up the process to collect this data, but here are a few of the data I would like to collect:
- Job Descriptions: I will be collecting data on the job applications I submit. This will include the company, the job title, the job description, requirements and responsibilities.
- Job Categories: I will be collecting data on the job categories I apply to. This will include the industry, the job function, the job level, specific domains such as marketing, healthcare, finance.
- Job Sources: I will be collecting data on the job sources I use. This will include the job board, the company website, the recruiter, personal recommendation, etc. I have an early hypothesis that the source of the job application will have a large impact on the success of the application.
- Job Status: I will collect information about each step in the process in order to measure response time and success rate. This will include the date of application, the date of response, the date of interview, the date of offer, etc.
- Interview Type: Do I have a better success rate with a phone interview or an in-person interview? Do I have a better success rate with a technical interview or take home assignment?
- Salary: Are there patterns in the salary of the responses I get? Do I get more responses from higher paying jobs? Do I get more responses from lower paying jobs?
This is a good time to think about the volume, velocity, and variety of the data. It’s unlikely that I’ll be collecting petabytes of job application data. So, automation has a quickly diminishing return. And while there is some merit in showing the process of scraping sites and connecting to API’s, I think the best use of my time is to collect the data manually and focus on the analysis. However, I will build a entry form process that allows for efficient and validated Create, Read, Update or Delete and database to store the data.
Additionally, as I iterate on this project, keeping the data collection and store simple will allow me to quickly pivot and change the data I collect as I learn more about the problem.
That is the end of Part 1, which includes a detailed setup of the project and thoughtful consideration of the problem and constraints, metrics for success, and data sources. In Part 2, I will continue with the next steps in the process: Exploring the Data, Preprocessing the Data, and Feature Engineering.
Articles in the series (and work coming soon):
- A detailed description of the project — Following something similar to the CRISP-DM process (this article)
- The data sources and data collection process
- Plotly Dash — building the front end (I’m not a front-end developer)
- Using Google Cloud Platform App Engine to host the project
- Cover Letter Generator — LLM, prompting, and more
- BigQuery for data storage and analysis
- Data visualization in Plotly — histograms, sankey diagrams, and more
- CI/CD with Google Cloud Build
- Segmentation Project to look for patterns in job applications — PCA, t-SNE, clustering, and more