Salary, whether we like it or not, plays an important role when planning a career path. As a result, it is important for Sokanu to report salaries as accurately as possible. Typically when one researches the salary for a career, they may visit multiple sources in order to get the most accurate picture possible. Combining information from multiple resources has become increasingly important in today’s climate and is core to how Sokanu calculates salary. This article details how Sokanu combines our own internal data with external data from publicly available resources.
How Does Sokanu Calculate Salary?
Throughout the Sokanu assessment, salary data is collected from users when they tell us their past careers. We incorporate this data into the salary information we then report to other users, but there are some difficulties when trying to do this. When Sokanu reports salaries per career to our users, one option is to take the average of this collected data. However, this is problematic because we may not have collected enough data points per career in order to confidently report the salaries for certain careers. As a result, we perform a calculation in two steps:
- We first filter out data which we believe to be fake or careless data. This step is not covered in this blog post.
- Next, we combine our own data with external data from the Bureau of Labor Statistics (BLS). This blog post details how this calculation takes place.
In order to perform this calculation, Bayesian statistics are called upon. In a nutshell, Bayesian statistics allows us to systematically balance between expert opinion (external salary data from BLS) and what our own data is telling us (salary data collected by Sokanu).
To demonstrate this, let’s start off with an illustrative scenario in which we are comparing the salary of Actors and Welders with the following data collected:
|Career||Number Surveyed||Average Salary|
Which of these two careers has a higher average salary? Purely looking at the data collected, we would have to say that Actors earn more on average than Welders.
However, most people would intuitively disagree with this result. Reporting the results above is questionable for two main reasons:
- The number of people surveyed. We are less sure about the average salary reported for actors because only three people were surveyed. Perhaps one of the people surveyed was a famous celebrity who reported earning $1 million a year while the other two reported a more reasonable and modest income of $25,000. In contrast, we should feel more comfortable with the value reported for Welders since we have more respondents
- Most people’s intuition will tell them that the average welder would have a higher income than the average actor. Even without seeing the number of people surveyed, the results above are still questionable.
The results are further put into question if other data sources or a “salary expert” comes along and tells us that the salaries should look more like this:
|Career||Expert Opinion of Salary|
Obviously it would be irresponsible for us to report our own “Average Salary” data in the first table. Alternatively, we should not purely rely on expert opinion as we have collected a substantial amount of data for Welders and feel like we have a pretty good idea about how much they earn. Like most things in life, the “right answer” probably lies somewhere in between the two options. Now, the question becomes how to combine “expert opinion” with the data that we have collected?
Intuition Behind the Methodology:
Although the example above is a contrived one, it illustrates the problem quite well. It would be great to balance between expert opinion and the data we have collected. Here is a birds-eye view of the equation that we use:
At the heart of the methodology is a careful balancing act between the AverageSalary that our data is telling us and the ExpertOpinionOfSalary that external sources are telling us. In the context of the equation above, Wdata controls how much weight we place in AverageSalary data collected. The higher Wdata the more weight we place on the data. The same is true in the case of Wexpert and ExpertOpinionOfSalary. As we collect more data, Wdata and Wexpert change accordingly.
Let’s try and gain some intuition as to how Wdata and Wexpert change as we collected more data:
If the number of data points (number of people surveyed) increases, Wdata increases and Wexpert decreases
If the “spread” of the data collected is high, Wdata decreases and Wexpert increases
The first point above has already been discussed. As the number of data points increases, we have more confidence in the resulting average salary. This makes sense!
The second point introduces a new concept of “data spread”. In order to measure how spread out the collected data is, we use a statistical concept called “variance”. In the example given above with the three actors with one of them having a drastically higher salary, the data is very spread out. Thus, the data is said to have a high variance. As a result of having a high variance, we have less confidence that we know where the average salary is and we rely more on expert opinion.
Let’s now take a look at some results of using this methodology:
|Career||Number of Surveyed||Average Salary||Expert Opinion of Salary||Combined Salary|
In the above table, we can see that for Actors, the resulting “Combined Salary” is very heavily skewed towards expert opinion. This result is great as an average salary of $37,800 is a lot more reasonable than an average salary of $350,000. As discussed before, the result is skewed heavily because we have very little data collected and a high variance.
In contrast, Welders are skewed heavily towards the actual data collected under “Average Salary” because there is enough data to support that result.
Now, what would happen to our data if we then ask 497 additional actors their salary and all of them report that they earn $25,000 a year?:
|Career||Number of Surveyed||Average Salary||Expert Opinion of Salary||Combined Salary|
The first thing to notice is that the “Average Salary” has for Actors have changed drastically after accounting for the extra data. Secondly, the “Combined Salary” has changed to be skewed more towards the “Average Salary” because we have collected enough data to change our minds about the expert opinion.
In fact, we have collected so much data that the expert opinion almost seems irrelevant. This is great! We trust expert opinion until enough evidence has been collected to suggest otherwise.
Up until now, we have seen the intuition behind the balancing act between data collected and expert opinion. In order to see clearly the benefits of this methodolgy lets imagine another scenario concocted by David Robinson in his blog. Here, we assume the role of a baseball recruiter comparing these two players:
|Player||Number of Home Runs||Number of Batting Attempts||Home Run Percentage|
|Rookie Ryan||1 homerun||2 attempts||50%|
|Veteran Victoria||400 homeruns||1000 attempts||40%|
Which of the two players above would rather recruit?
In addition, let’s arm ourselves with the fact that we know an expert baseball analyst who tells us that the typical home run percentage of any player is 5% as this is what historical data suggests. I think most people would choose Veteran Victoria as she has been tested more and we have more evidence to support her reported home run percentage. In contrast, for Rookie Ryan, we just haven’t seen him play enough.
In our brains, we start off with a preconceived idea of what a typical player is able to achieve (5% home run percentage). Unless we have collected enough data to make us think otherwise, we tend to stick to our belief. In the case of Verteran Victoria, we have enough evidence to convince ourselves that she is better than the typical player. This is not the case for Rookie Ryan. This is exactly what Bayesian statistics allows us to do mathematically.
The last ingredient to this Bayesian methodology is a concept of “prior/expert variance”. This parameters controls how “sure” we are in our expert opinion and how much evidence we must be presented in order to shift our prior beliefs. The higher the prior variance, the less sure we are of our prior beleifs and less data is required for us to change our minds.
Interestingly, the highest recorded home run percentage of any player within a single season is 15.34% by Barry Bonds in 2001. With this additional knowledge, we would be especially foolish not to choose Veteran Victoria.
Throughout this article, we have discussed how Sokanu uses Bayesian statistics to combine salary data collected from our users with external sources. This method is particularly useful in situations in which we don’t have a lot of data. In situations in which one does not have a lot of “experience” or data, it makes sense to rely on the opinions of others. As we collect more data, we start to place more trust into the data we collect and rely less on expert input. It is interesting how this mathematical method is quite analogous to how people reason through problems. Regardless, this method has allowed us to gain a higher level of confidence in the salaries that we report.
In future blog posts we plan to provide details into the mathematical models that Sokanu uses to provide career matches to our users.