Scaler Clustering

Problem statement

Objective

  • The objective of this project is to cluster Scaler learners based on their company, job position, experience, and compensation (CTC) to identify homogeneous groups with similar professional and salary characteristics.
  • Both manual clustering (business-rule based) and unsupervised learning techniques (K-Means and Hierarchical Clustering) are applied to:
    • Identify high-earning and low-earning learners
    • Benchmark salaries across companies and roles
    • Provide actionable insights to learners and Scaler’s business team

Import Libraries & Load Data

In [2]:
!pip install kneed
Collecting kneed
  Downloading kneed-0.8.5-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: numpy>=1.14.2 in /usr/local/lib/python3.12/dist-packages (from kneed) (2.0.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from kneed) (1.16.3)
Downloading kneed-0.8.5-py3-none-any.whl (10 kB)
Installing collected packages: kneed
Successfully installed kneed-0.8.5
In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import re
from datetime import datetime

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.cluster import KMeans, DBSCAN
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import silhouette_score
from sklearn.utils import resample
from sklearn.mixture import GaussianMixture
from sklearn.cluster import AgglomerativeClustering

import warnings
warnings.filterwarnings('ignore')

from scipy.stats import kruskal, spearmanr, pearsonr, mannwhitneyu

from kneed import KneeLocator
In [4]:
!gdown 1Sg45aJZsKNEZdOSjJ5BnNoPhpBgyQf0L
Downloading...
From: https://drive.google.com/uc?id=1Sg45aJZsKNEZdOSjJ5BnNoPhpBgyQf0L
To: /content/scaler_clustering.csv
100% 24.7M/24.7M [00:00<00:00, 47.9MB/s]
In [5]:
df = pd.read_csv('/content/scaler_clustering.csv')

Basic EDA

Preliminary analysis

In [6]:
df.shape
Out[6]:
(205843, 7)
In [7]:
df.info()
df.describe(include='all')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205843 entries, 0 to 205842
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        205843 non-null  int64  
 1   company_hash      205799 non-null  object 
 2   email_hash        205843 non-null  object 
 3   orgyear           205757 non-null  float64
 4   ctc               205843 non-null  int64  
 5   job_position      153279 non-null  object 
 6   ctc_updated_year  205843 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 11.0+ MB
Out[7]:
Unnamed: 0 company_hash email_hash orgyear ctc job_position ctc_updated_year
count 205843.000000 205799 205843 205757.000000 2.058430e+05 153279 205843.000000
unique NaN 37299 153443 NaN NaN 1016 NaN
top NaN nvnv wgzohrnvzwj otqcxwto bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7... NaN NaN Backend Engineer NaN
freq NaN 8337 10 NaN NaN 43554 NaN
mean 103273.941786 NaN NaN 2014.882750 2.271685e+06 NaN 2019.628231
std 59741.306484 NaN NaN 63.571115 1.180091e+07 NaN 1.325104
min 0.000000 NaN NaN 0.000000 2.000000e+00 NaN 2015.000000
25% 51518.500000 NaN NaN 2013.000000 5.300000e+05 NaN 2019.000000
50% 103151.000000 NaN NaN 2016.000000 9.500000e+05 NaN 2020.000000
75% 154992.500000 NaN NaN 2018.000000 1.700000e+06 NaN 2021.000000
max 206922.000000 NaN NaN 20165.000000 1.000150e+09 NaN 2021.000000

👁️ Observation

  • Dataset contains both categorical (Company, Job_position) and numerical (CTC, orgyear) variables
  • Email_hash and Company_hash are anonymized identifiers
  • Unnamed: 0 is an index column → can be dropped
In [8]:
df.drop(columns=['Unnamed: 0'], inplace=True)

Unique Email Analysis

In [9]:
df['email_hash'].nunique()
Out[9]:
153443
In [10]:
df['email_hash'].value_counts().head(10)
Out[10]:
count
email_hash
bbace3cc586400bbc65765bc6a16b77d8913836cfc98b77c05488f02f5714a4b 10
3e5e49daa5527a6d5a33599b238bf9bf31e85b9efa9a94f1c88c5e15a6f31378 9
298528ce3160cc761e4dc37a07337ee2e0589df251d73645aae209b010210eee 9
6842660273f70e9aa239026ba33bfe82275d6ab0d20124021b952b5bc3d07e6c 9
d598d6f1fb21b45593c2afc1c2f76ae9f4cb7167156cdf93246d4192a89d8065 8
b4d5afa09bec8689017d8b29701b80d664ca37b83cb883376b2e95191320da66 8
d15041f58bb01c8ee29f72e33b136e26bc32f3169a40b53d75fe7ae9cbb9a551 8
c0eb129061675da412b0deb15871dd06ef0d7cd86eb5f7e8cc6a20b0d1938183 8
4818edfd67ed8563dde5d083306485d91d19f4f1c95d193a1700e79dd245b75c 8
faf40195f8c58d5c7edc758cc725a762d51920da996410b80ac4a4d85c803da0 8

👁️ Observation

  • Multiple records for the same email indicate job switches, promotions, or salary updates.
  • This justifies clustering at the learner-record level, not unique learner level.

CTC checking

In [11]:
df['ctc'].describe()
Out[11]:
ctc
count 2.058430e+05
mean 2.271685e+06
std 1.180091e+07
min 2.000000e+00
25% 5.300000e+05
50% 9.500000e+05
75% 1.700000e+06
max 1.000150e+09

In [12]:
df['ctc'].min(), df['ctc'].max()
Out[12]:
(np.int64(2), np.int64(1000150000))
In [13]:
df[df['ctc'] < 200]
Out[13]:
company_hash email_hash orgyear ctc job_position ctc_updated_year
54820 uqvpqxnx voogwxvnto 8786759b95d673466e94f62f1b15e4f8c6bd7de6164074... 2020.0 24 Other 2020.0
91552 ftm ongqt 512f761579fb116e215cabc9821c7f81153f0763e16018... 2016.0 25 Android Engineer 2018.0
114164 xzntqcxtfmxn 23ad96d6b6f1ecf554a52f6e9b61677c7d73d8a409a143... 2013.0 14 NaN 2018.0
118236 xzntqcxtfmxn f2b58aeed3c074652de2cfd3c0717a5d21d6fbcf342a78... 2013.0 6 NaN 2018.0
135435 xzntqcxtfmxn 3505b02549ebe2c95840ac6f0a35561a3b4cbe4b79cdb1... 2014.0 2 Backend Engineer 2019.0
183804 xm 75357254a31f133e2d3870057922feddeba82b88056a07... 2019.0 16 NaN 2018.0
184946 xm b8a0bb340583936b5a7923947e9aec21add5ebc50cd60b... 2016.0 15 NaN 2018.0

CTC is stored in mixed units:

  • Some values are in lakhs (2, 6, 14, 15)
  • Some values are in thousands (200, 600, 1000)
  • Some are in actual amounts (1,000,000+)

Standardizing CTC

In [14]:
a = set(df['ctc'].unique())
a
Out[14]:
{np.int64(5120000),
 np.int64(2048000),
 np.int64(2),
 np.int64(6),
 np.int64(8200),
 np.int64(213000),
 np.int64(2261000),
 np.int64(1237000),
 np.int64(14),
 np.int64(15),
 np.int64(1450000),
 np.int64(16400),
 np.int64(426000),
 np.int64(16),
 np.int64(24),
 np.int64(25),
 np.int64(639000),
 np.int64(1663000),
 np.int64(1875999),
 np.int64(2900000),
 np.int64(100180000),
 np.int64(852000),
 np.int64(1065000),
 np.int64(41000),
 np.int64(147500),
 np.int64(4350000),
 np.int64(1278000),
 np.int64(254000),
 np.int64(467000),
 np.int64(2515000),
 np.int64(680000),
 np.int64(185000000),
 np.int64(5800000),
 np.int64(2728000),
 np.int64(893000),
 np.int64(3965000),
 np.int64(7250000),
 np.int64(2130000),
 np.int64(82000),
 np.int64(1106000),
 np.int64(3154000),
 np.int64(1319000),
 np.int64(3367000),
 np.int64(295000),
 np.int64(401500),
 np.int64(3579999),
 np.int64(8700000),
 np.int64(3580000),
 np.int64(1532000),
 np.int64(98400),
 np.int64(508000),
 np.int64(1745000),
 np.int64(721000),
 np.int64(8300),
 np.int64(5030000),
 np.int64(934000),
 np.int64(2982000),
 np.int64(16500),
 np.int64(123000),
 np.int64(1147000),
 np.int64(1360000),
 np.int64(336000),
 np.int64(11600000),
 np.int64(37200000),
 np.int64(2384000),
 np.int64(6480000),
 np.int64(1573000),
 np.int64(3621000),
 np.int64(13050000),
 np.int64(2810000),
 np.int64(762000),
 np.int64(975000),
 np.int64(14500000),
 np.int64(164000),
 np.int64(1188000),
 np.int64(3236000),
 np.int64(4260000),
 np.int64(377000),
 np.int64(2425000),
 np.int64(1401000),
 np.int64(590000),
 np.int64(1614000),
 np.int64(131250),
 np.int64(802999),
 np.int64(803000),
 np.int64(2039999),
 np.int64(2040000),
 np.int64(1016000),
 np.int64(1015999),
 np.int64(43000000),
 np.int64(68600000),
 np.int64(17400000),
 np.int64(204999),
 np.int64(205000),
 np.int64(200),
 np.int64(1229000),
 np.int64(2253000),
 np.int64(90530000),
 np.int64(3490000),
 np.int64(8400),
 np.int64(418000),
 np.int64(2572500),
 np.int64(1442000),
 np.int64(3703000),
 np.int64(631000),
 np.int64(1655000),
 np.int64(4940000),
 np.int64(844000),
 np.int64(639200),
 np.int64(10060000),
 np.int64(33000),
 np.int64(1057000),
 np.int64(33004),
 np.int64(1270000),
 np.int64(6390000),
 np.int64(1679600),
 np.int64(246000),
 np.int64(1483000),
 np.int64(1278200),
 np.int64(459000),
 np.int64(100000000),
 np.int64(672000),
 np.int64(2720000),
 np.int64(57600),
 np.int64(885000),
 np.int64(2933000),
 np.int64(34890000),
 np.int64(74000),
 np.int64(4170000),
 np.int64(2122000),
 np.int64(1098000),
 np.int64(1311000),
 np.int64(6431000),
 np.int64(2335000),
 np.int64(4383000),
 np.int64(499999),
 np.int64(500000),
 np.int64(5620000),
 np.int64(1524000),
 np.int64(8480),
 np.int64(1737000),
 np.int64(713000),
 np.int64(300),
 np.int64(1950000),
 np.int64(7070000),
 np.int64(926000),
 np.int64(8500),
 np.int64(115000),
 np.int64(221500),
 np.int64(100680000),
 np.int64(3400000),
 np.int64(29000000),
 np.int64(328000),
 np.int64(8520000),
 np.int64(157000000),
 np.int64(1352000),
 np.int64(2376000),
 np.int64(1565000),
 np.int64(541000),
 np.int64(82250),
 np.int64(434500),
 np.int64(4850000),
 np.int64(754000),
 np.int64(967000),
 np.int64(3015000),
 np.int64(147805),
 np.int64(6300000),
 np.int64(1180000),
 np.int64(156000),
 np.int64(8348000),
 np.int64(57500000),
 np.int64(31900000),
 np.int64(369000),
 np.int64(9585000),
 np.int64(2630000),
 np.int64(7750000),
 np.int64(1401200),
 np.int64(1606000),
 np.int64(582000),
 np.int64(795000),
 np.int64(1819000),
 np.int64(4079999),
 np.int64(9200000),
 np.int64(1008000),
 np.int64(4080000),
 np.int64(11248000),
 np.int64(2245000),
 np.int64(197000),
 np.int64(409999),
 np.int64(410000),
 np.int64(10650000),
 np.int64(57750),
 np.int64(5743000),
 np.int64(8600),
 np.int64(623000),
 np.int64(1860000),
 np.int64(836000),
 np.int64(16800),
 np.int64(12100000),
 np.int64(25000),
 np.int64(18670000),
 np.int64(3310000),
 np.int64(2286000),
 np.int64(8430000),
 np.int64(238000),
 np.int64(1475000),
 np.int64(451000),
 np.int64(15000000),
 np.int64(4760000),
 np.int64(664000),
 np.int64(1688000),
 np.int64(877000),
 np.int64(2925000),
 np.int64(1090000),
 np.int64(66000),
 np.int64(2327000),
 np.int64(43500000),
 np.int64(2540000),
 np.int64(120300000),
 np.int64(492000),
 np.int64(1516000),
 np.int64(705000),
 np.int64(1729000),
 np.int64(2966000),
 np.int64(101270000),
 np.int64(918000),
 np.int64(3990000),
 np.int64(500),
 np.int64(107000),
 np.int64(1131000),
 np.int64(8700),
 np.int64(123200000),
 np.int64(320000),
 np.int64(200000000),
 np.int64(72000000),
 np.int64(1344000),
 np.int64(2368000),
 np.int64(5440000),
 np.int64(533000),
 np.int64(1769999),
 np.int64(1770000),
 np.int64(746000),
 np.int64(1982999),
 np.int64(3007000),
 np.int64(3220000),
 np.int64(100500000),
 np.int64(148000),
 np.int64(1172000),
 np.int64(361000),
 np.int64(1385000),
 np.int64(467500),
 np.int64(574000),
 np.int64(3646000),
 np.int64(787000),
 np.int64(999999),
 np.int64(1000000),
 np.int64(52200000),
 np.int64(82500),
 np.int64(1213000),
 np.int64(189000),
 np.int64(1000008),
 np.int64(401999),
 np.int64(2450000),
 np.int64(402000),
 np.int64(1426000),
 np.int64(615000),
 np.int64(600),
 np.int64(3899999),
 np.int64(8800),
 np.int64(3900000),
 np.int64(1852000),
 np.int64(828000),
 np.int64(827999),
 np.int64(17000),
 np.int64(1041000),
 np.int64(123500),
 np.int64(229999),
 np.int64(230000),
 np.int64(5350000),
 np.int64(1254000),
 np.int64(443000),
 np.int64(98937),
 np.int64(1680000),
 np.int64(58000000),
 np.int64(6800000),
 np.int64(32400000),
 np.int64(656000),
 np.int64(2917000),
 np.int64(58000),
 np.int64(3130000),
 np.int64(8250000),
 np.int64(1082000),
 np.int64(100410000),
 np.int64(2106000),
 np.int64(1295000),
 np.int64(35300000),
 np.int64(4580000),
 np.int64(40420000),
 np.int64(2532000),
 np.int64(484000),
 np.int64(9700000),
 np.int64(1508000),
 np.int64(1721000),
 np.int64(909999),
 np.int64(910000),
 np.int64(6030000),
 np.int64(1934000),
 np.int64(11150000),
 np.int64(99000),
 np.int64(1123000),
 np.int64(115000000),
 np.int64(2360000),
 np.int64(12600000),
 np.int64(312000),
 np.int64(1336000),
 np.int64(8900),
 np.int64(525000),
 np.int64(1549000),
 np.int64(3810000),
 np.int64(738000),
 np.int64(1762000),
 np.int64(2786000),
 np.int64(1975000),
 np.int64(951000),
 np.int64(811000),
 np.int64(140000),
 np.int64(199820000),
 np.int64(15500000),
 np.int64(5260000),
 np.int64(353000),
 np.int64(3425000),
 np.int64(49900),
 np.int64(1590000),
 np.int64(566000),
 np.int64(3638000),
 np.int64(2827000),
 np.int64(779000),
 np.int64(992000),
 np.int64(3040000),
 np.int64(44000000),
 np.int64(2016000),
 np.int64(181000),
 np.int64(1205000),
 np.int64(4490000),
 np.int64(394000),
 np.int64(1418000),
 np.int64(90900),
 np.int64(607000),
 np.int64(819999),
 np.int64(820000),
 np.int64(72500000),
 np.int64(5940000),
 np.int64(52020000),
 np.int64(1844000),
 np.int64(9000),
 np.int64(1033000),
 np.int64(2269999),
 np.int64(2270000),
 np.int64(1246000),
 np.int64(7390000),
 np.int64(222000),
 np.int64(434999),
 np.int64(435000),
 np.int64(1459000),
 np.int64(101000000),
 np.int64(3720000),
 np.int64(1672000),
 np.int64(8840000),
 np.int64(648000),
 np.int64(33600),
 np.int64(24200000),
 np.int64(861000),
 np.int64(1885000),
 np.int64(50000),
 np.int64(1074000),
 np.int64(5170000),
 np.int64(3122000),
 np.int64(1287000),
 np.int64(1500000),
 np.int64(476000),
 np.int64(2524000),
 np.int64(2737000),
 np.int64(1713000),
 np.int64(3761000),
 np.int64(901999),
 np.int64(100230000),
 np.int64(2950000),
 np.int64(91000),
 np.int64(1115000),
 np.int64(4400000),
 np.int64(30000000),
 np.int64(106800000),
 np.int64(1328000),
 np.int64(304000),
 np.int64(2352000),
 np.int64(517000),
 np.int64(2565000),
 np.int64(9100),
 np.int64(730000),
 np.int64(5850000),
 np.int64(9946000),
 np.int64(132000),
 np.int64(7300000),
 np.int64(2180000),
 np.int64(1156000),
 np.int64(58500000),
 np.int64(345000),
 np.int64(2393000),
 np.int64(1369000),
 np.int64(3629999),
 np.int64(100910000),
 np.int64(3630000),
 np.int64(1795000),
 np.int64(771000),
 np.int64(983999),
 np.int64(10200000),
 np.int64(87000000),
 np.int64(5080000),
 np.int64(66500),
 np.int64(173000),
 np.int64(33740),
 np.int64(1410000),
 np.int64(386000),
 np.int64(11650000),
 np.int64(2647000),
 np.int64(811999),
 np.int64(2860000),
 np.int64(13100000),
 np.int64(1835999),
 np.int64(89900000),
 np.int64(1000),
 np.int64(1025000),
 np.int64(11265000),
 np.int64(107500),
 np.int64(9200),
 np.int64(40150000),
 np.int64(4310000),
 np.int64(214000),
 np.int64(426999),
 np.int64(17400),
 np.int64(2475000),
 np.int64(1451000),
 np.int64(640000),
 np.int64(16000000),
 np.int64(41600000),
 np.int64(5760000),
 np.int64(1664000),
 np.int64(25600),
 np.int64(852999),
 np.int64(2089999),
 np.int64(2090000),
 np.int64(42000),
 np.int64(1066000),
 np.int64(230421),
 np.int64(254999),
 np.int64(255000),
 np.int64(3327000),
 np.int64(8660000),
 np.int64(3540000),
 np.int64(468000),
 np.int64(2516000),
 np.int64(1492000),
 np.int64(1705000),
 np.int64(681000),
 np.int64(20350000),
 np.int64(894000),
 np.int64(83000),
 np.int64(2131000),
 np.int64(1107000),
 np.int64(1320000),
 np.int64(73000000),
 np.int64(6440000),
 np.int64(2770000),
 np.int64(722000),
 np.int64(1746000),
 np.int64(9300),
 np.int64(935000),
 np.int64(17500),
 np.int64(50300000),
 np.int64(101500000),
 np.int64(4220000),
 np.int64(124000),
 np.int64(24700000),
 np.int64(9340000),
 np.int64(2172000),
 np.int64(337000),
 np.int64(3409000),
 np.int64(2385000),
 np.int64(550000),
 np.int64(1574000),
 np.int64(3835000),
 np.int64(763000),
 np.int64(1999999),
 np.int64(155600000),
 np.int64(7120000),
 np.int64(2000000),
 np.int64(130000000),
 np.int64(27600000),
 np.int64(165000),
 np.int64(3450000),
 np.int64(378000),
 np.int64(8570000),
 np.int64(1402000),
 np.int64(1614999),
 np.int64(1615000),
 np.int64(803999),
 np.int64(4900000),
 np.int64(10020000),
 np.int64(2852000),
 np.int64(1017000),
 np.int64(1230000),
 np.int64(6350000),
 np.int64(1200),
 np.int64(206000),
 np.int64(11470000),
 np.int64(419000),
 np.int64(1443000),
 np.int64(7800000),
 np.int64(33400000),
 np.int64(2680000),
 np.int64(59000000),
 np.int64(632000),
 np.int64(17600),
 np.int64(844999),
 np.int64(25800),
 np.int64(845000),
 np.int64(4129999),
 np.int64(34000),
 np.int64(1058000),
 np.int64(4130000),
 np.int64(91347),
 np.int64(140500),
 np.int64(1271000),
 np.int64(247000),
 np.int64(459999),
 np.int64(460000),
 np.int64(61900000),
 np.int64(189900000),
 np.int64(10700000),
 np.int64(369888),
 np.int64(5580000),
 np.int64(2508000),
 np.int64(673000),
 np.int64(5793000),
 np.int64(1910000),
 np.int64(886000),
 np.int64(7030000),
 np.int64(75000),
 np.int64(3360000),
 np.int64(288000),
 np.int64(1312000),
 np.int64(90400000),
 np.int64(13600000),
 np.int64(501000),
 np.int64(1525000),
 np.int64(5621000),
 np.int64(2549000),
 np.int64(4810000),
 np.int64(714000),
 np.int64(1300),
 np.int64(2975000),
 np.int64(9500),
 np.int64(1140000),
 np.int64(16500000),
 np.int64(116000),
 np.int64(2164000),
 np.int64(4425000),
 np.int64(2590000),
 np.int64(1566000),
 np.int64(542000),
 np.int64(19998000),
 np.int64(755000),
 np.int64(4040000),
 np.int64(45000000),
 np.int64(968000),
 np.int64(157000),
 np.int64(1181000),
 np.int64(370000),
 np.int64(1394000),
 np.int64(583000),
 np.int64(1607000),
 np.int64(1820000),
 np.int64(22300000),
 np.int64(6940000),
 np.int64(2032999),
 np.int64(1009000),
 np.int64(2033000),
 np.int64(100550000),
 np.int64(198000),
 np.int64(3270000),
 np.int64(1222000),
 np.int64(1435000),
 np.int64(127600000),
 np.int64(50800000),
 np.int64(25200000),
 np.int64(9600),
 np.int64(102000000),
 np.int64(4720000),
 np.int64(624000),
 np.int64(1648000),
 np.int64(4933000),
 np.int64(6768000),
 np.int64(837000),
 np.int64(1050000),
 np.int64(6170000),
 np.int64(26000),
 np.int64(1263000),
 np.int64(345500),
 np.int64(451999),
 np.int64(2500000),
 np.int64(7620000),
 np.int64(1440),
 np.int64(452000),
 np.int64(665000),
 np.int64(75180),
 np.int64(877999),
 np.int64(3950000),
 np.int64(878000),
 np.int64(58800),
 np.int64(67000),
 np.int64(2115000),
 np.int64(1091000),
 np.int64(56600000),
 np.int64(31000000),
 np.int64(280000),
 np.int64(5400000),
 np.int64(20760000),
 np.int64(199960000),
 np.int64(181700),
 np.int64(3565000),
 np.int64(493000),
 np.int64(1730000),
 np.int64(6850000),
 np.int64(706000),
 np.int64(2754000),
 np.int64(812500),
 np.int64(919000),
 np.int64(1500),
 np.int64(8300000),
 np.int64(3180000),
 np.int64(108000),
 np.int64(2156000),
 np.int64(9700),
 np.int64(1132000),
 np.int64(1345000),
 np.int64(17900),
 np.int64(534000),
 np.int64(4630000),
 np.int64(75250),
 np.int64(1558000),
 np.int64(1771000),
 np.int64(959999),
 np.int64(960000),
 np.int64(11200000),
 np.int64(88000000),
 np.int64(1984000),
 np.int64(6293000),
 np.int64(149000),
 np.int64(2410000),
 np.int64(362000),
 np.int64(1386000),
 np.int64(3434000),
 np.int64(575000),
 np.int64(2623000),
 np.int64(14100000),
 np.int64(3860000),
 np.int64(65300000),
 np.int64(788000),
 np.int64(2025000),
 np.int64(1001000),
 np.int64(83500),
 np.int64(190000),
 np.int64(5310000),
 np.int64(1214000),
 np.int64(402999),
 np.int64(403000),
 np.int64(1427000),
 np.int64(1639999),
 np.int64(17000000),
 np.int64(1640000),
 np.int64(616000),
 np.int64(2664000),
 np.int64(1600),
 np.int64(2877000),
 np.int64(10045000),
 np.int64(18000),
 np.int64(3090000),
 np.int64(18450000),
 np.int64(58960),
 np.int64(1255000),
 np.int64(231000),
 np.int64(45500000),
 np.int64(5564000),
 np.int64(101820000),
 np.int64(444000),
 np.int64(50620000),
 np.int64(42600),
 np.int64(657000),
 np.int64(869999),
 np.int64(870000),
 np.int64(5990000),
 np.int64(1650),
 np.int64(1894000),
 np.int64(2107000),
 np.int64(59000),
 np.int64(1083000),
 np.int64(2320000),
 np.int64(272000),
 np.int64(17680000),
 np.int64(1296000),
 np.int64(99600000),
 np.int64(484999),
 np.int64(2533000),
 np.int64(485000),
 np.int64(101050000),
 np.int64(3770000),
 np.int64(698000),
 np.int64(1722000),
 np.int64(910999),
 np.int64(1935000),
 np.int64(911000),
 np.int64(100000),
 np.int64(102500000),
 np.int64(10340000),
 np.int64(5220000),
 np.int64(1124000),
 np.int64(2148000),
 np.int64(1337000),
 np.int64(313000),
 np.int64(108200),
 np.int64(3385000),
 np.int64(9900),
 np.int64(1550000),
 np.int64(526000),
 np.int64(3000000),
 np.int64(28600000),
 np.int64(952000),
 np.int64(1165000),
 np.int64(141000),
 np.int64(2189000),
 np.int64(354000),
 np.int64(4450000),
 np.int64(1378000),
 np.int64(567000),
 np.int64(2615000),
 np.int64(780000),
 np.int64(5900000),
 np.int64(1804000),
 np.int64(31500000),
 np.int64(2017000),
 np.int64(7137000),
 np.int64(2230000),
 np.int64(182000),
 np.int64(1206000),
 np.int64(1312500),
 np.int64(395000),
 np.int64(501500),
 np.int64(3679999),
 np.int64(8800000),
 np.int64(60000000),
 np.int64(188000000),
 np.int64(608000),
 np.int64(2656000),
 np.int64(3680000),
 np.int64(1632000),
 np.int64(6752000),
 np.int64(821000),
 np.int64(1800),
 np.int64(1845000),
 np.int64(10000),
 np.int64(10250000),
 np.int64(1034000),
 np.int64(2058000),
 np.int64(223000),
 np.int64(1460000),
 np.int64(57780000),
 np.int64(436000),
 np.int64(11700000),
 np.int64(88500000),
 np.int64(6580000),
 np.int64(649000),
 np.int64(1673000),
 np.int64(861999),
 np.int64(2910000),
 np.int64(1075000),
 np.int64(51000),
 np.int64(264000),
 np.int64(91400000),
 np.int64(1288000),
 np.int64(4360000),
 np.int64(476999),
 np.int64(2525000),
 np.int64(1501000),
 np.int64(690000),
 np.int64(5810000),
 np.int64(902999),
 np.int64(1107800),
 np.int64(1927000),
 np.int64(3975000),
 np.int64(92000),
 np.int64(2140000),
 np.int64(17500000),
 np.int64(1116000),
 np.int64(198500),
 np.int64(43100000),
 np.int64(68700000),
 np.int64(305000),
 np.int64(2353000),
 np.int64(5425000),
 np.int64(1900),
 np.int64(518000),
 np.int64(1542000),
 np.int64(3590000),
 np.int64(731000),
 np.int64(1755000),
 np.int64(18300),
 np.int64(5040000),
 np.int64(944000),
 np.int64(71600000),
 np.int64(26500),
 np.int64(133000),
 np.int64(1370000),
 np.int64(346000),
 np.int64(559000),
 np.int64(100100000),
 np.int64(2820000),
 np.int64(1796000),
 np.int64(772000),
 np.int64(984999),
 np.int64(985000),
 np.int64(67500),
 np.int64(174000),
 np.int64(1198000),
 np.int64(4270000),
 np.int64(2222000),
 np.int64(280500),
 np.int64(387000),
 np.int64(493499),
 np.int64(1623999),
 np.int64(600000),
 np.int64(103000000),
 np.int64(51800000),
 np.int64(26200000),
 np.int64(2861000),
 np.int64(813000),
 np.int64(1837000),
 np.int64(2000),
 np.int64(2050000),
 np.int64(1026000),
 np.int64(7170000),
 np.int64(108500),
 np.int64(10200),
 np.int64(215000),
 np.int64(3647448),
 np.int64(2014),
 np.int64(3500000),
 np.int64(23980000),
 np.int64(428000),
 np.int64(2476000),
 np.int64(1664999),
 np.int64(1665000),
 np.int64(641000),
 np.int64(6785000),
 np.int64(26600),
 np.int64(853999),
 np.int64(4950000),
 np.int64(854000),
 np.int64(1067000),
 np.int64(43000),
 np.int64(3115000),
 np.int64(1280000),
 np.int64(6400000),
 np.int64(160000000),
 np.int64(32000000),
 np.int64(1493000),
 np.int64(2730000),
 np.int64(682000),
 np.int64(67600),
 np.int64(100010000),
 np.int64(894999),
 np.int64(9300000),
 np.int64(84000),
 np.int64(2132000),
 np.int64(1108000),
 np.int64(142420000),
 np.int64(10274),
 np.int64(4180000),
 np.int64(2345000),
 np.int64(509999),
 np.int64(510000),
 np.int64(3582000),
 np.int64(1534000),
 np.int64(2558000),
 np.int64(2100),
 np.int64(5630000),
 np.int64(1747000),
 np.int64(723000),
 np.int64(10300),
 np.int64(1959999),
 np.int64(12200000),
 np.int64(1960000),
 np.int64(5032000),
 np.int64(7080000),
 np.int64(18500),
 np.int64(936000),
 np.int64(125000),
 np.int64(1149000),
 np.int64(1362000),
 np.int64(8530000),
 np.int64(338000),
 np.int64(3410000),
 np.int64(1575000),
 np.int64(551000),
 np.int64(2140),
 np.int64(764000),
 np.int64(1788000),
 np.int64(4860000),
 np.int64(2000999),
 np.int64(977000),
 np.int64(3025000),
 np.int64(59500),
 np.int64(10350),
 np.int64(1190000),
 np.int64(166000),
 np.int64(6310000),
 np.int64(18550),
 np.int64(1403000),
 np.int64(379000),
 np.int64(4475000),
 np.int64(26750),
 np.int64(18000000),
 np.int64(2640000),
 np.int64(43600000),
 np.int64(99920000),
 np.int64(592000),
 np.int64(1616000),
 np.int64(805000),
 np.int64(4090000),
 np.int64(9210000),
 np.int64(1018000),
 np.int64(207000),
 np.int64(2255000),
 np.int64(3279000),
 np.int64(1231000),
 np.int64(420000),
 np.int64(1444000),
 np.int64(46500000),
 np.int64(18600),
 np.int64(633000),
 np.int64(1869999),
 np.int64(1870000),
 np.int64(26800),
 np.int64(846000),
 np.int64(35000),
 np.int64(75000000),
 np.int64(100600000),
 np.int64(23800000),
 np.int64(3320000),
 np.int64(248000),
 np.int64(1272000),
 np.int64(1485000),
 np.int64(674000),
 np.int64(1501400),
 np.int64(887000),
 np.int64(1100000),
 np.int64(52300000),
 np.int64(76000),
 np.int64(26700000),
 np.int64(1313000),
 np.int64(18670),
 np.int64(501999),
 np.int64(2550000),
 np.int64(502000),
 np.int64(2290),
 np.int64(1526000),
 np.int64(3786999),
 np.int64(715000),
 np.int64(2300),
 np.int64(927999),
 np.int64(4000000),
 np.int64(80800000),
 np.int64(55200000),
 np.int64(928000),
 np.int64(10500),
 np.int64(117000),
 np.int64(2165000),
 np.int64(2066700),
 np.int64(330000),
 np.int64(5450000),
 np.int64(2378000),
 np.int64(1354000),
 np.int64(90442000),
 np.int64(10519),
 np.int64(543000),
 np.int64(1779999),
 np.int64(1780000),
 np.int64(6900000),
 np.int64(58100000),
 np.int64(32500000),
 np.int64(862500),
 ...}
In [15]:
def convert_ctc_to_lpa(ctc):
    """
    Converts CTC into Lakhs Per Annum (LPA)
    """

    if pd.isna(ctc):
        return np.nan

    # Monthly salary → convert to annual
    if ctc < 100000:
        annual_ctc = ctc * 12

    # Annual salary
    elif ctc <= 50000000:
        annual_ctc = ctc

    # Unrealistic values → mark as NaN
    else:
        return np.nan

    # Convert to LPA
    return annual_ctc / 100000


df['ctc_lpa'] = df['ctc'].apply(convert_ctc_to_lpa)

During EDA, observed that the raw CTC values were recorded in inconsistent units (monthly and annual), so we standardized them into Lakhs Per Annum using domain-driven rules before performing any outlier analysis or clustering.

Missing Values Handling (Mean / KNN Imputation)

In [16]:
df.isnull().sum()
Out[16]:
0
company_hash 44
email_hash 0
orgyear 86
ctc 0
job_position 52564
ctc_updated_year 0
ctc_lpa 1454

Strategy:

Numerical → Mean or KNN Imputation

Categorical → Most frequent

In [17]:
num_cols = ['ctc_lpa', 'orgyear', 'ctc_updated_year']
cat_cols = ['company_hash', 'job_position']
In [18]:
num_imputer = KNNImputer(n_neighbors=5)
cat_imputer = SimpleImputer(strategy='most_frequent')

df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

Checking the years

In [19]:
invalid_years = df[df['ctc_updated_year'] < df['orgyear']]
invalid_years.shape
Out[19]:
(8847, 7)
In [20]:
invalid_years.sample(5)
Out[20]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa
20278 zgn vuurxwvmrt vwwghzn 4bc6fb265911f8f98622c91bb76469ea9a422b3e293bd7... 2020.0 640000 Frontend Engineer 2019.0 6.40
12294 ugrxwjmvkvvqwgb 24e229fc825f8892225e333bc9fe5ca53c75c5c27ddfcc... 2022.0 5000000 Co-founder 2020.0 50.00
179705 eqtoytq 51192bd41dff3a3ffad14f3525e98549b0e94a3e340f02... 2020.0 2600000 Backend Engineer 2019.0 26.00
39204 x vb v eqtoytq 48b00207f75dd25ca9d518103e2ddc3c9a9706e51ae393... 2020.0 200000000 Frontend Engineer 2019.0 6.56
74532 rtvqzwgatgzrxzt xzw 8b3a332f31177737f0491fc758ede2393a84f37d087262... 2020.0 600000 SDET 2019.0 6.00

👁️ Observation

  • Some records have CTC updated year earlier than organization joining year, which is logically inconsistent and indicates data quality issues.
  • Let assume salary update happened at joining.
In [21]:
df['ctc_updated_year'] = np.where(
    df['ctc_updated_year'] < df['orgyear'],
    df['orgyear'],
    df['ctc_updated_year']
)

We observed records where ctc_updated_year was earlier than orgyear, which is logically inconsistent. To maintain data integrity, we corrected such cases by ensuring that the CTC updated year is not earlier than the joining year.

In [22]:
invalid_years = df[df['ctc_updated_year'] < df['orgyear']]
invalid_years.shape
Out[22]:
(0, 7)

Regex Cleaning (Company & Job Position)

In [23]:
def clean_text(text):
    return re.sub('[^A-Za-z0-9 ]+', '', str(text)).strip()

df['company_hash'] = df['company_hash'].apply(clean_text)
df['job_position'] = df['job_position'].apply(clean_text)
  • Regex removes tabs, special characters, and inconsistent naming issues.
In [24]:
df['job_position_clean'] = (df['job_position'].str.lower().str.strip())
In [25]:
df['job_position_clean'].unique()
Out[25]:
array(['other', 'fullstack engineer', 'backend engineer', 'ios engineer',
       'data analyst', 'frontend engineer', 'qa engineer',
       'engineering leadership', 'data scientist', 'engineering intern',
       'android engineer', 'research engineers', 'sdet',
       'support engineer', 'security leadership', 'devops engineer',
       'product manager', 'program manager', 'cofounder',
       'member of technical staff 3', 'product designer',
       'system engineer', 'backend architect', 'database administrator',
       'senior  software engineer', 'release engineer', 'ase',
       'non coder', 'trainer', 'associate consultant', 'administrator',
       'software development engineer 2', 'sr', 'sr executive',
       'software engineer android', 'owner',
       'tool room engineer mechanical engineer', 'senior data scientist',
       'sde 2', 'project engineer', 'mean stack developer',
       'software development engineer  i', 'sde3', 'digital marketing',
       'app developer', 'principal consultant',
       'business analyst  product owner', 'senior business analyst',
       'architect team lead', 'graduate engineer  trainee',
       'technical consulting', 'machine learning engineer', 'software',
       '', 'instructor', 'worker', 'senior software engineer backend',
       'student', 'operations manager', 'senior engineer', 'toyota',
       'interndata analyst', 'senior data engineer', 'operation',
       'associate  software  engineer', 'technical manager',
       'senior sdet', 'escalation engineer', 'assistant professor',
       'software engineer i', 'exchange coordinator', 'associate',
       'senior qa engineer iii', 'process executive',
       'software engineer backend', 'assistant system engineer',
       'sde intern', 'intern', 'software development engineer  ii',
       'senior associate technology level1',
       'product development engineer', 'engineer i', 'l4',
       'solution engineer', 'application development analyst', 'typing',
       'engineer trainee', 'associate manager', 'cant reveal',
       'full stack web developer', 'sde1', 'lead analyst',
       'business consultant', 'sdet2', 'software  engineer',
       'software engineer ii', 'senior developer',
       'integration developer', 'networking', 'business analysts',
       'associate analyst', 'java devloper', 'traineeintern',
       'sales manager', 'lead engineer ii', 'webcenter consultant',
       'project associate', 'logistics supply chain',
       'data science analyst', 'software engineer 2b',
       'analyst programmer', 'software engineer 2',
       'senior software developer', 'entrepreneur',
       'atm interface system', 'sap abap developer', 'mts  ii',
       'business man', 'senior software engineeer', 'programmer analyst',
       'tech lead team lead', 'member of technical staff',
       'software engineer full stack', 'engineer', 'teacher', 'no job',
       'associate system engineer  trainee', 'front end developer',
       'user experience specialist', 'driver', 'reseller',
       'technical leader', 'senior lead analyst', 'executive operation',
       'basis administrator', 'risk analyst', 'senior system engineer',
       'sde 3', 'product engineer backend', 'catagory manager',
       'senior consultant', 'salesforce developer', 'senior associate',
       'power platform developer', 'freelance sde', 'quality enginner',
       'software engineeradvanced analytics', 'student intern',
       'automation developer', 'software engineer 1', 'project trainee',
       'engineer 1', 'senior administrator', 'deputy general manager',
       'senior software  development engineer', 'senior test engineer',
       'sde2', 'associate system engineer', 'computer scientist 1',
       'support escalation engineer', 'technical  consultant',
       'associate software engineer', 'hod', 'radio frequency engineer',
       'learning web designing', 'software analyst', 'ucm',
       'solutions architect', 'ml engineer',
       'senior staff software engineer', 'pa', 'not employed',
       'project management profession', 'pdp executive', 'fb service',
       'office management', 'risk model developer', 'nce',
       'engineering lead  ai', 'reseach analyst', 'mainframe developer',
       'software engineer  i', 'solution architect cloud and infra',
       '857628', '7', 'software developer automation', 'null',
       'erp application developer', 'ui architect', 'senior executive',
       'software development', 'research analyst', 'kam', 'maricketing',
       'ui developer intern', 'operations lead',
       'cloud software engineer', 'a group chat application',
       'senior manager it', 'sr software engineer', 'bta',
       'data engineer', 'member technical', 'senior front end engineer',
       'software developer data engineer', 'associate product lead',
       'software engineer analyst', 'programmer analyst 2',
       'senior systems engineer', 'not applicable', 'team lead',
       'hardware engineer', 'mts', 'assistant system engineer  trainee',
       'voice president', 'senior software engineer qa',
       'associate engineer', 'data eingineer', 'data analayst',
       'quality associate', 'consultant', 'sde 1',
       'software development consultant 1', 'lead software engineer',
       'programmer analyst trainee', 'mulesoft developer', 'se4',
       'cashier', 'data associate', 'compliance operation',
       'electric power supply', 'system admin', 'software engineer  ii',
       'others', 'assistant manager', 'senior web engineer',
       'application developer analyst', 'lead android developer',
       'teaching field', 'trainee', 'talent acquisition specialist',
       'finance and accounting', 'cto and co founder',
       'senior member technical staff',
       'software engineering  engineer 2', 'advisor',
       'associate processor', 'dot net developer',
       'site reliability engineer',
       'services it development program senior associate',
       'staff software engineer', 'full stack engineer', 'manager',
       'software prod  plat eng analyst', 'personal banker',
       'analytics and insite devloper', 'developer  production support',
       'platform engineer',
       'senior software development engineer in test',
       'software engineering intern', 'solutions designer',
       'fibre engineer', 'quantitative analyst', 'entrepreneurship',
       'automation engineer', 'administrative clerk', 'freelancer',
       'software development analyst', 'design specialist',
       'principal software engineer', 'loan processing', 'sr consultant',
       'principal solution architect', 'software developer intern',
       'lead  mis', 'embedded software development engineer', 'pop',
       'intern  software developer', 'assisatnt system engineer',
       'cloud security analyst', 'software developer  ios', 'sdet senior',
       'tech lead software engineer', 'lead', 'associate data scientist',
       'process associate', 'associate developer', 'sde ii',
       'technology associate', 'sr developer',
       'associate software developer', 'assistant software engineer',
       'research engineer', 'system software engineer', 'serd',
       'module lead software engineer', 'seniorsoftwareengineer',
       'sr lead engineer', 'android developer sr executive',
       'machine learning data associate', 'principal engineer  1',
       'business technology analyst', 'fresher', 'na', 'cloud',
       'computer faculty', 'employee', 'it executive', 'quality analyst',
       'safety officer',
       'some data entry operator like some copys writetype and upload',
       'managercx', 'seleceman', 'agency collection manager',
       'now iam not working waiting for job',
       'business development manager', 'qa intern', 'auditor',
       'graduate programmer', 'financial analyst',
       'cloud support associate', 'software development engineer  2',
       'senior manager', 'data engineer 2', 'cto', 'module lead',
       'business analytics', 'software development enginner',
       'automation test engineer', 'vice president', 'automation expert',
       'software developer  ui', 'software eng',
       'senior quantitative researcher', 'salesforce consultant',
       'devlopment engineer', 'senior software analyst',
       'associate product manager', 'firmware software engineer l',
       'tech architecture', 'any technical', 'young professional ii',
       'technical consultant', '896651', 'technology analyst',
       'operation executive', 'software  web developer',
       'full stack devloper', 'r  d engineer2', 'software engineer rd',
       'product develloper', 'software development intern',
       'member technical staff', 'software engineer iv',
       'operations analyst', 'scientist  c', 'software engineer iii',
       'sdei', 'senior development engineer',
       'application developmentaassociate', 'network administrator',
       'assosiate product manager', 'software developer  android',
       'lecturer', 'softwear engineer', 'engineet',
       'senior software development engineer ii', 'q',
       'software engineer  frontend', 'techops engineer 1',
       'customer service executive', 'eno',
       'system engineerfullstack developer', 'applications engineer',
       'graduate software developer', 'software developer ii',
       'summer technology analyst software developer intern',
       'senior product development engineer senior developer',
       'highway engineer', 'performance engineer',
       'senior software consultant', 'application developer frontend',
       'associate applications developer', 'software enginner',
       'sales and collection executive', 'software engineering coop',
       'software development engineer  intern', 'manual tester',
       'it analyst', 'research assistant',
       'senior fusion applications engineer', 'jr prod specialist',
       'pgt computer science', 'integration',
       'talent acquisition sspecialist', 'software developer 2',
       'asset management coordinator', 'software engineer in test',
       'front end engineer', 'leisure manager', 'pat',
       'core tech engineer  researcher', 'senior engineer software',
       'software enginnering specialist', 'android lead',
       'senior analysts', 'director engineering', 'aspnet developer',
       'product developer', 'area operations manager', 'test analyst',
       'so', 'server engineer', 'application development associate',
       'android application developer', 'risk investigator', 'x',
       'machine learning engineer intern', 'third party provider',
       'web designer', 'compliance auditor', '91',
       'senior software development engineer backend',
       'associate consultant  developer', 'messenger come driver',
       'associate technical engineer', 'senior associate platform l1',
       'sw engineer 2', 'senior web developer', 'pop engineer',
       'senior frontend developer', 'no',
       'senior member of technical staff', 'tsc', 'tx sme',
       'mechanical engineers', 'sps associate', 'staff engineermgr',
       'finance lead', 'seller support', 'data scientist ii',
       'software developer sde3', 'calculator', 'application engineer',
       'staff consultant', 'assistant engineer', 'human resource',
       'associate l1', 'rd engineer', 'test technician specialist',
       'software test engineer', 'sr technical architect', 'phd student',
       'frontend developer', 'ays', 'principal product engineer',
       'senior director of engineering', 'seller support associate',
       'software development engineer  iii', 'rd engineer ii',
       'matlab programmer', 'software development engineer intern',
       'assistant system engineer trainee', 'sdet 3',
       'associate application developer', 'project lead',
       'senior software engineerl2', 'big data developer',
       'associate technology', 'data specialist', 'pune',
       'instructional apprentice', 'trainee decision scientist',
       'teaching assistant', 'research assistant  software engineer',
       'sr solution architect', 'sr software engineerandroid',
       'sr product engineer', 'air force',
       'principal engineer and imaging leader', 'lead product engineer',
       'software engineer intern', 'sr representative', 'associate coder',
       'marketing professional', 'crc', 'specialist',
       'underwriting officer', 'fresher student',
       'business operations new associate', 'software engineerrd intern',
       'project leader', 'specialist programmer', 'smts',
       'technical support engineer', 'senior technical specialist',
       'senior mobile developer', 'vendor management', 'techinical lead',
       'key account manager', 'programmer analyst 1',
       'applied scientist ii', 'data visualization engineer', 'sde 4',
       'service delivery specialist', 'software developer 1',
       'delivery project lead', 'production engineering senior analyst',
       'database developer', 'senior chat associate', 'systems analyst',
       'oracle dba', 'solutions specialist', 'senior technical assistant',
       'web designerdeveloper intern', 'credit risk manager',
       'web application developer', 'staff developer',
       'software engineer iii  sde2', 'advisory software engineer',
       'growth manager', 'executive software developer',
       'software engineer  testing', 'sr ios engineer',
       'engineering program manager',
       'member of technical staff at nineleaps',
       'mobile application developer', 'specialist programmer pp',
       'associate data engineer', 'engineering manager ii',
       'advisory system analyat', 'customer support specialist',
       'business development executive', 'it coordinator',
       'technical head', 'asst manager', 'mts3', 'engeneering lead',
       'software development engineer 3', 'forex',
       'java microservices developer', 'teaching', 'coo', 'telar',
       'senior software engineer  infrastructure', 'software consultant',
       'web developer full stack', 'business process lead',
       'senior quantitative analyst',
       'senior business operations analyst',
       'senior applications engineer', 'software qa engineer',
       'senior ui enginner', 'senior software engineer net backend',
       'sdeii', 'product specialist', 'application engineer ii',
       'program manager intern', 'head scientist',
       'ios application developer', 'implementation engineer',
       'cofounderg', 'sdet  intern', 'sde lead', 'junior front end eng',
       'sales professional', 'account', 'computer scientist 2',
       'member of technical staff 4', 'sr associate platform l2',
       'advisory consultant  uiux expert',
       'associate director online products', 'technical associate',
       'ios software developer', 'abap developer', 'sr mis executive',
       'lead consultant technology', 'info technology sr associate',
       'associate technical support', 'lead software engineer in test',
       'qaqc engineer', 'lead ui engineer',
       'software developerapplication developer', 'se ii',
       'application developer 1', 'principle member of technical staff',
       'senior analyst', 'software engineer traineeml', 'pmo',
       'associate futures engineering team', 'researcher', 'edp admin',
       'zomato', 'software development engineer iii',
       'software development engineer', 'online reputation manager',
       'farmer', 'relationship officer', 'middleware engineer',
       'senior manager online products', 'senior tech architect', 'bda',
       'junior software developer', 'linux administrator',
       'junior developer', 'solution architect  apprentice',
       'it mmanager', 'project officer', 'bdm',
       'sap c4c functional consultant',
       'lead engineer  software development', 'graphic designer',
       'jharkhand', 'associate project manager', 'technical officer',
       'design engineer', 'recovery officer',
       'package iimplementation associate', 'erp consultant',
       'senior engineer aiml', 'assistant regional sales manager',
       'cloud data architect', 'technical program program',
       'application developer 2', 'consultant development',
       'oracle apps technical consultant', 'android facilitator',
       'internship', 'intern developer', 'power plant engineer',
       'content manager intern', 'etl developer', 'technical test lead',
       'machine learning intern', 'senior analyst programmer',
       'sr software engg  android', 'professional',
       'systems operations engineer', 'lead software developer',
       'business development', 'ba', 'sde  3', 'technology architect',
       'decision analytics associate', 'engineering team lead', 'n',
       'strategy and new initiatives', 'senior qa engineer',
       'erp associate', 'associate professional',
       'assistant manager  capital market', 'senior sofware engineer',
       'senior webmethods developer', 'principal applications engineer',
       'software engineer  2', 'data warehouse developer',
       'fusion applications engineer', 'digital content associate',
       'technology specialist', 'maintenance engineer',
       'network function virtualisation', 'software engineer 3',
       'unempolyed', 'member of technical staff java',
       'senior application engineer', 'automation architect',
       'software engineer 2 backend', 'wordpress developer',
       'oracle lead dba', 'management executive',
       'sr operation executive', 'associate professor',
       'project assistant', 'network software engineer',
       'manager candidate', 'associate professional developer',
       'technical engineer', 'member of technical staff 2',
       'lead software development engineer', 'sales', 'insurance agent',
       'data entry', 'senior solution engineer', 'flutter app engineer',
       'senior android developer', 'senior software engineer front end',
       'application development', 'retail asscociate', 'consultanat',
       'branch manager', 'tester', 'digital marketing manager',
       'senior software engineer lead', 'assistant system enginner',
       'rd enginner ii', 'software engineer  operations', 'mts2',
       'developer asociate', 'professional services engineer',
       'senior enginner', 'member technical staff 2', 'sr hr executive',
       'software developer i', 'technology analyst ii', 'professor',
       '737', 'tech person', 'engineer 2', 'test automation engineer',
       'servicenow developer', 'sr software engg',
       'application development senior analyst',
       'software engineer 2 full stack', 'quality engineer',
       'program analyst trainee', 'technician', 'ta',
       'mechanical engineering', 'security delivery specialist',
       'senior frontend engineer', 'react native developer',
       'python devloper', 'business analyst lead',
       'electrical technician', 'member technical staff zoho',
       'software developer  pl10', 'developer 2', 'technical mentor',
       'growth', 'data scientist 2', 'sde  ii',
       'software developer consultant', 'audit manager',
       'lead consultant', 'snr analyst programmerjava developer',
       'fullstack web developer', 'quality assurance manager',
       'student in computer application', 'city transport lead',
       'senior devops', 'software development engineer in test',
       'integration engineer', 'transaction risk investigator',
       'subject matter expert', 'business system analyst',
       'software engineering lead', 'firmware engineer',
       'hvac algorithm developer', 'area business manager',
       'software dev engineer 1', 'senior software engineer 1',
       'associate system engineertrainee', 'software associate',
       'automation test enginner', 'security intern',
       'research intern  machine learning', 'bengaluru',
       'security researcher', 'r  d engineer', 'junior web developer',
       'backend engineering', 'graduate research assistant',
       'embedded software engineer', 'member of technical staff mts',
       'member technical staff 1', 'member of technical staff  2',
       'software engineering contractor', 'team lead mobile',
       'applications engineer 2', 'junior associate technology',
       'credit risk', 'senior software development enggineer',
       'associate technical lead', 'senior research associate',
       'software engineering', 'sr technical manager',
       'mobile application engineerandroid  ios',
       'software development engineer 1', 'software development manager',
       'java back end developer', 'developer associate',
       'senior associate  index technology  production support',
       'senior software development engineer', 'team manager',
       'associate it engineer', 'medical field', 'security analyst',
       'back office executive admin', 'sharepoint developer',
       'deputy manager', 'international coding instructor',
       'chief people officer', 'software engineer associate',
       'software qa', 'tuition teacher', 'technology consultant',
       'assosiate software engineer', 'teammate',
       'data operations manager', 'freelance', 'software devloper',
       'assistant', 'software development specialist',
       'senior process executive', 'reasearch engineer', 'it engineer',
       'member technial staff', 'solution integration consultant i',
       'ios  swift developer', 'devopscloud engineer',
       'associate tech lead', 'head of engineering',
       'associate interactive developer l2', 'linux system administrator',
       'software development engineer iv', 'cyber security engineer',
       'operations expert', 'shift controller', 'subinspector', 'h',
       'soft skill trainer at skill india government project',
       'secretary', 'sap technical consultant', 'some random title',
       'professional application delivery i', 'senior consulant',
       'senior member of technical staff  rd  machine learning',
       'chief technology officer', 'co foundercto',
       'software engineer l2', 'machine learning developer',
       'web engineer', 'iot consultant', 'tecchnical lead',
       'software engineer   programmer analyst',
       'senior database engineer', 'tech manager',
       'principal application engineer', 'sap cpq consultant', 'clerk',
       'mts 3', 'founders office', 'research engineer 2',
       'data engineer iii', 'self job', 'it trainee',
       'application development team lead', 'mts intern',
       'software engineering analyst', 'graduate intern',
       'associate application engineer', 'assistant engineer trainee',
       'front end web developer', 'associate principal engineer',
       'junior consultant', 'senior software engineering',
       'senior system specialist', 'lead design engineer', 'sre',
       'front end dev', 'senior software test engineer', 'auditing',
       'operation associate', '7033771951', 'storw', 'support',
       'senior project engineer', 'head of analytics',
       'full stack developer intern', 'ai engineer', 'programmar analyst',
       'software developer grade1', 'analyst consultant', 'new graduate',
       'pragrammer analyst', 'member of technical staff level 2',
       'senior mobile applications developer androidios',
       'trainee engineer java developer', 'iaas  engineer',
       'cloud engineer', 'team lead  android',
       'software development engineering intern', 'technical architecct',
       'qaeintern', 'java software engineer', 'product solution engineer',
       'jr software engineer', 'software dev engineer',
       'sr ios developer', 'software engineer android intern',
       'sql plsql developer', 'graduate engineer trainee',
       'core engineer software developer', 'analyst  software engineer',
       'dataproduct engineer', 'applied scientist',
       'member technical staff iii', 'web  ui designer',
       'azure data factory'], dtype=object)
In [26]:
def map_job_category(title):
    if pd.isna(title) or title in ['', 'na', 'null', 'no', 'not applicable']:
        return 'Other / Non-Tech'

    title = title.lower()

    # Data roles
    if any(x in title for x in [
        'data scientist', 'data analyst', 'data engineer',
        'machine learning', 'ml', 'ai', 'analytics', 'applied scientist'
    ]):
        return 'Data'

    # Software engineering roles
    if any(x in title for x in [
        'software engineer', 'developer', 'sde', 'full stack',
        'backend', 'frontend', 'android', 'ios', 'mobile',
        'web developer', 'programmer', 'fullstack'
    ]):
        return 'Software Engineer'

    # QA / Testing
    if any(x in title for x in [
        'qa', 'test engineer', 'testing', 'sdet', 'automation'
    ]):
        return 'QA / Testing'

    # DevOps / Infra
    if any(x in title for x in [
        'devops', 'sre', 'site reliability', 'cloud',
        'platform engineer', 'infra', 'network'
    ]):
        return 'DevOps / Infra'

    # Product / Management
    if any(x in title for x in [
        'product manager', 'program manager', 'manager',
        'lead', 'director', 'head', 'cto', 'vp', 'architect'
    ]):
        return 'Product / Program / Manager'

    # Design roles
    if any(x in title for x in [
        'designer', 'ui', 'ux'
    ]):
        return 'Design'

    # Intern / Student
    if any(x in title for x in [
        'intern', 'trainee', 'student', 'apprentice', 'graduate'
    ]):
        return 'Intern / Student'

    # Business / Ops / Support
    if any(x in title for x in [
        'business', 'operations', 'support', 'consultant',
        'analyst', 'sales', 'marketing', 'hr'
    ]):
        return 'Business / Operations / Support'

    return 'Other / Non-Tech'
In [27]:
df['job_category'] = df['job_position_clean'].apply(map_job_category)
In [28]:
df['job_category'].value_counts()
Out[28]:
count
job_category
Software Engineer 147513
Other / Non-Tech 21699
Product / Program / Manager 9134
Data 8375
QA / Testing 6610
DevOps / Infra 4632
Business / Operations / Support 3773
Intern / Student 2781
Design 1326

In [29]:
df[df['job_position']=='Other'].shape[0]
Out[29]:
18072
In [30]:
df[df['job_category'] == 'Other / Non-Tech'].sample(5)
Out[30]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category
134173 gqvwrt f87b680307676acc9248f3b037071649fe63ad61ce6f50... 2016.0 1200000 Other 2017.0 12.00 other Other / Non-Tech
67182 btaxvztn 78ec9030de84d08430d948ddf076d19c85c64ce8adb81a... 2017.0 1325000 Other 2020.0 13.25 other Other / Non-Tech
125768 wno wgqugqvnxgz 809b2cb327d582fa1e30801e4cae518a94d27c4056b890... 2004.0 100000 Other 2021.0 1.00 other Other / Non-Tech
52516 mn ca8eeeafcb990599c2f8af02f20de9b341fedddf20332c... 2010.0 1100000 Other 2016.0 11.00 other Other / Non-Tech
67788 qtctqxt d75e72c4d6b6c34ffafa217c50fc5c1650c89651fc5eb3... 2009.0 1000000 Non Coder 2021.0 10.00 non coder Other / Non-Tech

Duplicate Handling

In [31]:
df.duplicated().sum()
Out[31]:
np.int64(8717)
In [32]:
df.drop_duplicates(inplace=True)
In [33]:
org_rows = df.shape[0]
In [34]:
df.shape
Out[34]:
(197126, 9)

Feature Engineering – Years of Experience

In [35]:
df['years_of_exp'] = df['ctc_updated_year'] - df['orgyear']

df['years_of_exp'].describe()
Out[35]:
years_of_exp
count 197126.000000
mean 4.999788
std 29.557376
min 0.000000
25% 2.000000
50% 4.000000
75% 6.000000
max 2021.000000

In [36]:
df[df['years_of_exp'] == 2021]
Out[36]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp
13424 9xntwyzgrgsj 854ff163ded87211b944dfcaebdcf9e8efa45defc9582f... 0.0 700000 Backend Engineer 2021.0 7.0 backend engineer Software Engineer 2021.0
32086 lxg ae1b500192dcd0b6c2d5c69b51f6caf19c630640c0aeb1... 0.0 600000 Backend Engineer 2021.0 6.0 backend engineer Software Engineer 2021.0
33117 mvzp ge vbtqxwv 8ef16126bd9a4691801d2830156dc5528142a45d314593... 0.0 1200000 Backend Engineer 2021.0 12.0 backend engineer Software Engineer 2021.0
40553 mqvpto xzaxv rna a0f794db04d5c13cae6f07c6ce9aee8ff731176cc4d4a1... 0.0 600000 Backend Engineer 2021.0 6.0 backend engineer Software Engineer 2021.0
56770 vngo ojzntr 536e217527d1101538c70b7001d7e28344dba87cd8e246... 0.0 400000 Backend Engineer 2021.0 4.0 backend engineer Software Engineer 2021.0
74313 ovbohzs trtwnqgzxwo b9fac647e08fe47c6d112466c338e847e82e8f24e16236... 0.0 1000000 Backend Engineer 2021.0 10.0 backend engineer Software Engineer 2021.0
100415 rgsfvqa b511b17a1aaaa822c6b5d6a04498ff3121808101697407... 0.0 1600000 Frontend Engineer 2021.0 16.0 frontend engineer Software Engineer 2021.0
101678 awm mvzp 37a5cce01b1e049307d73bbb2138fa54cbb02555c2137a... 0.0 700000 Backend Engineer 2021.0 7.0 backend engineer Software Engineer 2021.0
In [37]:
df[df['years_of_exp'] == -18146]
Out[37]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp
In [38]:
df[(df['orgyear'] > 2025) | (df['orgyear'] < 1970)].sample(10)
Out[38]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp
202210 mqvmtzatq d66f939c4318c1958be5bc9e7b70b741aa61be7493ff58... 2028.0 1300000 Backend Engineer 2028.0 13.0 backend engineer Software Engineer 0.0
167456 xzwnhqt ntwyzgrgsj 8b670b30a7588a36cbe0cea33f8c840c3ab25c649d970c... 2.0 500000 Backend Engineer 2019.0 5.0 backend engineer Software Engineer 2017.0
62717 wyvqstugxzn a1da4f131e2efea24f6d64c66519724aa5c8c5c11a23d9... 0.0 1850000 FullStack Engineer 2019.0 18.5 fullstack engineer Software Engineer 2019.0
183768 ftxzcton 4007e5caadc3f52c3e18bf2b4eacbadf17b114208c2d04... 20165.0 1560000 Backend Engineer 20165.0 15.6 backend engineer Software Engineer 0.0
33670 sgltp a8a779493f5844846ae536b0f48437cf9f0f6ebcac340f... 2107.0 3000000 Backend Engineer 2107.0 30.0 backend engineer Software Engineer 0.0
73302 xzntqcxtfmxn 91e4562ab8bab639b859082d519722a91b8e6f3d55c109... 2026.0 300000 Backend Engineer 2026.0 3.0 backend engineer Software Engineer 0.0
84882 hzxntaytvrny sqghu 82e77fdd3e43e37ec6b805bbfe624d7cd24b37cb0a4317... 3.0 540000 QA Engineer 2019.0 5.4 qa engineer QA / Testing 2016.0
94689 xzntr otwhqxnj bwvett ogenfvqt 7126263909db5c37b278b05ea56740688062821f452ef6... 5.0 2720000 QA Engineer 2019.0 27.2 qa engineer QA / Testing 2014.0
143238 wxowg 19bd55263d4e8cb5659b50814dc18b1fbdd1b7e0689e4b... 5.0 1000000 Backend Engineer 2019.0 10.0 backend engineer Software Engineer 2014.0
19920 zgn vuurxwvmrt vwwghzn 6aa38b497c73367a7dd6eafb95bdd5b07cca83ed14c588... 2026.0 500000 Backend Engineer 2026.0 5.0 backend engineer Software Engineer 0.0
  • From the above it seems that there is an error in the org year. We will remove these by doing outlier treatment.

Outlier treatment

In [39]:
def outlier_treatment(data, col):
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3-q1

    lb = q1 - (1.5*iqr)
    ub = q3 + (1.5*iqr)

    outliers = data[(data[col]<lb) | (data[col]>ub)]

    print(f"Column: {col}")
    print(f"Lower Bound: {lb}")
    print(f"Upper Bound: {ub}")
    print(f"Number of outliers: {outliers.shape[0]}")
    print("\nOutliers:")
    print(outliers)
In [40]:
def remove_outliers(data, col):
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3 - q1

    lb = q1 - 1.5 * iqr
    ub = q3 + 1.5 * iqr

    before = data.shape[0]

    data = data[(data[col] >= lb) & (data[col] <= ub)]

    after = data.shape[0]

    print(f"Number of rows before removal from {col}: {before}")
    print(f"Outliers removed from {col}: {before - after}")
    print(f"Number of rows after removal from {col}: {after}\n")
    return data
In [41]:
for col in num_cols:
    outlier_treatment(df, col)
Column: ctc_lpa
Lower Bound: -11.0
Upper Bound: 33.0
Number of outliers: 12047

Outliers:
                company_hash  \
25            nxbto xzntqztn   
38                evzzxt bvt   
93             xb  v onhatzn   
118                 bgzgnjut   
189                     tmvj   
...                      ...   
205817          wxowg ojntbo   
205821  rxnyxhb ntwyzgrgsxto   
205825                vbvkgz   
205833                vbvkgz   
205841        zgn vuurxwvmrt   

                                               email_hash  orgyear       ctc  \
25      d2668cb959e5657c3881413257b9850caa1359c7ce959a...   2015.0   9500000   
38      69ef6838be8ee5b628375b4cc160ba54c1f7cab8c3b130...   2006.0  11800000   
93      2b0daf6243847309e75a47fb22e9e3dc3f79031edf464a...   2021.0   5000000   
118     a37740bc0cedb23fd029276cbb7f4e821cca7df8dd0f70...   2011.0   6300000   
189     0ba214e9b65397dd7815b2909c714240781c590b2dc7d0...   2015.0  12000000   
...                                                   ...      ...       ...   
205817  a74c83782bc88928dd12b103c09697fbef87d15bd308d3...   2011.0   3327000   
205821  3def1401b5a8879eeb4a8465663a9c6c66702c85444d7e...   2012.0   3500000   
205825  95023bca0172ad67bfc3453550c5cf056557bc2c8c7169...   2016.0   4800000   
205833  f4415be48a1ef885e086dcd72181f667a289641e66f828...   2014.0   3800000   
205841  fb46a1a2752f5f652ce634f6178d0578ef6995ee59f6c8...   2019.0   5100000   

              job_position  ctc_updated_year  ctc_lpa  job_position_clean  \
25        Backend Engineer            2019.0    95.00    backend engineer   
38             QA Engineer            2019.0   118.00         qa engineer   
93        Android Engineer            2021.0    50.00    android engineer   
118       Backend Engineer            2019.0    63.00    backend engineer   
189     FullStack Engineer            2020.0   120.00  fullstack engineer   
...                    ...               ...      ...                 ...   
205817    Backend Engineer            2019.0    33.27    backend engineer   
205821    Backend Engineer            2016.0    35.00    backend engineer   
205825    Backend Engineer            2020.0    48.00    backend engineer   
205833    Backend Engineer            2019.0    38.00    backend engineer   
205841    Backend Engineer            2019.0    51.00    backend engineer   

             job_category  years_of_exp  
25      Software Engineer           4.0  
38           QA / Testing          13.0  
93      Software Engineer           0.0  
118     Software Engineer           8.0  
189     Software Engineer           5.0  
...                   ...           ...  
205817  Software Engineer           8.0  
205821  Software Engineer           4.0  
205825  Software Engineer           4.0  
205833  Software Engineer           5.0  
205841  Software Engineer           0.0  

[12047 rows x 10 columns]
Column: orgyear
Lower Bound: 2005.5
Upper Bound: 2025.5
Number of outliers: 7600

Outliers:
                     company_hash  \
20            pqgzgo xzwgqugqvnta   
52             wqgoogctq egq fgqp   
306                        xznhxn   
385               xmb wvzvav rvmo   
392     utgurt10 ntwyzgrgsxto xzw   
...                           ...   
205654                   tqxwoogz   
205667                    nfxnntq   
205686            mvzp ge vbtqxwv   
205693                       uvjh   
205803              wxowg ojontbo   

                                               email_hash  orgyear      ctc  \
20      b7d0b9cd894ab871c547063df449d03e4138050c0463c6...   2003.0   600000   
52      70656ece5fd6fc7c81783c6a83b8b34302c8d89d1e5e4c...   2004.0    60000   
306     466948a62063ec83e128d9d47df4795df4d59d953ad3e0...   2005.0  1000000   
385     6c95f6eb2b19d0375405cdfb99a420461f432e96cc92d4...   2004.0   105000   
392     83c951dfe2e20c805d4ce917e3212b6885b8d0a2416a03...   2005.0    34000   
...                                                   ...      ...      ...   
205654  e6e7a1f22cb01cf95650076c7dc8c6858f14778b7079ce...   1999.0  6500000   
205667  40b12fe32800569a02e210b92019479800c4ead15d5241...   1996.0   166000   
205686  46aba3e728d1bac5913ee45b0f5251db0dbeef3cf23bd5...   2002.0   120000   
205693  a1543b6aaed65b7fec970489fb0d792b0c55919f1fcc55...   2005.0  2500000   
205803  97f9acab2edf988fcec48e072b88f3731affd6db4487cb...   2005.0   200000   

                  job_position  ctc_updated_year  ctc_lpa  \
20                 QA Engineer            2021.0     6.00   
52            Backend Engineer            2019.0     7.20   
306     Engineering Leadership            2019.0    10.00   
385           Backend Engineer            2019.0     1.05   
392     Engineering Leadership            2019.0     4.08   
...                        ...               ...      ...   
205654        Backend Engineer            2017.0    65.00   
205667        Backend Engineer            2017.0     1.66   
205686        Backend Engineer            2016.0     1.20   
205693        Backend Engineer            2017.0    25.00   
205803        Backend Engineer            2019.0     2.00   

            job_position_clean                 job_category  years_of_exp  
20                 qa engineer                 QA / Testing          18.0  
52            backend engineer            Software Engineer          15.0  
306     engineering leadership  Product / Program / Manager          14.0  
385           backend engineer            Software Engineer          15.0  
392     engineering leadership  Product / Program / Manager          14.0  
...                        ...                          ...           ...  
205654        backend engineer            Software Engineer          18.0  
205667        backend engineer            Software Engineer          21.0  
205686        backend engineer            Software Engineer          14.0  
205693        backend engineer            Software Engineer          12.0  
205803        backend engineer            Software Engineer          14.0  

[7600 rows x 10 columns]
Column: ctc_updated_year
Lower Bound: 2016.0
Upper Bound: 2024.0
Number of outliers: 1759

Outliers:
                           company_hash  \
2211                             phrxkv   
3651                          wgszxkvzn   
10076                           xzegojo   
11081   exqon vacvznvst uqxcvnt rxbxnta   
19011                            gqvwrt   
...                                 ...   
205747       nyxzsfgqd v unw ntwyzgrgsj   
205748                           gnytqo   
205778                           gqvwrt   
205806                  ztdnowb xzwqtee   
205828                          cxqnhov   

                                               email_hash  orgyear      ctc  \
2211    3394674bb6bb1de6289e931853fa0bd131c811e0054a92...   2031.0  1500000   
3651    2cc6bae4e52677d27ce3fca38d7a01ecbe537e1dc1c48d...   2106.0   600000   
10076   4c171381270155fb87b885f89cd71ca37ebbb8fd9da58b...   2025.0   360000   
11081   d6df76c2b61fa3a068e4e3812be12a58f86f78a31fe888...   2029.0   310000   
19011   651e571741251182066dd1a7d5faaedd31a77eb25b8c08...   2010.0  1600000   
...                                                   ...      ...      ...   
205747  b06f730fb9012de1a5204b302391980e75d99fd471787a...   2010.0   170000   
205748  6a2bf858e0ed2829d21d994ce1e04fa5b8cf6c52662179...   2015.0   550000   
205778  dc896e8fb7bff82ff5c059a63c23bd9e4e8f9d6125b981...   2014.0  1268000   
205806  29f02bc29797857294cfcd44273ad42f64281ea79203f0...   2014.0  5266000   
205828  90a5e17931d705e48bf6a36144cad0954e7444574a7cf6...   2014.0  1200000   

            job_position  ctc_updated_year  ctc_lpa job_position_clean  \
2211    Backend Engineer            2031.0    15.00   backend engineer   
3651    Backend Engineer            2106.0     6.00   backend engineer   
10076              Other            2025.0     3.60              other   
11081              Other            2029.0     3.10              other   
19011   Backend Engineer            2015.0    16.00   backend engineer   
...                  ...               ...      ...                ...   
205747  Backend Engineer            2015.0     1.70   backend engineer   
205748  Backend Engineer            2015.0     5.50   backend engineer   
205778  Backend Engineer            2015.0    12.68   backend engineer   
205806  Backend Engineer            2015.0    52.66   backend engineer   
205828  Backend Engineer            2015.0    12.00   backend engineer   

             job_category  years_of_exp  
2211    Software Engineer           0.0  
3651    Software Engineer           0.0  
10076    Other / Non-Tech           0.0  
11081    Other / Non-Tech           0.0  
19011   Software Engineer           5.0  
...                   ...           ...  
205747  Software Engineer           5.0  
205748  Software Engineer           0.0  
205778  Software Engineer           1.0  
205806  Software Engineer           1.0  
205828  Software Engineer           1.0  

[1759 rows x 10 columns]
In [42]:
for col in ['orgyear', 'ctc_updated_year']:
    df = remove_outliers(df, col)
Number of rows before removal from orgyear: 197126
Outliers removed from orgyear: 7600
Number of rows after removal from orgyear: 189526

Number of rows before removal from ctc_updated_year: 189526
Outliers removed from ctc_updated_year: 1669
Number of rows after removal from ctc_updated_year: 187857

In [43]:
# For CTC we will cap the outliers instead of removing them.
def cap_outliers(data, col):
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3 - q1

    lb = q1 - 1.5 * iqr
    ub = q3 + 1.5 * iqr

    data[col] = np.clip(data[col], lb, ub)
    return data

df = cap_outliers(df, 'ctc_lpa')
In [44]:
rows_after_clean = df.shape[0]

print(f"Percentage of data removed after outlier removal: {round((org_rows-rows_after_clean)/org_rows *100, 2)}%")
Percentage of data removed after outlier removal: 4.7%

Manual Clustering

Company + Job + Experience CTC Summary

In [45]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 187857 entries, 0 to 205842
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   company_hash        187857 non-null  object 
 1   email_hash          187857 non-null  object 
 2   orgyear             187857 non-null  float64
 3   ctc                 187857 non-null  int64  
 4   job_position        187857 non-null  object 
 5   ctc_updated_year    187857 non-null  float64
 6   ctc_lpa             187857 non-null  float64
 7   job_position_clean  187857 non-null  object 
 8   job_category        187857 non-null  object 
 9   years_of_exp        187857 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 15.8+ MB
In [46]:
summary_cje = (
    df.groupby(['company_hash', 'job_position', 'years_of_exp'])
      .agg(
          avg_ctc=('ctc', 'mean'),
          min_ctc=('ctc', 'min'),
          max_ctc=('ctc', 'max'),
          count=('ctc', 'count')
      )
      .reset_index()
)
In [47]:
df = df.merge(summary_cje,
              on=['company_hash', 'job_position', 'years_of_exp'],
              how='left')
In [48]:
df.sample(5)
Out[48]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp avg_ctc min_ctc max_ctc count
92540 atbtztf 38d1fa6b6127c5b0fcf7a32125b6f723e93b33b0e2dc49... 2016.0 1300000 FullStack Engineer 2021.0 13.00 fullstack engineer Software Engineer 5.0 1.300000e+06 1300000 1300000 1
105421 xzegojo 95db365454f5570f1204f281375ff894432f094659fee0... 2009.0 390000 Backend Engineer 2019.0 3.90 backend engineer Software Engineer 10.0 1.072222e+06 390000 2000000 18
61609 ogenfvqt tzsxzttq ogzj xzaxv ogenfvqt wtznqt 6253bb1144050e607521e595eb36c1b4d05544bff55666... 2017.0 750000 Backend Engineer 2019.0 7.50 backend engineer Software Engineer 2.0 7.500000e+05 750000 750000 1
117674 zxoovz axsxnvr b0835cd9a41974819eeec33de7e9f0bc4217ddde3bf0b9... 2018.0 630000 Backend Engineer 2019.0 6.30 backend engineer Software Engineer 1.0 6.900000e+05 625000 1000000 6
147579 nyghsynfgqpo 0fcd2634281ad1676af00981f09ac9e3ec9bb89a6b805b... 2011.0 126000 Backend Engineer 2016.0 1.26 backend engineer Software Engineer 5.0 1.152889e+06 126000 1700000 9

Designation Flag

Logic

  • 1 → Above average
  • 2 → Around average
  • 3 → Below average
In [49]:
def designation_flag(row):
    if row['ctc'] > row['avg_ctc']:
        return 1
    elif row['ctc'] < row['avg_ctc']:
        return 3
    else:
        return 2

df['Designation'] = df.apply(designation_flag, axis=1)
df.sample(5)
Out[49]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp avg_ctc min_ctc max_ctc count Designation
42048 zxr 9690b44afccd5011ce3b4c6959ca3b98b2bb555ae8bfd1... 2020.0 2310000 Backend Engineer 2020.0 23.1 backend engineer Software Engineer 0.0 2.578333e+06 360000 7400000 6 3
141204 rgjvrnj lhsstqzvhn ba9499599c7e4b9702a84da141719b8b2592ccfed6eb63... 2013.0 1950000 Backend Engineer 2020.0 19.5 backend engineer Software Engineer 7.0 1.950000e+06 1950000 1950000 1 2
12563 vwnxctvx b31490e2eac8b671b38bf9ceabb85a5a56ee9d9be52322... 2018.0 500000 Backend Engineer 2021.0 5.0 backend engineer Software Engineer 3.0 5.000000e+05 500000 500000 1 2
9661 ohmtd 53ce43655503bc47faaf85d52e5fee4d3d475a62e5f726... 2016.0 1100000 Other 2020.0 11.0 other Other / Non-Tech 4.0 1.100000e+06 1100000 1100000 1 2
45618 exo 402c7c4f4508f9cc4c7d31b11d9085eb4702be6b06b915... 2014.0 1050000 QA Engineer 2021.0 10.5 qa engineer QA / Testing 7.0 8.900000e+05 730000 1050000 2 1

Designation 1 learners are outperformers within same role & experience.

Class Flag (Company + Job Position)

In [50]:
summary_cj = (
    df.groupby(['company_hash', 'job_position'])
      .agg(avg_ctc_class=('ctc', 'mean'))
      .reset_index()
)
In [51]:
df = df.merge(summary_cj, on=['company_hash', 'job_position'], how='left')
In [52]:
df['class'] = np.where(
    df['ctc'] > df['avg_ctc_class'], 1,
    np.where(df['ctc'] < df['avg_ctc_class'], 3, 2)
)

Tier Flag (Company Level)

In [53]:
summary_company = (
    df.groupby('company_hash')
      .agg(avg_ctc_company=('ctc', 'mean'))
      .reset_index()
)
In [54]:
df = df.merge(summary_company, on='company_hash', how='left')
In [55]:
df['Tier'] = np.where(
    df['ctc'] > df['avg_ctc_company'], 1,
    np.where(df['ctc'] < df['avg_ctc_company'], 3, 2)
)

Visual Analysis

Univariate Analysis

In [56]:
for col in num_cols:
    plt.figure(figsize=(6,4))
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution of {col.upper()}")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [57]:
for col in num_cols:
    plt.figure(figsize=(6,4))
    sns.boxplot(y=df[col])
    plt.title(f"Boxplot of {col.upper()}")
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

“CTC is right-skewed”

“Outliers significantly reduced after cleaning”

Categorical Variable Analysis

In [58]:
for col in cat_cols:
    plt.figure(figsize=(10,4))
    df[col].value_counts().head(10).plot(kind='bar')
    plt.title(f"Top categories in {col.upper()}")
    plt.show()
No description has been provided for this image
No description has been provided for this image

Bivariate Analysis

CTC vs Experience

In [59]:
plt.figure(figsize=(12,6))
sns.boxplot(x=round(df['years_of_exp']), y='ctc_lpa', data=df)
plt.xticks(rotation=45)
plt.title('CTC Distribution by Years of Experience')
plt.ylabel('CTC LPA')
plt.xlabel('Years of Experience')
plt.show()
No description has been provided for this image

CTC increases with experience, but the widening salary spread and presence of high outliers across all experience levels indicate that role and company significantly influence compensation, making clustering a suitable approach.

CTC vs Job position

In [60]:
plt.figure(figsize=(12,6))
sns.boxplot(x='job_category', y='ctc_lpa', data=df)
plt.xticks(rotation=45)
plt.title('CTC Distribution by Job position')
plt.ylabel('CTC LPA')
plt.xlabel('Job position')
plt.show()
No description has been provided for this image

Product / Program / Manager roles command the highest median CTC, while Intern/Student and Business/Operations roles earn the lowest, with Software Engineering and DevOps showing strong but more variable pay growth.

Top 10 employees (Tier 1)

In [61]:
df[df['Tier'] == 1].sort_values('ctc_lpa', ascending=False).head(10)
Out[61]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp avg_ctc min_ctc max_ctc count Designation avg_ctc_class class avg_ctc_company Tier
24 nxbto xzntqztn d2668cb959e5657c3881413257b9850caa1359c7ce959a... 2015.0 9500000 Backend Engineer 2019.0 31.9 backend engineer Software Engineer 4.0 2.091250e+06 700000 9500000 16 1 1.862625e+06 1 1.894941e+06 1
187847 vbvkgz f4415be48a1ef885e086dcd72181f667a289641e66f828... 2014.0 3800000 Backend Engineer 2019.0 31.9 backend engineer Software Engineer 5.0 3.436423e+06 14000 100000000 189 1 3.320822e+06 1 3.455085e+06 1
187781 zvz 298933dd70b9152e589a7e4c263253f354606521dee832... 2022.0 7500000 Backend Engineer 2022.0 31.9 backend engineer Software Engineer 0.0 1.772605e+06 600 80000000 438 1 1.962582e+06 1 2.097492e+06 1
187795 vnytzvytvrny 12f85a43be88b0fc64a36648e2de0200325344dc223624... 2013.0 4400000 Backend Engineer 2020.0 31.9 backend engineer Software Engineer 7.0 2.836667e+06 2200000 4400000 6 1 2.212306e+06 1 2.354785e+06 1
187759 vbvkgz 5dfef7c1643f0b744456f47f72901027dfad9c6b24d443... 2013.0 4000000 Backend Engineer 2020.0 31.9 backend engineer Software Engineer 7.0 4.066214e+06 61000 100000000 112 3 3.320822e+06 1 3.455085e+06 1
187799 zcxaxv df6bc6b9b560dd3faae14c182743a31a315680c2051a94... 2019.0 4240000 Backend Engineer 2020.0 31.9 backend engineer Software Engineer 1.0 3.570000e+06 2900000 4240000 2 1 1.833545e+06 1 3.675157e+06 1
63427 zgho xzegojontbo 350e59e722f73a630057cbcea88b545f23e66399e509d2... 2011.0 3500000 FullStack Engineer 2021.0 31.9 fullstack engineer Software Engineer 10.0 3.500000e+06 3500000 3500000 1 2 1.518750e+06 1 1.541250e+06 1
63454 sggsrt 0ff25e37f8d62d03695882afd203ce667ee6c90bae0074... 2013.0 6500000 Software Engineer Backend 2020.0 31.9 software engineer backend Software Engineer 7.0 6.500000e+06 6500000 6500000 1 2 3.350000e+06 1 5.961943e+06 1
63463 uvjovet sqghu fe67fa1b7d4a93ea1e8d6bf0ee2c8dd5964c5656b2d378... 2013.0 3500000 FullStack Engineer 2021.0 31.9 fullstack engineer Software Engineer 8.0 3.500000e+06 3500000 3500000 1 2 1.535000e+06 1 2.145368e+06 1
63472 zgn vuurxwvmrt vwwghzn fecac8d9ff2e0a852943534d6cdc9a3c496c8496c4b9e6... 2018.0 10000000 Other 2021.0 31.9 other Other / Non-Tech 3.0 4.125690e+06 60000 100000000 29 1 3.983624e+06 1 2.337827e+06 1

Bottom 10 employees (Tier 3)

In [62]:
df[df['Tier'] == 3].sort_values('ctc_lpa').head(10)
Out[62]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp avg_ctc min_ctc max_ctc count Designation avg_ctc_class class avg_ctc_company Tier
123912 xzntqcxtfmxn 3505b02549ebe2c95840ac6f0a35561a3b4cbe4b79cdb1... 2014.0 2 Backend Engineer 2019.0 0.00024 backend engineer Software Engineer 5.0 1.092503e+06 2 3000000 8 3 1.143685e+06 3 1.496672e+06 3
108500 xzntqcxtfmxn f2b58aeed3c074652de2cfd3c0717a5d21d6fbcf342a78... 2013.0 6 Backend Engineer 2018.0 0.00072 backend engineer Software Engineer 5.0 1.092503e+06 2 3000000 8 3 1.143685e+06 3 1.496672e+06 3
104842 xzntqcxtfmxn 23ad96d6b6f1ecf554a52f6e9b61677c7d73d8a409a143... 2013.0 14 Backend Engineer 2018.0 0.00168 backend engineer Software Engineer 5.0 1.092503e+06 2 3000000 8 3 1.143685e+06 3 1.496672e+06 3
168752 xm b8a0bb340583936b5a7923947e9aec21add5ebc50cd60b... 2016.0 15 Backend Engineer 2018.0 0.00180 backend engineer Software Engineer 2.0 1.500000e+01 15 15 1 2 1.550000e+01 3 1.550000e+01 3
107392 hzxctqoxnj ge fvoyxzsngz f7e5e788676100d7c4146740ada9e2f8974defc01f571d... 2022.0 200 Backend Engineer 2022.0 0.02400 backend engineer Software Engineer 0.0 2.000000e+02 200 200 1 2 2.000000e+02 2 3.140000e+04 3
156199 nvnv wgzohrnvzwj otqcxwto 80ba0259f9f59034c4927cf3bd38dc9ce2eb60ff18135b... 2012.0 600 Backend Engineer 2017.0 0.07200 backend engineer Software Engineer 5.0 2.258645e+06 600 200000000 242 3 1.453687e+06 3 1.728092e+06 3
137661 zvz 9af3dca6c9d705d8d42585ccfce2627f00e1629130d14e... 2023.0 600 Backend Engineer 2023.0 0.07200 backend engineer Software Engineer 0.0 1.772605e+06 600 80000000 438 3 1.962582e+06 3 2.097492e+06 3
91490 gjg b995d7a2ae5c6f8497762ce04dc5c04ad6ec734d70802a... 2018.0 600 FullStack Engineer 2021.0 0.07200 fullstack engineer Software Engineer 3.0 1.185120e+06 600 1875000 5 3 1.412721e+06 3 2.249174e+06 3
71039 vwwtznhqt f0f2005505c707dbdd2c86ca1587c26f822a004e86a8ec... 2021.0 1000 Backend Engineer 2021.0 0.12000 backend engineer Software Engineer 0.0 2.419641e+06 1000 200000000 217 3 2.002374e+06 3 2.448008e+06 3
81717 wgd vhngbgnxct xzw 4d18008fc2cb66e4b90f3798ccbbc4792dfd4bad5a7a87... 2016.0 1000 FullStack Engineer 2019.0 0.12000 fullstack engineer Software Engineer 3.0 5.050000e+04 1000 100000 2 3 6.700000e+04 3 7.525000e+04 3

Top 10 Data Science employees per company

In [63]:
df[
    (df['job_position'].str.contains('Data', case=False)) &
    (df['class'] == 1)
].sort_values('ctc', ascending=False).groupby('company_hash').head(10)
Out[63]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp avg_ctc min_ctc max_ctc count Designation avg_ctc_class class avg_ctc_company Tier
82784 vwwtznhqt 2f9a4241053f76b2f8c50ea593a90586d38b3f0e08c141... 2016.0 200000000 Data Analyst 2019.0 9.86 data analyst Data 3.0 1.094865e+07 360000 200000000 23 1 6.240684e+06 1 2.448008e+06 1
93516 ytqt ntwyzgrgsxto 98a90272cbba6e6e9ca94981824f3465f3d34567cb065f... 2015.0 200000000 Data Analyst 2020.0 12.74 data analyst Data 5.0 2.000000e+08 200000000 200000000 1 2 4.029800e+07 1 4.259825e+06 1
63577 wgzahtzn 9ce2995b2221fe627e861daea9d0603872cce8cc128390... 2016.0 200000000 Data Analyst 2020.0 20.60 data analyst Data 4.0 1.002300e+08 460000 200000000 2 1 3.387917e+07 1 3.238675e+06 1
123172 nvnv wgzohrnvzwj otqcxwto 1b95e7ba0ee82100ca5a034239fa0203a1bec14280b82a... 2018.0 200000000 Database Administrator 2021.0 6.00 database administrator Other / Non-Tech 3.0 2.899457e+07 390000 200000000 7 1 7.809081e+06 1 1.728092e+06 1
49179 zgzt 268a5aa92f0b6d0c675fc9cc1e300eb0c5930a3a139a23... 2022.0 200000000 Data Scientist 2022.0 4.14 data scientist Data 0.0 2.995714e+07 100000 200000000 7 1 1.020609e+07 1 5.838392e+06 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
114614 xzatrrxtzn cafbe82bdd822f80cdf232311a09fb8d8c6ccdac5786f2... 2018.0 180000 Data Scientist 2019.0 1.80 data scientist Data 1.0 1.800000e+05 180000 180000 1 2 1.450000e+05 1 1.450000e+05 1
62153 x vb v eqttrvzwtq 8b1b2cf96ee10506f1d937c9045a923233f122512b8df4... 2014.0 180000 Data Analyst 2021.0 1.80 data analyst Data 7.0 1.800000e+05 180000 180000 1 2 1.307500e+05 1 1.405905e+06 3
65172 urhbxzw 89c50595727487bfa7d687bb168d2c2b55a57646876a4d... 2017.0 155000 Data Scientist 2018.0 1.55 data scientist Data 1.0 1.400000e+05 125000 155000 2 1 1.400000e+05 1 1.400000e+05 1
45509 mvjtq db2c70fea469a7f1456457812fe94a01c337eb6ce75bd5... 2018.0 115000 Data Scientist 2019.0 1.15 data scientist Data 1.0 1.150000e+05 115000 115000 1 2 1.055000e+05 1 1.833250e+06 3
140134 exznqhon ogrhnxgzo ucn rna ab2dc9db23c3104f0b6b3dbd4cdd5bfb9e5829b8b7943d... 2017.0 10000 Data Scientist 2020.0 1.20 data scientist Data 3.0 1.000000e+04 10000 10000 1 2 8.600000e+03 1 9.066667e+03 1

1670 rows × 19 columns

Top 10 Companies by CTC

In [64]:
df.groupby('company_hash')['ctc'].mean().sort_values(ascending=False).head(10)
Out[64]:
ctc
company_hash
whmxw rgsxwo uqxcvnt rxbxnta 1.000150e+09
aveegaxr xzntqzvnxgzvr hzxctqoxnj 2.500000e+08
ntwyzg xzaxv sqghu 2.000000e+08
touxqxn ntwyzgrgsxto ucnrna 2.000000e+08
onttr onqxuo fyttro rxbxnta 2.000000e+08
i wgzztin mhoxztoo ogrhnxgzo ucn rna 2.000000e+08
agyv tdnqvwg 2.000000e+08
hzxiht sqghu ge owyggro 2.000000e+08
xzaxv ugftq wgqugqvnxgz rna 2.000000e+08
ctqexohayv ntwyzgrgsxto 2.000000e+08

Top 2 Positions per Company

In [65]:
df.groupby(['company_hash', 'job_position'])['ctc'].mean() \
  .reset_index() \
  .sort_values(['company_hash', 'ctc'], ascending=[True, False]) \
  .groupby('company_hash').head(2)
Out[65]:
company_hash job_position ctc
0 0 Backend Engineer 100000.0
1 0 Other 100000.0
2 0000 Other 300000.0
4 01 ojztqsj Frontend Engineer 830000.0
3 01 ojztqsj Android Engineer 270000.0
... ... ... ...
63516 zz Other 1370000.0
63515 zz Backend Engineer 500000.0
63517 zzb ztdnstz vacxogqj ucn rna Backend Engineer 600000.0
63518 zzb ztdnstz vacxogqj ucn rna FullStack Engineer 600000.0
63519 zzgato Backend Engineer 130000.0

46883 rows × 3 columns

Across companies, technical roles such as Backend and Full-Stack Engineering consistently appear among the top two highest-paying positions based on average CTC.

Statistical Analysis

Normality test

In [66]:
from scipy.stats import shapiro

sample_ctc = df['ctc_lpa'].sample(5000, random_state=42)
stat, p = shapiro(sample_ctc)

if p < 0.05:
    print(f"p-Value is {p}. Hence CTC is not normally distributed")
else:
    print(f"p-Value is {p}. Hence CTC is normally distributed")
p-Value is 1.0426914677433449e-52. Hence CTC is not normally distributed

Since CTC is not normally distributed, we will use non-parametric tests.

Kruskal–Wallis Test

In [67]:
alpha = 0.05

CTC vs Job position

Hypotheses:

  • H₀: Median CTC is the same across job categories
  • H₁: At least one category differs
In [68]:
groups = [
    group['ctc_lpa'].values
    for _, group in df.groupby('job_position')
]

stat, p = kruskal(*groups)

if p < alpha:
    print("Kruskal-Wallis p-value:", p)
    print("Salary differs significantly by job position")

else:
    print("Kruskal-Wallis p-value:", p)
    print("No significant difference")
Kruskal-Wallis p-value: 0.0
Salary differs significantly by job position
  • CTC differs significantly across job categories/positions, validating the visual salary gaps observed.

CTC vs Experience

In [69]:
corr, p = pearsonr(df['years_of_exp'], df['ctc_lpa'])

print("Pearson Correlation:", corr)
print("p-value:", p)
Pearson Correlation: 0.31235430225317334
p-value: 0.0
In [70]:
corr, p = spearmanr(df['years_of_exp'], df['ctc_lpa'])

print("Spearman Correlation:", corr)
print("p-value:", p)
Spearman Correlation: 0.30573054623289775
p-value: 0.0

There is a statistically significant positive relationship between experience and CTC.

Tiers vs Salary Distributions

In [71]:
tier_groups = [
    df[df['Tier'] == t]['ctc_lpa']
    for t in df['Tier'].unique()
]

stat, p = kruskal(*tier_groups)

if p < 0.05:
    print(f"p-Value is {p}. Salary distributions differ significantly across Tier groups.")
else:
    print(f"p-Value is {p}. Salary distributions not differ across Tier groups.")
p-Value is 0.0. Salary distributions differ significantly across Tier groups.

Since CTC is highly skewed and non-normally distributed, non-parametric statistical tests were applied.

Kruskal–Wallis tests confirmed that salary distributions differ significantly across job categories and tier levels.

Spearman correlation showed a statistically significant positive relationship between years of experience and CTC.

These results validate the patterns observed during exploratory data analysis.

ML Development

Data Processing for Unsupervised Learning

In [72]:
df.sample(5)
Out[72]:
company_hash email_hash orgyear ctc job_position ctc_updated_year ctc_lpa job_position_clean job_category years_of_exp avg_ctc min_ctc max_ctc count Designation avg_ctc_class class avg_ctc_company Tier
134669 vagmt e0615a4e98c07837b967537bbfe0614fb24b042421cd6e... 2013.0 1230000 Backend Engineer 2016.0 12.3 backend engineer Software Engineer 3.0 2.114379e+06 10000 5100000 91 3 2.746472e+06 3 2.880667e+06 3
163694 tqxwoogz 57cc33fe71eab1d8aabf167c1ec150cb5b4a137d060f5a... 2019.0 350000 Other 2019.0 3.5 other Other / Non-Tech 0.0 8.800000e+05 350000 1600000 3 3 8.578387e+05 3 1.602591e+06 3
102177 xmb ec6e90591ff23db1e78ed0d3b8a8e0094c9aa6c61c5343... 2012.0 600000 QA Engineer 2019.0 6.0 qa engineer QA / Testing 7.0 7.399998e+05 409999 1800000 6 3 9.546428e+05 3 1.411994e+06 3
6824 rv tdvwnrrj ucn rna 66dc6ea9d6dbd7d33741385706770f8d06d69bc6b831c5... 2017.0 200000 Backend Engineer 2020.0 2.0 backend engineer Software Engineer 3.0 2.000000e+05 200000 200000 1 2 2.000000e+05 2 2.000000e+05 2
73 ptntp x 7fe7375ea26f4acf0c3e589616df27bb4379a96ae6d855... 2021.0 2200000 Backend Engineer 2021.0 22.0 backend engineer Software Engineer 0.0 2.200000e+06 2200000 2200000 1 2 2.200000e+06 2 2.200000e+06 2
In [73]:
company_le = LabelEncoder()
job_le = LabelEncoder()
scaler = MinMaxScaler()
In [74]:
df['company_enc'] = company_le.fit_transform(df['company_hash'])
df['job_enc'] = job_le.fit_transform(df['job_position_clean'])

df['ctc_scaled'] = scaler.fit_transform(df[['ctc_lpa']])

X = df[['company_enc', 'job_enc', 'ctc_scaled', 'years_of_exp']]
In [75]:
X.head(5)
Out[75]:
company_enc job_enc ctc_scaled years_of_exp
0 920 408 0.344823 4.0
1 18750 255 0.141059 1.0
2 14746 110 0.626956 5.0
3 11518 110 0.219430 2.0
4 19222 255 0.438867 2.0

Elbow Method

In [76]:
inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
In [77]:
kneedle = KneeLocator(
    K,
    inertia,
    curve='convex',
    direction='decreasing'
)

elbow_k = kneedle.elbow
elbow_inertia = inertia[K.index(elbow_k)]
In [78]:
plt.plot(K, inertia, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")

plt.annotate(
    f'Elbow Point (K={elbow_k})',
    xy=(elbow_k, elbow_inertia),
    xytext=(elbow_k + 1, elbow_inertia + 30000),
    arrowprops=dict(facecolor='red', arrowstyle='->'),
    fontsize=10
)

plt.show()
No description has been provided for this image
  • We selected $K = 3$ because the inertia drops sharply up to $3$ clusters, after which the curve flattens.
  • This indicates diminishing returns beyond $K = 3$, making it the optimal balance between cluster compactness and model simplicity.

ML Model

Silhouette score

In [79]:
def sil_score(x, labels, sample_frac=0.1, random_state=42):
    sample_size = int(0.1 * X.shape[0])
    X_sample, labels_sample = resample(
        x,
        labels,
        n_samples=sample_size,
        random_state=random_state,
        stratify=labels
    )

    score = silhouette_score(X_sample, labels_sample)

    return f"Silhouette Score (10% sample): {score:.4f}"

KMeans++

In [80]:
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)

np.random.seed(45)
kmeans.fit(X)
klabels = kmeans.fit_predict(X)

sil_score(X, klabels)
Out[80]:
'Silhouette Score (10% sample): 0.6158'
In [81]:
X_km = X.copy()

X_km['klabels']=klabels

X_km.head(5)
Out[81]:
company_enc job_enc ctc_scaled years_of_exp klabels
0 920 408 0.344823 4.0 2
1 18750 255 0.141059 1.0 0
2 14746 110 0.626956 5.0 0
3 11518 110 0.219430 2.0 0
4 19222 255 0.438867 2.0 0

Hierarchical Clustering (Sample)

In [82]:
np.random.seed(45)
sample = X[:1000]
linked = linkage(sample, method='ward')

plt.figure(figsize=(16, 6))
cut_height = 125000
plt.axhline(y=cut_height, color='red', linestyle='--', linewidth=1)
dendrogram(linked)
plt.show()
No description has been provided for this image
  • Hierarchical clustering shows a clear separation into three major clusters when cut at distance $≈125,000$.
  • These clusters likely represent low, medium, and high compensation groups, validating the KMeans result ($k=3$).
  • The large vertical gaps indicate strong natural grouping in the data.

Agglomerative clustering

In [83]:
n_samples = int(0.1 * X.shape[0])
np.random.seed(42)
idx = np.random.choice(X.shape[0], size=n_samples, replace=False)

X_sample = X.iloc[idx]

model = AgglomerativeClustering(n_clusters=4, linkage='ward')

agg_labels = model.fit_predict(X_sample)
In [84]:
print(f"Agglomerative - Silhouette  score: {round(silhouette_score(X_sample, agg_labels), 4)}")
Agglomerative - Silhouette  score: 0.5906
In [85]:
X_sample_agg = X_sample.copy(deep=True)

X_sample_agg['agg_labels'] = agg_labels

X_sample_agg.head(5)
Out[85]:
company_enc job_enc ctc_scaled years_of_exp agg_labels
41151 14396 110 0.125385 1.0 3
57235 5424 110 0.470215 0.0 1
41897 7267 110 0.470215 0.0 1
43861 13019 215 0.815046 10.0 3
108260 23933 255 0.188082 0.0 0

GMM (Gaussian Mixture Model)

In [86]:
X_gmm_sample = X_sample.copy(deep=True)
In [87]:
gmm = GaussianMixture(
    n_components=4,
    covariance_type='full',
    random_state=42
)
gmm_labels = gmm.fit_predict(X_gmm_sample)

sil = silhouette_score(X_gmm_sample, gmm_labels)
print(f"GMM Silhouette Score: {sil:.4f}")
GMM Silhouette Score: 0.0889
In [88]:
X_gmm_sample['gmm_lables'] = gmm_labels

X_gmm_sample.head(5)
Out[88]:
company_enc job_enc ctc_scaled years_of_exp gmm_lables
41151 14396 110 0.125385 1.0 3
57235 5424 110 0.470215 0.0 2
41897 7267 110 0.470215 0.0 2
43861 13019 215 0.815046 10.0 0
108260 23933 255 0.188082 0.0 3
  • GMM score is very low and seems very poor clustering of the data.
In [89]:
X_km['cluster'] = X_km['klabels'].map({0: 'Entry / Low CTC', 1: 'Mid-level Professionals', 2: 'High Earners / Leaders'})
In [90]:
chart = X_km['cluster'].value_counts().reset_index()
In [91]:
plt.pie(x=chart['count'],
        labels=chart['cluster'],
        explode = (0.1, 0, 0),
        colors = ['#4C5D6A', '#2F8F8B', '#6FBF73'],
        autopct='%1.1f%%',
        startangle=120)

plt.show()
No description has been provided for this image

📊 Business Recommendations

1. Compensation Benchmarking

Companies can use these clusters to:

  • Compare their employee pay distribution against the market
  • Identify overpaying or underpaying roles

➡️ Example: If most employees fall into Mid-level cluster but competitors place similar roles in High Earners, there is a retention risk.

2. Role-Based Salary Bands

HR teams should define salary bands by job position, not just experience.

✔ Recommended:

  • Maintain role-specific salary ranges
  • Adjust compensation policies based on market cluster positioning

3. Talent Retention & Career Planning

Employees can be mapped to clusters to:

  • Identify high performers stuck in lower pay clusters
  • Design promotion and upskilling paths to move employees upward

4. Hiring & Offer Strategy

For new hires:

  • Predict likely cluster placement based on role + experience
  • Make competitive but controlled offers aligned with market clusters

5. Strategic Workforce Planning

Leadership can use cluster insights to:

  • Forecast payroll growth
  • Plan leadership hiring
  • Balance cost vs expertise across teams