Scaler Clustering¶

Problem statement¶

Objective¶

The objective of this project is to cluster Scaler learners based on their company, job position, experience, and compensation (CTC) to identify homogeneous groups with similar professional and salary characteristics.
Both manual clustering (business-rule based) and unsupervised learning techniques (K-Means and Hierarchical Clustering) are applied to:
- Identify high-earning and low-earning learners
- Benchmark salaries across companies and roles
- Provide actionable insights to learners and Scaler’s business team

Import Libraries & Load Data¶

!pip install kneed

Collecting kneed
  Downloading kneed-0.8.5-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: numpy>=1.14.2 in /usr/local/lib/python3.12/dist-packages (from kneed) (2.0.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.12/dist-packages (from kneed) (1.16.3)
Downloading kneed-0.8.5-py3-none-any.whl (10 kB)
Installing collected packages: kneed
Successfully installed kneed-0.8.5

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import re
from datetime import datetime

from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.cluster import KMeans, DBSCAN
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import silhouette_score
from sklearn.utils import resample
from sklearn.mixture import GaussianMixture
from sklearn.cluster import AgglomerativeClustering

import warnings
warnings.filterwarnings('ignore')

from scipy.stats import kruskal, spearmanr, pearsonr, mannwhitneyu

from kneed import KneeLocator

!gdown 1Sg45aJZsKNEZdOSjJ5BnNoPhpBgyQf0L

Downloading...
From: https://drive.google.com/uc?id=1Sg45aJZsKNEZdOSjJ5BnNoPhpBgyQf0L
To: /content/scaler_clustering.csv
100% 24.7M/24.7M [00:00<00:00, 47.9MB/s]

df = pd.read_csv('/content/scaler_clustering.csv')

Basic EDA¶

Preliminary analysis¶

df.shape

(205843, 7)

df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205843 entries, 0 to 205842
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        205843 non-null  int64  
 1   company_hash      205799 non-null  object 
 2   email_hash        205843 non-null  object 
 3   orgyear           205757 non-null  float64
 4   ctc               205843 non-null  int64  
 5   job_position      153279 non-null  object 
 6   ctc_updated_year  205843 non-null  float64
dtypes: float64(2), int64(2), object(3)
memory usage: 11.0+ MB

👁️ Observation

Dataset contains both categorical (Company, Job_position) and numerical (CTC, orgyear) variables
Email_hash and Company_hash are anonymized identifiers
Unnamed: 0 is an index column → can be dropped

df.drop(columns=['Unnamed: 0'], inplace=True)

Unique Email Analysis¶

df['email_hash'].nunique()

153443

df['email_hash'].value_counts().head(10)

👁️ Observation

Multiple records for the same email indicate job switches, promotions, or salary updates.
This justifies clustering at the learner-record level, not unique learner level.

CTC checking¶

df['ctc'].describe()

df['ctc'].min(), df['ctc'].max()

(np.int64(2), np.int64(1000150000))

df[df['ctc'] < 200]

CTC is stored in mixed units:

Some values are in lakhs (2, 6, 14, 15)
Some values are in thousands (200, 600, 1000)
Some are in actual amounts (1,000,000+)

Standardizing CTC¶

a = set(df['ctc'].unique())
a

{np.int64(5120000),
 np.int64(2048000),
 np.int64(2),
 np.int64(6),
 np.int64(8200),
 np.int64(213000),
 np.int64(2261000),
 np.int64(1237000),
 np.int64(14),
 np.int64(15),
 np.int64(1450000),
 np.int64(16400),
 np.int64(426000),
 np.int64(16),
 np.int64(24),
 np.int64(25),
 np.int64(639000),
 np.int64(1663000),
 np.int64(1875999),
 np.int64(2900000),
 np.int64(100180000),
 np.int64(852000),
 np.int64(1065000),
 np.int64(41000),
 np.int64(147500),
 np.int64(4350000),
 np.int64(1278000),
 np.int64(254000),
 np.int64(467000),
 np.int64(2515000),
 np.int64(680000),
 np.int64(185000000),
 np.int64(5800000),
 np.int64(2728000),
 np.int64(893000),
 np.int64(3965000),
 np.int64(7250000),
 np.int64(2130000),
 np.int64(82000),
 np.int64(1106000),
 np.int64(3154000),
 np.int64(1319000),
 np.int64(3367000),
 np.int64(295000),
 np.int64(401500),
 np.int64(3579999),
 np.int64(8700000),
 np.int64(3580000),
 np.int64(1532000),
 np.int64(98400),
 np.int64(508000),
 np.int64(1745000),
 np.int64(721000),
 np.int64(8300),
 np.int64(5030000),
 np.int64(934000),
 np.int64(2982000),
 np.int64(16500),
 np.int64(123000),
 np.int64(1147000),
 np.int64(1360000),
 np.int64(336000),
 np.int64(11600000),
 np.int64(37200000),
 np.int64(2384000),
 np.int64(6480000),
 np.int64(1573000),
 np.int64(3621000),
 np.int64(13050000),
 np.int64(2810000),
 np.int64(762000),
 np.int64(975000),
 np.int64(14500000),
 np.int64(164000),
 np.int64(1188000),
 np.int64(3236000),
 np.int64(4260000),
 np.int64(377000),
 np.int64(2425000),
 np.int64(1401000),
 np.int64(590000),
 np.int64(1614000),
 np.int64(131250),
 np.int64(802999),
 np.int64(803000),
 np.int64(2039999),
 np.int64(2040000),
 np.int64(1016000),
 np.int64(1015999),
 np.int64(43000000),
 np.int64(68600000),
 np.int64(17400000),
 np.int64(204999),
 np.int64(205000),
 np.int64(200),
 np.int64(1229000),
 np.int64(2253000),
 np.int64(90530000),
 np.int64(3490000),
 np.int64(8400),
 np.int64(418000),
 np.int64(2572500),
 np.int64(1442000),
 np.int64(3703000),
 np.int64(631000),
 np.int64(1655000),
 np.int64(4940000),
 np.int64(844000),
 np.int64(639200),
 np.int64(10060000),
 np.int64(33000),
 np.int64(1057000),
 np.int64(33004),
 np.int64(1270000),
 np.int64(6390000),
 np.int64(1679600),
 np.int64(246000),
 np.int64(1483000),
 np.int64(1278200),
 np.int64(459000),
 np.int64(100000000),
 np.int64(672000),
 np.int64(2720000),
 np.int64(57600),
 np.int64(885000),
 np.int64(2933000),
 np.int64(34890000),
 np.int64(74000),
 np.int64(4170000),
 np.int64(2122000),
 np.int64(1098000),
 np.int64(1311000),
 np.int64(6431000),
 np.int64(2335000),
 np.int64(4383000),
 np.int64(499999),
 np.int64(500000),
 np.int64(5620000),
 np.int64(1524000),
 np.int64(8480),
 np.int64(1737000),
 np.int64(713000),
 np.int64(300),
 np.int64(1950000),
 np.int64(7070000),
 np.int64(926000),
 np.int64(8500),
 np.int64(115000),
 np.int64(221500),
 np.int64(100680000),
 np.int64(3400000),
 np.int64(29000000),
 np.int64(328000),
 np.int64(8520000),
 np.int64(157000000),
 np.int64(1352000),
 np.int64(2376000),
 np.int64(1565000),
 np.int64(541000),
 np.int64(82250),
 np.int64(434500),
 np.int64(4850000),
 np.int64(754000),
 np.int64(967000),
 np.int64(3015000),
 np.int64(147805),
 np.int64(6300000),
 np.int64(1180000),
 np.int64(156000),
 np.int64(8348000),
 np.int64(57500000),
 np.int64(31900000),
 np.int64(369000),
 np.int64(9585000),
 np.int64(2630000),
 np.int64(7750000),
 np.int64(1401200),
 np.int64(1606000),
 np.int64(582000),
 np.int64(795000),
 np.int64(1819000),
 np.int64(4079999),
 np.int64(9200000),
 np.int64(1008000),
 np.int64(4080000),
 np.int64(11248000),
 np.int64(2245000),
 np.int64(197000),
 np.int64(409999),
 np.int64(410000),
 np.int64(10650000),
 np.int64(57750),
 np.int64(5743000),
 np.int64(8600),
 np.int64(623000),
 np.int64(1860000),
 np.int64(836000),
 np.int64(16800),
 np.int64(12100000),
 np.int64(25000),
 np.int64(18670000),
 np.int64(3310000),
 np.int64(2286000),
 np.int64(8430000),
 np.int64(238000),
 np.int64(1475000),
 np.int64(451000),
 np.int64(15000000),
 np.int64(4760000),
 np.int64(664000),
 np.int64(1688000),
 np.int64(877000),
 np.int64(2925000),
 np.int64(1090000),
 np.int64(66000),
 np.int64(2327000),
 np.int64(43500000),
 np.int64(2540000),
 np.int64(120300000),
 np.int64(492000),
 np.int64(1516000),
 np.int64(705000),
 np.int64(1729000),
 np.int64(2966000),
 np.int64(101270000),
 np.int64(918000),
 np.int64(3990000),
 np.int64(500),
 np.int64(107000),
 np.int64(1131000),
 np.int64(8700),
 np.int64(123200000),
 np.int64(320000),
 np.int64(200000000),
 np.int64(72000000),
 np.int64(1344000),
 np.int64(2368000),
 np.int64(5440000),
 np.int64(533000),
 np.int64(1769999),
 np.int64(1770000),
 np.int64(746000),
 np.int64(1982999),
 np.int64(3007000),
 np.int64(3220000),
 np.int64(100500000),
 np.int64(148000),
 np.int64(1172000),
 np.int64(361000),
 np.int64(1385000),
 np.int64(467500),
 np.int64(574000),
 np.int64(3646000),
 np.int64(787000),
 np.int64(999999),
 np.int64(1000000),
 np.int64(52200000),
 np.int64(82500),
 np.int64(1213000),
 np.int64(189000),
 np.int64(1000008),
 np.int64(401999),
 np.int64(2450000),
 np.int64(402000),
 np.int64(1426000),
 np.int64(615000),
 np.int64(600),
 np.int64(3899999),
 np.int64(8800),
 np.int64(3900000),
 np.int64(1852000),
 np.int64(828000),
 np.int64(827999),
 np.int64(17000),
 np.int64(1041000),
 np.int64(123500),
 np.int64(229999),
 np.int64(230000),
 np.int64(5350000),
 np.int64(1254000),
 np.int64(443000),
 np.int64(98937),
 np.int64(1680000),
 np.int64(58000000),
 np.int64(6800000),
 np.int64(32400000),
 np.int64(656000),
 np.int64(2917000),
 np.int64(58000),
 np.int64(3130000),
 np.int64(8250000),
 np.int64(1082000),
 np.int64(100410000),
 np.int64(2106000),
 np.int64(1295000),
 np.int64(35300000),
 np.int64(4580000),
 np.int64(40420000),
 np.int64(2532000),
 np.int64(484000),
 np.int64(9700000),
 np.int64(1508000),
 np.int64(1721000),
 np.int64(909999),
 np.int64(910000),
 np.int64(6030000),
 np.int64(1934000),
 np.int64(11150000),
 np.int64(99000),
 np.int64(1123000),
 np.int64(115000000),
 np.int64(2360000),
 np.int64(12600000),
 np.int64(312000),
 np.int64(1336000),
 np.int64(8900),
 np.int64(525000),
 np.int64(1549000),
 np.int64(3810000),
 np.int64(738000),
 np.int64(1762000),
 np.int64(2786000),
 np.int64(1975000),
 np.int64(951000),
 np.int64(811000),
 np.int64(140000),
 np.int64(199820000),
 np.int64(15500000),
 np.int64(5260000),
 np.int64(353000),
 np.int64(3425000),
 np.int64(49900),
 np.int64(1590000),
 np.int64(566000),
 np.int64(3638000),
 np.int64(2827000),
 np.int64(779000),
 np.int64(992000),
 np.int64(3040000),
 np.int64(44000000),
 np.int64(2016000),
 np.int64(181000),
 np.int64(1205000),
 np.int64(4490000),
 np.int64(394000),
 np.int64(1418000),
 np.int64(90900),
 np.int64(607000),
 np.int64(819999),
 np.int64(820000),
 np.int64(72500000),
 np.int64(5940000),
 np.int64(52020000),
 np.int64(1844000),
 np.int64(9000),
 np.int64(1033000),
 np.int64(2269999),
 np.int64(2270000),
 np.int64(1246000),
 np.int64(7390000),
 np.int64(222000),
 np.int64(434999),
 np.int64(435000),
 np.int64(1459000),
 np.int64(101000000),
 np.int64(3720000),
 np.int64(1672000),
 np.int64(8840000),
 np.int64(648000),
 np.int64(33600),
 np.int64(24200000),
 np.int64(861000),
 np.int64(1885000),
 np.int64(50000),
 np.int64(1074000),
 np.int64(5170000),
 np.int64(3122000),
 np.int64(1287000),
 np.int64(1500000),
 np.int64(476000),
 np.int64(2524000),
 np.int64(2737000),
 np.int64(1713000),
 np.int64(3761000),
 np.int64(901999),
 np.int64(100230000),
 np.int64(2950000),
 np.int64(91000),
 np.int64(1115000),
 np.int64(4400000),
 np.int64(30000000),
 np.int64(106800000),
 np.int64(1328000),
 np.int64(304000),
 np.int64(2352000),
 np.int64(517000),
 np.int64(2565000),
 np.int64(9100),
 np.int64(730000),
 np.int64(5850000),
 np.int64(9946000),
 np.int64(132000),
 np.int64(7300000),
 np.int64(2180000),
 np.int64(1156000),
 np.int64(58500000),
 np.int64(345000),
 np.int64(2393000),
 np.int64(1369000),
 np.int64(3629999),
 np.int64(100910000),
 np.int64(3630000),
 np.int64(1795000),
 np.int64(771000),
 np.int64(983999),
 np.int64(10200000),
 np.int64(87000000),
 np.int64(5080000),
 np.int64(66500),
 np.int64(173000),
 np.int64(33740),
 np.int64(1410000),
 np.int64(386000),
 np.int64(11650000),
 np.int64(2647000),
 np.int64(811999),
 np.int64(2860000),
 np.int64(13100000),
 np.int64(1835999),
 np.int64(89900000),
 np.int64(1000),
 np.int64(1025000),
 np.int64(11265000),
 np.int64(107500),
 np.int64(9200),
 np.int64(40150000),
 np.int64(4310000),
 np.int64(214000),
 np.int64(426999),
 np.int64(17400),
 np.int64(2475000),
 np.int64(1451000),
 np.int64(640000),
 np.int64(16000000),
 np.int64(41600000),
 np.int64(5760000),
 np.int64(1664000),
 np.int64(25600),
 np.int64(852999),
 np.int64(2089999),
 np.int64(2090000),
 np.int64(42000),
 np.int64(1066000),
 np.int64(230421),
 np.int64(254999),
 np.int64(255000),
 np.int64(3327000),
 np.int64(8660000),
 np.int64(3540000),
 np.int64(468000),
 np.int64(2516000),
 np.int64(1492000),
 np.int64(1705000),
 np.int64(681000),
 np.int64(20350000),
 np.int64(894000),
 np.int64(83000),
 np.int64(2131000),
 np.int64(1107000),
 np.int64(1320000),
 np.int64(73000000),
 np.int64(6440000),
 np.int64(2770000),
 np.int64(722000),
 np.int64(1746000),
 np.int64(9300),
 np.int64(935000),
 np.int64(17500),
 np.int64(50300000),
 np.int64(101500000),
 np.int64(4220000),
 np.int64(124000),
 np.int64(24700000),
 np.int64(9340000),
 np.int64(2172000),
 np.int64(337000),
 np.int64(3409000),
 np.int64(2385000),
 np.int64(550000),
 np.int64(1574000),
 np.int64(3835000),
 np.int64(763000),
 np.int64(1999999),
 np.int64(155600000),
 np.int64(7120000),
 np.int64(2000000),
 np.int64(130000000),
 np.int64(27600000),
 np.int64(165000),
 np.int64(3450000),
 np.int64(378000),
 np.int64(8570000),
 np.int64(1402000),
 np.int64(1614999),
 np.int64(1615000),
 np.int64(803999),
 np.int64(4900000),
 np.int64(10020000),
 np.int64(2852000),
 np.int64(1017000),
 np.int64(1230000),
 np.int64(6350000),
 np.int64(1200),
 np.int64(206000),
 np.int64(11470000),
 np.int64(419000),
 np.int64(1443000),
 np.int64(7800000),
 np.int64(33400000),
 np.int64(2680000),
 np.int64(59000000),
 np.int64(632000),
 np.int64(17600),
 np.int64(844999),
 np.int64(25800),
 np.int64(845000),
 np.int64(4129999),
 np.int64(34000),
 np.int64(1058000),
 np.int64(4130000),
 np.int64(91347),
 np.int64(140500),
 np.int64(1271000),
 np.int64(247000),
 np.int64(459999),
 np.int64(460000),
 np.int64(61900000),
 np.int64(189900000),
 np.int64(10700000),
 np.int64(369888),
 np.int64(5580000),
 np.int64(2508000),
 np.int64(673000),
 np.int64(5793000),
 np.int64(1910000),
 np.int64(886000),
 np.int64(7030000),
 np.int64(75000),
 np.int64(3360000),
 np.int64(288000),
 np.int64(1312000),
 np.int64(90400000),
 np.int64(13600000),
 np.int64(501000),
 np.int64(1525000),
 np.int64(5621000),
 np.int64(2549000),
 np.int64(4810000),
 np.int64(714000),
 np.int64(1300),
 np.int64(2975000),
 np.int64(9500),
 np.int64(1140000),
 np.int64(16500000),
 np.int64(116000),
 np.int64(2164000),
 np.int64(4425000),
 np.int64(2590000),
 np.int64(1566000),
 np.int64(542000),
 np.int64(19998000),
 np.int64(755000),
 np.int64(4040000),
 np.int64(45000000),
 np.int64(968000),
 np.int64(157000),
 np.int64(1181000),
 np.int64(370000),
 np.int64(1394000),
 np.int64(583000),
 np.int64(1607000),
 np.int64(1820000),
 np.int64(22300000),
 np.int64(6940000),
 np.int64(2032999),
 np.int64(1009000),
 np.int64(2033000),
 np.int64(100550000),
 np.int64(198000),
 np.int64(3270000),
 np.int64(1222000),
 np.int64(1435000),
 np.int64(127600000),
 np.int64(50800000),
 np.int64(25200000),
 np.int64(9600),
 np.int64(102000000),
 np.int64(4720000),
 np.int64(624000),
 np.int64(1648000),
 np.int64(4933000),
 np.int64(6768000),
 np.int64(837000),
 np.int64(1050000),
 np.int64(6170000),
 np.int64(26000),
 np.int64(1263000),
 np.int64(345500),
 np.int64(451999),
 np.int64(2500000),
 np.int64(7620000),
 np.int64(1440),
 np.int64(452000),
 np.int64(665000),
 np.int64(75180),
 np.int64(877999),
 np.int64(3950000),
 np.int64(878000),
 np.int64(58800),
 np.int64(67000),
 np.int64(2115000),
 np.int64(1091000),
 np.int64(56600000),
 np.int64(31000000),
 np.int64(280000),
 np.int64(5400000),
 np.int64(20760000),
 np.int64(199960000),
 np.int64(181700),
 np.int64(3565000),
 np.int64(493000),
 np.int64(1730000),
 np.int64(6850000),
 np.int64(706000),
 np.int64(2754000),
 np.int64(812500),
 np.int64(919000),
 np.int64(1500),
 np.int64(8300000),
 np.int64(3180000),
 np.int64(108000),
 np.int64(2156000),
 np.int64(9700),
 np.int64(1132000),
 np.int64(1345000),
 np.int64(17900),
 np.int64(534000),
 np.int64(4630000),
 np.int64(75250),
 np.int64(1558000),
 np.int64(1771000),
 np.int64(959999),
 np.int64(960000),
 np.int64(11200000),
 np.int64(88000000),
 np.int64(1984000),
 np.int64(6293000),
 np.int64(149000),
 np.int64(2410000),
 np.int64(362000),
 np.int64(1386000),
 np.int64(3434000),
 np.int64(575000),
 np.int64(2623000),
 np.int64(14100000),
 np.int64(3860000),
 np.int64(65300000),
 np.int64(788000),
 np.int64(2025000),
 np.int64(1001000),
 np.int64(83500),
 np.int64(190000),
 np.int64(5310000),
 np.int64(1214000),
 np.int64(402999),
 np.int64(403000),
 np.int64(1427000),
 np.int64(1639999),
 np.int64(17000000),
 np.int64(1640000),
 np.int64(616000),
 np.int64(2664000),
 np.int64(1600),
 np.int64(2877000),
 np.int64(10045000),
 np.int64(18000),
 np.int64(3090000),
 np.int64(18450000),
 np.int64(58960),
 np.int64(1255000),
 np.int64(231000),
 np.int64(45500000),
 np.int64(5564000),
 np.int64(101820000),
 np.int64(444000),
 np.int64(50620000),
 np.int64(42600),
 np.int64(657000),
 np.int64(869999),
 np.int64(870000),
 np.int64(5990000),
 np.int64(1650),
 np.int64(1894000),
 np.int64(2107000),
 np.int64(59000),
 np.int64(1083000),
 np.int64(2320000),
 np.int64(272000),
 np.int64(17680000),
 np.int64(1296000),
 np.int64(99600000),
 np.int64(484999),
 np.int64(2533000),
 np.int64(485000),
 np.int64(101050000),
 np.int64(3770000),
 np.int64(698000),
 np.int64(1722000),
 np.int64(910999),
 np.int64(1935000),
 np.int64(911000),
 np.int64(100000),
 np.int64(102500000),
 np.int64(10340000),
 np.int64(5220000),
 np.int64(1124000),
 np.int64(2148000),
 np.int64(1337000),
 np.int64(313000),
 np.int64(108200),
 np.int64(3385000),
 np.int64(9900),
 np.int64(1550000),
 np.int64(526000),
 np.int64(3000000),
 np.int64(28600000),
 np.int64(952000),
 np.int64(1165000),
 np.int64(141000),
 np.int64(2189000),
 np.int64(354000),
 np.int64(4450000),
 np.int64(1378000),
 np.int64(567000),
 np.int64(2615000),
 np.int64(780000),
 np.int64(5900000),
 np.int64(1804000),
 np.int64(31500000),
 np.int64(2017000),
 np.int64(7137000),
 np.int64(2230000),
 np.int64(182000),
 np.int64(1206000),
 np.int64(1312500),
 np.int64(395000),
 np.int64(501500),
 np.int64(3679999),
 np.int64(8800000),
 np.int64(60000000),
 np.int64(188000000),
 np.int64(608000),
 np.int64(2656000),
 np.int64(3680000),
 np.int64(1632000),
 np.int64(6752000),
 np.int64(821000),
 np.int64(1800),
 np.int64(1845000),
 np.int64(10000),
 np.int64(10250000),
 np.int64(1034000),
 np.int64(2058000),
 np.int64(223000),
 np.int64(1460000),
 np.int64(57780000),
 np.int64(436000),
 np.int64(11700000),
 np.int64(88500000),
 np.int64(6580000),
 np.int64(649000),
 np.int64(1673000),
 np.int64(861999),
 np.int64(2910000),
 np.int64(1075000),
 np.int64(51000),
 np.int64(264000),
 np.int64(91400000),
 np.int64(1288000),
 np.int64(4360000),
 np.int64(476999),
 np.int64(2525000),
 np.int64(1501000),
 np.int64(690000),
 np.int64(5810000),
 np.int64(902999),
 np.int64(1107800),
 np.int64(1927000),
 np.int64(3975000),
 np.int64(92000),
 np.int64(2140000),
 np.int64(17500000),
 np.int64(1116000),
 np.int64(198500),
 np.int64(43100000),
 np.int64(68700000),
 np.int64(305000),
 np.int64(2353000),
 np.int64(5425000),
 np.int64(1900),
 np.int64(518000),
 np.int64(1542000),
 np.int64(3590000),
 np.int64(731000),
 np.int64(1755000),
 np.int64(18300),
 np.int64(5040000),
 np.int64(944000),
 np.int64(71600000),
 np.int64(26500),
 np.int64(133000),
 np.int64(1370000),
 np.int64(346000),
 np.int64(559000),
 np.int64(100100000),
 np.int64(2820000),
 np.int64(1796000),
 np.int64(772000),
 np.int64(984999),
 np.int64(985000),
 np.int64(67500),
 np.int64(174000),
 np.int64(1198000),
 np.int64(4270000),
 np.int64(2222000),
 np.int64(280500),
 np.int64(387000),
 np.int64(493499),
 np.int64(1623999),
 np.int64(600000),
 np.int64(103000000),
 np.int64(51800000),
 np.int64(26200000),
 np.int64(2861000),
 np.int64(813000),
 np.int64(1837000),
 np.int64(2000),
 np.int64(2050000),
 np.int64(1026000),
 np.int64(7170000),
 np.int64(108500),
 np.int64(10200),
 np.int64(215000),
 np.int64(3647448),
 np.int64(2014),
 np.int64(3500000),
 np.int64(23980000),
 np.int64(428000),
 np.int64(2476000),
 np.int64(1664999),
 np.int64(1665000),
 np.int64(641000),
 np.int64(6785000),
 np.int64(26600),
 np.int64(853999),
 np.int64(4950000),
 np.int64(854000),
 np.int64(1067000),
 np.int64(43000),
 np.int64(3115000),
 np.int64(1280000),
 np.int64(6400000),
 np.int64(160000000),
 np.int64(32000000),
 np.int64(1493000),
 np.int64(2730000),
 np.int64(682000),
 np.int64(67600),
 np.int64(100010000),
 np.int64(894999),
 np.int64(9300000),
 np.int64(84000),
 np.int64(2132000),
 np.int64(1108000),
 np.int64(142420000),
 np.int64(10274),
 np.int64(4180000),
 np.int64(2345000),
 np.int64(509999),
 np.int64(510000),
 np.int64(3582000),
 np.int64(1534000),
 np.int64(2558000),
 np.int64(2100),
 np.int64(5630000),
 np.int64(1747000),
 np.int64(723000),
 np.int64(10300),
 np.int64(1959999),
 np.int64(12200000),
 np.int64(1960000),
 np.int64(5032000),
 np.int64(7080000),
 np.int64(18500),
 np.int64(936000),
 np.int64(125000),
 np.int64(1149000),
 np.int64(1362000),
 np.int64(8530000),
 np.int64(338000),
 np.int64(3410000),
 np.int64(1575000),
 np.int64(551000),
 np.int64(2140),
 np.int64(764000),
 np.int64(1788000),
 np.int64(4860000),
 np.int64(2000999),
 np.int64(977000),
 np.int64(3025000),
 np.int64(59500),
 np.int64(10350),
 np.int64(1190000),
 np.int64(166000),
 np.int64(6310000),
 np.int64(18550),
 np.int64(1403000),
 np.int64(379000),
 np.int64(4475000),
 np.int64(26750),
 np.int64(18000000),
 np.int64(2640000),
 np.int64(43600000),
 np.int64(99920000),
 np.int64(592000),
 np.int64(1616000),
 np.int64(805000),
 np.int64(4090000),
 np.int64(9210000),
 np.int64(1018000),
 np.int64(207000),
 np.int64(2255000),
 np.int64(3279000),
 np.int64(1231000),
 np.int64(420000),
 np.int64(1444000),
 np.int64(46500000),
 np.int64(18600),
 np.int64(633000),
 np.int64(1869999),
 np.int64(1870000),
 np.int64(26800),
 np.int64(846000),
 np.int64(35000),
 np.int64(75000000),
 np.int64(100600000),
 np.int64(23800000),
 np.int64(3320000),
 np.int64(248000),
 np.int64(1272000),
 np.int64(1485000),
 np.int64(674000),
 np.int64(1501400),
 np.int64(887000),
 np.int64(1100000),
 np.int64(52300000),
 np.int64(76000),
 np.int64(26700000),
 np.int64(1313000),
 np.int64(18670),
 np.int64(501999),
 np.int64(2550000),
 np.int64(502000),
 np.int64(2290),
 np.int64(1526000),
 np.int64(3786999),
 np.int64(715000),
 np.int64(2300),
 np.int64(927999),
 np.int64(4000000),
 np.int64(80800000),
 np.int64(55200000),
 np.int64(928000),
 np.int64(10500),
 np.int64(117000),
 np.int64(2165000),
 np.int64(2066700),
 np.int64(330000),
 np.int64(5450000),
 np.int64(2378000),
 np.int64(1354000),
 np.int64(90442000),
 np.int64(10519),
 np.int64(543000),
 np.int64(1779999),
 np.int64(1780000),
 np.int64(6900000),
 np.int64(58100000),
 np.int64(32500000),
 np.int64(862500),
 ...}

def convert_ctc_to_lpa(ctc):
    """
    Converts CTC into Lakhs Per Annum (LPA)
    """

    if pd.isna(ctc):
        return np.nan

    # Monthly salary → convert to annual
    if ctc < 100000:
        annual_ctc = ctc * 12

    # Annual salary
    elif ctc <= 50000000:
        annual_ctc = ctc

    # Unrealistic values → mark as NaN
    else:
        return np.nan

    # Convert to LPA
    return annual_ctc / 100000


df['ctc_lpa'] = df['ctc'].apply(convert_ctc_to_lpa)

During EDA, observed that the raw CTC values were recorded in inconsistent units (monthly and annual), so we standardized them into Lakhs Per Annum using domain-driven rules before performing any outlier analysis or clustering.

Missing Values Handling (Mean / KNN Imputation)¶

df.isnull().sum()

Strategy:

Numerical → Mean or KNN Imputation

Categorical → Most frequent

num_cols = ['ctc_lpa', 'orgyear', 'ctc_updated_year']
cat_cols = ['company_hash', 'job_position']

num_imputer = KNNImputer(n_neighbors=5)
cat_imputer = SimpleImputer(strategy='most_frequent')

df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

Checking the years¶

invalid_years = df[df['ctc_updated_year'] < df['orgyear']]
invalid_years.shape

(8847, 7)

invalid_years.sample(5)

👁️ Observation

Some records have CTC updated year earlier than organization joining year, which is logically inconsistent and indicates data quality issues.

Let assume salary update happened at joining.

df['ctc_updated_year'] = np.where(
    df['ctc_updated_year'] < df['orgyear'],
    df['orgyear'],
    df['ctc_updated_year']
)

We observed records where ctc_updated_year was earlier than orgyear, which is logically inconsistent. To maintain data integrity, we corrected such cases by ensuring that the CTC updated year is not earlier than the joining year.

invalid_years = df[df['ctc_updated_year'] < df['orgyear']]
invalid_years.shape

(0, 7)

Regex Cleaning (Company & Job Position)¶

def clean_text(text):
    return re.sub('[^A-Za-z0-9 ]+', '', str(text)).strip()

df['company_hash'] = df['company_hash'].apply(clean_text)
df['job_position'] = df['job_position'].apply(clean_text)

Regex removes tabs, special characters, and inconsistent naming issues.

df['job_position_clean'] = (df['job_position'].str.lower().str.strip())

df['job_position_clean'].unique()

array(['other', 'fullstack engineer', 'backend engineer', 'ios engineer',
       'data analyst', 'frontend engineer', 'qa engineer',
       'engineering leadership', 'data scientist', 'engineering intern',
       'android engineer', 'research engineers', 'sdet',
       'support engineer', 'security leadership', 'devops engineer',
       'product manager', 'program manager', 'cofounder',
       'member of technical staff 3', 'product designer',
       'system engineer', 'backend architect', 'database administrator',
       'senior  software engineer', 'release engineer', 'ase',
       'non coder', 'trainer', 'associate consultant', 'administrator',
       'software development engineer 2', 'sr', 'sr executive',
       'software engineer android', 'owner',
       'tool room engineer mechanical engineer', 'senior data scientist',
       'sde 2', 'project engineer', 'mean stack developer',
       'software development engineer  i', 'sde3', 'digital marketing',
       'app developer', 'principal consultant',
       'business analyst  product owner', 'senior business analyst',
       'architect team lead', 'graduate engineer  trainee',
       'technical consulting', 'machine learning engineer', 'software',
       '', 'instructor', 'worker', 'senior software engineer backend',
       'student', 'operations manager', 'senior engineer', 'toyota',
       'interndata analyst', 'senior data engineer', 'operation',
       'associate  software  engineer', 'technical manager',
       'senior sdet', 'escalation engineer', 'assistant professor',
       'software engineer i', 'exchange coordinator', 'associate',
       'senior qa engineer iii', 'process executive',
       'software engineer backend', 'assistant system engineer',
       'sde intern', 'intern', 'software development engineer  ii',
       'senior associate technology level1',
       'product development engineer', 'engineer i', 'l4',
       'solution engineer', 'application development analyst', 'typing',
       'engineer trainee', 'associate manager', 'cant reveal',
       'full stack web developer', 'sde1', 'lead analyst',
       'business consultant', 'sdet2', 'software  engineer',
       'software engineer ii', 'senior developer',
       'integration developer', 'networking', 'business analysts',
       'associate analyst', 'java devloper', 'traineeintern',
       'sales manager', 'lead engineer ii', 'webcenter consultant',
       'project associate', 'logistics supply chain',
       'data science analyst', 'software engineer 2b',
       'analyst programmer', 'software engineer 2',
       'senior software developer', 'entrepreneur',
       'atm interface system', 'sap abap developer', 'mts  ii',
       'business man', 'senior software engineeer', 'programmer analyst',
       'tech lead team lead', 'member of technical staff',
       'software engineer full stack', 'engineer', 'teacher', 'no job',
       'associate system engineer  trainee', 'front end developer',
       'user experience specialist', 'driver', 'reseller',
       'technical leader', 'senior lead analyst', 'executive operation',
       'basis administrator', 'risk analyst', 'senior system engineer',
       'sde 3', 'product engineer backend', 'catagory manager',
       'senior consultant', 'salesforce developer', 'senior associate',
       'power platform developer', 'freelance sde', 'quality enginner',
       'software engineeradvanced analytics', 'student intern',
       'automation developer', 'software engineer 1', 'project trainee',
       'engineer 1', 'senior administrator', 'deputy general manager',
       'senior software  development engineer', 'senior test engineer',
       'sde2', 'associate system engineer', 'computer scientist 1',
       'support escalation engineer', 'technical  consultant',
       'associate software engineer', 'hod', 'radio frequency engineer',
       'learning web designing', 'software analyst', 'ucm',
       'solutions architect', 'ml engineer',
       'senior staff software engineer', 'pa', 'not employed',
       'project management profession', 'pdp executive', 'fb service',
       'office management', 'risk model developer', 'nce',
       'engineering lead  ai', 'reseach analyst', 'mainframe developer',
       'software engineer  i', 'solution architect cloud and infra',
       '857628', '7', 'software developer automation', 'null',
       'erp application developer', 'ui architect', 'senior executive',
       'software development', 'research analyst', 'kam', 'maricketing',
       'ui developer intern', 'operations lead',
       'cloud software engineer', 'a group chat application',
       'senior manager it', 'sr software engineer', 'bta',
       'data engineer', 'member technical', 'senior front end engineer',
       'software developer data engineer', 'associate product lead',
       'software engineer analyst', 'programmer analyst 2',
       'senior systems engineer', 'not applicable', 'team lead',
       'hardware engineer', 'mts', 'assistant system engineer  trainee',
       'voice president', 'senior software engineer qa',
       'associate engineer', 'data eingineer', 'data analayst',
       'quality associate', 'consultant', 'sde 1',
       'software development consultant 1', 'lead software engineer',
       'programmer analyst trainee', 'mulesoft developer', 'se4',
       'cashier', 'data associate', 'compliance operation',
       'electric power supply', 'system admin', 'software engineer  ii',
       'others', 'assistant manager', 'senior web engineer',
       'application developer analyst', 'lead android developer',
       'teaching field', 'trainee', 'talent acquisition specialist',
       'finance and accounting', 'cto and co founder',
       'senior member technical staff',
       'software engineering  engineer 2', 'advisor',
       'associate processor', 'dot net developer',
       'site reliability engineer',
       'services it development program senior associate',
       'staff software engineer', 'full stack engineer', 'manager',
       'software prod  plat eng analyst', 'personal banker',
       'analytics and insite devloper', 'developer  production support',
       'platform engineer',
       'senior software development engineer in test',
       'software engineering intern', 'solutions designer',
       'fibre engineer', 'quantitative analyst', 'entrepreneurship',
       'automation engineer', 'administrative clerk', 'freelancer',
       'software development analyst', 'design specialist',
       'principal software engineer', 'loan processing', 'sr consultant',
       'principal solution architect', 'software developer intern',
       'lead  mis', 'embedded software development engineer', 'pop',
       'intern  software developer', 'assisatnt system engineer',
       'cloud security analyst', 'software developer  ios', 'sdet senior',
       'tech lead software engineer', 'lead', 'associate data scientist',
       'process associate', 'associate developer', 'sde ii',
       'technology associate', 'sr developer',
       'associate software developer', 'assistant software engineer',
       'research engineer', 'system software engineer', 'serd',
       'module lead software engineer', 'seniorsoftwareengineer',
       'sr lead engineer', 'android developer sr executive',
       'machine learning data associate', 'principal engineer  1',
       'business technology analyst', 'fresher', 'na', 'cloud',
       'computer faculty', 'employee', 'it executive', 'quality analyst',
       'safety officer',
       'some data entry operator like some copys writetype and upload',
       'managercx', 'seleceman', 'agency collection manager',
       'now iam not working waiting for job',
       'business development manager', 'qa intern', 'auditor',
       'graduate programmer', 'financial analyst',
       'cloud support associate', 'software development engineer  2',
       'senior manager', 'data engineer 2', 'cto', 'module lead',
       'business analytics', 'software development enginner',
       'automation test engineer', 'vice president', 'automation expert',
       'software developer  ui', 'software eng',
       'senior quantitative researcher', 'salesforce consultant',
       'devlopment engineer', 'senior software analyst',
       'associate product manager', 'firmware software engineer l',
       'tech architecture', 'any technical', 'young professional ii',
       'technical consultant', '896651', 'technology analyst',
       'operation executive', 'software  web developer',
       'full stack devloper', 'r  d engineer2', 'software engineer rd',
       'product develloper', 'software development intern',
       'member technical staff', 'software engineer iv',
       'operations analyst', 'scientist  c', 'software engineer iii',
       'sdei', 'senior development engineer',
       'application developmentaassociate', 'network administrator',
       'assosiate product manager', 'software developer  android',
       'lecturer', 'softwear engineer', 'engineet',
       'senior software development engineer ii', 'q',
       'software engineer  frontend', 'techops engineer 1',
       'customer service executive', 'eno',
       'system engineerfullstack developer', 'applications engineer',
       'graduate software developer', 'software developer ii',
       'summer technology analyst software developer intern',
       'senior product development engineer senior developer',
       'highway engineer', 'performance engineer',
       'senior software consultant', 'application developer frontend',
       'associate applications developer', 'software enginner',
       'sales and collection executive', 'software engineering coop',
       'software development engineer  intern', 'manual tester',
       'it analyst', 'research assistant',
       'senior fusion applications engineer', 'jr prod specialist',
       'pgt computer science', 'integration',
       'talent acquisition sspecialist', 'software developer 2',
       'asset management coordinator', 'software engineer in test',
       'front end engineer', 'leisure manager', 'pat',
       'core tech engineer  researcher', 'senior engineer software',
       'software enginnering specialist', 'android lead',
       'senior analysts', 'director engineering', 'aspnet developer',
       'product developer', 'area operations manager', 'test analyst',
       'so', 'server engineer', 'application development associate',
       'android application developer', 'risk investigator', 'x',
       'machine learning engineer intern', 'third party provider',
       'web designer', 'compliance auditor', '91',
       'senior software development engineer backend',
       'associate consultant  developer', 'messenger come driver',
       'associate technical engineer', 'senior associate platform l1',
       'sw engineer 2', 'senior web developer', 'pop engineer',
       'senior frontend developer', 'no',
       'senior member of technical staff', 'tsc', 'tx sme',
       'mechanical engineers', 'sps associate', 'staff engineermgr',
       'finance lead', 'seller support', 'data scientist ii',
       'software developer sde3', 'calculator', 'application engineer',
       'staff consultant', 'assistant engineer', 'human resource',
       'associate l1', 'rd engineer', 'test technician specialist',
       'software test engineer', 'sr technical architect', 'phd student',
       'frontend developer', 'ays', 'principal product engineer',
       'senior director of engineering', 'seller support associate',
       'software development engineer  iii', 'rd engineer ii',
       'matlab programmer', 'software development engineer intern',
       'assistant system engineer trainee', 'sdet 3',
       'associate application developer', 'project lead',
       'senior software engineerl2', 'big data developer',
       'associate technology', 'data specialist', 'pune',
       'instructional apprentice', 'trainee decision scientist',
       'teaching assistant', 'research assistant  software engineer',
       'sr solution architect', 'sr software engineerandroid',
       'sr product engineer', 'air force',
       'principal engineer and imaging leader', 'lead product engineer',
       'software engineer intern', 'sr representative', 'associate coder',
       'marketing professional', 'crc', 'specialist',
       'underwriting officer', 'fresher student',
       'business operations new associate', 'software engineerrd intern',
       'project leader', 'specialist programmer', 'smts',
       'technical support engineer', 'senior technical specialist',
       'senior mobile developer', 'vendor management', 'techinical lead',
       'key account manager', 'programmer analyst 1',
       'applied scientist ii', 'data visualization engineer', 'sde 4',
       'service delivery specialist', 'software developer 1',
       'delivery project lead', 'production engineering senior analyst',
       'database developer', 'senior chat associate', 'systems analyst',
       'oracle dba', 'solutions specialist', 'senior technical assistant',
       'web designerdeveloper intern', 'credit risk manager',
       'web application developer', 'staff developer',
       'software engineer iii  sde2', 'advisory software engineer',
       'growth manager', 'executive software developer',
       'software engineer  testing', 'sr ios engineer',
       'engineering program manager',
       'member of technical staff at nineleaps',
       'mobile application developer', 'specialist programmer pp',
       'associate data engineer', 'engineering manager ii',
       'advisory system analyat', 'customer support specialist',
       'business development executive', 'it coordinator',
       'technical head', 'asst manager', 'mts3', 'engeneering lead',
       'software development engineer 3', 'forex',
       'java microservices developer', 'teaching', 'coo', 'telar',
       'senior software engineer  infrastructure', 'software consultant',
       'web developer full stack', 'business process lead',
       'senior quantitative analyst',
       'senior business operations analyst',
       'senior applications engineer', 'software qa engineer',
       'senior ui enginner', 'senior software engineer net backend',
       'sdeii', 'product specialist', 'application engineer ii',
       'program manager intern', 'head scientist',
       'ios application developer', 'implementation engineer',
       'cofounderg', 'sdet  intern', 'sde lead', 'junior front end eng',
       'sales professional', 'account', 'computer scientist 2',
       'member of technical staff 4', 'sr associate platform l2',
       'advisory consultant  uiux expert',
       'associate director online products', 'technical associate',
       'ios software developer', 'abap developer', 'sr mis executive',
       'lead consultant technology', 'info technology sr associate',
       'associate technical support', 'lead software engineer in test',
       'qaqc engineer', 'lead ui engineer',
       'software developerapplication developer', 'se ii',
       'application developer 1', 'principle member of technical staff',
       'senior analyst', 'software engineer traineeml', 'pmo',
       'associate futures engineering team', 'researcher', 'edp admin',
       'zomato', 'software development engineer iii',
       'software development engineer', 'online reputation manager',
       'farmer', 'relationship officer', 'middleware engineer',
       'senior manager online products', 'senior tech architect', 'bda',
       'junior software developer', 'linux administrator',
       'junior developer', 'solution architect  apprentice',
       'it mmanager', 'project officer', 'bdm',
       'sap c4c functional consultant',
       'lead engineer  software development', 'graphic designer',
       'jharkhand', 'associate project manager', 'technical officer',
       'design engineer', 'recovery officer',
       'package iimplementation associate', 'erp consultant',
       'senior engineer aiml', 'assistant regional sales manager',
       'cloud data architect', 'technical program program',
       'application developer 2', 'consultant development',
       'oracle apps technical consultant', 'android facilitator',
       'internship', 'intern developer', 'power plant engineer',
       'content manager intern', 'etl developer', 'technical test lead',
       'machine learning intern', 'senior analyst programmer',
       'sr software engg  android', 'professional',
       'systems operations engineer', 'lead software developer',
       'business development', 'ba', 'sde  3', 'technology architect',
       'decision analytics associate', 'engineering team lead', 'n',
       'strategy and new initiatives', 'senior qa engineer',
       'erp associate', 'associate professional',
       'assistant manager  capital market', 'senior sofware engineer',
       'senior webmethods developer', 'principal applications engineer',
       'software engineer  2', 'data warehouse developer',
       'fusion applications engineer', 'digital content associate',
       'technology specialist', 'maintenance engineer',
       'network function virtualisation', 'software engineer 3',
       'unempolyed', 'member of technical staff java',
       'senior application engineer', 'automation architect',
       'software engineer 2 backend', 'wordpress developer',
       'oracle lead dba', 'management executive',
       'sr operation executive', 'associate professor',
       'project assistant', 'network software engineer',
       'manager candidate', 'associate professional developer',
       'technical engineer', 'member of technical staff 2',
       'lead software development engineer', 'sales', 'insurance agent',
       'data entry', 'senior solution engineer', 'flutter app engineer',
       'senior android developer', 'senior software engineer front end',
       'application development', 'retail asscociate', 'consultanat',
       'branch manager', 'tester', 'digital marketing manager',
       'senior software engineer lead', 'assistant system enginner',
       'rd enginner ii', 'software engineer  operations', 'mts2',
       'developer asociate', 'professional services engineer',
       'senior enginner', 'member technical staff 2', 'sr hr executive',
       'software developer i', 'technology analyst ii', 'professor',
       '737', 'tech person', 'engineer 2', 'test automation engineer',
       'servicenow developer', 'sr software engg',
       'application development senior analyst',
       'software engineer 2 full stack', 'quality engineer',
       'program analyst trainee', 'technician', 'ta',
       'mechanical engineering', 'security delivery specialist',
       'senior frontend engineer', 'react native developer',
       'python devloper', 'business analyst lead',
       'electrical technician', 'member technical staff zoho',
       'software developer  pl10', 'developer 2', 'technical mentor',
       'growth', 'data scientist 2', 'sde  ii',
       'software developer consultant', 'audit manager',
       'lead consultant', 'snr analyst programmerjava developer',
       'fullstack web developer', 'quality assurance manager',
       'student in computer application', 'city transport lead',
       'senior devops', 'software development engineer in test',
       'integration engineer', 'transaction risk investigator',
       'subject matter expert', 'business system analyst',
       'software engineering lead', 'firmware engineer',
       'hvac algorithm developer', 'area business manager',
       'software dev engineer 1', 'senior software engineer 1',
       'associate system engineertrainee', 'software associate',
       'automation test enginner', 'security intern',
       'research intern  machine learning', 'bengaluru',
       'security researcher', 'r  d engineer', 'junior web developer',
       'backend engineering', 'graduate research assistant',
       'embedded software engineer', 'member of technical staff mts',
       'member technical staff 1', 'member of technical staff  2',
       'software engineering contractor', 'team lead mobile',
       'applications engineer 2', 'junior associate technology',
       'credit risk', 'senior software development enggineer',
       'associate technical lead', 'senior research associate',
       'software engineering', 'sr technical manager',
       'mobile application engineerandroid  ios',
       'software development engineer 1', 'software development manager',
       'java back end developer', 'developer associate',
       'senior associate  index technology  production support',
       'senior software development engineer', 'team manager',
       'associate it engineer', 'medical field', 'security analyst',
       'back office executive admin', 'sharepoint developer',
       'deputy manager', 'international coding instructor',
       'chief people officer', 'software engineer associate',
       'software qa', 'tuition teacher', 'technology consultant',
       'assosiate software engineer', 'teammate',
       'data operations manager', 'freelance', 'software devloper',
       'assistant', 'software development specialist',
       'senior process executive', 'reasearch engineer', 'it engineer',
       'member technial staff', 'solution integration consultant i',
       'ios  swift developer', 'devopscloud engineer',
       'associate tech lead', 'head of engineering',
       'associate interactive developer l2', 'linux system administrator',
       'software development engineer iv', 'cyber security engineer',
       'operations expert', 'shift controller', 'subinspector', 'h',
       'soft skill trainer at skill india government project',
       'secretary', 'sap technical consultant', 'some random title',
       'professional application delivery i', 'senior consulant',
       'senior member of technical staff  rd  machine learning',
       'chief technology officer', 'co foundercto',
       'software engineer l2', 'machine learning developer',
       'web engineer', 'iot consultant', 'tecchnical lead',
       'software engineer   programmer analyst',
       'senior database engineer', 'tech manager',
       'principal application engineer', 'sap cpq consultant', 'clerk',
       'mts 3', 'founders office', 'research engineer 2',
       'data engineer iii', 'self job', 'it trainee',
       'application development team lead', 'mts intern',
       'software engineering analyst', 'graduate intern',
       'associate application engineer', 'assistant engineer trainee',
       'front end web developer', 'associate principal engineer',
       'junior consultant', 'senior software engineering',
       'senior system specialist', 'lead design engineer', 'sre',
       'front end dev', 'senior software test engineer', 'auditing',
       'operation associate', '7033771951', 'storw', 'support',
       'senior project engineer', 'head of analytics',
       'full stack developer intern', 'ai engineer', 'programmar analyst',
       'software developer grade1', 'analyst consultant', 'new graduate',
       'pragrammer analyst', 'member of technical staff level 2',
       'senior mobile applications developer androidios',
       'trainee engineer java developer', 'iaas  engineer',
       'cloud engineer', 'team lead  android',
       'software development engineering intern', 'technical architecct',
       'qaeintern', 'java software engineer', 'product solution engineer',
       'jr software engineer', 'software dev engineer',
       'sr ios developer', 'software engineer android intern',
       'sql plsql developer', 'graduate engineer trainee',
       'core engineer software developer', 'analyst  software engineer',
       'dataproduct engineer', 'applied scientist',
       'member technical staff iii', 'web  ui designer',
       'azure data factory'], dtype=object)

def map_job_category(title):
    if pd.isna(title) or title in ['', 'na', 'null', 'no', 'not applicable']:
        return 'Other / Non-Tech'

    title = title.lower()

    # Data roles
    if any(x in title for x in [
        'data scientist', 'data analyst', 'data engineer',
        'machine learning', 'ml', 'ai', 'analytics', 'applied scientist'
    ]):
        return 'Data'

    # Software engineering roles
    if any(x in title for x in [
        'software engineer', 'developer', 'sde', 'full stack',
        'backend', 'frontend', 'android', 'ios', 'mobile',
        'web developer', 'programmer', 'fullstack'
    ]):
        return 'Software Engineer'

    # QA / Testing
    if any(x in title for x in [
        'qa', 'test engineer', 'testing', 'sdet', 'automation'
    ]):
        return 'QA / Testing'

    # DevOps / Infra
    if any(x in title for x in [
        'devops', 'sre', 'site reliability', 'cloud',
        'platform engineer', 'infra', 'network'
    ]):
        return 'DevOps / Infra'

    # Product / Management
    if any(x in title for x in [
        'product manager', 'program manager', 'manager',
        'lead', 'director', 'head', 'cto', 'vp', 'architect'
    ]):
        return 'Product / Program / Manager'

    # Design roles
    if any(x in title for x in [
        'designer', 'ui', 'ux'
    ]):
        return 'Design'

    # Intern / Student
    if any(x in title for x in [
        'intern', 'trainee', 'student', 'apprentice', 'graduate'
    ]):
        return 'Intern / Student'

    # Business / Ops / Support
    if any(x in title for x in [
        'business', 'operations', 'support', 'consultant',
        'analyst', 'sales', 'marketing', 'hr'
    ]):
        return 'Business / Operations / Support'

    return 'Other / Non-Tech'

df['job_category'] = df['job_position_clean'].apply(map_job_category)

df['job_category'].value_counts()

df[df['job_position']=='Other'].shape[0]

18072

df[df['job_category'] == 'Other / Non-Tech'].sample(5)

Duplicate Handling¶

df.duplicated().sum()

np.int64(8717)

df.drop_duplicates(inplace=True)

org_rows = df.shape[0]

df.shape

(197126, 9)

Feature Engineering – Years of Experience¶

df['years_of_exp'] = df['ctc_updated_year'] - df['orgyear']

df['years_of_exp'].describe()

df[df['years_of_exp'] == 2021]

df[df['years_of_exp'] == -18146]

df[(df['orgyear'] > 2025) | (df['orgyear'] < 1970)].sample(10)

From the above it seems that there is an error in the org year. We will remove these by doing outlier treatment.

Outlier treatment¶

def outlier_treatment(data, col):
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3-q1

    lb = q1 - (1.5*iqr)
    ub = q3 + (1.5*iqr)

    outliers = data[(data[col]<lb) | (data[col]>ub)]

    print(f"Column: {col}")
    print(f"Lower Bound: {lb}")
    print(f"Upper Bound: {ub}")
    print(f"Number of outliers: {outliers.shape[0]}")
    print("\nOutliers:")
    print(outliers)

def remove_outliers(data, col):
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3 - q1

    lb = q1 - 1.5 * iqr
    ub = q3 + 1.5 * iqr

    before = data.shape[0]

    data = data[(data[col] >= lb) & (data[col] <= ub)]

    after = data.shape[0]

    print(f"Number of rows before removal from {col}: {before}")
    print(f"Outliers removed from {col}: {before - after}")
    print(f"Number of rows after removal from {col}: {after}\n")
    return data

for col in num_cols:
    outlier_treatment(df, col)

Column: ctc_lpa
Lower Bound: -11.0
Upper Bound: 33.0
Number of outliers: 12047

Outliers:
                company_hash  \
25            nxbto xzntqztn   
38                evzzxt bvt   
93             xb  v onhatzn   
118                 bgzgnjut   
189                     tmvj   
...                      ...   
205817          wxowg ojntbo   
205821  rxnyxhb ntwyzgrgsxto   
205825                vbvkgz   
205833                vbvkgz   
205841        zgn vuurxwvmrt   

                                               email_hash  orgyear       ctc  \
25      d2668cb959e5657c3881413257b9850caa1359c7ce959a...   2015.0   9500000   
38      69ef6838be8ee5b628375b4cc160ba54c1f7cab8c3b130...   2006.0  11800000   
93      2b0daf6243847309e75a47fb22e9e3dc3f79031edf464a...   2021.0   5000000   
118     a37740bc0cedb23fd029276cbb7f4e821cca7df8dd0f70...   2011.0   6300000   
189     0ba214e9b65397dd7815b2909c714240781c590b2dc7d0...   2015.0  12000000   
...                                                   ...      ...       ...   
205817  a74c83782bc88928dd12b103c09697fbef87d15bd308d3...   2011.0   3327000   
205821  3def1401b5a8879eeb4a8465663a9c6c66702c85444d7e...   2012.0   3500000   
205825  95023bca0172ad67bfc3453550c5cf056557bc2c8c7169...   2016.0   4800000   
205833  f4415be48a1ef885e086dcd72181f667a289641e66f828...   2014.0   3800000   
205841  fb46a1a2752f5f652ce634f6178d0578ef6995ee59f6c8...   2019.0   5100000   

              job_position  ctc_updated_year  ctc_lpa  job_position_clean  \
25        Backend Engineer            2019.0    95.00    backend engineer   
38             QA Engineer            2019.0   118.00         qa engineer   
93        Android Engineer            2021.0    50.00    android engineer   
118       Backend Engineer            2019.0    63.00    backend engineer   
189     FullStack Engineer            2020.0   120.00  fullstack engineer   
...                    ...               ...      ...                 ...   
205817    Backend Engineer            2019.0    33.27    backend engineer   
205821    Backend Engineer            2016.0    35.00    backend engineer   
205825    Backend Engineer            2020.0    48.00    backend engineer   
205833    Backend Engineer            2019.0    38.00    backend engineer   
205841    Backend Engineer            2019.0    51.00    backend engineer   

             job_category  years_of_exp  
25      Software Engineer           4.0  
38           QA / Testing          13.0  
93      Software Engineer           0.0  
118     Software Engineer           8.0  
189     Software Engineer           5.0  
...                   ...           ...  
205817  Software Engineer           8.0  
205821  Software Engineer           4.0  
205825  Software Engineer           4.0  
205833  Software Engineer           5.0  
205841  Software Engineer           0.0  

[12047 rows x 10 columns]
Column: orgyear
Lower Bound: 2005.5
Upper Bound: 2025.5
Number of outliers: 7600

Outliers:
                     company_hash  \
20            pqgzgo xzwgqugqvnta   
52             wqgoogctq egq fgqp   
306                        xznhxn   
385               xmb wvzvav rvmo   
392     utgurt10 ntwyzgrgsxto xzw   
...                           ...   
205654                   tqxwoogz   
205667                    nfxnntq   
205686            mvzp ge vbtqxwv   
205693                       uvjh   
205803              wxowg ojontbo   

                                               email_hash  orgyear      ctc  \
20      b7d0b9cd894ab871c547063df449d03e4138050c0463c6...   2003.0   600000   
52      70656ece5fd6fc7c81783c6a83b8b34302c8d89d1e5e4c...   2004.0    60000   
306     466948a62063ec83e128d9d47df4795df4d59d953ad3e0...   2005.0  1000000   
385     6c95f6eb2b19d0375405cdfb99a420461f432e96cc92d4...   2004.0   105000   
392     83c951dfe2e20c805d4ce917e3212b6885b8d0a2416a03...   2005.0    34000   
...                                                   ...      ...      ...   
205654  e6e7a1f22cb01cf95650076c7dc8c6858f14778b7079ce...   1999.0  6500000   
205667  40b12fe32800569a02e210b92019479800c4ead15d5241...   1996.0   166000   
205686  46aba3e728d1bac5913ee45b0f5251db0dbeef3cf23bd5...   2002.0   120000   
205693  a1543b6aaed65b7fec970489fb0d792b0c55919f1fcc55...   2005.0  2500000   
205803  97f9acab2edf988fcec48e072b88f3731affd6db4487cb...   2005.0   200000   

                  job_position  ctc_updated_year  ctc_lpa  \
20                 QA Engineer            2021.0     6.00   
52            Backend Engineer            2019.0     7.20   
306     Engineering Leadership            2019.0    10.00   
385           Backend Engineer            2019.0     1.05   
392     Engineering Leadership            2019.0     4.08   
...                        ...               ...      ...   
205654        Backend Engineer            2017.0    65.00   
205667        Backend Engineer            2017.0     1.66   
205686        Backend Engineer            2016.0     1.20   
205693        Backend Engineer            2017.0    25.00   
205803        Backend Engineer            2019.0     2.00   

            job_position_clean                 job_category  years_of_exp  
20                 qa engineer                 QA / Testing          18.0  
52            backend engineer            Software Engineer          15.0  
306     engineering leadership  Product / Program / Manager          14.0  
385           backend engineer            Software Engineer          15.0  
392     engineering leadership  Product / Program / Manager          14.0  
...                        ...                          ...           ...  
205654        backend engineer            Software Engineer          18.0  
205667        backend engineer            Software Engineer          21.0  
205686        backend engineer            Software Engineer          14.0  
205693        backend engineer            Software Engineer          12.0  
205803        backend engineer            Software Engineer          14.0  

[7600 rows x 10 columns]
Column: ctc_updated_year
Lower Bound: 2016.0
Upper Bound: 2024.0
Number of outliers: 1759

Outliers:
                           company_hash  \
2211                             phrxkv   
3651                          wgszxkvzn   
10076                           xzegojo   
11081   exqon vacvznvst uqxcvnt rxbxnta   
19011                            gqvwrt   
...                                 ...   
205747       nyxzsfgqd v unw ntwyzgrgsj   
205748                           gnytqo   
205778                           gqvwrt   
205806                  ztdnowb xzwqtee   
205828                          cxqnhov   

                                               email_hash  orgyear      ctc  \
2211    3394674bb6bb1de6289e931853fa0bd131c811e0054a92...   2031.0  1500000   
3651    2cc6bae4e52677d27ce3fca38d7a01ecbe537e1dc1c48d...   2106.0   600000   
10076   4c171381270155fb87b885f89cd71ca37ebbb8fd9da58b...   2025.0   360000   
11081   d6df76c2b61fa3a068e4e3812be12a58f86f78a31fe888...   2029.0   310000   
19011   651e571741251182066dd1a7d5faaedd31a77eb25b8c08...   2010.0  1600000   
...                                                   ...      ...      ...   
205747  b06f730fb9012de1a5204b302391980e75d99fd471787a...   2010.0   170000   
205748  6a2bf858e0ed2829d21d994ce1e04fa5b8cf6c52662179...   2015.0   550000   
205778  dc896e8fb7bff82ff5c059a63c23bd9e4e8f9d6125b981...   2014.0  1268000   
205806  29f02bc29797857294cfcd44273ad42f64281ea79203f0...   2014.0  5266000   
205828  90a5e17931d705e48bf6a36144cad0954e7444574a7cf6...   2014.0  1200000   

            job_position  ctc_updated_year  ctc_lpa job_position_clean  \
2211    Backend Engineer            2031.0    15.00   backend engineer   
3651    Backend Engineer            2106.0     6.00   backend engineer   
10076              Other            2025.0     3.60              other   
11081              Other            2029.0     3.10              other   
19011   Backend Engineer            2015.0    16.00   backend engineer   
...                  ...               ...      ...                ...   
205747  Backend Engineer            2015.0     1.70   backend engineer   
205748  Backend Engineer            2015.0     5.50   backend engineer   
205778  Backend Engineer            2015.0    12.68   backend engineer   
205806  Backend Engineer            2015.0    52.66   backend engineer   
205828  Backend Engineer            2015.0    12.00   backend engineer   

             job_category  years_of_exp  
2211    Software Engineer           0.0  
3651    Software Engineer           0.0  
10076    Other / Non-Tech           0.0  
11081    Other / Non-Tech           0.0  
19011   Software Engineer           5.0  
...                   ...           ...  
205747  Software Engineer           5.0  
205748  Software Engineer           0.0  
205778  Software Engineer           1.0  
205806  Software Engineer           1.0  
205828  Software Engineer           1.0  

[1759 rows x 10 columns]

for col in ['orgyear', 'ctc_updated_year']:
    df = remove_outliers(df, col)

Number of rows before removal from orgyear: 197126
Outliers removed from orgyear: 7600
Number of rows after removal from orgyear: 189526

Number of rows before removal from ctc_updated_year: 189526
Outliers removed from ctc_updated_year: 1669
Number of rows after removal from ctc_updated_year: 187857

# For CTC we will cap the outliers instead of removing them.
def cap_outliers(data, col):
    q1 = data[col].quantile(0.25)
    q3 = data[col].quantile(0.75)
    iqr = q3 - q1

    lb = q1 - 1.5 * iqr
    ub = q3 + 1.5 * iqr

    data[col] = np.clip(data[col], lb, ub)
    return data

df = cap_outliers(df, 'ctc_lpa')

rows_after_clean = df.shape[0]

print(f"Percentage of data removed after outlier removal: {round((org_rows-rows_after_clean)/org_rows *100, 2)}%")

Percentage of data removed after outlier removal: 4.7%

Manual Clustering¶

Company + Job + Experience CTC Summary¶

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 187857 entries, 0 to 205842
Data columns (total 10 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   company_hash        187857 non-null  object 
 1   email_hash          187857 non-null  object 
 2   orgyear             187857 non-null  float64
 3   ctc                 187857 non-null  int64  
 4   job_position        187857 non-null  object 
 5   ctc_updated_year    187857 non-null  float64
 6   ctc_lpa             187857 non-null  float64
 7   job_position_clean  187857 non-null  object 
 8   job_category        187857 non-null  object 
 9   years_of_exp        187857 non-null  float64
dtypes: float64(4), int64(1), object(5)
memory usage: 15.8+ MB

summary_cje = (
    df.groupby(['company_hash', 'job_position', 'years_of_exp'])
      .agg(
          avg_ctc=('ctc', 'mean'),
          min_ctc=('ctc', 'min'),
          max_ctc=('ctc', 'max'),
          count=('ctc', 'count')
      )
      .reset_index()
)

df = df.merge(summary_cje,
              on=['company_hash', 'job_position', 'years_of_exp'],
              how='left')

df.sample(5)

Designation Flag¶

Logic

1 → Above average
2 → Around average
3 → Below average

def designation_flag(row):
    if row['ctc'] > row['avg_ctc']:
        return 1
    elif row['ctc'] < row['avg_ctc']:
        return 3
    else:
        return 2

df['Designation'] = df.apply(designation_flag, axis=1)
df.sample(5)

Designation 1 learners are outperformers within same role & experience.

Class Flag (Company + Job Position)¶

summary_cj = (
    df.groupby(['company_hash', 'job_position'])
      .agg(avg_ctc_class=('ctc', 'mean'))
      .reset_index()
)

df = df.merge(summary_cj, on=['company_hash', 'job_position'], how='left')

df['class'] = np.where(
    df['ctc'] > df['avg_ctc_class'], 1,
    np.where(df['ctc'] < df['avg_ctc_class'], 3, 2)
)

Tier Flag (Company Level)¶

summary_company = (
    df.groupby('company_hash')
      .agg(avg_ctc_company=('ctc', 'mean'))
      .reset_index()
)

df = df.merge(summary_company, on='company_hash', how='left')

df['Tier'] = np.where(
    df['ctc'] > df['avg_ctc_company'], 1,
    np.where(df['ctc'] < df['avg_ctc_company'], 3, 2)
)

Visual Analysis¶

Univariate Analysis¶

for col in num_cols:
    plt.figure(figsize=(6,4))
    sns.histplot(df[col], kde=True)
    plt.title(f"Distribution of {col.upper()}")
    plt.show()

for col in num_cols:
    plt.figure(figsize=(6,4))
    sns.boxplot(y=df[col])
    plt.title(f"Boxplot of {col.upper()}")
    plt.show()

“CTC is right-skewed”

“Outliers significantly reduced after cleaning”

Categorical Variable Analysis¶

for col in cat_cols:
    plt.figure(figsize=(10,4))
    df[col].value_counts().head(10).plot(kind='bar')
    plt.title(f"Top categories in {col.upper()}")
    plt.show()

Bivariate Analysis¶

CTC vs Experience¶

plt.figure(figsize=(12,6))
sns.boxplot(x=round(df['years_of_exp']), y='ctc_lpa', data=df)
plt.xticks(rotation=45)
plt.title('CTC Distribution by Years of Experience')
plt.ylabel('CTC LPA')
plt.xlabel('Years of Experience')
plt.show()

CTC increases with experience, but the widening salary spread and presence of high outliers across all experience levels indicate that role and company significantly influence compensation, making clustering a suitable approach.

CTC vs Job position¶

plt.figure(figsize=(12,6))
sns.boxplot(x='job_category', y='ctc_lpa', data=df)
plt.xticks(rotation=45)
plt.title('CTC Distribution by Job position')
plt.ylabel('CTC LPA')
plt.xlabel('Job position')
plt.show()

Product / Program / Manager roles command the highest median CTC, while Intern/Student and Business/Operations roles earn the lowest, with Software Engineering and DevOps showing strong but more variable pay growth.

Top 10 employees (Tier 1)¶

df[df['Tier'] == 1].sort_values('ctc_lpa', ascending=False).head(10)

Bottom 10 employees (Tier 3)¶

df[df['Tier'] == 3].sort_values('ctc_lpa').head(10)

Top 10 Data Science employees per company¶

df[
    (df['job_position'].str.contains('Data', case=False)) &
    (df['class'] == 1)
].sort_values('ctc', ascending=False).groupby('company_hash').head(10)

Top 10 Companies by CTC¶

df.groupby('company_hash')['ctc'].mean().sort_values(ascending=False).head(10)

Top 2 Positions per Company¶

df.groupby(['company_hash', 'job_position'])['ctc'].mean() \
  .reset_index() \
  .sort_values(['company_hash', 'ctc'], ascending=[True, False]) \
  .groupby('company_hash').head(2)

Across companies, technical roles such as Backend and Full-Stack Engineering consistently appear among the top two highest-paying positions based on average CTC.

Statistical Analysis¶

Normality test¶

from scipy.stats import shapiro

sample_ctc = df['ctc_lpa'].sample(5000, random_state=42)
stat, p = shapiro(sample_ctc)

if p < 0.05:
    print(f"p-Value is {p}. Hence CTC is not normally distributed")
else:
    print(f"p-Value is {p}. Hence CTC is normally distributed")

p-Value is 1.0426914677433449e-52. Hence CTC is not normally distributed

Since CTC is not normally distributed, we will use non-parametric tests.

Kruskal–Wallis Test¶

alpha = 0.05

CTC vs Job position¶

Hypotheses:

H₀: Median CTC is the same across job categories
H₁: At least one category differs

groups = [
    group['ctc_lpa'].values
    for _, group in df.groupby('job_position')
]

stat, p = kruskal(*groups)

if p < alpha:
    print("Kruskal-Wallis p-value:", p)
    print("Salary differs significantly by job position")

else:
    print("Kruskal-Wallis p-value:", p)
    print("No significant difference")

Kruskal-Wallis p-value: 0.0
Salary differs significantly by job position

CTC differs significantly across job categories/positions, validating the visual salary gaps observed.

CTC vs Experience¶

corr, p = pearsonr(df['years_of_exp'], df['ctc_lpa'])

print("Pearson Correlation:", corr)
print("p-value:", p)

Pearson Correlation: 0.31235430225317334
p-value: 0.0

corr, p = spearmanr(df['years_of_exp'], df['ctc_lpa'])

print("Spearman Correlation:", corr)
print("p-value:", p)

Spearman Correlation: 0.30573054623289775
p-value: 0.0

There is a statistically significant positive relationship between experience and CTC.

Tiers vs Salary Distributions¶

tier_groups = [
    df[df['Tier'] == t]['ctc_lpa']
    for t in df['Tier'].unique()
]

stat, p = kruskal(*tier_groups)

if p < 0.05:
    print(f"p-Value is {p}. Salary distributions differ significantly across Tier groups.")
else:
    print(f"p-Value is {p}. Salary distributions not differ across Tier groups.")

p-Value is 0.0. Salary distributions differ significantly across Tier groups.

Since CTC is highly skewed and non-normally distributed, non-parametric statistical tests were applied.

Kruskal–Wallis tests confirmed that salary distributions differ significantly across job categories and tier levels.

Spearman correlation showed a statistically significant positive relationship between years of experience and CTC.

These results validate the patterns observed during exploratory data analysis.

ML Development¶

Data Processing for Unsupervised Learning¶

df.sample(5)

company_le = LabelEncoder()
job_le = LabelEncoder()
scaler = MinMaxScaler()

df['company_enc'] = company_le.fit_transform(df['company_hash'])
df['job_enc'] = job_le.fit_transform(df['job_position_clean'])

df['ctc_scaled'] = scaler.fit_transform(df[['ctc_lpa']])

X = df[['company_enc', 'job_enc', 'ctc_scaled', 'years_of_exp']]

X.head(5)

Elbow Method¶

inertia = []
K = range(1, 11)

for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

kneedle = KneeLocator(
    K,
    inertia,
    curve='convex',
    direction='decreasing'
)

elbow_k = kneedle.elbow
elbow_inertia = inertia[K.index(elbow_k)]

plt.plot(K, inertia, marker='o')
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.title("Elbow Method")

plt.annotate(
    f'Elbow Point (K={elbow_k})',
    xy=(elbow_k, elbow_inertia),
    xytext=(elbow_k + 1, elbow_inertia + 30000),
    arrowprops=dict(facecolor='red', arrowstyle='->'),
    fontsize=10
)

plt.show()

We selected $K = 3$ because the inertia drops sharply up to $3$ clusters, after which the curve flattens.
This indicates diminishing returns beyond $K = 3$, making it the optimal balance between cluster compactness and model simplicity.

ML Model¶

Silhouette score¶

def sil_score(x, labels, sample_frac=0.1, random_state=42):
    sample_size = int(0.1 * X.shape[0])
    X_sample, labels_sample = resample(
        x,
        labels,
        n_samples=sample_size,
        random_state=random_state,
        stratify=labels
    )

    score = silhouette_score(X_sample, labels_sample)

    return f"Silhouette Score (10% sample): {score:.4f}"

KMeans++¶

kmeans = KMeans(n_clusters=3, init='k-means++', random_state=42)

np.random.seed(45)
kmeans.fit(X)
klabels = kmeans.fit_predict(X)

sil_score(X, klabels)

'Silhouette Score (10% sample): 0.6158'

X_km = X.copy()

X_km['klabels']=klabels

X_km.head(5)

Hierarchical Clustering (Sample)¶

np.random.seed(45)
sample = X[:1000]
linked = linkage(sample, method='ward')

plt.figure(figsize=(16, 6))
cut_height = 125000
plt.axhline(y=cut_height, color='red', linestyle='--', linewidth=1)
dendrogram(linked)
plt.show()

Hierarchical clustering shows a clear separation into three major clusters when cut at distance $≈125,000$.
These clusters likely represent low, medium, and high compensation groups, validating the KMeans result ($k=3$).
The large vertical gaps indicate strong natural grouping in the data.

Agglomerative clustering¶

n_samples = int(0.1 * X.shape[0])
np.random.seed(42)
idx = np.random.choice(X.shape[0], size=n_samples, replace=False)

X_sample = X.iloc[idx]

model = AgglomerativeClustering(n_clusters=4, linkage='ward')

agg_labels = model.fit_predict(X_sample)

print(f"Agglomerative - Silhouette  score: {round(silhouette_score(X_sample, agg_labels), 4)}")

Agglomerative - Silhouette  score: 0.5906

X_sample_agg = X_sample.copy(deep=True)

X_sample_agg['agg_labels'] = agg_labels

X_sample_agg.head(5)

GMM (Gaussian Mixture Model)¶

X_gmm_sample = X_sample.copy(deep=True)

gmm = GaussianMixture(
    n_components=4,
    covariance_type='full',
    random_state=42
)
gmm_labels = gmm.fit_predict(X_gmm_sample)

sil = silhouette_score(X_gmm_sample, gmm_labels)
print(f"GMM Silhouette Score: {sil:.4f}")

GMM Silhouette Score: 0.0889

X_gmm_sample['gmm_lables'] = gmm_labels

X_gmm_sample.head(5)

GMM score is very low and seems very poor clustering of the data.

X_km['cluster'] = X_km['klabels'].map({0: 'Entry / Low CTC', 1: 'Mid-level Professionals', 2: 'High Earners / Leaders'})

chart = X_km['cluster'].value_counts().reset_index()

plt.pie(x=chart['count'],
        labels=chart['cluster'],
        explode = (0.1, 0, 0),
        colors = ['#4C5D6A', '#2F8F8B', '#6FBF73'],
        autopct='%1.1f%%',
        startangle=120)

plt.show()

📊 Business Recommendations¶

1. Compensation Benchmarking

Companies can use these clusters to:

Compare their employee pay distribution against the market
Identify overpaying or underpaying roles

➡️ Example: If most employees fall into Mid-level cluster but competitors place similar roles in High Earners, there is a retention risk.

2. Role-Based Salary Bands

HR teams should define salary bands by job position, not just experience.

✔ Recommended:

Maintain role-specific salary ranges
Adjust compensation policies based on market cluster positioning

3. Talent Retention & Career Planning

Employees can be mapped to clusters to:

Identify high performers stuck in lower pay clusters
Design promotion and upskilling paths to move employees upward

4. Hiring & Offer Strategy

For new hires:

Predict likely cluster placement based on role + experience
Make competitive but controlled offers aligned with market clusters

5. Strategic Workforce Planning

Leadership can use cluster insights to:

Forecast payroll growth
Plan leadership hiring
Balance cost vs expertise across teams

	count
email_hash
bbace3cc586400bbc65765bc6a16b77d8913836cfc98b77c05488f02f5714a4b	10
3e5e49daa5527a6d5a33599b238bf9bf31e85b9efa9a94f1c88c5e15a6f31378	9
298528ce3160cc761e4dc37a07337ee2e0589df251d73645aae209b010210eee	9
6842660273f70e9aa239026ba33bfe82275d6ab0d20124021b952b5bc3d07e6c	9
d598d6f1fb21b45593c2afc1c2f76ae9f4cb7167156cdf93246d4192a89d8065	8
b4d5afa09bec8689017d8b29701b80d664ca37b83cb883376b2e95191320da66	8
d15041f58bb01c8ee29f72e33b136e26bc32f3169a40b53d75fe7ae9cbb9a551	8
c0eb129061675da412b0deb15871dd06ef0d7cd86eb5f7e8cc6a20b0d1938183	8
4818edfd67ed8563dde5d083306485d91d19f4f1c95d193a1700e79dd245b75c	8
faf40195f8c58d5c7edc758cc725a762d51920da996410b80ac4a4d85c803da0	8

	ctc
count	2.058430e+05
mean	2.271685e+06
std	1.180091e+07
min	2.000000e+00
25%	5.300000e+05
50%	9.500000e+05
75%	1.700000e+06
max	1.000150e+09

	years_of_exp
count	197126.000000
mean	4.999788
std	29.557376
min	0.000000
25%	2.000000
50%	4.000000
75%	6.000000
max	2021.000000

	ctc
company_hash
whmxw rgsxwo uqxcvnt rxbxnta	1.000150e+09
aveegaxr xzntqzvnxgzvr hzxctqoxnj	2.500000e+08
ntwyzg xzaxv sqghu	2.000000e+08
touxqxn ntwyzgrgsxto ucnrna	2.000000e+08
onttr onqxuo fyttro rxbxnta	2.000000e+08
i wgzztin mhoxztoo ogrhnxgzo ucn rna	2.000000e+08
agyv tdnqvwg	2.000000e+08
hzxiht sqghu ge owyggro	2.000000e+08
xzaxv ugftq wgqugqvnxgz rna	2.000000e+08
ctqexohayv ntwyzgrgsxto	2.000000e+08

	company_hash	job_position	ctc
0	0	Backend Engineer	100000.0
1	0	Other	100000.0
2	0000	Other	300000.0
4	01 ojztqsj	Frontend Engineer	830000.0
3	01 ojztqsj	Android Engineer	270000.0
...	...	...	...
63516	zz	Other	1370000.0
63515	zz	Backend Engineer	500000.0
63517	zzb ztdnstz vacxogqj ucn rna	Backend Engineer	600000.0
63518	zzb ztdnstz vacxogqj ucn rna	FullStack Engineer	600000.0
63519	zzgato	Backend Engineer	130000.0

	Unnamed: 0	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year
count	205843.000000	205799	205843	205757.000000	2.058430e+05	153279	205843.000000
unique	NaN	37299	153443	NaN	NaN	1016	NaN
top	NaN	nvnv wgzohrnvzwj otqcxwto	bbace3cc586400bbc65765bc6a16b77d8913836cfc98b7...	NaN	NaN	Backend Engineer	NaN
freq	NaN	8337	10	NaN	NaN	43554	NaN
mean	103273.941786	NaN	NaN	2014.882750	2.271685e+06	NaN	2019.628231
std	59741.306484	NaN	NaN	63.571115	1.180091e+07	NaN	1.325104
min	0.000000	NaN	NaN	0.000000	2.000000e+00	NaN	2015.000000
25%	51518.500000	NaN	NaN	2013.000000	5.300000e+05	NaN	2019.000000
50%	103151.000000	NaN	NaN	2016.000000	9.500000e+05	NaN	2020.000000
75%	154992.500000	NaN	NaN	2018.000000	1.700000e+06	NaN	2021.000000
max	206922.000000	NaN	NaN	20165.000000	1.000150e+09	NaN	2021.000000

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year
54820	uqvpqxnx voogwxvnto	8786759b95d673466e94f62f1b15e4f8c6bd7de6164074...	2020.0	24	Other	2020.0
91552	ftm ongqt	512f761579fb116e215cabc9821c7f81153f0763e16018...	2016.0	25	Android Engineer	2018.0
114164	xzntqcxtfmxn	23ad96d6b6f1ecf554a52f6e9b61677c7d73d8a409a143...	2013.0	14	NaN	2018.0
118236	xzntqcxtfmxn	f2b58aeed3c074652de2cfd3c0717a5d21d6fbcf342a78...	2013.0	6	NaN	2018.0
135435	xzntqcxtfmxn	3505b02549ebe2c95840ac6f0a35561a3b4cbe4b79cdb1...	2014.0	2	Backend Engineer	2019.0
183804	xm	75357254a31f133e2d3870057922feddeba82b88056a07...	2019.0	16	NaN	2018.0
184946	xm	b8a0bb340583936b5a7923947e9aec21add5ebc50cd60b...	2016.0	15	NaN	2018.0

	0
company_hash	44
email_hash	0
orgyear	86
ctc	0
job_position	52564
ctc_updated_year	0
ctc_lpa	1454

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa
20278	zgn vuurxwvmrt vwwghzn	4bc6fb265911f8f98622c91bb76469ea9a422b3e293bd7...	2020.0	640000	Frontend Engineer	2019.0	6.40
12294	ugrxwjmvkvvqwgb	24e229fc825f8892225e333bc9fe5ca53c75c5c27ddfcc...	2022.0	5000000	Co-founder	2020.0	50.00
179705	eqtoytq	51192bd41dff3a3ffad14f3525e98549b0e94a3e340f02...	2020.0	2600000	Backend Engineer	2019.0	26.00
39204	x vb v eqtoytq	48b00207f75dd25ca9d518103e2ddc3c9a9706e51ae393...	2020.0	200000000	Frontend Engineer	2019.0	6.56
74532	rtvqzwgatgzrxzt xzw	8b3a332f31177737f0491fc758ede2393a84f37d087262...	2020.0	600000	SDET	2019.0	6.00

	count
job_category
Software Engineer	147513
Other / Non-Tech	21699
Product / Program / Manager	9134
Data	8375
QA / Testing	6610
DevOps / Infra	4632
Business / Operations / Support	3773
Intern / Student	2781
Design	1326

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category
134173	gqvwrt	f87b680307676acc9248f3b037071649fe63ad61ce6f50...	2016.0	1200000	Other	2017.0	12.00	other	Other / Non-Tech
67182	btaxvztn	78ec9030de84d08430d948ddf076d19c85c64ce8adb81a...	2017.0	1325000	Other	2020.0	13.25	other	Other / Non-Tech
125768	wno wgqugqvnxgz	809b2cb327d582fa1e30801e4cae518a94d27c4056b890...	2004.0	100000	Other	2021.0	1.00	other	Other / Non-Tech
52516	mn	ca8eeeafcb990599c2f8af02f20de9b341fedddf20332c...	2010.0	1100000	Other	2016.0	11.00	other	Other / Non-Tech
67788	qtctqxt	d75e72c4d6b6c34ffafa217c50fc5c1650c89651fc5eb3...	2009.0	1000000	Non Coder	2021.0	10.00	non coder	Other / Non-Tech

	company_hash	email_hash	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp
13424	9xntwyzgrgsj	854ff163ded87211b944dfcaebdcf9e8efa45defc9582f...	700000	Backend Engineer	2021.0	7.0	backend engineer	Software Engineer	2021.0
32086	lxg	ae1b500192dcd0b6c2d5c69b51f6caf19c630640c0aeb1...	600000	Backend Engineer	2021.0	6.0	backend engineer	Software Engineer	2021.0
33117	mvzp ge vbtqxwv	8ef16126bd9a4691801d2830156dc5528142a45d314593...	1200000	Backend Engineer	2021.0	12.0	backend engineer	Software Engineer	2021.0
40553	mqvpto xzaxv rna	a0f794db04d5c13cae6f07c6ce9aee8ff731176cc4d4a1...	600000	Backend Engineer	2021.0	6.0	backend engineer	Software Engineer	2021.0
56770	vngo ojzntr	536e217527d1101538c70b7001d7e28344dba87cd8e246...	400000	Backend Engineer	2021.0	4.0	backend engineer	Software Engineer	2021.0
74313	ovbohzs trtwnqgzxwo	b9fac647e08fe47c6d112466c338e847e82e8f24e16236...	1000000	Backend Engineer	2021.0	10.0	backend engineer	Software Engineer	2021.0
100415	rgsfvqa	b511b17a1aaaa822c6b5d6a04498ff3121808101697407...	1600000	Frontend Engineer	2021.0	16.0	frontend engineer	Software Engineer	2021.0
101678	awm mvzp	37a5cce01b1e049307d73bbb2138fa54cbb02555c2137a...	700000	Backend Engineer	2021.0	7.0	backend engineer	Software Engineer	2021.0

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp
202210	mqvmtzatq	d66f939c4318c1958be5bc9e7b70b741aa61be7493ff58...	2028.0	1300000	Backend Engineer	2028.0	13.0	backend engineer	Software Engineer	0.0
167456	xzwnhqt ntwyzgrgsj	8b670b30a7588a36cbe0cea33f8c840c3ab25c649d970c...	2.0	500000	Backend Engineer	2019.0	5.0	backend engineer	Software Engineer	2017.0
62717	wyvqstugxzn	a1da4f131e2efea24f6d64c66519724aa5c8c5c11a23d9...	0.0	1850000	FullStack Engineer	2019.0	18.5	fullstack engineer	Software Engineer	2019.0
183768	ftxzcton	4007e5caadc3f52c3e18bf2b4eacbadf17b114208c2d04...	20165.0	1560000	Backend Engineer	20165.0	15.6	backend engineer	Software Engineer	0.0
33670	sgltp	a8a779493f5844846ae536b0f48437cf9f0f6ebcac340f...	2107.0	3000000	Backend Engineer	2107.0	30.0	backend engineer	Software Engineer	0.0
73302	xzntqcxtfmxn	91e4562ab8bab639b859082d519722a91b8e6f3d55c109...	2026.0	300000	Backend Engineer	2026.0	3.0	backend engineer	Software Engineer	0.0
84882	hzxntaytvrny sqghu	82e77fdd3e43e37ec6b805bbfe624d7cd24b37cb0a4317...	3.0	540000	QA Engineer	2019.0	5.4	qa engineer	QA / Testing	2016.0
94689	xzntr otwhqxnj bwvett ogenfvqt	7126263909db5c37b278b05ea56740688062821f452ef6...	5.0	2720000	QA Engineer	2019.0	27.2	qa engineer	QA / Testing	2014.0
143238	wxowg	19bd55263d4e8cb5659b50814dc18b1fbdd1b7e0689e4b...	5.0	1000000	Backend Engineer	2019.0	10.0	backend engineer	Software Engineer	2014.0
19920	zgn vuurxwvmrt vwwghzn	6aa38b497c73367a7dd6eafb95bdd5b07cca83ed14c588...	2026.0	500000	Backend Engineer	2026.0	5.0	backend engineer	Software Engineer	0.0

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp	avg_ctc	min_ctc	max_ctc	count
92540	atbtztf	38d1fa6b6127c5b0fcf7a32125b6f723e93b33b0e2dc49...	2016.0	1300000	FullStack Engineer	2021.0	13.00	fullstack engineer	Software Engineer	5.0	1.300000e+06	1300000	1300000	1
105421	xzegojo	95db365454f5570f1204f281375ff894432f094659fee0...	2009.0	390000	Backend Engineer	2019.0	3.90	backend engineer	Software Engineer	10.0	1.072222e+06	390000	2000000	18
61609	ogenfvqt tzsxzttq ogzj xzaxv ogenfvqt wtznqt	6253bb1144050e607521e595eb36c1b4d05544bff55666...	2017.0	750000	Backend Engineer	2019.0	7.50	backend engineer	Software Engineer	2.0	7.500000e+05	750000	750000	1
117674	zxoovz axsxnvr	b0835cd9a41974819eeec33de7e9f0bc4217ddde3bf0b9...	2018.0	630000	Backend Engineer	2019.0	6.30	backend engineer	Software Engineer	1.0	6.900000e+05	625000	1000000	6
147579	nyghsynfgqpo	0fcd2634281ad1676af00981f09ac9e3ec9bb89a6b805b...	2011.0	126000	Backend Engineer	2016.0	1.26	backend engineer	Software Engineer	5.0	1.152889e+06	126000	1700000	9

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp	avg_ctc	min_ctc	max_ctc	count	Designation
42048	zxr	9690b44afccd5011ce3b4c6959ca3b98b2bb555ae8bfd1...	2020.0	2310000	Backend Engineer	2020.0	23.1	backend engineer	Software Engineer	0.0	2.578333e+06	360000	7400000	6	3
141204	rgjvrnj lhsstqzvhn	ba9499599c7e4b9702a84da141719b8b2592ccfed6eb63...	2013.0	1950000	Backend Engineer	2020.0	19.5	backend engineer	Software Engineer	7.0	1.950000e+06	1950000	1950000	1	2
12563	vwnxctvx	b31490e2eac8b671b38bf9ceabb85a5a56ee9d9be52322...	2018.0	500000	Backend Engineer	2021.0	5.0	backend engineer	Software Engineer	3.0	5.000000e+05	500000	500000	1	2
9661	ohmtd	53ce43655503bc47faaf85d52e5fee4d3d475a62e5f726...	2016.0	1100000	Other	2020.0	11.0	other	Other / Non-Tech	4.0	1.100000e+06	1100000	1100000	1	2
45618	exo	402c7c4f4508f9cc4c7d31b11d9085eb4702be6b06b915...	2014.0	1050000	QA Engineer	2021.0	10.5	qa engineer	QA / Testing	7.0	8.900000e+05	730000	1050000	2	1

	company_enc	job_enc	ctc_scaled	years_of_exp
0	920	408	0.344823	4.0
1	18750	255	0.141059	1.0
2	14746	110	0.626956	5.0
3	11518	110	0.219430	2.0
4	19222	255	0.438867	2.0

	company_enc	job_enc	ctc_scaled	years_of_exp	klabels
0	920	408	0.344823	4.0	2
1	18750	255	0.141059	1.0	0
2	14746	110	0.626956	5.0	0
3	11518	110	0.219430	2.0	0
4	19222	255	0.438867	2.0	0

	company_enc	job_enc	ctc_scaled	years_of_exp	agg_labels
41151	14396	110	0.125385	1.0	3
57235	5424	110	0.470215	0.0	1
41897	7267	110	0.470215	0.0	1
43861	13019	215	0.815046	10.0	3
108260	23933	255	0.188082	0.0	0

	company_enc	job_enc	ctc_scaled	years_of_exp	gmm_lables
41151	14396	110	0.125385	1.0	3
57235	5424	110	0.470215	0.0	2
41897	7267	110	0.470215	0.0	2
43861	13019	215	0.815046	10.0	0
108260	23933	255	0.188082	0.0	3

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp	avg_ctc	min_ctc	max_ctc	count	Designation	avg_ctc_class	class	avg_ctc_company	Tier
24	nxbto xzntqztn	d2668cb959e5657c3881413257b9850caa1359c7ce959a...	2015.0	9500000	Backend Engineer	2019.0	31.9	backend engineer	Software Engineer	4.0	2.091250e+06	700000	9500000	16	1	1.862625e+06	1	1.894941e+06	1
187847	vbvkgz	f4415be48a1ef885e086dcd72181f667a289641e66f828...	2014.0	3800000	Backend Engineer	2019.0	31.9	backend engineer	Software Engineer	5.0	3.436423e+06	14000	100000000	189	1	3.320822e+06	1	3.455085e+06	1
187781	zvz	298933dd70b9152e589a7e4c263253f354606521dee832...	2022.0	7500000	Backend Engineer	2022.0	31.9	backend engineer	Software Engineer	0.0	1.772605e+06	600	80000000	438	1	1.962582e+06	1	2.097492e+06	1
187795	vnytzvytvrny	12f85a43be88b0fc64a36648e2de0200325344dc223624...	2013.0	4400000	Backend Engineer	2020.0	31.9	backend engineer	Software Engineer	7.0	2.836667e+06	2200000	4400000	6	1	2.212306e+06	1	2.354785e+06	1
187759	vbvkgz	5dfef7c1643f0b744456f47f72901027dfad9c6b24d443...	2013.0	4000000	Backend Engineer	2020.0	31.9	backend engineer	Software Engineer	7.0	4.066214e+06	61000	100000000	112	3	3.320822e+06	1	3.455085e+06	1
187799	zcxaxv	df6bc6b9b560dd3faae14c182743a31a315680c2051a94...	2019.0	4240000	Backend Engineer	2020.0	31.9	backend engineer	Software Engineer	1.0	3.570000e+06	2900000	4240000	2	1	1.833545e+06	1	3.675157e+06	1
63427	zgho xzegojontbo	350e59e722f73a630057cbcea88b545f23e66399e509d2...	2011.0	3500000	FullStack Engineer	2021.0	31.9	fullstack engineer	Software Engineer	10.0	3.500000e+06	3500000	3500000	1	2	1.518750e+06	1	1.541250e+06	1
63454	sggsrt	0ff25e37f8d62d03695882afd203ce667ee6c90bae0074...	2013.0	6500000	Software Engineer Backend	2020.0	31.9	software engineer backend	Software Engineer	7.0	6.500000e+06	6500000	6500000	1	2	3.350000e+06	1	5.961943e+06	1
63463	uvjovet sqghu	fe67fa1b7d4a93ea1e8d6bf0ee2c8dd5964c5656b2d378...	2013.0	3500000	FullStack Engineer	2021.0	31.9	fullstack engineer	Software Engineer	8.0	3.500000e+06	3500000	3500000	1	2	1.535000e+06	1	2.145368e+06	1
63472	zgn vuurxwvmrt vwwghzn	fecac8d9ff2e0a852943534d6cdc9a3c496c8496c4b9e6...	2018.0	10000000	Other	2021.0	31.9	other	Other / Non-Tech	3.0	4.125690e+06	60000	100000000	29	1	3.983624e+06	1	2.337827e+06	1

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp	avg_ctc	min_ctc	max_ctc	count	Designation	avg_ctc_class	class	avg_ctc_company	Tier
123912	xzntqcxtfmxn	3505b02549ebe2c95840ac6f0a35561a3b4cbe4b79cdb1...	2014.0	2	Backend Engineer	2019.0	0.00024	backend engineer	Software Engineer	5.0	1.092503e+06	2	3000000	8	3	1.143685e+06	3	1.496672e+06	3
108500	xzntqcxtfmxn	f2b58aeed3c074652de2cfd3c0717a5d21d6fbcf342a78...	2013.0	6	Backend Engineer	2018.0	0.00072	backend engineer	Software Engineer	5.0	1.092503e+06	2	3000000	8	3	1.143685e+06	3	1.496672e+06	3
104842	xzntqcxtfmxn	23ad96d6b6f1ecf554a52f6e9b61677c7d73d8a409a143...	2013.0	14	Backend Engineer	2018.0	0.00168	backend engineer	Software Engineer	5.0	1.092503e+06	2	3000000	8	3	1.143685e+06	3	1.496672e+06	3
168752	xm	b8a0bb340583936b5a7923947e9aec21add5ebc50cd60b...	2016.0	15	Backend Engineer	2018.0	0.00180	backend engineer	Software Engineer	2.0	1.500000e+01	15	15	1	2	1.550000e+01	3	1.550000e+01	3
107392	hzxctqoxnj ge fvoyxzsngz	f7e5e788676100d7c4146740ada9e2f8974defc01f571d...	2022.0	200	Backend Engineer	2022.0	0.02400	backend engineer	Software Engineer	0.0	2.000000e+02	200	200	1	2	2.000000e+02	2	3.140000e+04	3
156199	nvnv wgzohrnvzwj otqcxwto	80ba0259f9f59034c4927cf3bd38dc9ce2eb60ff18135b...	2012.0	600	Backend Engineer	2017.0	0.07200	backend engineer	Software Engineer	5.0	2.258645e+06	600	200000000	242	3	1.453687e+06	3	1.728092e+06	3
137661	zvz	9af3dca6c9d705d8d42585ccfce2627f00e1629130d14e...	2023.0	600	Backend Engineer	2023.0	0.07200	backend engineer	Software Engineer	0.0	1.772605e+06	600	80000000	438	3	1.962582e+06	3	2.097492e+06	3
91490	gjg	b995d7a2ae5c6f8497762ce04dc5c04ad6ec734d70802a...	2018.0	600	FullStack Engineer	2021.0	0.07200	fullstack engineer	Software Engineer	3.0	1.185120e+06	600	1875000	5	3	1.412721e+06	3	2.249174e+06	3
71039	vwwtznhqt	f0f2005505c707dbdd2c86ca1587c26f822a004e86a8ec...	2021.0	1000	Backend Engineer	2021.0	0.12000	backend engineer	Software Engineer	0.0	2.419641e+06	1000	200000000	217	3	2.002374e+06	3	2.448008e+06	3
81717	wgd vhngbgnxct xzw	4d18008fc2cb66e4b90f3798ccbbc4792dfd4bad5a7a87...	2016.0	1000	FullStack Engineer	2019.0	0.12000	fullstack engineer	Software Engineer	3.0	5.050000e+04	1000	100000	2	3	6.700000e+04	3	7.525000e+04	3

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp	avg_ctc	min_ctc	max_ctc	count	Designation	avg_ctc_class	class	avg_ctc_company	Tier
82784	vwwtznhqt	2f9a4241053f76b2f8c50ea593a90586d38b3f0e08c141...	2016.0	200000000	Data Analyst	2019.0	9.86	data analyst	Data	3.0	1.094865e+07	360000	200000000	23	1	6.240684e+06	1	2.448008e+06	1
93516	ytqt ntwyzgrgsxto	98a90272cbba6e6e9ca94981824f3465f3d34567cb065f...	2015.0	200000000	Data Analyst	2020.0	12.74	data analyst	Data	5.0	2.000000e+08	200000000	200000000	1	2	4.029800e+07	1	4.259825e+06	1
63577	wgzahtzn	9ce2995b2221fe627e861daea9d0603872cce8cc128390...	2016.0	200000000	Data Analyst	2020.0	20.60	data analyst	Data	4.0	1.002300e+08	460000	200000000	2	1	3.387917e+07	1	3.238675e+06	1
123172	nvnv wgzohrnvzwj otqcxwto	1b95e7ba0ee82100ca5a034239fa0203a1bec14280b82a...	2018.0	200000000	Database Administrator	2021.0	6.00	database administrator	Other / Non-Tech	3.0	2.899457e+07	390000	200000000	7	1	7.809081e+06	1	1.728092e+06	1
49179	zgzt	268a5aa92f0b6d0c675fc9cc1e300eb0c5930a3a139a23...	2022.0	200000000	Data Scientist	2022.0	4.14	data scientist	Data	0.0	2.995714e+07	100000	200000000	7	1	1.020609e+07	1	5.838392e+06	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
114614	xzatrrxtzn	cafbe82bdd822f80cdf232311a09fb8d8c6ccdac5786f2...	2018.0	180000	Data Scientist	2019.0	1.80	data scientist	Data	1.0	1.800000e+05	180000	180000	1	2	1.450000e+05	1	1.450000e+05	1
62153	x vb v eqttrvzwtq	8b1b2cf96ee10506f1d937c9045a923233f122512b8df4...	2014.0	180000	Data Analyst	2021.0	1.80	data analyst	Data	7.0	1.800000e+05	180000	180000	1	2	1.307500e+05	1	1.405905e+06	3
65172	urhbxzw	89c50595727487bfa7d687bb168d2c2b55a57646876a4d...	2017.0	155000	Data Scientist	2018.0	1.55	data scientist	Data	1.0	1.400000e+05	125000	155000	2	1	1.400000e+05	1	1.400000e+05	1
45509	mvjtq	db2c70fea469a7f1456457812fe94a01c337eb6ce75bd5...	2018.0	115000	Data Scientist	2019.0	1.15	data scientist	Data	1.0	1.150000e+05	115000	115000	1	2	1.055000e+05	1	1.833250e+06	3
140134	exznqhon ogrhnxgzo ucn rna	ab2dc9db23c3104f0b6b3dbd4cdd5bfb9e5829b8b7943d...	2017.0	10000	Data Scientist	2020.0	1.20	data scientist	Data	3.0	1.000000e+04	10000	10000	1	2	8.600000e+03	1	9.066667e+03	1

	company_hash	email_hash	orgyear	ctc	job_position	ctc_updated_year	ctc_lpa	job_position_clean	job_category	years_of_exp	avg_ctc	min_ctc	max_ctc	count	Designation	avg_ctc_class	class	avg_ctc_company	Tier
134669	vagmt	e0615a4e98c07837b967537bbfe0614fb24b042421cd6e...	2013.0	1230000	Backend Engineer	2016.0	12.3	backend engineer	Software Engineer	3.0	2.114379e+06	10000	5100000	91	3	2.746472e+06	3	2.880667e+06	3
163694	tqxwoogz	57cc33fe71eab1d8aabf167c1ec150cb5b4a137d060f5a...	2019.0	350000	Other	2019.0	3.5	other	Other / Non-Tech	0.0	8.800000e+05	350000	1600000	3	3	8.578387e+05	3	1.602591e+06	3
102177	xmb	ec6e90591ff23db1e78ed0d3b8a8e0094c9aa6c61c5343...	2012.0	600000	QA Engineer	2019.0	6.0	qa engineer	QA / Testing	7.0	7.399998e+05	409999	1800000	6	3	9.546428e+05	3	1.411994e+06	3
6824	rv tdvwnrrj ucn rna	66dc6ea9d6dbd7d33741385706770f8d06d69bc6b831c5...	2017.0	200000	Backend Engineer	2020.0	2.0	backend engineer	Software Engineer	3.0	2.000000e+05	200000	200000	1	2	2.000000e+05	2	2.000000e+05	2
73	ptntp x	7fe7375ea26f4acf0c3e589616df27bb4379a96ae6d855...	2021.0	2200000	Backend Engineer	2021.0	22.0	backend engineer	Software Engineer	0.0	2.200000e+06	2200000	2200000	1	2	2.200000e+06	2	2.200000e+06	2