Ziro2Mach dream it... build it!

The Only Thing about Machine Learning that's more Important than Machine Learning


perhaps the only thing about machine learning that’s more important than machine learning itself is data pre-processing 🙃

that’s cuz as defined before machine learning is:

the science math of taking in real world info, converting it into numbers and then finding learning a pattern out of it

and info out in the real world brings along with it, ton of noise

Example Data from the Real World

as an advocate of learning by getting your hands dirty, here’s an example

there’s something called the russel’s circumplex

source: the pennsylvania state university

something that helps quantify emotions

cuz ML algorithms learn best when the data they work with is continuous numbers instead of traditional encoded classification data like

while the class-ified data does represent numbers, the numberical value of a class doesn’t always represent the intensity of an emotion, while russel’s model gives you an activation and an pleseantness value that are already intensities of an emotion

Unclean Data

let’s say we find a dataset with paramenters we are looking for

here the column pic represents an 3d array of red, green and blue pixel values of an image containing an emotion and the rest are pretty straight forward

Step - 1 : Splitting The Data

the whole goal of training an ML model is so that we could us it to actively predict output on unseen data/situations. a simple way of doing that is

the remaining 20% can be used to value the performance of the model developed

Step - 2 : Dealing with Missing Data

notice that there’s some missing data in the age column, so there are 2 common ways of dealing with that missing data

1. deleting entire rows if a required column is missing

note: works great for super ultra large datasets but since more data = better

2. substituting the middle value of the column (depends on type of data)

Step - 3 : Dealing with Class Data

many a times, the data in datasets is class data and while encoded class data might not always accurately represent the intensity of a parameter, something is better than nothing

there are 2 common ways of dealing with class data, lets take the gender column

1. one hot encoding

when one column is split into number of class columns, like gender has 2 classes: male and female, so the gender columns gets split into 2 columns: a male column and female column

2. label encoding

for columns with binary classes, like true or false, male or female, yes or no, etc so that one of the class label is replaced with 0 and the other with 1

Step - 4 : Feature Scaling

different columns usually represent different parameters, and not all paraneters have the same proportion. assuming a dataset of age and height, the age column has a range of 1 to 100, while the height column perhaps has a range of 100cm to 200cm

why is this important?

when we plot these values without scaling em to the same range it would look like

and let’s say we tried to find a line that best fit through the points it would look like

however if we scaled the inputs to the same range, it would look like this

which even from a glance we can tell that the line better fits the model, i.e there is lesser error to predict for unseen data

now feature scaling is commonly done using 2 methods

  1. normalization
  2. standardization

where x is the current input we want to scale, here’s an example of normalization on the dataset we were working on

this leaves us with a ready for training dataset

Step - 5 : Dealing with the Testing Data

we’ve done a lot of pre-processing on the training dataset, and testing data is going to look like the unclean training data

so we’ve to remember to

  1. replace missing data with the middle value of training data
  2. encode class data to match the training data
  3. feature scale using parameters of the training data

you get the point we’ve use the exact same operation tools used on the training dataset for the operation we would be doing on the testing dataset

yuppp data people shouldn’t become doctors 😝

and with that we have testing data that is ready to be taken for a ride in our ML model

untill next time !️ ✌

or you could spot me in the wild 🤭 i mean instagram, twitter, linkedin and maybe even youtube where i excalidraw those diagrams