Machine Learning – Scale

Today we are covering Python ML scale with some sort of examples in hopes of meeting the needs of the learners.



Features of scale

You can have a hard time comparing data if it has different values and measurements.

How do pounds (lbs) compare to kilos (km)?

How about altitude versus time?

Scaling is the solution to this problem. Scaling data allows us to compare new values more easily.

Check out the table below, where liabilities and Performance are represented in USD.

FirmDepartemployeesliabilitiesPerformance
Amazon Incwarehouse300009000000089
Amazon IncIT500045000000200
Amazon IncSupport1000025000000400
Apple IncDesigner Dep10001100000080
Apple IncAudit500175000090
Apple IncTech25000187500000130
BlackRock IncAdvisors50009150000090
BlackRock IncAnalysts200056300000134
BlackRock IncTech750018000000100
BlackRock IncSales100003450000066
BlackRock IncConsultants220009768000076
China Petroleum & Chemical Corp. (SNP)Mechnical7500016650000068
China Petroleum & Chemical Corp. (SNP)Research25001250000030
China Petroleum & Chemical Corp. (SNP)Supply110000220000000240
CVS Health CorpResearch100009230000030
CVS Health CorpMaintainace5000839000069
CVS Health CorpPharmD25008500000508
Google IncAI2500049975000157
Google IncAdvert40000180000000782
Google IncResearch1000013430000052
Google IncFinance50001400000089
Microsoft IncResearch2400024480000050
Microsoft IncAI500046800000029
Microsoft IncOS270001512000001130
Royal Dutch Shell PLCResearch1500018450000040
Royal Dutch Shell PLCFinance100004380000078
Tesla IncEngineer800048000000210
Tesla IncAssemble1700081600000330
Tesla IncFinance30001050000099
Tesla IncAdvisors10001886000040
Tesla IncAudit1300650130094
Wallmart IncSupply134000368500000566
Wallmart IncFinance10000033570000079
Wallmart Incwarehouse155000376960000198
Wallmart IncTech200000800000000164
Wallmart IncSupport3600001620000000303

 

Using comparable values, we can easily see how much one value is compared to another if we scale the liabilities 90000000 with the employees 30000.

Scaling data can be achieved in a variety of ways. We will use a method called standardization in this tutorial.

This formula is used in the standardization method:

z = (x - u) / s

In this equation, z represents the new value, x represents the original value, u represents the mean, and s represents the standard deviation.

Based on the above data set, the first value of liabilities is 90000000, and the scaled value is:

(90000000 - 175100452.77) / 293691058.741973 = -0.28976181

According to the data set above, the first value of the employees column is 3000, and the scaled value is:

(30000 - 40647.222222222) / 72005.90441111 = -0.147865961

Instead of comparing 90000000 with 30000, you can now compare -0.28 with -0.14.

There are several common utility functions and transformer classes in the sklearn.preprocessing package for converting raw feature vectors into a format suitable for downstream estimation.

The standardization of data sets generally benefits learning algorithms.

Python sklearn module has a method called StandardScaler() that returns a Scaling object with methods for transforming data sets.

Scaling should be done as follows:

Example

import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() vardf = pandas.read_csv("dataset.csv") mrx = vardf[['liabilities', 'employees']] scaledmrx = scale.fit_transform(mrx) print(scaledmrx)

Result:

As you can see, the first two values correspond to our calculations: -0.28 and -0.14.

[[-0.28976181 -0.14786596]
[-0.44298404 -0.49505971]
[-0.51108281 -0.42562096]
[-0.55875195 -0.55061071]
[-0.59024763 -0.55755459]
[ 0.0422197 -0.21730471]
[-0.2846544 -0.49505971]
[-0.40450824 -0.53672296]
[-0.53491738 -0.46034034]
[-0.4787359 -0.42562096]
[-0.26361188 -0.25896796]
[-0.02928401 0.47708279]
[-0.55364455 -0.52977909]
[ 0.1528802 0.96315404]
[-0.28193045 -0.42562096]
[-0.56763884 -0.49505971]
[-0.5672643 -0.52977909]
[-0.42604447 -0.21730471]
[ 0.01668266 -0.00898846]
[-0.13892303 -0.42562096]
[-0.54853714 -0.49505971]
[ 0.23732267 -0.23119246]
[ 0.99730495 -0.49505971]
[-0.08137957 -0.18952921]
[ 0.03200488 -0.35618221]
[-0.44706997 -0.42562096]
[-0.43276923 -0.45339646]
[-0.31836329 -0.32840671]
[-0.56045442 -0.52283521]
[-0.53198914 -0.55061071]
[-0.57406975 -0.54644439]
[ 0.65851357 1.29646004]
[ 0.54683159 0.82427654]
[ 0.68731935 1.58810279]
[ 2.12774454 2.21305154]
[ 4.91979413 4.43509154]]

Predict Performance

Scaling the data set will require you to use the scale when predicting values:

Predict the performance results of 10 employees whose liabilities are 100,000 dollars

Example

import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() df = pandas.read_csv("dataset.csv") mrx = df[['liabilities', 'employees']] ample = df['Performance'] scaledmrx = scale.fit_transform(mrx) regr = linear_model.LinearRegression() regr.fit(scaledmrx, ample) scaled = scale.transform([[100000, 10]]) predictedPerformance = regr.predict([scaled[0]]) print(predictedPerformance)

Result:

scale prediction result


Now you know:

  • When working with many machine learning algorithms, data scaling is recommended as a pre-processing step.
  • Input and output variables can be normalized or standardized to achieve data scaling.
  • Standardization and normalization can be applied to improve the performance of predictive modeling algorithms.

 

We value your feedback.
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0

Subscribe To Our Newsletter
Enter your email to receive a weekly round-up of our best posts. Learn more!
icon

Leave a Reply

Your email address will not be published. Required fields are marked *