Machine Learning – Scale

Today we are covering Python ML scale with some sort of examples in hopes of meeting the needs of the learners.



Features of scale

You can have a hard time comparing data if it has different values and measurements.

How do pounds (lbs) compare to kilos (km)?

How about altitude versus time?

Scaling is the solution to this problem. Scaling data allows us to compare new values more easily.

Check out the table below, where liabilities and Performance are represented in USD.

Firm Depart employees liabilities Performance
Amazon Inc warehouse 30000 90000000 89
Amazon Inc IT 5000 45000000 200
Amazon Inc Support 10000 25000000 400
Apple Inc Designer Dep 1000 11000000 80
Apple Inc Audit 500 1750000 90
Apple Inc Tech 25000 187500000 130
BlackRock Inc Advisors 5000 91500000 90
BlackRock Inc Analysts 2000 56300000 134
BlackRock Inc Tech 7500 18000000 100
BlackRock Inc Sales 10000 34500000 66
BlackRock Inc Consultants 22000 97680000 76
China Petroleum & Chemical Corp. (SNP) Mechnical 75000 166500000 68
China Petroleum & Chemical Corp. (SNP) Research 2500 12500000 30
China Petroleum & Chemical Corp. (SNP) Supply 110000 220000000 240
CVS Health Corp Research 10000 92300000 30
CVS Health Corp Maintainace 5000 8390000 69
CVS Health Corp PharmD 2500 8500000 508
Google Inc AI 25000 49975000 157
Google Inc Advert 40000 180000000 782
Google Inc Research 10000 134300000 52
Google Inc Finance 5000 14000000 89
Microsoft Inc Research 24000 244800000 50
Microsoft Inc AI 5000 468000000 29
Microsoft Inc OS 27000 151200000 1130
Royal Dutch Shell PLC Research 15000 184500000 40
Royal Dutch Shell PLC Finance 10000 43800000 78
Tesla Inc Engineer 8000 48000000 210
Tesla Inc Assemble 17000 81600000 330
Tesla Inc Finance 3000 10500000 99
Tesla Inc Advisors 1000 18860000 40
Tesla Inc Audit 1300 6501300 94
Wallmart Inc Supply 134000 368500000 566
Wallmart Inc Finance 100000 335700000 79
Wallmart Inc warehouse 155000 376960000 198
Wallmart Inc Tech 200000 800000000 164
Wallmart Inc Support 360000 1620000000 303

 

Using comparable values, we can easily see how much one value is compared to another if we scale the liabilities 90000000 with the employees 30000.

Scaling data can be achieved in a variety of ways. We will use a method called standardization in this tutorial.

This formula is used in the standardization method:

z = (x - u) / s

In this equation, z represents the new value, x represents the original value, u represents the mean, and s represents the standard deviation.

Based on the above data set, the first value of liabilities is 90000000, and the scaled value is:

(90000000 - 175100452.77) / 293691058.741973 = -0.28976181

According to the data set above, the first value of the employees column is 3000, and the scaled value is:

(30000 - 40647.222222222) / 72005.90441111 = -0.147865961

Instead of comparing 90000000 with 30000, you can now compare -0.28 with -0.14.

There are several common utility functions and transformer classes in the sklearn.preprocessing package for converting raw feature vectors into a format suitable for downstream estimation.

The standardization of data sets generally benefits learning algorithms.

Python sklearn module has a method called StandardScaler() that returns a Scaling object with methods for transforming data sets.

Scaling should be done as follows:

Example

import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() vardf = pandas.read_csv("dataset.csv") mrx = vardf[['liabilities', 'employees']] scaledmrx = scale.fit_transform(mrx) print(scaledmrx)

Result:

As you can see, the first two values correspond to our calculations: -0.28 and -0.14.

[[-0.28976181 -0.14786596]
[-0.44298404 -0.49505971]
[-0.51108281 -0.42562096]
[-0.55875195 -0.55061071]
[-0.59024763 -0.55755459]
[ 0.0422197 -0.21730471]
[-0.2846544 -0.49505971]
[-0.40450824 -0.53672296]
[-0.53491738 -0.46034034]
[-0.4787359 -0.42562096]
[-0.26361188 -0.25896796]
[-0.02928401 0.47708279]
[-0.55364455 -0.52977909]
[ 0.1528802 0.96315404]
[-0.28193045 -0.42562096]
[-0.56763884 -0.49505971]
[-0.5672643 -0.52977909]
[-0.42604447 -0.21730471]
[ 0.01668266 -0.00898846]
[-0.13892303 -0.42562096]
[-0.54853714 -0.49505971]
[ 0.23732267 -0.23119246]
[ 0.99730495 -0.49505971]
[-0.08137957 -0.18952921]
[ 0.03200488 -0.35618221]
[-0.44706997 -0.42562096]
[-0.43276923 -0.45339646]
[-0.31836329 -0.32840671]
[-0.56045442 -0.52283521]
[-0.53198914 -0.55061071]
[-0.57406975 -0.54644439]
[ 0.65851357 1.29646004]
[ 0.54683159 0.82427654]
[ 0.68731935 1.58810279]
[ 2.12774454 2.21305154]
[ 4.91979413 4.43509154]]

Predict Performance

Scaling the data set will require you to use the scale when predicting values:

Predict the performance results of 10 employees whose liabilities are 100,000 dollars

Example

import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() df = pandas.read_csv("dataset.csv") mrx = df[['liabilities', 'employees']] ample = df['Performance'] scaledmrx = scale.fit_transform(mrx) regr = linear_model.LinearRegression() regr.fit(scaledmrx, ample) scaled = scale.transform([[100000, 10]]) predictedPerformance = regr.predict([scaled[0]]) print(predictedPerformance)

Result:

scale prediction result


Now you know:

  • When working with many machine learning algorithms, data scaling is recommended as a pre-processing step.
  • Input and output variables can be normalized or standardized to achieve data scaling.
  • Standardization and normalization can be applied to improve the performance of predictive modeling algorithms.

 

We value your feedback.
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0

Subscribe To Our Newsletter
Enter your email to receive a weekly round-up of our best posts. Learn more!
icon

Leave a Reply

Your email address will not be published. Required fields are marked *