Machine Learning – Scale
Today we are covering Python ML scale with some sort of examples in hopes of meeting the needs of the learners.
Features of scale
You can have a hard time comparing data if it has different values and measurements.
How do pounds (lbs) compare to kilos (km)?
How about altitude versus time?
Scaling is the solution to this problem. Scaling data allows us to compare new values more easily.
Check out the table below, where liabilities and Performance are represented in USD.
Firm | Depart | employees | liabilities | Performance |
Amazon Inc | warehouse | 30000 | 90000000 | 89 |
Amazon Inc | IT | 5000 | 45000000 | 200 |
Amazon Inc | Support | 10000 | 25000000 | 400 |
Apple Inc | Designer Dep | 1000 | 11000000 | 80 |
Apple Inc | Audit | 500 | 1750000 | 90 |
Apple Inc | Tech | 25000 | 187500000 | 130 |
BlackRock Inc | Advisors | 5000 | 91500000 | 90 |
BlackRock Inc | Analysts | 2000 | 56300000 | 134 |
BlackRock Inc | Tech | 7500 | 18000000 | 100 |
BlackRock Inc | Sales | 10000 | 34500000 | 66 |
BlackRock Inc | Consultants | 22000 | 97680000 | 76 |
China Petroleum & Chemical Corp. (SNP) | Mechnical | 75000 | 166500000 | 68 |
China Petroleum & Chemical Corp. (SNP) | Research | 2500 | 12500000 | 30 |
China Petroleum & Chemical Corp. (SNP) | Supply | 110000 | 220000000 | 240 |
CVS Health Corp | Research | 10000 | 92300000 | 30 |
CVS Health Corp | Maintainace | 5000 | 8390000 | 69 |
CVS Health Corp | PharmD | 2500 | 8500000 | 508 |
Google Inc | AI | 25000 | 49975000 | 157 |
Google Inc | Advert | 40000 | 180000000 | 782 |
Google Inc | Research | 10000 | 134300000 | 52 |
Google Inc | Finance | 5000 | 14000000 | 89 |
Microsoft Inc | Research | 24000 | 244800000 | 50 |
Microsoft Inc | AI | 5000 | 468000000 | 29 |
Microsoft Inc | OS | 27000 | 151200000 | 1130 |
Royal Dutch Shell PLC | Research | 15000 | 184500000 | 40 |
Royal Dutch Shell PLC | Finance | 10000 | 43800000 | 78 |
Tesla Inc | Engineer | 8000 | 48000000 | 210 |
Tesla Inc | Assemble | 17000 | 81600000 | 330 |
Tesla Inc | Finance | 3000 | 10500000 | 99 |
Tesla Inc | Advisors | 1000 | 18860000 | 40 |
Tesla Inc | Audit | 1300 | 6501300 | 94 |
Wallmart Inc | Supply | 134000 | 368500000 | 566 |
Wallmart Inc | Finance | 100000 | 335700000 | 79 |
Wallmart Inc | warehouse | 155000 | 376960000 | 198 |
Wallmart Inc | Tech | 200000 | 800000000 | 164 |
Wallmart Inc | Support | 360000 | 1620000000 | 303 |
Using comparable values, we can easily see how much one value is compared to another if we scale the liabilities 90000000 with the employees 30000.
Scaling data can be achieved in a variety of ways. We will use a method called standardization in this tutorial.
This formula is used in the standardization method:
z = (x - u) / s
In this equation, z represents the new value, x represents the original value, u represents the mean, and s represents the standard deviation.
Based on the above data set, the first value of liabilities is 90000000, and the scaled value is:
(90000000 - 175100452.77) / 293691058.741973 = -0.28976181
According to the data set above, the first value of the employees column is 3000, and the scaled value is:
(30000 - 40647.222222222) / 72005.90441111 = -0.147865961
Instead of comparing 90000000 with 30000, you can now compare -0.28 with -0.14.
There are several common utility functions and transformer classes in the sklearn.preprocessing package for converting raw feature vectors into a format suitable for downstream estimation.
The standardization of data sets generally benefits learning algorithms.
Python sklearn module has a method called StandardScaler() that returns a Scaling object with methods for transforming data sets.
Scaling should be done as follows:
Example
Result:
As you can see, the first two values correspond to our calculations: -0.28 and -0.14.
[[-0.28976181 -0.14786596] [-0.44298404 -0.49505971] [-0.51108281 -0.42562096] [-0.55875195 -0.55061071] [-0.59024763 -0.55755459] [ 0.0422197 -0.21730471] [-0.2846544 -0.49505971] [-0.40450824 -0.53672296] [-0.53491738 -0.46034034] [-0.4787359 -0.42562096] [-0.26361188 -0.25896796] [-0.02928401 0.47708279] [-0.55364455 -0.52977909] [ 0.1528802 0.96315404] [-0.28193045 -0.42562096] [-0.56763884 -0.49505971] [-0.5672643 -0.52977909] [-0.42604447 -0.21730471] [ 0.01668266 -0.00898846] [-0.13892303 -0.42562096] [-0.54853714 -0.49505971] [ 0.23732267 -0.23119246] [ 0.99730495 -0.49505971] [-0.08137957 -0.18952921] [ 0.03200488 -0.35618221] [-0.44706997 -0.42562096] [-0.43276923 -0.45339646] [-0.31836329 -0.32840671] [-0.56045442 -0.52283521] [-0.53198914 -0.55061071] [-0.57406975 -0.54644439] [ 0.65851357 1.29646004] [ 0.54683159 0.82427654] [ 0.68731935 1.58810279] [ 2.12774454 2.21305154] [ 4.91979413 4.43509154]]
Predict Performance
Scaling the data set will require you to use the scale when predicting values:
Predict the performance results of 10 employees whose liabilities are 100,000 dollars
Example
Result:
Now you know:
- When working with many machine learning algorithms, data scaling is recommended as a pre-processing step.
- Input and output variables can be normalized or standardized to achieve data scaling.
- Standardization and normalization can be applied to improve the performance of predictive modeling algorithms.