Reshenie zadachi klassifikatsii pri pomoshchi Deep Learning i klassicheskogo Machine Learning

Nebol’shoi benchmark (vrode etogo): generiruem dannye, potom treniruem na nikh neiroset’ (DL – deep learning) i statisticheskie modeli (ML – machine learning). Otsenivat’ rezul’tat budem po tochnosti (Confusion Matrix) i konturnomu grafiku Decision Boundary, a takzhe po vremeni trenirovki. My class sinteticheskie dannye tremia sposobami (na raznom kolichestve dannykh, from 1000 to 100 000 primerov):

  • DL model’s odnim sloem iz 8 neironov

  • Support Vector Classifier

  • Decision Tree Classifier

V stat’e daiutsia osnovnye nastroiki DL modeli. Zatem – vyvody i rezul’taty privodiatsia v kontse dlia spravki, chtoby ne zagromozhdat’ poleznoe place. Te zhe samye rezul’taty mozhno poluchit’ zapuskaia kod iz Google Colab. Kod v stat’e privoditsia ne ves’, a tol’ko neobkhodimyi. Details โ€“ v blocknote.

Decision Boundary i TensorFlow Playground.

Arkhitektura neirosti v Dannoi Stat’e rassmatrivaetsia odna. S tsel’iu izucheniia arkhitektur neiroseti, mozhno takzhe posmotret’ TensorFlow Playground. V nem mozhno vizual’no redaktirovat’ neiroset’ i smotret’ na rezul’tat, delaia obuchenie shag za shagom, nabliudaia znacheniia koeffitsientov. Tam svoia neirosetevaia biblioteka. Zato est’ vozmozhnost’ pokopat’sia v iskhodnikakh. Obratite vnimanie na stolbets Output na risunke nizhe. Tam dana Test Loss i kak raz pokazana Decision Boundary. My budem takuiu zhe kartinku poluchat’ i ispol’zovat’ v kachestve vizual’noi kharakteristiki.

Nabor dannykh – eto dvumernye tochki, imeiushchie opredelennyi tsvet (goluboi ili oranzhevyi). Dve koordinaty – eto features, tsvet – label. Dlia priemlemogo obucheniia i test loss 3.6% potrebovalos’ 369 epokh. V pravoi chasti pokazan konturnyi grafik boundary decision.

Separabel’nost’ dannykh i kolichestvo neironov

Separabel’nye dannye dolzhny imet’ rasstoianie mezhdu klassami, chtoby dannye klassifitsirovalis’ tremia neironami. Na risunke vyshe vidno chto mezhdu gruppami oranzhevykh i sinikh tochek est’ kol’tsevoi zazor. V moem sluchae, zazora net (risunok budet nizhe) i nuzhno minimum – 6 neironov (sm. kod v Google Colab po ssylke nizhe). Eto trebovanie k kolichestvu neironov voznikaet iz-za togo chto classy prilegaiut plotno drug-k-drugu i “ะทะฐะทะพั€” mezhdu golubymi i oranzhevymi metkami v moem nabore nebol’shoi (ego pochti net). Esli, v sektsii generatsii dannykh, sdelat’ vmesto:

blue_points_separable=blue_points[distance_from_origin_blue> 1.0]

something like this:

blue_points_separable=blue_points[distance_from_origin_blue> 1.4]

To mozhno budet i tremia neironami klassifitsirovat’. Naprimer, tak (voobshche, na kartinke vyshe – 369 epokh, no i 200 epokh will dostatochno):

history=model.fit(X_train, y_train, epochs=200, batch_size=10, validation_split=0.2, verbose=1)

This is what Decision Boundary looks like for a three-neuron DL model executed with the above script (Google Colab). As you can see, it cannot be accurate enough if there is not enough space between the classes. “ะทะฐะทะพั€ะฐ”. There is such a gap in the data generated by TensorFlow Playground. Therefore, a three-neural network classifies the data.

In our examples, we will also use orange and blue dot datasets. But we will not strive for a minimum number of neurons, but only for approximately equally good model prediction results.

Data Generation

Google Colab here. This file contains all the code, all the calculations: generation, training, predictions and graphs. The file is divided into 4 parts. They are collapsed into headings.

When generated, the data is saved to a CSV file (circle_classification_separable_dataset.csv). The generator code is commented and quite simple, so Iโ€™ll just immediately show a graph of the generated data (for 1000 points):

The generated labeled data (circle_classification_separable_dataset.csv) is centered around the origin. It is noticeable that the classes lie closely and this makes classification difficult with a minimum number of neurons. As stated earlier, with three neurons, the Decision Boundary area will be a quadrilateral. In this case, the model will not be able to select a quadrilateral to classify the orange points without picking up the blue ones. Example for 1000 points.

So, the purpose of classification is for the neural network to guess the color of a point depending on the coordinates.

Data classification by neural network

A neural network, unlike the SVC and Decision Tree models, is specified by more than one command and has a number of parameters, ranging from the number of neurons to compilation parameters.

The classification code is also available from the link given above. The network is sequential with one layer of 8 neurons. The output layer of this neural network must have one neuron and a suitable activation function for binary classification – this is sigmoid. Two features are supplied as input (coordinates of points X1 and X2), so the input form is declared as shape=(2,):

model=tf.keras.models.Sequential([    tf.keras.layers.Input(shape=(2,)),             # Input shape ะทะฐะดะฐะฝะฐ ะพั‚ะดะตะปัŒะฝะพ    tf.keras.layers.Dense(8, activation='relu'),   # 1ะน hidden layer    tf.keras.layers.Dense(1, activation='sigmoid') # ะ’ั‹ั…ะพะดะฝะพะน ัะปะพะน ะดะปั ะฑะธะฝะฐั€ะฝะพะน ะบะปะฐััะธั„ะธะบะฐั†ะธะธ])

Compiling the model

A little about optimization parameters. They are determined when the model is compiled:

model.compile(optimizer='adam',              loss='binary_crossentropy',              metrics=['accuracy'])
  • Optimizer='adam': The adaptive optimizer is suitable for general problems (those without specific specifics – reinforcement learning, constraints, more complex network architectures), often used in binary classification.

  • loss='binary_crossentropy': How the error is calculated when training a network. There is a formula with logarithms. The Sigmoid function gives a number between 0 and 1, which is interpreted as the probability of being either 0 or 1. Therefore, the error is calculated for the probability, and not just โ€œone minus the other.โ€ This is also related to the fact that the concept of entropy appears. Standard for binary classification problems.

  • metrics=['accuracy']: A visually easy to understand metric of the model quality of the trained model at each epoch. Especially when both classes are equally important (that is, we do not have concerns about mispredicting any one class, as in problems of diagnosing diseases or credit ratings).

Accuracy vs Loss

When you start training the model (fit()), based on a neural network, you will see that Loss and Accuracy are printed at each epoch.

Function output fit() c parameter verbose=1 (default). With the parameter verbose=0 Only the last epoch is displayed.

loss – This is what, “ะฝะฐัะบะพะปัŒะบะพ” the result produced for a given input differs from a specific label, and accuracy is the percentage of correct predictions out of the total. Prefix val – from the word validation (data) – assessment based on validation data. Their share of the training ones is determined when training is started by the parameter validation_split:

history=model.fit(X_train, y_train, epochs=100, batch_size=32, validation_split=0.2)

Results of the DL model

Plot of test data with exact labels (solid circles) and overlaid labels based on prediction results (circles). The model made a mistake where the orange circle is surrounded by blue or vice versa.

Results of neural network model predictions on a test data segment. If the colors of the circle and the circle do not match, this is an erroneous prediction.

Results of DL and ML models

To visually evaluate the prediction, we can use Decision Boundary, built on a uniform grid in the area of โ€‹โ€‹the source data. In the center there is a gray area in which test points are marked in orange. We will present three options, give the accuracy of the Decision Boundary picture and the model training time. You can see that the training of the models differs significantly in time.

There is a built-in scikit-learn function, or rather a class for working with Decision Boundary. Our script requires working with both the DL model (Keras) and ML, so we left with a custom one. In a custom build function plot_decision_boundary(X, y, model, resolution=100) there are two construction options plt.scatter()so that you don’t end up with a mess of dots. If the number of points is more than 5000, it is recommended to display every 20th point (from the original set) on Decision Boundary, for which you move the comments in this code:

plt.scatter(X[y==0][:, 0], X[y==0][:, 1], color='orange', label='Orange (True 0)')#plt.scatter(X[y==0][::20, 0], X[y==0][::20, 1], color='orange', label='Orange (True 0)')plt.scatter(X[y==1][:, 0], X[y==1][:, 1], color='blue', label='Blue (True 1)')#plt.scatter(X[y==1][::20, 0], X[y==1][::20, 1], color='blue', label='Blue (True 1)')

Conclusions

  1. With a small data size, similar visual signs of the correct operation of the classifier and approximately the same metrics, classical ML algorithms significantly outperform the DL algorithm in terms of training time.

  2. The DL model, when increasing the size of the initial data from 1000 to 100,000, did not increase the training time by a factor. Training time did not increase in proportion to the number of examples (starting points). In the last case, a 100-fold increase in data resulted in a 45-fold increase in training time. At the same time, both classical models already increased the training time by a factor greater than the number of input data.

  3. At the beginning, with a small amount of data (1000 points), the classical models outperformed the DL model very significantly, but as the number of data increased number of points, this winning may be lost. Although the winnings did not disappear until 100,000 points. The DL model was still significantly slower in training.

  4. As the number of samples (points) increases, the DL accuracy converges to acceptable values โ€‹โ€‹faster; this can be seen from the convergence graphs that are plotted in the Loss(Epoch) DL model training code.

One more point should be noted (a consequence of the 4th conclusion). The DL model learned all the time up to 100 epochs. This is not necessary in many situations; the Loss(Epoch) accuracy graph became horizontal much earlier in the example at 100,000 points. Therefore, the training performance of the DL model on a large amount of data may be even more favorable and closer to the SVC model.

Results for reference

1000 points

1000 blue, 1000 orange dots. From left to right DL (8 neurons), SVC, Decision Tree Classifier

DL 8 neurons

Support Vector Classifier (SVC)

Decision Tree Classifier

Accuracy: 99.71%

Confusion Matrix:

[[201 ย  1]

ย [ย  0 148]]

Accuracy: 99.14%

Confusion Matrix:

[[199 ย  3]

ย [ย  0 148]]

Accuracy: 98.29%

Confusion Matrix:

[[191 ย  4]

ย [ย  2 153]]

Elapsed time: 19.2379 seconds

Elapsed time: 0.0068 seconds

Elapsed time: 0.0045 seconds

2000 tochek

2000 blue, 2000 orange dots.

DL8 neironov

Decision Tree Classifier (DTC)

Support Vector Classifier

Accuracy: 100.00%

Confusion Matrix:

[[433 ย  0]

ย [ย  0 264]]

Accuracy: 99.71%

Confusion Matrix:

[[431 ย  2]

ย [ย  0 264]]

Accuracy: 99.57%

Confusion Matrix:

[[432 ย  1]

ย [ย  2 262]]

Elapsed time: 32.8104 seconds x1.7

Elapsed time: 0.0166 seconds x2.35

Elapsed time: 0.0067 seconds x1.48

Four thousand tochek

4000 blue, 4000 orange dots.

DL 8 neurons

Support Vector Classifier (SVC)

Decision Tree Classifier

Accuracy: 99.35%

Confusion Matrix:

[[816 ย  5]

ย [ย  4 562]]

Accuracy: 99.64%

Confusion Matrix:

[[816 ย  5]

ย [ย  0 566]]

Accuracy: 99.50%

Confusion Matrix:

[[817 ย  4]

ย [ย  3 563]]

Elapsed time: 66.0951 seconds x3.43

Elapsed time: 0.0439 seconds x6.45

Elapsed time: 0.0131 seconds x2.91

100,000 tochek

100,000 blue, 100,000 orange dots. Print to display in 20.

DL 8 neurons

Support Vector Classifier (SVC)

Decision Tree Classifier

Accuracy: 99.69%

Confusion Matrix:

[[19811 ย  103]

ย [ย  ย  5 14948]]

Accuracy: 99.62%

Confusion Matrix:

[[19780 ย  134]

ย [ย  ย  0 14953]]

Accuracy: 99.48%

Confusion Matrix:

[[19817ย  ย  97]

ย [ ย  83 14870]]

Elapsed time: 893.0035 seconds x46.51

Elapsed time: 12.5210 seconds x1841

Elapsed time: 0.6075 seconds x135

Pri uvelichenii chisla primerov v dannykh v 100 raz, DL model’ rabotaet v 46 raz dol’she, SVC rabotaet v 1841 raz dol’she, a Decision Tree – v 135 raz dol’she. Kratnost’ vremeni obucheniia dlia klassicheskikh modelei prevysila kratnost’ kolichestva dannykh. U modeli SVC ona vsegda byla bol’she.

Read More

More from this stream

Recomended


Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home2/mflzrxmy/public_html/website_18d00083/wp-includes/functions.php on line 5464