VC-Dimension V.S. Inductive Bias V.S. Biology V.S. Physical Laws : Comprehensive Multi-Disciplinary Table of Machine Learning Classifiers | by Medium_AI_CS_ML

Questions yet important:

– How to have better explanation for fields having “~”?

– How to understand non-Euclidean geometry effect on shuffling indices in Graph Neural (/Convolutional) networks

– What is role of Taylor Approximation in formulas and proofs?

– What is role of sparse regularizations, and why Deep Learning did not require sparsity condition as regularization term? Effects of dropout, ReLU zero section, etc?

– Why current state of Stochastic Approximation (Gradient Descent) is having problem in capturing nests of nonlinear features, and learns all weights together?

Is fuzzification an inductive bias? if fuzzification causes a hypothesis that restricts the classifier to certain limited configurations, it can be quantization of qualitative features or more usage of categorized features to limit degree of freedom of input features states. So it might be interpreted as an inductive bias.

Does Lookup table (and operators, comparation, rounding) classifier have any inductive bias? Depending on precision of rounding operation, the inputs are undergone a quantization process to match the training elements in lookup table. If no rounding operation is used, no inductive bias is applied.

What is relationship of Inductive Bias to VC-Dimension? Inductive biases assign a condition to classifier to limit decision given certain condition. VC dimension is maximum number of inputs classifier adapts to outputs. The stricter is the inductive bias, the less flexible is the outcome classifier. For example, SoftMax networks are less flexible than ReLU networks due to more strict inductive bias compiled by a very complex algorithm.

If FC+ shift+ SoftMax is not bias, which column name is better choice for this kind of info? Fully connected layer is simply a subspace of input to either rotate or linear-transform inputs with/without shift bias and SoftMax activation function. They make certain impact on. FC gets the most affect from most highly weighted dimension. Compared to a look up table where checks all dimension, it limitly chooses certain dimensions more, and instead approximately ignores some features.

Is there reason why FFNN takes input as Multinomial Distribution? ~ This way, transforming inputs by inverse of multinomial pdf function leads to highest variation of variables that extracted variables have maximum variation (maximum Entropy). Maximum Entropy in this sense means that any other function rather than Softmax would lead to variables whose diversity (Entropy) were less than this case. Because negative Cross-Entropy of linear combination softmax is maximum when inputs are Multinomial. ~Passing every variable from inverse function of its distribution yields to uniform.

Is there any other name for inherent bias in fuzzification and FC+ shift+ SoftMax?

how to rewrite optimization terms (ridge, prior) as differential equation or statespace model?

TODO: Colorize

Generative
discriminative
Supervised
Logistic regression
Lda
Svm
Perceptron
Kernel lda
unsupervised
Kmeans
Knn
Kernel pca
Dictionary learning
Functional
Linear combination
Rectifiers
Deep models
Shallow models
Attention/ filters
cos(w) exp(w)
Sufficient statistics
Average
Standard deviation
Median
Mean
Quotients
Gain functions

Unified view of ml:

Shattering
Regularization
Nonlinearity learner
Optimization
Unsupervised vs supervised
Generative vs discriminative

Ideas, backgrounds that may be helpful to understand table:

Power of shattering has nothing to do with main labels of inputs. VC dimension is not that comprehensive to regard true labels. Because it then automatically would group same labels close to each other and demand less than conventional VC dimension.

SVM objective function:

max 1/2 ||w||² — sigma alpha_i (y_i*(wx_i+b)-c_i)²

SVM is based on series of boundary inputs. They control role of optimization. A parametric function that wants to be flexible enough for as many as input variations, has to preserve a maximum margin which is not necessity in LDA. So, the maximum number of inputs that SVM can have 2^that num ways of classifications is higher than half the number of supports.