Questions yet important:

– How to have better explanation for fields having “~”?

– How to understand non-Euclidean geometry effect on shuffling indices in Graph Neural (/Convolutional) networks

– What is role of Taylor Approximation in formulas and proofs?

– What is role of sparse regularizations, and why Deep Learning did not require sparsity condition as regularization term? Effects of dropout, ReLU zero section, etc?

– Why current state of Stochastic Approximation (Gradient Descent) is having problem in capturing nests of nonlinear features, and learns all weights together?

**Is fuzzification an inductive bias?** if fuzzification causes a hypothesis that restricts the classifier to certain limited configurations, it can be quantization of qualitative features or more usage of categorized features to limit degree of freedom of input features states. So it might be interpreted as an inductive bias.

**Does Lookup table (and operators, comparation, rounding) classifier have any inductive bias? **Depending on precision of **rounding operation**, the inputs are undergone a quantization process to match the training elements in lookup table. If no rounding operation is used, no inductive bias is applied.

**What is relationship of Inductive Bias to VC-Dimension**? Inductive biases assign a condition to classifier to limit decision given certain condition. VC dimension is maximum number of inputs classifier adapts to outputs. The stricter is the inductive bias, the less flexible is the outcome classifier. For example, SoftMax networks are less flexible than ReLU networks due to more strict inductive bias compiled by a very complex algorithm.

**If FC+ shift+ SoftMax is not bias, which column name is better choice for this kind of info?** Fully connected layer is simply a subspace of input to either rotate or linear-transform inputs with/without shift bias and SoftMax activation function. They make certain impact on. FC gets the most affect from most highly weighted dimension. Compared to a look up table where checks all dimension, it limitly chooses certain dimensions more, and instead approximately ignores some features.

**Is there reason why FFNN takes input as Multinomial Distribution? ~ **This way, transforming inputs by inverse of multinomial pdf function leads to highest variation of variables that extracted variables have maximum variation (maximum Entropy). Maximum Entropy in this sense means that any other function rather than Softmax would lead to variables whose diversity (Entropy) were less than this case. Because negative Cross-Entropy of linear combination softmax is maximum when inputs are Multinomial. ~Passing every variable from inverse function of its distribution yields to uniform.

**Is there any other name for inherent bias in fuzzification and FC+ shift+ SoftMax?**

**how to rewrite optimization terms (ridge, prior) as differential equation or statespace model?**

**TODO: Colorize**

- Generative
- discriminative
- Supervised
- Logistic regression
- Lda
- Svm
- Perceptron
- Kernel lda
- unsupervised
- Kmeans
- Knn
- Kernel pca
- Dictionary learning
- Functional
- Linear combination
- Rectifiers
- Deep models
- Shallow models
- Attention/ filters
- cos(w) exp(w)
- Sufficient statistics
- Average
- Standard deviation
- Median
- Mean
- Quotients
- Gain functions

Unified view of ml:

- Shattering
- Regularization
- Nonlinearity learner
- Optimization
- Unsupervised vs supervised
- Generative vs discriminative

**Ideas, backgrounds that may be helpful to understand table:**

Power of shattering has nothing to do with main labels of inputs. VC dimension is not that comprehensive to regard true labels. Because it then automatically would group same labels close to each other and demand less than conventional VC dimension.

**SVM objective function:**

max 1/2 ||w||² — sigma alpha_i (y_i*(wx_i+b)-c_i)²

SVM is based on series of boundary inputs. They control role of optimization. A parametric function that wants to be flexible enough for as many as input variations, has to preserve a maximum margin which is not necessity in LDA. So, the maximum number of inputs that SVM can have 2^that num ways of classifications is higher than half the number of supports.