Training and test data are assumed to be generated i.i.d., independently and identitcally distributed. Saying this another way, each data point in the train and test set are samples from the probability distribution p(X) and each data point is not influenced by any other datapoint. Only under this assumption can you draw conclusions of overfitting or underfitting by looking at the gaps between training error and testing error. Something I found interesting was that sequential data implies the absence of the i.i.d asummption in the sense that the elements of a sequence are dependent on previous inputs.

With respect to just deep learning, standard fully-connected feedforward neural nets are the most general models. Theoretically (to the best of my knowledge), there’s nothing to stop a feedforward neural network from having incredible results on all tasks, it’s the most open general model of mapping inputs to outputs. Feedforward neural nets have the capability to learn all sorts of patterns *implictly*. The catch is, a stupid amount of data is required to converge. The model needs to search over a vast parameter space before finding the optimal arrangement. Improvements introduce specialization and add a regularizing effect, effectively cutting down the size of the parameter space that’s being searched. RNN’s explictly define and look for sequential/temporal patterns. CNN’s explictly define and look for spatial patterns. No Free Lunch.