Learning with neural networks depends on the particular parametrization of the functions represented by the network, that is, the assignment of parameters to functions. It also depends on the identity of the functions, which get assigned typical parameters at initialization, and, later, the parameters that arise during training. The choice of the activation function is a critical aspect of the network design that influences these function properties and requires investigation. This thesis focuses on analyzing the expected behavior of networks with maxout (multi-argument) activation functions. On top of enhancing the practical applicability of maxout networks, these findings add to the theoretical exploration of activation functions beyond the common choices. We believe this work can advance the study of activation functions and complicated neural network architectures.
We begin by taking the number of activation regions as a complexity measure and showing that the practical complexity of deep networks with maxout activation functions is often far from the theoretical maximum. This analysis extends the previous results that were valid for deep neural networks with single-argument activation functions such as ReLU. Additionally, we demonstrate that a similar phenomenon occurs when considering the decision boundaries in classification tasks. We also show that the parameter space has a multitude of full-dimensional regions with widely different complexity and obtain nontrivial lower bounds on the expected complexity. Finally, we investigate different parameter initialization procedures and show that they can increase the speed of the gradient descent convergence in training.
Further, continuing the investigation of the expected behavior, we study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK. As the result of the research in this thesis, we develop multiple experiments and helpful components and make the code for them publicly available.
Identifer | oai:union.ndltd.org:DRESDEN/oai:qucosa:de:qucosa:87962 |
Date | 10 November 2023 |
Creators | Tseran, Hanna |
Contributors | Universität Leipzig |
Source Sets | Hochschulschriftenserver (HSSS) der SLUB Dresden |
Language | English |
Detected Language | English |
Type | info:eu-repo/semantics/publishedVersion, doc-type:doctoralThesis, info:eu-repo/semantics/doctoralThesis, doc-type:Text |
Rights | info:eu-repo/semantics/openAccess |
Page generated in 0.0023 seconds