Global ETD Search

1	Robust portfolio construction: using resampled efficiency in combination with covariance shrinkage Combrink, James January 2017 (has links) The thesis considers the general area of robust portfolio construction. In particular the thesis considers two techniques in this area that aim to improve portfolio construction, and consequently portfolio performance. The first technique focusses on estimation error in the sample covariance (one of portfolio optimisation inputs). In particular shrinkage techniques applied to the sample covariance matrix are considered and the merits thereof are assessed. The second technique considered in the thesis focusses on the portfolio construction/optimisation process itself. Here the thesis adopted the 'resampled efficiency' proposal of Michaud (1989) which utilises Monte Carlo simulation from the sampled distribution to generate a range of resampled efficient frontiers. Thereafter the thesis assesses the merits of combining these two techniques in the portfolio construction process. Portfolios are constructed using a quadratic programming algorithm requiring two inputs: (i) expected returns; and (ii) cross-sectional behaviour and individual risk (the covariance matrix). The output is a set of 'optimal' investment weights, one per each share who's returns were fed into the algorithm. This thesis looks at identifying and removing avoidable risk through a statistical robustification of the algorithms and attempting to improve upon the 'optimal' weights provided by the algorithms. The assessment of performance is done by comparing the out-of-period results with standard optimisation results, which highly sensitive and prone to sampling-error and extreme weightings. The methodology looks at applying various shrinkage techniques onto the historical covariance matrix; and then taking a resampling portfolio optimisation approach using the shrunken matrix. We use Monte-Carlo simulation techniques to replicate sets of statistically equivalent portfolios, find optimal weightings for each; and then through aggregation of these reduce the sensitivity to the historical time-series anomalies. We also consider the trade-off between sampling-error and specification-error of models. Advanced Analytics and Data Sciences
2	Managing uncertain data over distributed environments / Gestion des données incertaines dans un environnement distribué Benaissa, Adel 02 March 2017 (has links) Ces dernières années, les données deviennent incertaines en raison du fleurissement des technologies de pointe qui participent continuellement et de plus en plus dans la production d’une grande quantité de données incertaines. Surtout, que certains nombres d’applications ou l’incertitude est omniprésentes sont distribuées dans la nature, e.g. Des réseaux de capteur distribués, l’extraction de l’information, l’intégration de données, le réseau social, etc. Par conséquent, malgré que ‘incertitudes a été étudier dans la littérature des bases de données centralisé, il reste toujours des défis à relever dans le contexte des bases de données distribuées. Dans ce travail, nous nous concentrons sur le type de données qui est composé d’un ensemble d’attributs descriptifs, qui ne sont ni numériques, ni en soi ordonnés en aucune façon, à savoir des données catégoriques. Nous proposons deux approches pour la gestion de données catégorielles incertaines dans un environnement distribué. Ces approches sont construites sur une technique d’indexation hiérarchique et des algorithmes distribués pour efficacement traiter certain types de requêtes sur des données incertaines dans un environnement distribué Dans la première approche, nous proposons une technique d’indexation distribuée basée sur la structure d’index inversée pour efficacement rechercher des données catégoriques incertaines dans un environnement distribué. En utilisant cette technique d’indexation, nous adressons deux types de requêtes sur les bases de données incertaines distribuées (1) une requête de seuils probabiliste distribuée, où les réponses obtenues satisfont l’exigence de seuil de probabilités (2) une requêtes probabiliste de meilleurs k-réponse, en assurant l’optimisation de transfert du tuples des sites interrogés au site de coordinateur en un temps réduit . Des expériences empiriques sont conduites pour vérifier l’efficacité et l’efficacité de la méthode proposée en termes de coûts de communication et le temps de réponse. La deuxième approche se concentre sur les requêtes Top-k , on propose un algorithme distribué à savoir TDUD. Son but est de trouves les meilleurs k réponses sur des données catégorielles incertaines distribuées en un seul tour seul de communication. Pour aboutir à ce but, nous enrichissons l’index incertain global proposé dans la première approche avec d’autres informations qui résument les indexes locaux afin de minimiser le coût de communication, De plus, en utilisant les moyennes de dispersion de probabilité de chaque site, on peut prévoir le nombre de sites qu’on doit interroger afin d’avoir les meilleurs k réponse, ainsi élaguer les sites qui ne fournis pas de réponse, ce qui engendre un meilleur temps d’exécution et moins de transfert de tuples. Des expériences vastes sont conduites pour vérifier l’efficacité de la méthode proposée en termes de coûts de communication et le temps de réponse. Nous montrons empiriquement que l’algorithme lié est presque optimal, dans lequel, il peut typiquement récupérer les meilleurs k-réponses en communiquant un nombre restreint de tuples dans un seul tour seul. / In recent years, data has become uncertain due to the flourishing advanced technologies that participate continuously and increasingly in producing large amounts of incomplete data. Often, many modern applications where uncertainty occurs are distributed in nature, e.g., distributed sensor networks, information extraction, data integration, social network etc. Consequently, even though the data uncertainty has been studied in the past for centralized behavior, it is still a challenging issue to manage uncertainty over the data in situ. In this work, we focus on the type of data records that are composed of a set of descriptive attributes, which are neither numeric nor inherently ordered in any way namely categorical data. We propose two approaches to managing uncertain categorical data over distributed environments. These approaches are built upon a hierarchical indexing technique and a distributed algorithm to efficiently process queries on uncertain data in distributed environment In the first approach, we propose a distributed indexing technique based on inverted index structure for efficiently searching uncertain categorical data over distributed environments. By leveraging this indexing technique, we address two kinds of queries on the distributed uncertain databases (1) a distributed probabilistic thresholds query, where its answers are satisfy the probabilistic threshold requirement (2) a distributed top k-queries, optimizing, the transfer of the tuples from the distributed sources to the coordinator site and the time treatment. Extensive experiments are conducted to verify the effectiveness and efficiency of the proposed method in terms of communication costs and response time. The second approach is focuses on answering top-k queries and proposing a distributed algorithm namely TDUD. Its aim is to efficiently answer top-k queries over distributed uncertain categorical data in single round of communication. For that purpose, we enrich the global uncertain index provided in the first approach with richer summarizing information from the local indexes, and use it to minimize the amount of communication needed to answer a top-k query. Moreover, the approach maintains the mean sum dispersion of the probability distribution on each site which are then merged at the coordinator site. Extensive experiments are conducted to verify the effectiveness and efficiency of the proposed method in terms of communication costs and response time. We show empirically that the related algorithm is near-optimal in that it can typically retrieve the top-k query answers by communicating few k tuples in a single round. Sciences de données Data sciences 005.74
3	Learning Compact Architectures for Deep Neural Networks Srinivas, Suraj January 2017 (has links) (PDF) Deep neural networks with millions of parameters are at the heart of many state of the art computer vision models. However, recent works have shown that models with much smaller number of parameters can often perform just as well. A smaller model has the advantage of being faster to evaluate and easier to store - both of which are crucial for real-time and embedded applications. While prior work on compressing neural networks have looked at methods based on sparsity, quantization and factorization of neural network layers, we look at the alternate approach of pruning neurons. Training Neural Networks is often described as a kind of `black magic', as successful training requires setting the right hyper-parameter values (such as the number of neurons in a layer, depth of the network, etc ). It is often not clear what these values should be, and these decisions often end up being either ad-hoc or driven through extensive experimentation. It would be desirable to automatically set some of these hyper-parameters for the user so as to minimize trial-and-error. Combining this objective with our earlier preference for smaller models, we ask the following question - for a given task, is it possible to come up with small neural network architectures automatically? In this thesis, we propose methods to achieve the same. The work is divided into four parts. First, given a neural network, we look at the problem of identifying important and unimportant neurons. We look at this problem in a data-free setting, i.e; assuming that the data the neural network was trained on, is not available. We propose two rules for identifying wasteful neurons and show that these suffice in such a data-free setting. By removing neurons based on these rules, we are able to reduce model size without significantly affecting accuracy. Second, we propose an automated learning procedure to remove neurons during the process of training. We call this procedure ‘Architecture-Learning’, as this automatically discovers the optimal width and depth of neural networks. We empirically show that this procedure is preferable to trial-and-error based Bayesian Optimization procedures for selecting neural network architectures. Third, we connect ‘Architecture-Learning’ to a popular regularize called ‘Dropout’, and propose a novel regularized which we call ‘Generalized Dropout’. From a Bayesian viewpoint, this method corresponds to a hierarchical extension of the Dropout algorithm. Empirically, we observe that Generalized Dropout corresponds to a more flexible version of Dropout, and works in scenarios where Dropout fails. Finally, we apply our procedure for removing neurons to the problem of removing weights in a neural network, and achieve state-of-the-art results in scarifying neural networks. Deep Neural Networks Learning Compact Architectures Machine Learning Binary Neural Nets Architecture Learning Sparse Neural Networks Bayesian Neural Networks Neural Network Architectures Computational and Data Sciences
4	Enjeux et place des data sciences dans le champ de la réutilisation secondaire des données massives cliniques : une approche basée sur des cas d’usage / Issues and place of the data sciences for reusing clinical big data : a case-based study Bouzillé, Guillaume 21 June 2019 (has links) La dématérialisation des données de santé a permis depuis plusieurs années de constituer un véritable gisement de données provenant de tous les domaines de la santé. Ces données ont pour caractéristiques d’être très hétérogènes et d’être produites à différentes échelles et dans différents domaines. Leur réutilisation dans le cadre de la recherche clinique, de la santé publique ou encore de la prise en charge des patients implique de développer des approches adaptées reposant sur les méthodes issues de la science des données. L’objectif de cette thèse est d’évaluer au travers de trois cas d’usage, quels sont les enjeux actuels ainsi que la place des data sciences pour l’exploitation des données massives en santé. La démarche utilisée pour répondre à cet objectif consiste dans une première partie à exposer les caractéristiques des données massives en santé et les aspects techniques liés à leur réutilisation. La seconde partie expose les aspects organisationnels permettant l’exploitation et le partage des données massives en santé. La troisième partie décrit les grandes approches méthodologiques en science des données appliquées actuellement au domaine de la santé. Enfin, la quatrième partie illustre au travers de trois exemples l’apport de ces méthodes dans les champs suivant : la surveillance syndromique, la pharmacovigilance et la recherche clinique. Nous discutons enfin les limites et enjeux de la science des données dans le cadre de la réutilisation des données massives en santé. / The dematerialization of health data, which started several years ago, now generates na huge amount of data produced by all actors of health. These data have the characteristics of being very heterogeneous and of being produced at different scales and in different domains. Their reuse in the context of clinical research, public health or patient care involves developing appropriate approaches based on methods from data science. The aim of this thesis is to evaluate, through three use cases, what are the current issues as well as the place of data sciences regarding the reuse of massive health data. To meet this objective, the first section exposes the characteristics of health big data and the technical aspects related to their reuse. The second section presents the organizational aspects for the exploitation and sharing of health big data. The third section describes the main methodological approaches in data sciences currently applied in the field of health. Finally, the fourth section illustrates, through three use cases, the contribution of these methods in the following fields: syndromic surveillance, pharmacovigilance and clinical research. Finally, we discuss the limits and challenges of data science in the context of health big data. Réutilisation secondaire des données Données massives en santé Sciences des données Surveillance syndromique Recherche clinique Pharmacovigilance Data reuse Health big data Data sciences Syndromic surveillance Clinical research Drug safety
5	Studies on Kernel Based Edge Detection an Hyper Parameter Selection in Image Restoration and Diffuse Optical Image Reconstruction Narayana Swamy, Yamuna January 2017 (has links) (PDF) Computational imaging has been playing an important role in understanding and analysing the captured images. Both image segmentation and restoration has been in-tegral parts of computational imaging. The studies performed in this thesis has been focussed toward developing novel algorithms for image segmentation and restoration. Study related to usage of Morozov Discrepancy Principle in Di use Optical Imaging was also presented here to show that hyper parameter selection could be performed with ease. The Laplacian of Gaussian (LoG) and Canny operators use Gaussian smoothing be-fore applying the derivative operator for edge detection in real images. The LoG kernel was based on second derivative and is highly sensitive to noise when compared to the Canny edge detector. A new edge detection kernel, called as Helmholtz of Gaussian (HoG), which provides higher di suavity is developed in this thesis and it was shown that it is more robust to noise. The formulation of the developed HoG kernel is similar to LoG. It was also shown both theoretically and experimentally that LoG is a special case of HoG. This kernel when used as an edge detector exhibited superior performance compared to LoG, Canny and wavelet based edge detector for the standard test cases both in one- and two-dimensions. The linear inverse problem encountered in restoration of blurred noisy images is typically solved via Tikhonov minimization. The outcome (restored image) of such min-imitation is highly dependent on the choice of regularization parameter. In the absence of prior information about the noise levels in the blurred image, ending this regular-inaction/hyper parameter in an automated way becomes extremely challenging. The available methods like Generalized Cross Validation (GCV) may not yield optimal re-salts in all cases. A novel method that relies on minimal residual method for ending the regularization parameter automatically was proposed here and was systematically compared with the GCV method. It was shown that the proposed method performance was superior to the GCV method in providing high quality restored images in cases where the noise levels are high Di use optical tomography uses near infrared (NIR) light as the probing media to recover the distributions of tissue optical properties with an ability to provide functional information of the tissue under investigation. As NIR light propagation in the tissue is dominated by scattering, the image reconstruction problem (inverse problem) is non-linear and ill-posed, requiring usage of advanced computational methods to compensate this. An automated method for selection of regularization/hyper parameter that incorporates Morozov discrepancy principle(MDP) into the Tikhonov method was proposed and shown to be a promising method for the dynamic Di use Optical Tomography. Biomedical Optical Imaging Diffuse Optical Tomography Dynamic Diffuse Optical Imaging Medical Imaging Computational Methods Edge Detection Edge Operators Helmholtz of Gaussian [HoG] Laplacian of Gaussian [LoG] Inverse Problems Image Restoration Image Denoising Diffuse Optical Image Reconstruction Medical Imaging Computational and Data Sciences
6	Visual Flow Analysis and Saliency Prediction Srinivas, Kruthiventi S S January 2016 (has links) (PDF) Nowadays, we have millions of cameras in public places such as traffic junctions, railway stations etc., and capturing video data round the clock. This humongous data has resulted in an increased need for automation of visual surveillance. Analysis of crowd and traffic flows is an important step towards achieving this goal. In this work, we present our algorithms for identifying and segmenting dominant ows in surveillance scenarios. In the second part, we present our work aiming at predicting the visual saliency. The ability of humans to discriminate and selectively pay attention to few regions in the scene over the others is a key attentional mechanism. Here, we present our algorithms for predicting human eye fixations and segmenting salient objects in the scene. (i) Flow Analysis in Surveillance Videos: We propose algorithms for segmenting flows of static and dynamic nature in surveillance videos in an unsupervised manner. In static flows scenarios, we assume the motion patterns to be consistent over the entire duration of video and analyze them in the compressed domain using H.264 motion vectors. Our approach is based on modeling the motion vector field as a Conditional Random Field (CRF) and obtaining oriented motion segments which are merged to obtain the final flow segments. This approach in compressed domain is shown to be both accurate and computationally efficient. In the case of dynamic flow videos (e.g. flows at a traffic junction), we propose a method for segmenting the individual object flows over long durations. This long-term flow segmentation is achieved in the framework of CRF using local color and motion features. We propose a Dynamic Time Warping (DTW) based distance measure between flow segments for clustering them and generate representative dominant ow models. Using these dominant flow models, we perform path prediction for the vehicles entering the camera's field-of-view and detect anomalous motions. (ii) Visual Saliency Prediction using Deep Convolutional Neural Networks: We propose a deep fully convolutional neural network (CNN) - DeepFix, for accurately predicting eye fixations in the form of saliency maps. Unlike classical works which characterize the saliency map using various hand-crafted features, our model automatically learns features in a hierarchical fashion and predicts saliency map in an end-to-end manner. DeepFix is designed to capture visual semantics at multiple scales while taking global context into account. Generally, fully convolutional nets are spatially invariant which prevents them from modeling location dependent patterns (e.g. centre-bias). Our network overcomes this limitation by incorporating a novel Location Biased Convolutional layer. We experimentally show that our network outperforms other recent approaches by a significant margin. In general, human eye fixations correlate with locations of salient objects in the scene. However, only a handful of approaches have attempted to simultaneously address these related aspects of eye fixations and object saliency. In our work, we also propose a deep convolutional network capable of simultaneously predicting eye fixations and segmenting salient objects in a unified framework. We design the initial network layers, shared between both the tasks, such that they capture the global contextual aspects of saliency, while the deeper layers of the network address task specific aspects. Our network shows a significant improvement over the current state-of-the-art for both eye fixation prediction and salient object segmentation across a number of challenging datasets. Visual Flow Analysis Saliency Prediction Visual Saliency Static Flow Analysis Surveillance Videos Dynamic Flow Analysis DeepFix Convolutional Network Eye Fixation Prediction Salient Object Segmentation Convolutional Neural Networks (CNNs) Saliency Unified Computational and Data Sciences
7	Balancing Money and Time for OLAP Queries on Cloud Databases Sabih, Rafia January 2016 (has links) (PDF) Enterprise Database Management Systems (DBMSs) have to contend with resource-intensive and time-varying workloads, making them well-suited candidates for migration to cloud plat-forms { specifically, they can dynamically leverage the resource elasticity while retaining affordability through the pay-as-you-go rental interface. The current design of database engine components lays emphasis on maximizing computing efficiency, but to fully capitalize on the cloud's benefits, the outlays of these computations also need to be factored into the planning exercise. In this thesis, we investigate this contemporary problem in the context of industrial-strength deployments of relational database systems on real-world cloud platforms. Specifically, we consider how the traditional metric used to compare query execution plans, namely response-time, can be augmented to incorporate monetary costs in the decision process. The challenge here is that execution-time and monetary costs are adversarial metrics, with a decrease in one entailing a rise in the other. For instance, a Virtual Machine (VM) with rich physical resources (RAM, cores, etc.) decreases the query response-time, but is expensive with regard to rental rates. In a nutshell, there is a tradeoff between money and time, and our goal therefore is to identify the VM that others the best tradeoff between these two competing considerations. In our study, we pro le the behavior of money versus time for a given query, and de ne the best tradeoff as the \knee" { that is, the location on the pro le with the minimum Euclidean distance from the origin. To study the performance of industrial-strength database engines on real-world cloud infrastructure, we have deployed a commercial DBMS on Google cloud services. On this platform, we have carried out extensive experimentation with the TPC-DS decision-support benchmark, an industry-wide standard for evaluating database system performance. Our experiments demonstrate that the choice of VM for hosting the database server is a crucial decision, because: (i) variation in time and money across VMs is significant for a given query, (ii) no one VM offers the best money-time tradeoff across all queries. To efficiently identify the VM with the best tradeoff from a large suite of available configurations, we propose a technique to characterize the money-time pro le for a given query. The core of this technique is a VM pruning mechanism that exploits the property of partially ordered set of the VMs on their resources. It processes the minimal and maximal VMs of this poset for estimated query response-time. If the response-times on these extreme VMs are similar, then all the VMs sandwiched between them are pruned from further consideration. Otherwise, the already processed VMs are set aside, and the minimal and maximal VMs of the remaining unprocessed VMs are evaluated for their response-times. Finally, the knee VM is identified from the processed VMs as the one with the minimum Euclidean distance from the origin on the money-time space. We theoretically prove that this technique always identifies the knee VM; further, if it is acceptable to and a \near-optimal" knee by providing a relaxation-factor on the response-time distance from the optimal knee, then it is also capable of finding more efficiently a satisfactory knee under these relaxed conditions. We propose two favors of this approach: the first one prunes the VMs using complete plan information received from database engine API, and named as Plan-based Identification of Knee (PIK). On the other hand, to further increase the efficiency of the identification of the knee VM, we propose a sub-plan based pruning algorithm called Sub-Plan-based Identification of Knee (SPIK), which requires modifications in the query optimizer. We have evaluated PIK on a commercial system and found that it often requires processing for only 20% of the total VMs. The efficiency of the algorithm is further increased significantly, by using 10-20% relaxation in response-time. For evaluating SPIK , we prototyped it on an open-source engine { Postgresql 9.3, and also implemented it as Java wrapper program with the commercial engine. Experimentally, the processing done by SPIK is found to be only 40% of the PIK approach. Therefore, from an overall perspective, this thesis facilitates the desired migration of enterprise databases to cloud platforms, by identifying the VM(s) that offer competitive tradeoffs between money and time for the given query. Database Management Syatem (DBMS) Virtual Machine Google Cloud Services Cloud Platforms Cloud Databases Cloud Query Processing Model Plan-based Identification of Knee (PIK ) Knee VM Computational and Data Sciences

1

Page generated in 0.0736 seconds