Spelling suggestions: "subject:"[een] DATASET"" "subject:"[enn] DATASET""
61 |
Evaluating the Performance of Propensity Scores to Address Selection Bias in a Multilevel Context: A Monte Carlo Simulation Study and Application Using a National DatasetLingle, Jeremy Andrew 16 October 2009 (has links)
When researchers are unable to randomly assign students to treatment conditions, selection bias is introduced into the estimates of treatment effects. Random assignment to treatment conditions, which has historically been the scientific benchmark for causal inference, is often impossible or unethical to implement in educational systems. For example, researchers cannot deny services to those who stand to gain from participation in an academic program. Additionally, students select into a particular treatment group through processes that are impossible to control, such as those that result in a child dropping-out of high school or attending a resource-starved school. Propensity score methods provide valuable tools for removing the selection bias from quasi-experimental research designs and observational studies through modeling the treatment assignment mechanism. The utility of propensity scores has been validated for the purposes of removing selection bias when the observations are assumed to be independent; however, the ability of propensity scores to remove selection bias in a multilevel context, in which group membership plays a role in the treatment assignment, is relatively unknown. A central purpose of the current study was to begin filling in the gaps in knowledge regarding the performance of propensity scores for removing selection bias, as defined by covariate balance, in multilevel settings using a Monte Carlo simulation study. The performance of propensity scores were also examined using a large-scale national dataset. Results from this study provide support for the conclusion that multilevel characteristics of a sample have a bearing upon the performance of propensity scores to balance covariates between treatment and control groups. Findings suggest that propensity score estimation models should take into account the cluster-level effects when working with multilevel data; however, the numbers of treatment and control group individuals within each cluster must be sufficiently large to allow estimation of those effects. Propensity scores that take into account the cluster-level effects can have the added benefit of balancing covariates within each cluster as well as across the sample as a whole.
|
62 |
Novel Image Representations and Learning TasksJanuary 2017 (has links)
abstract: Computer Vision as a eld has gone through signicant changes in the last decade.
The eld has seen tremendous success in designing learning systems with hand-crafted
features and in using representation learning to extract better features. In this dissertation
some novel approaches to representation learning and task learning are studied.
Multiple-instance learning which is generalization of supervised learning, is one
example of task learning that is discussed. In particular, a novel non-parametric k-
NN-based multiple-instance learning is proposed, which is shown to outperform other
existing approaches. This solution is applied to a diabetic retinopathy pathology
detection problem eectively.
In cases of representation learning, generality of neural features are investigated
rst. This investigation leads to some critical understanding and results in feature
generality among datasets. The possibility of learning from a mentor network instead
of from labels is then investigated. Distillation of dark knowledge is used to eciently
mentor a small network from a pre-trained large mentor network. These studies help
in understanding representation learning with smaller and compressed networks. / Dissertation/Thesis / Doctoral Dissertation Computer Science 2017
|
63 |
Metamorphic malware identification through Annotated Data Dependency Graphs' datasets indexingAguilera, Luis Miguel Rojas, +55 92 982114961 23 March 2018 (has links)
Submitted by Luis Miguel Rojas Aguilera (rojas@icomp.ufam.edu.br) on 2018-09-10T13:04:22Z
No. of bitstreams: 2
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
DissertacaoLuisRojasComFichaCatalograficaEFolhaAprovacao.pdf: 6768066 bytes, checksum: 5c26bd8a9fe369e787ba394d81fd07f3 (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2018-09-10T18:13:42Z (GMT) No. of bitstreams: 2
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
DissertacaoLuisRojasComFichaCatalograficaEFolhaAprovacao.pdf: 6768066 bytes, checksum: 5c26bd8a9fe369e787ba394d81fd07f3 (MD5) / Rejected by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br), reason: O Campo "Agência de Fomento" deve ser preenchido com o nome (ou sigla) da Agência de Fomento.
on 2018-09-10T18:15:16Z (GMT) / Submitted by Luis Miguel Rojas Aguilera (rojas@icomp.ufam.edu.br) on 2018-09-10T18:57:05Z
No. of bitstreams: 2
DissertacaoLuisRojasComFichaCatalograficaEFolhaAprovacao.pdf: 6768066 bytes, checksum: 5c26bd8a9fe369e787ba394d81fd07f3 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Secretaria PPGI (secretariappgi@icomp.ufam.edu.br) on 2018-09-10T20:49:15Z (GMT) No. of bitstreams: 2
DissertacaoLuisRojasComFichaCatalograficaEFolhaAprovacao.pdf: 6768066 bytes, checksum: 5c26bd8a9fe369e787ba394d81fd07f3 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2018-09-11T14:07:43Z (GMT) No. of bitstreams: 2
DissertacaoLuisRojasComFichaCatalograficaEFolhaAprovacao.pdf: 6768066 bytes, checksum: 5c26bd8a9fe369e787ba394d81fd07f3 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5) / Made available in DSpace on 2018-09-11T14:07:43Z (GMT). No. of bitstreams: 2
DissertacaoLuisRojasComFichaCatalograficaEFolhaAprovacao.pdf: 6768066 bytes, checksum: 5c26bd8a9fe369e787ba394d81fd07f3 (MD5)
license_rdf: 0 bytes, checksum: d41d8cd98f00b204e9800998ecf8427e (MD5)
Previous issue date: 2018-03-23 / CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior / Code mutation and metamorphism have been successfully employed to create and proliferate new malware instances from existing malicious code. With such techniques, it is possible to modify a code’s structure without altering its original functions, so, new samples can be made that lack structural and behavioral patterns present in knowledge bases of malware identification systems, which hinders their detection. Previous research endeavors addressing metamorphic malware detection can be grouped into two categories: identification through code signature matching and detection based on models of classification. Matching code signatures presents lower false positive rates in comparison with models of classification, since such structures are resilient to the effects of metamorphism and allow better discrimination among instances, however, temporal complexity of matching algorithms prevents the application of such technique in real detection systems. On the other hand, detection based on classification models present less algorithmic complexity, however, a models’ generalization capacity is affected by the versatility of patterns that can be obtained by applying techniques of metamorphism. In order to overcome such limitations, this work presents methods for metamorphic malware identification through matching annotated data dependency graphs, extracted from known malwares and suspicious instances in the moment of analysis. To deal with comparison algorithms’ complexity, using these methods on real detection systems, the databases of graphs were indexed using machine learning algorithms, resulting in multiclass classification models that discriminated among malware families based on structural features of graphs. Experimental results, employing a prototype of the proposed methods from a database of 40,785 graphs extracted from 4,530 malware instances, presented detection times below 150 seconds for all instances, as well as higher average accuracy than 56 evaluated commercial malware detection systems. / A mutação de código e o metamorfismo têm sido empregados com sucesso para a criação e proliferação de novas instâncias de malware a partir de códigos maliciosos existentes. Com estas técnicas é possível modificar a estrutura de um código sem alterar as funcionalidades originais para obter novas instâncias que não se encaixam nos padrões estruturais e de comportamento presentes em bases de conhecimento dos sistemas de identificação de malware, dificultando assim a detecção. Pesquisas anteriores que abordam a detecção de malware metamórfico podem ser agrupadas em: identificação por meio do matching de assinaturas de código e detecção baseada em modelos de classificação. O matching de assinaturas de código tem apresentado taxas de falsos positivos inferiores às apresentadas pelos modelos de classificação, uma vez que estas estruturas são resilientes aos efeitos do metamorfismo e permitem melhor discriminação entre as instâncias. Entretanto a complexidade temporal dos algoritmos de comparação impedem a aplicação desta técnica em sistemas de detecção reais. Por outro lado, a detecção baseada em modelos de classificação apresenta menor complexidade algorítmica, porém a capacidade de generalização dos modelos se vê afetada pela versatilidade de padrões que podem ser obtidos por médio da aplicação de técnicas de metamorfismo. Para superar estas limitações, este trabalho apresenta uma metodologia para a identificação de malware metamórfico através da comparação de grafos de dependência de dados anotados extraídos de malwares conhecidos e de instâncias suspeitas no momento da análise. Para lidar com a complexidade dos algoritmos de comparação, permitindo assim a utilização da metodologia em sistemas de detecção reais, as bases de grafos são indexadas empregando algoritmos de aprendizagem de máquina, resultando em modelos de classificação multiclasse que discriminam entre famílias de malwares a partir das características estruturais dos grafos. Resultados experimentais, utilizando um protótipo da metodologia proposta sobre uma base composta por 40,785 grafos extraídos de 4,530 instâncias de malwares, mostraram tempos de detecção inferiores aos 150 segundos para processar todas as instâncias e de criação dos modelos inferiores aos 10 minutos, bem como acurácia média superior à maioria de 56 ferramentas comerciais de detecção de malware avaliadas.
|
64 |
Video Data Collection for Continuous Identity AssuranceVenkatesan, Janani 27 June 2016 (has links)
Frequently monitoring the identity of a person connected to a secure system is an important component in a cyber-security system. Identity Assurance (IA) mechanisms which continuously confirm and verify users’ identity after the initial authentication process ensure integrity and security. Such systems prevent unauthorized access and eliminate the need of an authorized user to present credentials repeatedly for verification. Very few cyber-security systems deploy such IA modules. These IA modules are typically based on computer vision and machine learning algorithms. These algorithms work effectively when trained with representative datasets. This thesis describes our effort at collecting a small dataset of multi-view videos of typical work session of several subjects to serve as a resource for other researchers of IA algorithms to evaluate and compare the performance of their algorithms with those of others. We also present a Proof of Concept (POC) face matching algorithm and experimental results with this POC implementation for a subset of collected dataset.
|
65 |
From surveys to surveillance strategies: a case study of life satisfactionYang, Chao 01 May 2015 (has links)
Social media surveillance is becoming more and more popular. However, current surveillance methods do not utilize well-respected surveys, which were established over many decades in domains outside of computer science. Also the evaluation of the previous social media surveillance is not sufficient, especially for surveillance of happiness on social media. These motivated us to develop a general computational methodology for translating a well-known survey into a social media surveillance strategy. Therefore, traditional surveys could be utilized to broaden social media surveillance. The methodology could bridge domains like psychology and social science with computer science. We use life satisfaction on social media as a case study to illustrate our survey-to-surveillance methodology. We start with a famous life satisfaction survey, expand the survey statements to generate templates. Then we use the templates to build queries in our information retrieval system to retrieve the social media posts which could be considered as valid responses to the original survey. Filters were utilized to boost the performance of the retrieval system of our surveillance method.
To evaluate our surveillance method, we developed a novel method to build the gold standard dataset. Instead of evaluating all the data instances like the traditional way, we ask human workers to "find'' as many of the positives as possible in the dataset, the rest are assumed to be negatives. We used the method to build the gold standard dataset for the life satisfaction case study. We also build three more gold standard datasets to further demonstrate the value of our method. Using the life satisfaction gold standard dataset, we show that performance of our surveillance method of life satisfaction outperforms other popular methods (lexicon and machine learning based methods) used by previous researchers.
Using our surveillance method of life satisfaction on social media, we did a comprehensive analysis of life satisfaction expressions on Twitter. We not only show the time series, daily and weekly cycle of life satisfaction on social media, but also found the differences in characteristics for users with different life satisfaction expressions. These include psychosocial features such as anxiety, anger and depression. In addition, we present the geographic distribution of life satisfaction, including the life satisfaction across the U.S. and places around the world. This thesis is the first to systematically explore life satisfaction expressions over Twitter. This is done using computational methods that derive from an established survey on life satisfaction.
|
66 |
A Methodology of Dataset Generation for Secondary Use of Health Care Big Data / 保健医療ビックデータの二次利用におけるデータセット生成に関する方法論Iwao, Tomohide 23 March 2020 (has links)
京都大学 / 0048 / 新制・課程博士 / 博士(情報学) / 甲第22575号 / 情博第712号 / 新制||情||122(附属図書館) / 京都大学大学院情報学研究科社会情報学専攻 / (主査)教授 黒田 知宏, 教授 守屋 和幸, 教授 吉川 正俊 / 学位規則第4条第1項該当 / Doctor of Informatics / Kyoto University / DFAM
|
67 |
Logging, Visualization, and Analysis of Network and Power Data of IoT DevicesNguyen, Neal Huynh 01 December 2018 (has links)
There are approximately 23.14 billion IoT(Internet of Things) devices currently in use worldwide. This number is projected to grow to over 75 billion by 2025. Despite their ubiquity little is known about the security and privacy implications of IoT devices. Several large-scale attacks against IoT devices have already been recorded.
To help address this knowledge gap, we have collected a year’s worth of network traffic and power data from 16 common IoT devices. From this data, we show that we can identify different smart speakers, like the Echo Dot, from analyzing one minute of power data on a shared power line.
|
68 |
Analýza vlivu trénovací datové sady na úspěšnost segmentace / Analysis of training dataset influence on the efficiency of segmentationBenešovská, Veronika January 2021 (has links)
Microbial structures are present in every living organism, so it is important to classify them for subsequent research of their origin and function. Bruker, s.r.o is developing the MBT Pathfinder for this purpose, which automates the transfer of colonies to MALDI plates, where the subsequent analysis of the sample takes place. Transferred colonies can be selected manually or using an algorithm that ensures automatic colony segmentation. This algorithm must be learned on a training set, which has huge influence on its accuracy. This work deals with measuring the influence of a dataset on the accuracy of this learning algorithm.
|
69 |
Detekce a klasifikace létajících objektů / Detection and classification of flying objectsJurečka, Tomáš January 2021 (has links)
The thesis deals with the detection and classification of flying objects. The work can be divided into three parts. The first part describes the creation of dataset of flying objects. The reverse image search is used to create the dataset. The next part is a research of algorithms for detection, tracking and classification. Subsequently, the individual algorithms are applied and evaluated. In the last part, the design of hardware components is performed.
|
70 |
Zjednodušení přístupu k propojeným datům pomocí tabulkových pohledů / Simplifying access to linked data using tabular viewsJareš, Antonín January 2021 (has links)
The goal of this thesis is to design and implement a front-end application allowing users to create and manage custom views for arbitrary linked data endpoints. Such views will be executable against a predefined SPARQL endpoint and the users will be able to retrieve and download their requested data in the CSV format. The users will also be able to share these views and store them utilizing Solid Pods. Experienced SPARQL users will be able to manually customize the query. To achieve these goals, the system uses freely available technologies - HTML, JavaScript (namely the React framework) and CSS.
|
Page generated in 0.0257 seconds