• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 46
  • 25
  • 9
  • 1
  • 1
  • Tagged with
  • 82
  • 66
  • 54
  • 30
  • 29
  • 29
  • 27
  • 15
  • 13
  • 12
  • 11
  • 10
  • 10
  • 10
  • 10
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
61

Explorative Suchstrategien am Beispiel von flickr. com

Wenke, Birgit, Lechner, Ulrike January 2009 (has links)
No description available.
62

Jenseits der Suchmaschinen: Konzeption einer iterativen Informationssuche in Blogs

Franke, Ingmar S., Taranko, Severin, Wessel, Hans January 2009 (has links)
No description available.
63

Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

Eisinger, Daniel 07 October 2013 (has links)
The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible.
64

Search for the Standard Model Higgs boson in the dimuon decay channel with the ATLAS detector

Rudolph, Christian 09 December 2014 (has links)
Die Suche nach dem Higgs-Boson des Standardmodells der Teilchenphysik stellte einen der Hauptgründe für den Bau des Large Hadron Colliders (LHC) dar, dem derzeit größten Teilchenphysik-Experiment der Welt. Die vorliegende Arbeit ist gleichfalls von dieser Suche getrieben. Der direkte Zerfall des Higgs-Bosons in Myonen wird untersucht. Dieser Kanal hat mehrere Vorteile. Zum einen ist der Endzustand, bestehend aus zwei Myonen unterschiedlicher Ladung, leicht nachzuweisen und besitzt eine klare Signatur. Weiterhin ist die Massenauflösung hervorragend, sodass eine gegebenenfalls vorhandene Resonanz gleich in ihrer grundlegenden Eigenschaft - ihrer Masse - bestimmt werden kann. Leider ist der Zerfall des Higgs-Bosons in ein Paar von Myonen sehr selten. Lediglich etwa 2 von 10000 erzeugten Higgs-Bosonen zeigen diesen Endzustand . Außerdem existiert mit dem Standardmodellprozess Z/γ∗ → μμ ein Zerfall mit einer sehr ähnlichen Signatur, jedoch um Größenordnungen höherer Eintrittswahrscheinlichkeit. Auf ein entstandenes Higgs-Boson kommen so etwa 1,5 Millionen Z-Bosonen, welche am LHC bei einer Schwerpunktsenergie von 8 TeV produziert werden. In dieser Arbeit werden zwei eng miteinander verwandte Analysen präsentiert. Zum einen handelt es sich hierbei um die Untersuchung des Datensatzes von Proton-Proton-Kollisionen bei einer Schwerpunktsenergie von 8 TeV, aufgezeichnet vom ATLAS-Detektor im Jahre 2012, auch als alleinstehende Analyse bezeichnet. Zum anderen erfolgt die Präsentation der kombinierten Analyse des kompletten Run-I Datensatzes, welcher aus Aufzeichnungen von Proton-Proton-Kollisionen der Jahre 2011 und 2012 bei Schwerpunktsenergien von 7 TeV bzw. 8 TeV besteht. In beiden Fällen wird die Verteilung der invarianten Myon-Myon-Masse nach einer schmalen Resonanzsignatur auf der kontinuierlichen Untergrundverteilung hin untersucht. Dabei dient die theoretisch erwartete Massenverteilung sowie die Massenauflösung des ATLAS-Detektors als Grundlage, um analytische Parametrisierungen der Signal- und Untergrundverteilungen zu entwickeln. Auf diese Art wird der Einfluss systematischer Unsicherheiten auf Grund von ungenauer Beschreibung der Spektren in Monte-Carlo Simulationen verringert. Verbleibende systematische Unsicherheiten auf die Signalakzeptanz werden auf eine neuartige Weise bestimmt. Zusätzlich wird ein bisher einzigartiger Ansatz verfolgt, um die systematische Unsicherheit resultierend aus der Wahl der Untergrundparametrisierung in der kombinierten Analyse verfolgt. Zum ersten Mal wird dabei die Methode des scheinbaren Signals auf einem simulierten Untergrunddatensatz auf Generator-Niveau angewendet, was eine Bestimmung des Einflusses des Untergrundmodells auf die Anzahl der ermittelten Signalereignisse mit nie dagewesener Präzision ermöglicht. In keiner der durchgeführten Analysen konnte ein signifikanter Überschuss im invarianten Massenspektrum des Myon-Myon-Systems nachgewiesen werden, sodass obere Ausschlussgrenzen auf die Signalstärke μ = σ/σ(SM) in Abhängigkeit von der Higgs-Boson-Masse gesetzt werden. Dabei sind Stärken von μ ≥ 10,13 bzw. μ ≥ 7,05 mit einem Konfidenzniveau von 95% durch die alleinstehende bzw. kombinierte Analyse ausgeschlossen, jeweils für eine Higgs-Boson-Masse von 125,5 GeV. Die erzielten Ergebnisse werden ebenfalls im Hinblick auf die kürzlich erfolgte Entdeckung des neuen Teilchens interpretiert, dessen Eigenschaften mit den Vorhersagen eines Standardmodell-Higgs-Bosons mit einer Masse von etwa 125,5 GeV kompatibel sind. Dabei werden obere Grenzen auf das Verzweigungsverhältnis von BR(H → μμ) ≤ 1,3 × 10^−3 und auf die Yukawa-Kopplung des Myons von λμ ≤ 1,6 × 10^−3 gesetzt, jeweils mit einem Konfidenzniveau von 95%.:1. Introduction 2. Theoretical Foundations 3. Experimental Setup 4. Event Simulation 5. Muon Reconstruction and Identification 6. Event Selection 7. Signal and Background Modeling 8. Systematic Uncertainties 9. Statistical Methods 10. Results 11. Summary and Outlook / The search for the Standard Model Higgs boson was one of the key motivations to build the world’s largest particle physics experiment to date, the Large Hadron Collider (LHC). This thesis is equally driven by this search, and it investigates the direct muonic decay of the Higgs boson. The decay into muons has several advantages: it provides a very clear final state with two muons of opposite charge, which can easily be detected. In addition, the muonic final state has an excellent mass resolution, such that an observed resonance can be pinned down in one of its key properties: its mass. Unfortunately, the decay of a Standard Model Higgs boson into a pair of muons is very rare, only two out of 10000 Higgs bosons are predicted to exhibit this decay. On top of that, the non-resonant Standard Model background arising from the Z/γ∗ → μμ process has a very similar signature, while possessing a much higher cross-section. For one produced Higgs boson, there are approximately 1.5 million Z bosons produced at the LHC for a centre-of-mass energy of 8 TeV. Two related analyses are presented in this thesis: the investigation of 20.7 fb^−1 of the proton-proton collision dataset recorded by the ATLAS detector in 2012, referred to as standalone analysis, and the combined analysis as the search in the full run-I dataset consisting of proton-proton collision data recorded in 2011 and 2012, which corresponds to an integrated luminosity of L = 24.8 fb^−1 . In each case, the dimuon invariant mass spectrum is examined for a narrow resonance on top of the continuous background distribution. The dimuon phenomenology and ATLAS detector performance serve as the foundations to develop analytical models describing the spectra. Using these analytical parametrisations for the signal and background mass distributions, the sensitivity of the analyses to systematic uncertainties due to Monte-Carlo simulation mismodeling are minimised. These residual systematic uncertainties are addressed in a unique way as signal acceptance uncertainties. In addition, a new approach to assess the systematic uncertainty associated with the choice of the background model is designed for the combined analysis. For the first time, the spurious signal technique is performed on generator-level simulated background samples, which allows for a precise determination of the background fit bias. No statistically significant excess in the dimuon invariant mass spectrum is observed in either analysis, and upper limits are set on the signal strength μ = σ/σ(SM) as a function of the Higgs boson mass. Signal strengths of μ ≥ 10.13 and μ ≥ 7.05 are excluded for a Higgs boson mass of 125.5 GeV with a confidence level of 95% by the standalone and combined analysis, respectively. In the light of the discovery of a particle consistent with the predictions for a Standard Model Higgs boson with a mass of m H = 125.5 GeV, the search results are reinterpreted for this special case, setting upper limits on the Higgs boson branching ratio of BR(H →μμ) ≤ 1.3 × 10^−3, and on the muon Yukawa coupling of λμ ≤ 1.6 × 10^−3 , both with a confidence level of 95 %.:1. Introduction 2. Theoretical Foundations 3. Experimental Setup 4. Event Simulation 5. Muon Reconstruction and Identification 6. Event Selection 7. Signal and Background Modeling 8. Systematic Uncertainties 9. Statistical Methods 10. Results 11. Summary and Outlook
65

Search for Heavy Neutral Higgs Bosons in the tau+tau- Final State in LHC Proton-Proton Collisions at sqrt{s}=13 TeV with the ATLAS Detector

Hauswald, Lorenz 12 May 2017 (has links)
There are experimental and theoretical indications that the Standard Model of particle physics, although tremendously successful, is not sufficient to describe the universe, even at energies well below the Planck scale. One of the most promising new theories to resolve major open questions, the Minimal Supersymmetric Standard Model, predicts additional neutral and charged Higgs bosons, among other new particles. For the search of the new heavy neutral bosons, the decay into two hadronically decaying tau leptons is especially interesting, as in large parts of the search parameter space it has the second largest branching ratio while allowing for a considerably better background rejection than the leading decay into b-quark pairs. This search, based on proton-proton collisions recorded at sqrt(s) = 13 TeV in 2015 and early 2016 by the ATLAS experiment at the Large Hadron Collider at CERN, is presented in this thesis. No significant deviation from the Standard Model expectation is observed and CLs exclusion limits are determined, both model-independent and in various MSSM benchmark scenarios. The MSSM exclusion limits are significantly stronger compared to previous searches, due to the increased collision energy and improvements of the event selection and background estimation techniques. The upper limit on tan beta at 95% confidence level in the mhmod+ MSSM benchmark scenario ranges from 10 at mA = 300 GeV to 48 at mA = 1.2 TeV.
66

Search for neutral MSSM Higgs bosons in the fully hadronic di-tau decay channel with the ATLAS detector

Wahrmund, Sebastian 23 June 2017 (has links)
The search for additional heavy neutral Higgs bosons predicted in Minimal Supersymmetric Extensions of the Standard Model is presented, using the direct decay channel into two tau leptons which themselves decay hadronically. The study is based on proton-proton collisions recorded in 2011 at a center-of-mass energy of 7 TeV with the ATLAS detector at the Large Hadron Collider at CERN. With a sample size corresponding to an integrated luminosity of 4.5 fb−1, no significant excess above the expected Standard Model background prediction is observed and CLs exclusion limits at a 95% confidence level are evaluated for values of the CP-odd Higgs boson mass mA between 140 GeV to 800 GeV within the context of the mhmax and mhmod± benchmark scenarios. The results are combined with searches for neutral Higgs bosons performed using proton-proton collisions at a center-of-mass energy of 8 TeV recorded with the ATLAS detector in 2012, with a corresponding integrated luminosity of 19.5 fb−1. The combination allowed an improvement of the exclusion limit at the order of 1 to 3 units in tan β. Within the context of this study, the structure of additional interactions during a single proton-proton collision (the “underlying event”) in di-jet final states is analyzed using collision data at a center-of-mass energy of 7 TeV recorded with the ATLAS detector in 2010, with a corresponding integrated luminosity of 37 pb−1. The contribution of the underlying event is measured up to an energy scale of 800 GeV and compared to the predictions of various models. For several models, significant deviations compared to the measurements are found and the results are provided for the optimization of simulation algorithms.
67

Datenzentrierte Bestimmung von Assoziationsregeln in parallelen Datenbankarchitekturen

Legler, Thomas 22 June 2009 (has links)
Die folgende Arbeit befasst sich mit der Alltagstauglichkeit moderner Massendatenverarbeitung, insbesondere mit dem Problem der Assoziationsregelanalyse. Vorhandene Datenmengen wachsen stark an, aber deren Auswertung ist für ungeübte Anwender schwierig. Daher verzichten Unternehmen auf Informationen, welche prinzipiell vorhanden sind. Assoziationsregeln zeigen in diesen Daten Abhängigkeiten zwischen den Elementen eines Datenbestandes, beispielsweise zwischen verkauften Produkten. Diese Regeln können mit Interessantheitsmaßen versehen werden, welche dem Anwender das Erkennen wichtiger Zusammenhänge ermöglichen. Es werden Ansätze gezeigt, dem Nutzer die Auswertung der Daten zu erleichtern. Das betrifft sowohl die robuste Arbeitsweise der Verfahren als auch die einfache Auswertung der Regeln. Die vorgestellten Algorithmen passen sich dabei an die zu verarbeitenden Daten an, was sie von anderen Verfahren unterscheidet. Assoziationsregelsuchen benötigen die Extraktion häufiger Kombinationen (EHK). Hierfür werden Möglichkeiten gezeigt, Lösungsansätze auf die Eigenschaften moderne System anzupassen. Als Ansatz werden Verfahren zur Berechnung der häufigsten $N$ Kombinationen erläutert, welche anders als bekannte Ansätze leicht konfigurierbar sind. Moderne Systeme rechnen zudem oft verteilt. Diese Rechnerverbünde können große Datenmengen parallel verarbeiten, benötigen jedoch die Vereinigung lokaler Ergebnisse. Für verteilte Top-N-EHK auf realistischen Partitionierungen werden hierfür Ansätze mit verschiedenen Eigenschaften präsentiert. Aus den häufigen Kombinationen werden Assoziationsregeln gebildet, deren Aufbereitung ebenfalls einfach durchführbar sein soll. In der Literatur wurden viele Maße vorgestellt. Je nach den Anforderungen entsprechen sie je einer subjektiven Bewertung, allerdings nicht zwingend der des Anwenders. Hierfür wird untersucht, wie mehrere Interessantheitsmaßen zu einem globalen Maß vereinigt werden können. Dies findet Regeln, welche mehrfach wichtig erschienen. Der Nutzer kann mit den Vorschlägen sein Suchziel eingrenzen. Ein zweiter Ansatz gruppiert Regeln. Dies erfolgt über die Häufigkeiten der Regelelemente, welche die Grundlage von Interessantheitsmaßen bilden. Die Regeln einer solchen Gruppe sind daher bezüglich vieler Interessantheitsmaßen ähnlich und können gemeinsam ausgewertet werden. Dies reduziert den manuellen Aufwand des Nutzers. Diese Arbeit zeigt Möglichkeiten, Assoziationsregelsuchen auf einen breiten Benutzerkreis zu erweitern und neue Anwender zu erreichen. Die Assoziationsregelsuche wird dabei derart vereinfacht, dass sie statt als Spezialanwendung als leicht nutzbares Werkzeug zur Datenanalyse verwendet werden kann. / The importance of data mining is widely acknowledged today. Mining for association rules and frequent patterns is a central activity in data mining. Three main strategies are available for such mining: APRIORI , FP-tree-based approaches like FP-GROWTH, and algorithms based on vertical data structures and depth-first mining strategies like ECLAT and CHARM. Unfortunately, most of these algorithms are only moderately suitable for many “real-world” scenarios because their usability and the special characteristics of the data are two aspects of practical association rule mining that require further work. All mining strategies for frequent patterns use a parameter called minimum support to define a minimum occurrence frequency for searched patterns. This parameter cuts down the number of patterns searched to improve the relevance of the results. In complex business scenarios, it can be difficult and expensive to define a suitable value for the minimum support because it depends strongly on the particular datasets. Users are often unable to set this parameter for unknown datasets, and unsuitable minimum-support values can extract millions of frequent patterns and generate enormous runtimes. For this reason, it is not feasible to permit ad-hoc data mining by unskilled users. Such users do not have the knowledge and time to define suitable parameters by trial-and-error procedures. Discussions with users of SAP software have revealed great interest in the results of association-rule mining techniques, but most of these users are unable or unwilling to set very technical parameters. Given such user constraints, several studies have addressed the problem of replacing the minimum-support parameter with more intuitive top-n strategies. We have developed an adaptive mining algorithm to give untrained SAP users a tool to analyze their data easily without the need for elaborate data preparation and parameter determination. Previously implemented approaches of distributed frequent-pattern mining were expensive and time-consuming tasks for specialists. In contrast, we propose a method to accelerate and simplify the mining process by using top-n strategies and relaxing some requirements on the results, such as completeness. Unlike such data approximation techniques as sampling, our algorithm always returns exact frequency counts. The only drawback is that the result set may fail to include some of the patterns up to a specific frequency threshold. Another aspect of real-world datasets is the fact that they are often partitioned for shared-nothing architectures, following business-specific parameters like location, fiscal year, or branch office. Users may also want to conduct mining operations spanning data from different partners, even if the local data from the respective partners cannot be integrated at a single location for data security reasons or due to their large volume. Almost every data mining solution is constrained by the need to hide complexity. As far as possible, the solution should offer a simple user interface that hides technical aspects like data distribution and data preparation. Given that BW Accelerator users have such simplicity and distribution requirements, we have developed an adaptive mining algorithm to give unskilled users a tool to analyze their data easily, without the need for complex data preparation or consolidation. For example, Business Intelligence scenarios often partition large data volumes by fiscal year to enable efficient optimizations for the data used in actual workloads. For most mining queries, more than one data partition is of interest, and therefore, distribution handling that leaves the data unaffected is necessary. The algorithms presented in this paper have been developed to work with data stored in SAP BW. A salient feature of SAP BW Accelerator is that it is implemented as a distributed landscape that sits on top of a large number of shared-nothing blade servers. Its main task is to execute OLAP queries that require fast aggregation of many millions of rows of data. Therefore, the distribution of data over the dedicated storage is optimized for such workloads. Data mining scenarios use the same data from storage, but reporting takes precedence over data mining, and hence, the data cannot be redistributed without massive costs. Distribution by special data semantics or user-defined selections can produce many partitions and very different partition sizes. The handling of such real-world distributions for frequent-pattern mining is an important task, but it conflicts with the requirement of balanced partition.
68

GoWeb: Semantic Search and Browsing for the Life Sciences

Dietze, Heiko 20 October 2010 (has links)
Searching is a fundamental task to support research. Current search engines are keyword-based. Semantic technologies promise a next generation of semantic search engines, which will be able to answer questions. Current approaches either apply natural language processing to unstructured text or they assume the existence of structured statements over which they can reason. This work provides a system for combining the classical keyword-based search engines with semantic annotation. Conventional search results are annotated using a customized annotation algorithm, which takes the textual properties and requirements such as speed and scalability into account. The biomedical background knowledge consists of the GeneOntology and Medical Subject Headings and other related entities, e.g. proteins/gene names and person names. Together they provide the relevant semantic context for a search engine for the life sciences. We develop the system GoWeb for semantic web search and evaluate it using three benchmarks. It is shown that GoWeb is able to aid question answering with success rates up to 79%. Furthermore, the system also includes semantic hyperlinks that enable semantic browsing of the knowledge space. The semantic hyperlinks facilitate the use of the eScience infrastructure, even complex workflows of composed web services. To complement the web search of GoWeb, other data source and more specialized information needs are tested in different prototypes. This includes patents and intranet search. Semantic search is applicable for these usage scenarios, but the developed systems also show limits of the semantic approach. That is the size, applicability and completeness of the integrated ontologies, as well as technical issues of text-extraction and meta-data information gathering. Additionally, semantic indexing as an alternative approach to implement semantic search is implemented and evaluated with a question answering benchmark. A semantic index can help to answer questions and address some limitations of GoWeb. Still the maintenance and optimization of such an index is a challenge, whereas GoWeb provides a straightforward system.
69

GoPubMed: Ontology-based literature search for the life sciences

Doms, Andreas 06 January 2009 (has links)
Background: Most of our biomedical knowledge is only accessible through texts. The biomedical literature grows exponentially and PubMed comprises over 18.000.000 literature abstracts. Recently much effort has been put into the creation of biomedical ontologies which capture biomedical facts. The exploitation of ontologies to explore the scientific literature is a new area of research. Motivation: When people search, they have questions in mind. Answering questions in a domain requires the knowledge of the terminology of that domain. Classical search engines do not provide background knowledge for the presentation of search results. Ontology annotated structured databases allow for data-mining. The hypothesis is that ontology annotated literature databases allow for text-mining. The central problem is to associate scientific publications with ontological concepts. This is a prerequisite for ontology-based literature search. The question then is how to answer biomedical questions using ontologies and a literature corpus. Finally the task is to automate bibliometric analyses on an corpus of scientific publications. Approach: Recent joint efforts on automatically extracting information from free text showed that the applied methods are complementary. The idea is to employ the rich terminological and relational information stored in biomedical ontologies to markup biomedical text documents. Based on established semantic links between documents and ontology concepts the goal is to answer biomedical question on a corpus of documents. The entirely annotated literature corpus allows for the first time to automatically generate bibliometric analyses for ontological concepts, authors and institutions. Results: This work includes a novel annotation framework for free texts with ontological concepts. The framework allows to generate recognition patterns rules from the terminological and relational information in an ontology. Maximum entropy models can be trained to distinguish the meaning of ambiguous concept labels. The framework was used to develop a annotation pipeline for PubMed abstracts with 27,863 Gene Ontology concepts. The evaluation of the recognition performance yielded a precision of 79.9% and a recall of 72.7% improving the previously used algorithm by 25,7% f-measure. The evaluation was done on a manually created (by the original authors) curation corpus of 689 PubMed abstracts with 18,356 curations of concepts. Methods to reason over large amounts of documents with ontologies were developed. The ability to answer questions with the online system was shown on a set of biomedical question of the TREC Genomics Track 2006 benchmark. This work includes the first ontology-based, large scale, online available, up-to-date bibliometric analysis for topics in molecular biology represented by GO concepts. The automatic bibliometric analysis is in line with existing, but often out-dated, manual analyses. Outlook: A number of promising continuations starting from this work have been spun off. A freely available online search engine has a growing user community. A spin-off company was funded by the High-Tech Gründerfonds which commercializes the new ontology-based search paradigm. Several off-springs of GoPubMed including GoWeb (general web search), Go3R (search in replacement, reduction, refinement methods for animal experiments), GoGene (search in gene/protein databases) are developed.
70

Secure and Efficient Comparisons between Untrusted Parties

Beck, Martin 11 September 2018 (has links)
A vast number of online services is based on users contributing their personal information. Examples are manifold, including social networks, electronic commerce, sharing websites, lodging platforms, and genealogy. In all cases user privacy depends on a collective trust upon all involved intermediaries, like service providers, operators, administrators or even help desk staff. A single adversarial party in the whole chain of trust voids user privacy. Even more, the number of intermediaries is ever growing. Thus, user privacy must be preserved at every time and stage, independent of the intrinsic goals any involved party. Furthermore, next to these new services, traditional offline analytic systems are replaced by online services run in large data centers. Centralized processing of electronic medical records, genomic data or other health-related information is anticipated due to advances in medical research, better analytic results based on large amounts of medical information and lowered costs. In these scenarios privacy is of utmost concern due to the large amount of personal information contained within the centralized data. We focus on the challenge of privacy-preserving processing on genomic data, specifically comparing genomic sequences. The problem that arises is how to efficiently compare private sequences of two parties while preserving confidentiality of the compared data. It follows that the privacy of the data owner must be preserved, which means that as little information as possible must be leaked to any party participating in the comparison. Leakage can happen at several points during a comparison. The secured inputs for the comparing party might leak some information about the original input, or the output might leak information about the inputs. In the latter case, results of several comparisons can be combined to infer information about the confidential input of the party under observation. Genomic sequences serve as a use-case, but the proposed solutions are more general and can be applied to the generic field of privacy-preserving comparison of sequences. The solution should be efficient such that performing a comparison yields runtimes linear in the length of the input sequences and thus producing acceptable costs for a typical use-case. To tackle the problem of efficient, privacy-preserving sequence comparisons, we propose a framework consisting of three main parts. a) The basic protocol presents an efficient sequence comparison algorithm, which transforms a sequence into a set representation, allowing to approximate distance measures over input sequences using distance measures over sets. The sets are then represented by an efficient data structure - the Bloom filter -, which allows evaluation of certain set operations without storing the actual elements of the possibly large set. This representation yields low distortion for comparing similar sequences. Operations upon the set representation are carried out using efficient, partially homomorphic cryptographic systems for data confidentiality of the inputs. The output can be adjusted to either return the actual approximated distance or the result of an in-range check of the approximated distance. b) Building upon this efficient basic protocol we introduce the first mechanism to reduce the success of inference attacks by detecting and rejecting similar queries in a privacy-preserving way. This is achieved by generating generalized commitments for inputs. This generalization is done by treating inputs as messages received from a noise channel, upon which error-correction from coding theory is applied. This way similar inputs are defined as inputs having a hamming distance of their generalized inputs below a certain predefined threshold. We present a protocol to perform a zero-knowledge proof to assess if the generalized input is indeed a generalization of the actual input. Furthermore, we generalize a very efficient inference attack on privacy-preserving sequence comparison protocols and use it to evaluate our inference-control mechanism. c) The third part of the framework lightens the computational load of the client taking part in the comparison protocol by presenting a compression mechanism for partially homomorphic cryptographic schemes. It reduces the transmission and storage overhead induced by the semantically secure homomorphic encryption schemes, as well as encryption latency. The compression is achieved by constructing an asymmetric stream cipher such that the generated ciphertext can be converted into a ciphertext of an associated homomorphic encryption scheme without revealing any information about the plaintext. This is the first compression scheme available for partially homomorphic encryption schemes. Compression of ciphertexts of fully homomorphic encryption schemes are several orders of magnitude slower at the conversion from the transmission ciphertext to the homomorphically encrypted ciphertext. Indeed our compression scheme achieves optimal conversion performance. It further allows to generate keystreams offline and thus supports offloading to trusted devices. This way transmission-, storage- and power-efficiency is improved. We give security proofs for all relevant parts of the proposed protocols and algorithms to evaluate their security. A performance evaluation of the core components demonstrates the practicability of our proposed solutions including a theoretical analysis and practical experiments to show the accuracy as well as efficiency of approximations and probabilistic algorithms. Several variations and configurations to detect similar inputs are studied during an in-depth discussion of the inference-control mechanism. A human mitochondrial genome database is used for the practical evaluation to compare genomic sequences and detect similar inputs as described by the use-case. In summary we show that it is indeed possible to construct an efficient and privacy-preserving (genomic) sequences comparison, while being able to control the amount of information that leaves the comparison. To the best of our knowledge we also contribute to the field by proposing the first efficient privacy-preserving inference detection and control mechanism, as well as the first ciphertext compression system for partially homomorphic cryptographic systems.

Page generated in 0.0497 seconds