Global ETD Search

1	SWordNet: Inferring Semantically Related Words from Software Context Yang, Jinqiu January 2013 (has links) Code search is an integral part of software development and program comprehension. The difficulty of code search lies in the inability to guess the exact words used in the code. Therefore, it is crucial for keyword-based code search to expand queries with semantically related words, e.g., synonyms and abbreviations, to increase the search effectiveness. However, it is limited to rely on resources such as English dictionaries and WordNet to obtain semantically related words in software, because many words that are semantically related in software are not semantically related in English. On the other hand, many words that are semantically related in English are not semantically related in software. This thesis proposes a simple and general technique to automatically infer semantically re- lated words (referred to as rPairs) in software by leveraging the context of words in comments and code. In addition, we propose a ranking algorithm on the rPair results and study cross-project rPairs on two sets of software with similar functionality, i.e., media browsers and operating sys- tems. We achieve a reasonable accuracy in nine large and popular code bases written in C and Java. Our further evaluation against the state of art shows that our technique can achieve a higher precision and recall. In addition, the proposed ranking algorithm improves the rPair extraction accuracy by bringing correct rPairs to the top of the list. Our cross-project study successfully discovers overlapping rPairs among projects of similar functionality and finds that cross-project rPairs are more likely to be correct than project-specific rPairs. Since the cross-project rPairs are highly likely to be general for software of the same type, the discovered overlapping rPairs can benefit other projects of the same type that have not been anaylyzed. Semantically related words code search program comprehension
2	An Investigation into Code Search Engines: The State of the Art Versus Developer Expectations Li, Shuangyi 15 July 2022 (has links) An essential software development tool, code search engines are expected to provide superior accuracy, usability, and performance. However, prior research has neither (1) summarized, categorized, and compared representative code search engines, nor (2) analyzed the actual expectations that developers have for code search engines. This missing knowledge can empower developers to fully benefit from search engines, academic researchers to uncover promising research directions, and industry practitioners to properly marshal their efforts. This thesis fills the aforementioned gaps by drawing a comprehensive picture of code search engines, including their definition, standard processes, existing solutions, common alternatives, and developers' perspectives. We first study the state of the art in code search engines by analyzing academic papers, industry releases, and open-source projects. We then survey more than a 100 software developers to ascertain their usage of and preferences for code search engines. Finally, we juxtapose the results of our study and survey to synthesize a call-for-action for researchers and industry practitioners to better meet the demands software developers make on code search engines. We present the first comprehensive overview of state-of-the-art code search engines by categorizing and comparing them based on their respective search strategies, applicability, and performance. Our user survey revealed a surprising lack of awareness among many developers w.r.t. code search engines, with a high preference for using general-purpose search engines (e.g., Google) or code repositories (e.g., GitHub) to search for code. Our results also clearly identify typical usage scenarios and sought-after properties of code search engines. Our findings can guide software developers in selecting code search engines most suitable for their programming pursuits, suggest new research directions for researchers, and help programming tool builders in creating effective code search engine solutions. / Master of Science / When developing software, programmers rely on source code search engines to find code snippets related to the programming task at hand. Given their importance for software development, source code engines have become the focus of numerous research and industry projects. However, researchers and developers remain largely unaware of each other's efforts and expectations. As a consequence, developers find themselves struggling to determine which engine would best fit their needs, while researchers remain unaware what developers expect from search engines. This thesis address this problem via a three-pronged approach: (1) it provides a systematic review of the research literature and major engines; (2) it analyzes the results of surveying software developers about their experiences with and expectations for code search engines; (3) it presents actionable insights that can guide future research and industry efforts in code search engines to better meet the needs of software developers. Code search engines User survey Domain analysis
3	A METHOD FOR FINDING BETTER SPACE-TIME CODES FOR MIMO CHANNELS Panagos, Adam G., Kosbar, Kurt 10 1900 (has links) ITC/USA 2005 Conference Proceedings / The Forty-First Annual International Telemetering Conference and Technical Exhibition / October 24-27, 2005 / Riviera Hotel & Convention Center, Las Vegas, Nevada / Multiple-input multiple output (MIMO) communication systems can have dramatically higher throughput than single-input, single-output systems. Unfortunately, it can be difficult to find the space-time codes these systems need to achieve their potential. Previously published results located good codes by minimizing the maximum correlation between transmitted signals. This paper shows how this min-max method may produce sub-optimal codes. A new method which sorts codes based on the union bound of pairwise error probabilities is presented. This new technique can identify superior MIMO codes, providing higher system throughput without increasing the transmitted power or bandwidth requirements. MIMO space-time codes unitary codes code design code search
4	Source code search for automatic bug localization Shayan Ali A Akbar (9761117) 14 December 2020 (has links) This dissertation advances the state-of-the-art in information retrieval (IR) based automatic bug localization for large software systems. We present techniques from three generations of IR based bug localization and compare their performances on our large and diverse bug localization dataset --- the Bugzbook dataset. The three generations span over fifteen years of research in mining software repositories for bug localization and include: (1) the generation of simple bag-of-words (BoW) based techniques, (2) the generation in which software-centric information such as bug and code change histories as well as structured information embedded in bug reports and code files are exploited to improve retrieval, and (3) the third and most recent generation in which order and semantic relationships between terms are modeled to improve the performance of bug localization systems. The dissertation also presents a novel technique called SCOR (Source Code Retrieval with Semantics and Order) which combines Markov Random Fields (MRF) based term-term ordering dependencies with semantic word vectors obtained from neural network based word embedding algorithms, such as word2vec, to better localize bugs in code files. The results presented in this dissertation show that while term-term ordering and semantic relationships significantly improve the performance when they are modeled separately in retrieval systems, the best precisions in retrieval are obtained when they are modeled together in a single retrieval system. We also show that the semantic representations of software terms learned by training the word embedding algorithm on a corpus of software repositories can be used to perform search in new software code repositories not present in the training corpus of the word embedding algorithm.<br> Computer Engineering code search text embeddings information retrieval bug localization Word2vec
5	Evaluation and Implementation of Code Search using Transformers to Enhance Developer Productivity / Evaluering och Implementering av Kodsökning genom Transformers för att Förbättra Utvecklares Produktivitet Fredrikson, Sara, Månsson, Clara January 2023 (has links) With the rapid advancements in the field of Natural Language Processing and Artificial Intelligence, several aspects of its use cases and impact on productivity are largely unexplored. Many of the recent machine learning models are based on an architecture called Transformers that allows for faster computation and for more context to be preserved. At the same time, tech companies face the dilemmas of how to navigate their code bases, spanning over millions of lines of code. The aim of this thesis is to investigate whether the implementation and fine-tuning of a Transformers-based model can be utilised to improve the code search process in a tech company, leading to improvements in developer productivity. Specifically, the thesis will evaluate the effectiveness of such implementation from a productivity perspective in terms of velocity, quality, and satisfaction. The research uses a mixed method design consisting of two distinct methodologies as well as analyses of quantitative and qualitative data. To assess the level of accuracy that can be obtained by optimising a Transformers-based model on internal data, an evaluative experiment with various internal datasets was conducted. The second methodology applied was a usability test, investigating potential impacts on velocity, quality, and satisfaction by testing a contextual code-search prototype with developers. Data from the tests was analysed through a heat map-, trade-off- and template analysis. Results indicate that a Transformers-based modes can be optimised for code search on internal data and has the potential to improve code search from the aspects of velocity, quality, and satisfaction. / Den snabba utvecklingen inom områdena för Språlteknologi och Artificiell Intelligens har visat på stora framgångar men också lämnat utrymme för ytterligare forskning på dess användningsområden och inverkan på produktivitet. Många av de senaste maskininlärningsmodellerna använder sig av en arkitektur kallad Transformers. Denna arkitektur möjliggör snabbare bearbetning av data och är bättre på att ta hänsyn till kontext. Samtidigt står tech-bolagen inför stora utmaningar i att navigera sina kodbaser, vilka består av flera miljoner rader kod. Målet med denna uppsats är att undersöka huruvida implementering och fine-tuning av en Transformers-baserad modell kan användas för att förbättra kodsökningsprocessen i ett tech-bolag och därmed leda till förbättring av utvecklares produktivitet. Mer specifikt utvärderar uppsatsen en sådan implementation från ett produktivitetsperspektiv med hänsyn till dimensioner såsom hastighet, kvalitet och tillfredställelse. Uppsatsen använder sig av en mixad metodologi bestående av två distinkta metoder samt analys av både kvalitativ och kvantitativ data. För att utvärdera nivån av noggrannhet som kan uppnås genom implementation och optimering av en Transformers-baserad modell på intern data, genomfördes experiment på olika interna dataset. Den andra metoden består av ett usability test för att undersöka potentiella effekter på hastighet, kvalitet och tillfredställelse genom att testa en kontextuell kodsökningsprototyp med utvecklare. Data från testen analyserades genom en heat map, trade-off och template analys. Resultaten indikerar att en Transformers-baserad modell kan optimeras för kodsökningpå intern data och har möjlighet att förbättra kodsökning från perspektiven hastighet, kvalitet och tillfredställelse. Transformers Code Search Developer Productivity Natural Language Processing Code Discoverability Transformers Kodsökning Utvecklares Produktivitet Språkteknologi Kodupptäckbarhet Engineering and Technology Teknik och teknologier
6	Transformer-based Multistage Architectures for Code Search González Lopez, Angel Luis January 2021 (has links) Code Search is one of the most common tasks for developers. The open-source software movement and the rise of social media have made this process easier thanks to the vast public software repositories available to everyone and the Q&A sites where individuals can resolve their doubts. However, in the case of poorly documented code that is difficult to search in a repository, or in the case of private enterprise frameworks that are not publicly available, so there is not a community on Q&A sites to answer questions, searching for code snippets to solve doubts or learn how to use an API becomes very complicated. In order to solve this problem, this thesis studies the use of natural language in code retrieval. In particular, it studies transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), which are currently state of the art in natural language processing but present high latency in information retrieval tasks. That is why this project proposes a multi-stage architecture that seeks to maintain the performance of standard BERT-based models while reducing the high latency usually associated with the use of this type of framework. Experiments show that this architecture outperforms previous non- BERT-based models by +0.17 on the Top 1 (or Recall@1) metric and reduces latency with inference times 5% of those of standard BERT models. / Kodsökning är en av de vanligaste uppgifterna för utvecklare. Rörelsen för öppen källkod och de sociala medierna har gjort denna process enklare tack vare de stora offentliga programvaruupplagorna som är tillgängliga för alla och de Q&A-webbplatser där enskilda personer kan lösa sina tvivel. När det gäller dåligt dokumenterad kod som är svår att söka i ett arkiv, eller när det gäller ramverk för privata företag som inte är offentligt tillgängliga, så att det inte finns någon gemenskap på Q&AA-webbplatser för att besvara frågor, blir det dock mycket komplicerat att söka efter kodstycken för att lösa tvivel eller lära sig hur man använder ett API. För att lösa detta problem studeras i denna avhandling användningen av naturligt språk för att hitta kod. I synnerhet studeras transformatorbaserade modeller, såsom BERT, som för närvarande är den senaste tekniken inom behandling av naturliga språk men som har hög latenstid vid informationssökning. Därför föreslås i detta projekt en arkitektur i flera steg som syftar till att bibehålla prestandan hos standard BERT-baserade modeller samtidigt som den höga latenstiden som vanligtvis är förknippad med användningen av denna typ av ramverk minskas. Experiment visar att denna arkitektur överträffar tidigare icke-BERT-baserade modeller med +0,17 på Top 1 (eller Recall@1) och minskar latensen, med en inferenstid som är 5% av den för standard BERT-modeller. Code Search Natural Language Processing BERT Information Retrieval Kodsökning behandling av naturligt språk BERT informationssökning Computer and Information Sciences Data- och informationsvetenskap
7	Multi-modal Neural Representations for Semantic Code Search / Multimodala neurala representationer för semantisk kodsökning Gu, Jian January 2020 (has links) In recent decades, various software systems have gradually become the basis of our society. Programmers search existing code snippets from time to time in their daily life. It would be beneficial and meaningful to have better solutions for the task of semantic code search, which is to find the most semantically relevant code snippets for a given query. Our approach is to introduce tree representations by multi-modal learning. The core idea is to enrich semantic information for code snippets by preparing data of different modalities, and meanwhile ignore syntactic information. We design one novel tree structure named Simplified Semantic Tree and then extract RootPath representations from that. We utilize RootPath representation to complement the conventional sequential representation, namely the token sequence of the code snippet. Our multi-modal model receives code-query pair as input and computes similarity score as output, following the pseudo-siamese architecture. For each pair, besides the ready-made code sequence and query sequence, we extra one extra tree sequence from Simplified Semantic Tree. There are three encoders in our model, and they respectively encode these three sequences as vectors of the same length. Then we combine the code vector with the tree vector for one joint vector, which is still of the same length, as the multi-modal representation for the code snippet. We introduce triplet loss to ensure vectors of code and query in the same pair be close at the shared vector space. We conduct experiments in one large-scale multi-language corpus, with comparisons of strong baseline models by specified performance metrics. Among baseline models, the simplest Neural Bag-of-Words model is with the most satisfying performance. It indicates that syntactic information is likely to distract complex models from critical semantic information. Results show that our multi-modal representation approach performs better because it surpasses baseline models by far in most cases. The key to our multi-modal model is that it is totally about semantic information, and it learns from data of multiple modalities. / Under de senaste decennierna har olika programvarusystem gradvis blivit basen i vårt samhälle. Programmerare söker i befintliga kodavsnitt från tid till annan i deras dagliga liv. Det skulle vara fördelaktigt och meningsfullt att ha bättre lösningar för uppgiften att semantisk kodsökning, vilket är att hitta de mest semantiskt relevanta kodavsnitten för en given fråga. Vår metod är att introducera trädrepresentationer genom multimodal inlärning. Grundidén är att berika semantisk information för kodavsnitt genom att förbereda data med olika modaliteter och samtidigt ignorera syntaktisk information. Vi designar en ny trädstruktur med namnet Simplified Semantic Tree och extraherar sedan RootPath-representationer från det. Vi använder RootPath-representation för att komplettera den konventionella sekvensrepresentationen, nämligen kodsekvensens symbolsekvens. Vår multimodala modell får kodfrågeställningar som inmatning och beräknar likhetspoäng som utgång efter den pseudo-siamesiska arkitekturen. För varje par, förutom den färdiga kodsekvensen och frågesekvensen, extrager vi en extra trädsekvens från Simplified Semantic Tree. Det finns tre kodare i vår modell, och de kodar respektive tre sekvenser som vektorer av samma längd. Sedan kombinerar vi kodvektorn med trädvektorn för en gemensam vektor, som fortfarande är av samma längd som den multimodala representationen för kodavsnittet. Vi introducerar tripletförlust för att säkerställa att vektorer av kod och fråga i samma par är nära det delade vektorn. Vi genomför experiment i ett storskaligt flerspråkigt korpus, med jämförelser av starka baslinjemodeller med specificerade prestandametriker. Bland baslinjemodellerna är den enklaste Neural Bag-of-Words-modellen med den mest tillfredsställande prestanda. Det indikerar att syntaktisk information sannolikt kommer att distrahera komplexa modeller från kritisk semantisk information. Resultaten visar att vår multimodala representationsmetod fungerar bättre eftersom den överträffar basmodellerna i de flesta fall. Nyckeln till vår multimodala modell är att den helt handlar om semantisk information, och den lär sig av data om flera modaliteter. multi-modal learning pseudo-siamese architecture neural bagof- words model tree representation Simplified Semantic Tree semantic code search multimodal inlärning pseudo-siamesisk arkitektur neural väskamed- ord-modell trädrepresentation förenklat semantiskt träd semantisk kodsökning Computer and Information Sciences Data- och informationsvetenskap
8	Sur l'élaboration de meilleures techniques pour l'apprentissage auto-supervisé des représentations du code Maes, Lucas 07 1900 (has links) Les représentations du code apprises par les modèles d’apprentissage profond sont une composante cruciale pour certaines applications en génie logiciel telles que la recherche de code ou la détection de clones. Les performances de ces applications dépendent de la qualité des représentations apprises par les modèles. De fait, des représentations possédant peu de bruit et contenant des informations avec un haut niveau d’abstraction, comme la sémantique fonctionnelle, facilitent la résolution de ces tâches. En effet, la recherche de code nécessite de comprendre les objectifs des morceaux de code pour les comparer avec une requête en langage naturel, tandis que la détection de clone exige de déterminer si deux morceaux de code ont la même sémantique fonctionnelle. La capacité des modèles à apprendre des représentations contenant de telles informations abstraites est donc cruciale pour la bonne résolution de ces tâches. Cependant, il est toujours difficile pour les modèles de code d’apprendre des représentations abstraites indépendantes de la syntaxe, par exemple la sémantique fonctionnelle. Ce mémoire se consacre donc à l’élaboration de meilleures techniques pour l’apprentissage des représentations du code via l’apprentissage auto-supervisé. Plus spécifiquement, nous nous sommes concentrés sur deux tâches centrales dans l’automatisation du génie logiciel nécessitant un minimum de compréhension de la sémantique fonctionnelle, à savoir, la recherche de code et la détection de clones de type 4. Ce mémoire propose différentes approches à différents degrés d’entraînement. Le premier degré est le pré-entraînement et consiste à apprendre des représentations génériques du code adaptables à n’importe quels problèmes. Le second est le peaufinage, modifiant les représentations apprises pour un problème spécifique. Tout d’abord, nous proposons un nouvel algorithme de pré-entraînement pour les modèles de code utilisant une méthode non contrastive régularisée adaptée de VICReg, permettant l’apprentissage de représentations génériques. Ensuite, nous proposons un nouvel objectif de peaufinage des modèles de code utilisant la distillation des connaissances d’un ensemble de modèles déjà peaufinés, appelés enseignants, sur un modèle étudiant, lui permettant ainsi l’apprentissage de représentations plus abstraites. L’ensemble des contributions vise à améliorer les représentations du code et à maximiser les performances des modèles d’apprentissage automatique pour le code, mais aussi à déterminer quel est le meilleur degré d’entraînement à adopter pour cela. Les résultats expérimentaux et les analyses menées dans ce mémoire sont préliminaires et ne permettent pas de tirer de conclusions définitives. Néanmoins, il est important de souligner que la deuxième contribution surpasse la méthode classique de peaufinage des modèles pour la recherche de code. De plus, les approches décrites proposent des pistes de directions de recherche innovantes et non conventionnelles. / Code representations learned by deep learning models are a crucial component for certain software engineering applications such as code search or clone detection. The performance of these applications depends on the quality of the representations learned by the models. In fact, low-noise representations containing highly abstract information, such as functional semantics, facilitate the resolution of these tasks. Indeed, code search requires understanding the objectives of code snippets in order to compare them with a natural language query, while clone detection requires determining whether two code snippets have the same functional semantics. The ability of models to learn representations containing such abstract information is therefore crucial to the successful resolution of these tasks. However, it is still difficult for code models to learn abstract representations that are independent of syntax, such as functional semantics. This thesis is therefore dedicated to developing better techniques for learning code representations via self-supervised learning. More specifically, we focus on two central tasks in software engineering automation requiring a minimum understanding of functional semantics, namely, code search and type 4 clone detection. This work proposes different approaches with different degrees of training. The first, pre-training, consists in learning generic code representations that can be adapted to any problem. The second is fine-tuning, modifying the representations learned for a specific problem. First, we propose a new pre-training algorithm for code models using a regularized non-contrastive method adapted from VICReg [14] enabling the learning of generic representations. Secondly, we propose a new code model refinement objective using knowledge distillation of a set of already refined models, called teachers, on a student model allowing it to learn more abstract representations. The aim of all these contributions is not only to improve code representations and maximize the performance of machine learning models for code, but also to determine the best degree of training to adopt for this purpose. The experimental results and analyses carried out in this thesis are preliminary and do not allow to draw formal conclusions. Nevertheless, it is important to underline that the second contribution outperforms the classical model refinement method for code search. Moreover, the approaches described suggest innovative and unconventional research directions. Génie logiciel apprentissage profond apprentissage auto-supervisé non contrastif distillation représentation du code recherche de code détection de clone Software engineering Deep learning Self-supervised learning Noncontrastive Code representation Code search Clone detection

Search results