• Refine Query
  • Source
  • Publication year
  • to
  • Language
  • 64
  • 10
  • 10
  • 9
  • 8
  • 7
  • 5
  • 2
  • 2
  • 2
  • 1
  • 1
  • 1
  • 1
  • 1
  • Tagged with
  • 135
  • 41
  • 37
  • 31
  • 26
  • 21
  • 20
  • 19
  • 19
  • 18
  • 17
  • 16
  • 15
  • 14
  • 13
  • About
  • The Global ETD Search service is a free service for researchers to find electronic theses and dissertations. This service is provided by the Networked Digital Library of Theses and Dissertations.
    Our metadata is collected from universities around the world. If you manage a university/consortium/country archive and want to be added, details can be found on the NDLTD website.
101

Développement d'algorithmes pour la fonction NCTR - Application des calculs parallèles sur les processeurs GPU / Algorithm development for NCTR function - Parallel Computing application on GPU cards

Boulay, Thomas 22 October 2013 (has links)
Le thème principal de cette thèse est l'étude d'algorithmes de reconnaissance de cibles non coopératives (NCTR). Il s'agit de faire de la reconnaissance au sein de la classe "chasseur" en utilisant le profil distance. Nous proposons l'étude de quatre algorithmes : un basé sur l'algorithme des KPPV, un sur les méthodes probabilistes et deux sur la logique floue. Une contrainte majeure des algorithmes NCTR est le contrôle du taux d'erreur tout en maximisant le taux de succès. Nous avons pu montrer que les deux premiers algorithmes ne permettait pas de respecter cette contrainte. Nous avons en revanche proposé deux algorithmes basés sur la logique floue qui permettent de respecter cette contrainte. Ceci se fait au détriment du taux de succès (notamment sur les données réelles) pour le premier des deux algorithmes. Cependant la deuxième version de l'algorithme a permis d'augmenter considérablement le taux de succès tout en gardant le contrôle du taux d'erreur. Le principe de cet algorithme est de caractériser, case distance par case distance, l'appartenance à une classe en introduisant notamment des données acquises en chambre sourde. Nous avons également proposé une procédure permettant d'adapter les données acquises en chambre sourde pour une classe donnée à d'autres classes de cibles. La deuxième contrainte forte des algorithmes NCTR est la contrainte du temps réel. Une étude poussée d'une parallélisation de l'algorithme basé sur les KPPV a été réalisée en début de thèse. Cette étude a permis de faire ressortir les points à prendre en compte lors d'une parallélisation sur GPU d'algorithmes NCTR. Les conclusions tirées de cette étude permettront par la suite de paralléliser de manière efficace sur GPU les futurs algorithmes NCTR et notamment ceux proposés dans le cadre de cette thèse. / The main subject of this thesis is the study of algorithms for non-cooperative targets recognition (NCTR). The purpose is to make recognition within "fighter" class using range profile. The study of four algorithms is proposed : one based on the KNN algorithm, one on probabilistic methods and two on fuzzy logic. A major constraint of NCTR algorithms is to control the error rate while maximizing the success rate. We have shown that the two first algorithms are not sufficient to fulfill this requirement. On the other hand, two algorithms based on fuzzy logic have been proposed and meet this requirement. Compliance with this condition is made at the expense of success rate (in particular on real data) for the first of the two algorithms based on fuzzy-logic. However, a second version of the algorithm has greatly increased the success rate while keeping control of the error rate. The principle of this algorithm is to make classification range bin by range bin, with the introduction of data acquired in an anechoic chamber. We also proposed a procedure for adapting the data acquired in an anechoic chamber for a class to another class of targets. The second major constraint algorithms NCTR is the real time constraint. An advanced study of a parallelization on GPU of the algorithm based on KNN was conducted at the beginning of the thesis. This study has helped to identify key points of a parallelization on GPU of NCTR algorithms. Findings from this study will be used to parallelize efficiently on GPU future NCTR algorithms, including those proposed in the thesis.
102

整合文件探勘與類神經網路預測模型之研究 -以財經事件線索預測台灣股市為例

歐智民 Unknown Date (has links)
隨著全球化與資訊科技之進步,大幅加快媒體傳播訊息之速度,使得與股票市場相關之新聞事件,無論在產量、產出頻率上,都較以往增加,進而對股票市場造成影響。現今投資者多已具備傳統的投資概念、觀察總體經濟之趨勢與指標、分析漲跌之圖表用以預測股票收盤價;除此之外,從大量新聞資料中,找出關鍵輔助投資之新聞事件更是需要培養的能力,而此正是投資者較為不熟悉的部分,故希望透過本文加以探討之。   本研究使用2009年自由時報電子報之財經新聞(共5767篇)為資料來源,以文件距離為基礎之kNN技術分群,並採用時間區間之概念,用以增進分群之時效性;而分群之結果,再透過類別詞庫分類為正向、持平及負向新聞事件,與股票市場之量化資料,包括成交量、收盤價及3日收盤價,一併輸入於倒傳遞類神經網路之預測模型。自台灣經濟新報中取得半導體類股之交易資訊,將其分成訓練及測試資料,各包含168個及83個交易日,經由網路之迭代學習過程建立預測模型,並與原預測模型進行比較。   由研究結果中,首先,類別詞庫可透過股票收盤價報酬率及篩選字詞出現頻率的方式建立,使投資者能透藉由分群與分類降低新聞文件的資訊量;其次,於倒傳遞類神經網路預測模型中加入分類後的新聞事件,依統計顯著性檢定,在顯著水準為95%及99%下,皆顯著改善隔日股票收盤價之預測方向正確性與準確率,換言之,於預測模型中加入新聞事件,有助於預測隔日收盤價。最後,本研究並指出一些未來研究方向。
103

在高度分散式環境下對高維度資料建立索引 / Indexing high-dimensional data in highly distributed environments

黃齡葦, Huang, Ling Wei Unknown Date (has links)
目前,隨著資料急速地增加,大規模可擴充性的高度分散式資料庫服務已逐漸成為一種趨勢。在資料如此分散的環境下,如何讓資料的查詢更有效率,建立一個好的索引扮演著相當重要的角色,加上越來越多的資料庫程式應用像是生物、圖像、音樂和視訊等等,皆是處理高維度的資料,而在這些應用程式中,經常需要做相似資料的查詢,但是在高維度的資料且分散式的資料做相似資料的查詢,需耗費大量的時間與運算成本。 基於在高度分散式的環境下,針對高維度的資料有效地做KNN的查詢。我們提出一個利用reference point[2,13]的作法RP-CAN( Reference Point-Content Addressable Network )來改善查詢的效率。RP-CAN 主要是結合CAN [14] 的路由協定和使用reference point建立索引的方式來幫助在高度分散式環境下有效率的對高維的資料做查詢處理。 最後會實作出我們所提出的RP-CAN索引並與RT-CAN[1]做比較。我們發現我們所提出的RP-CAN索引在高維度資料作KNN的查詢時比RT-CAN索引來的有效率。 / There has been an increasing interest in deploying a storage system in a highly distributed environment because of the rapid increasing data. And many database applications such as time series, biological and multimedia database, handle high-dimensional data. In these systems, k nearest-neighbors query is one of the most frequent queries but costly operation that is to find objects in the high-dimensional database that are similar to a given query object. As in conventional DBMS, indexes can indeed improve query performance but cannot deploy directly in highly distributed systems because the environment has become more complex. To efficiently support k nearest-neighbors query, a high-dimensional indexing strategy, is developed for the highly distributed environment. In this paper, we propose an efficient indexing strategy, RP-CAN( Reference Point-Content Addressable Network ), to improve the performance of the k nearest-neighbors query in a highly distributed environment. In the end of this paper, we designed an experiment to demonstrate that the performance of RP-CAN is better than RT-CAN in high dimensional space. Thus, our RP-CAN index could efficiently handle the high dimensional data.
104

A Document Similarity Measure and Its Applications

Gan, Zih-Dian 07 September 2011 (has links)
In this paper, we propose a novel similarity measure for document data processing and apply it to text classification and clustering. For two documents, the proposed measure takes three cases into account: (a) The feature considered appears in both documents, (b) the feature considered appears in only one document, and (c) the feature considered appears in none of the documents. For the first case, we give a lower bound and decrease the similarity according to the difference between the feature values of the two documents. For the second case, we give a fixed value disregarding the magnitude of the feature value. For the last case, we ignore its effectiveness. We apply it to the similarity based single-label classifier k-NN and multi-label classifier ML-KNN, and adopt these properties to measure the similarity between a document and a specific set for document clustering, i.e., k-means like algorithm, to compare the effectiveness with other measures. Experimental results show that our proposed method can work more effectively than others.
105

Detection and counting of Powered Two Wheelers in traffic using a single-plane Laser Scanner

Prabhakar, Yadu 10 October 2013 (has links) (PDF)
The safety of Powered Two Wheelers (PTWs) is important for public authorities and roadadministrators around the world. Recent official figures show that PTWs are estimated to represent only 2% of the total traffic but represent 30% of total deaths on French roads. However, as these estimated figures are obtained by simply counting the number plates registered, they do not give a true picture of the PTWs on the road at any given moment. This dissertation comes under the project METRAMOTO and is a technical applied research work and deals with two problems: detection of PTWsand the use of a laser scanner to count PTWs in the traffic. Traffic generally contains random vehicles of unknown nature and behaviour such as speed,vehicle interaction with other users on the road etc. Even though there are several technologies that can measure traffic, for example radars, cameras, magnetometers etc, as the PTWs are small-sized vehicles, they often move in between lanes and at quite a high speed compared to the vehicles moving in the adjacent lanes. This makes them difficult to detect. the proposed solution in this research work is composed of the following parts: a configuration to install the laser scanner on the road is chosen and a data coherence method is introduced so that the system is able to detect the road verges and its own height above the road surface. This is validated by simulator. Then the rawd ata obtained is pre-processed and is transform into the spatial temporal domain. Following this, an extraction algorithm called the Last Line Check (LLC) method is proposed. Once extracted, the objectis classified using one of the two classifiers either the Support Vector Machine (SVM) or the k-Nearest Neighbour (KNN). At the end, the results given by each of the two classifiers are compared and presented in this research work. The proposed solution in this research work is a propototype that is intended to be integrated in a real time system that can be installed on a highway to detect, extract, classify and counts PTWs in real time under all traffic conditions (traffic at normal speeds, dense traffic and even traffic jams).
106

Approches variationnelles statistiques spatio-temporelles pour l'analyse quantitative de la perfusion myocardique en IRM

Hamrouni-Chtourou, Sameh 11 July 2012 (has links) (PDF)
L'analyse quantitative de la perfusion myocardique, i.e. l'estimation d'indices de perfusion segmentaires puis leur confrontation à des valeurs normatives, constitue un enjeu majeur pour le dépistage, le traitement et le suivi des cardiomyopathies ischémiques --parmi les premières causes de mortalité dans les pays occidentaux. Dans la dernière décennie, l'imagerie par résonance magnétique de perfusion (IRM-p) est la modalité privilégiée pour l'exploration dynamique non-invasive de la perfusion cardiaque. L'IRM-p consiste à acquérir des séries temporelles d'images cardiaques en incidence petit-axe et à plusieurs niveaux de coupe le long du grand axe du cœur durant le transit d'un agent de contraste vasculaire dans les cavités et le muscle cardiaques. Les examens IRM-p résultants présentent de fortes variations non linéaires de contraste et des artefacts de mouvements cardio-respiratoires. Dans ces conditions, l'analyse quantitative de la perfusion myocardique est confrontée aux problèmes complexes de recalage et de segmentation de structures cardiaques non rigides dans des examens IRM-p. Cette thèse se propose d'automatiser l'analyse quantitative de la perfusion du myocarde en développant un outil d'aide au diagnostic non supervisé dédié à l'IRM de perfusion cardiaque de premier passage, comprenant quatre étapes de traitement : -1.sélection automatique d'une région d'intérêt centrée sur le cœur; -2.compensation non rigide des mouvements cardio-respiratoires sur l'intégralité de l'examen traité; -3.segmentation des contours cardiaques; -4.quantification de la perfusion myocardique. Les réponses que nous apportons aux différents défis identifiés dans chaque étape s'articulent autour d'une idée commune : exploiter l'information liée à la cinématique de transit de l'agent de contraste dans les tissus pour discriminer les structures anatomiques et guider le processus de recalage des données. Ce dernier constitue le travail central de cette thèse. Les méthodes de recalage non rigide d'images fondées sur l'optimisation de mesures d'information constituent une référence en imagerie médicale. Leur cadre d'application usuel est l'alignement de paires d'images par appariement statistique de distributions de luminance, manipulées via leurs densités de probabilité marginales et conjointes, estimées par des méthodes à noyaux. Efficaces pour des densités jointes présentant des classes individualisées ou réductibles à des mélanges simples, ces approches atteignent leurs limites pour des mélanges non-linéaires où la luminance au pixel s'avère être un attribut trop frustre pour permettre une décision statistique discriminante, et pour des données mono-modal avec variations non linéaires et multi-modal. Cette thèse introduit un modèle mathématique de recalage informationnel multi-attributs/multi-vues générique répondant aux défis identifiés: (i) alignement simultané de l'intégralité de l'examen IRM-p analysé par usage d'un atlas, naturel ou synthétique, dans lequel le cœur est immobile et en utilisant les courbes de rehaussement au pixel comme ensemble dense de primitives; et (ii) capacité à intégrer des primitives image composites, spatiales ou spatio-temporelles, de grande dimension. Ce modèle, disponible dans le cadre classique de Shannon et dans le cadre généralisé d'Ali-Silvey, est fondé sur de nouveaux estimateurs géométriques de type k plus proches voisins des mesures d'information, consistants en dimension arbitraire. Nous étudions leur optimisation variationnelle en dérivant des expressions analytiques de leurs gradients sur des espaces de transformations spatiales régulières de dimension finie et infinie, et en proposant des schémas numériques et algorithmiques de descente en gradient efficace. Ce modèle de portée générale est ensuite instancié au cadre médical ciblé, et ses performances, notamment en terme de précision et de robustesse, sont évaluées dans le cadre d'un protocole expérimental tant qualitatif que quantitatif
107

Processamento e propriedades do sistema ferroelétrico (Li,K,Na)(Nb,Ta)O3 dopado com CuO

Zapata, Angélica Maria Mazuera 09 March 2015 (has links)
Made available in DSpace on 2016-06-02T20:16:54Z (GMT). No. of bitstreams: 1 6611.pdf: 4783832 bytes, checksum: 1dc280ff47cc4df343ea92399b40fdd5 (MD5) Previous issue date: 2015-03-09 / Financiadora de Estudos e Projetos / The search for new lead-free piezoelectric materials has been a major goal of many scientists in recent years. The main cause is the replacement of widely used lead zirconate titanate (PZT) based ceramics due to the highly toxic characteristics of the lead element. Potassium sodium niobate based ceramics have shown high piezoelectric coefficients and a morphotropic phase boundary close to the composition (K0.5Na0.5)NbO3 (KNN), similar to that found in lead zirconate titanate. However, the preparation of highly dense KNN based ceramics is extremely difficult. In this work, the structural, mechanical and electrical properties of lead free ferroelectric ceramics with compositions Li0,03(K0,5Na0,5)0,97Nb0,8Ta0,2O3 + xwt% CuO (x = 0; 2 and 3.5) were studied. All the compositions, sintered at 1050ºC for 2 hours had high density, approximately 95% of the theoretical value. Rietveld refinement of the X ray diffraction patterns showed a mixture of both orthorhombic Bmm2 and tetragonal P4mm phases, for all compositions. Nevertheless, compositions with high CuO contents have mainly the tetragonal phase. Dielectric and dynamic mechanical analysis (DMA) measurements showed two polymorphic phase transitions with increasing temperature. Both phase transitions have diffuse character and they can be related with the transformation of the orthorhombic phase fraction in the tetragonal one, and with the transformation of the tetragonal ferroelectric phase to a cubic paraelectric one. The origin of the difference observed between the temperatures where both techniques, dielectric and mechanical, see the diffuse phase transition is discussed. The ceramic with 2wt% of CuO is electrically softer than the other compositions and it has the highest value of the piezoelectric coefficient d31. Also, in this work we studied the possibility of using high contents of CuO to promote the formation of liquid phase for obtaining and extracting single crystal seeds, which can be used for the texture of KNN-based ceramics. The ceramic Li0,03(Na0,5K0,5)0,97Ta0,2Nb0,8O3 + x wt% CuO with x=16, sintered at 1090ºC for 2 hours, is a perfect candidate for extracting grains which may be used as seeds. Furthermore, ceramics with x=13, sintered at 1110ºC for 2 hours, showed a partial melting of the material, which caused the growth of highly oriented grains. This material can be practically considered as a single crystal and, with a proper cut procedure, the desired single crystal seeds can be obtained. This method to obtain single crystal seeds, as proposed in this work, is very simple and novelty. / Nos últimos anos, o foco principal de muitos cientistas tem sido a procura de novos materiais piezoelétricos livres de chumbo. A causa principal é a substituição dos materiais baseados em titanato zirconato de chumbo (PZT), os quais são amplamente utilizados em aplicações piezoelétricas, devido à alta toxicidade do elemento chumbo. Cerâmicas baseadas em niobato de sódio e potássio têm mostrado altos coeficientes piezoelétricos e um contorno de fases morfotrópico próximo da composição (K0.5Na0.5)NbO3 (KNN), similar ao encontrado no titanato zirconato de chumbo. Porém, a preparação de cerâmicas baseadas em KNN com alta densidade é extremamente dificultosa. Neste trabalho foram estudadas as propriedades estruturais, mecânicas e elétricas de cerâmicas ferroelétricas livres de chumbo com composições Li0,03(K0,5Na0,5)0,97Nb0,8Ta0,2O3 + x %P CuO (x = 0; 2 e 3,5). Todas as cerâmicas sinterizadas a 1050ºC durante 2 horas apresentaram altas densidades, sendo aproximadamente 95% da densidade teórica. O refinamento pelo método de Rietveld dos perfis de difração de raios X mostrou que todas as composições apresentam uma mistura de ambas as fases, ortorrômbica Bmm2 e tetragonal P4mm. Porém, composições com altos teores de CuO apresentam a fase tetragonal como sendo majoritária. As medidas dielétricas e as de análise mecânico dinâmico (DMA) mostraram duas transições de fase polimórficas com o aumento da temperatura. Ambas transições de fase têm caráter difuso e estão relacionadas com a transformação da fração de fase ortorrômbica em tetragonal e com a transformação da fase tetragonal ferroelétrica para cúbica paraelétrica. Foi discutida a origem da diferença observada, nas temperaturas em que ambas as técnicas, dielétrica e mecânica, enxergam a transição de fase difusa. A cerâmica com 2%P de CuO mostrou-se mais mole eletricamente e apresentou um valor maior de coeficiente piezoelétrico d31 do que as outras composições estudadas. Também, neste trabalho foi estudada a possibilidade de usar altos teores de CuO para promover a formação de fase líquida e conseguir a formação e extração de sementes monocristalinas que possam ser utilizadas na textura de cerâmicas baseadas em KNN. A cerâmica de Li0,03(Na0,5K0,5)0,97Ta0,2Nb0,8O3 + x % P CuO com x=16, sinterizada a 1090ºC durante 2 horas, mostrou-se a candidata perfeita para a extração de grãos que possam ser utilizados como sementes. Por outro lado, a cerâmica com x=13, sinterizada a 1110ºC durante 2 horas, apresentou fusão parcial de material, o que promoveu o crescimento dos grãos altamente orientados de forma que esse material já pode ser considerado como sendo praticamente um monocristal e com um procedimento de corte adequado, podem ser obtidas as sementes monocristalinas desejadas. Esse procedimento de obtenção de sementes monocristalinas, proposto neste trabalho, é totalmente simples e inovador.
108

Detection and counting of Powered Two Wheelers in traffic using a single-plane Laser Scanner / Détection de deux roues motorisées par télémètre laser à balayage

Prabhakar, Yadu 10 October 2013 (has links)
La sécurité des deux-roues motorisés (2RM) constitue un enjeu essentiel pour les pouvoirs publics et les gestionnaires routiers. Si globalement, l’insécurité routière diminue sensiblement depuis 2002, la part relative des accidents impliquant les 2RM a tendance à augmenter. Ce constat est résumé par les chiffres suivants : les 2RM représentent environ 2 % du trafic et 30 % des tués sur les routes.On observe depuis plusieurs années une augmentation du parc des 2RM et pourtant il manque des données et des informations sur ce mode de transport, ainsi que sur les interactions des 2RM avec les autres usagers et l'infrastructure routière. Ce travail de recherche appliquée est réalisé dans le cadre du projet ANR METRAMOTO et peut être divisé en deux parties : la détection des2RM et la détection des objets routiers par scanner laser. Le trafic routier en général contient des véhicules de nature et comportement inconnus, par exemple leurs vitesses, leurs trajectoires et leurs interactions avec les autres usagers de la route. Malgré plusieurs technologies pour mesurer le trafic,par exemple les radars ou les boucles électromagnétiques, il est difficile de détecter les 2RM à cause de leurs petits gabarits leur permettant de circuler à vitesse élevée et ce même en interfile. La méthode développée est composée de plusieurs sous-parties: Choisir une configuration optimale du scanner laser afin de l’installer sur la route. Ensuite une méthode de mise en correspondance est proposée pour trouver la hauteur et les bords de la route. Le choix d’installation est validé par un simulateur. A ces données brutes, la méthode de prétraitement est implémentée et une transformation de ces données dans le domaine spatio-temporel est faite. Après cette étape de prétraitement, la méthode d’extraction nommée ‘Last Line Check (LLC)’ est appliquée. Une fois que le véhicule est extrait, il est classifié avec un SVM et un KNN. Ensuite un compteur est mis en œuvre pour compter les véhicules classifiés. A la fin, une comparaison de la performance de chacun de ces deux classifieurs est réalisée. La solution proposée est un prototype et peut être intégrée dans un système qui serait installé sur une route au trafic aléatoire (dense, fluide, bouchons) pour détecter, classifier et compter des 2RM en temps réel. / The safety of Powered Two Wheelers (PTWs) is important for public authorities and roadadministrators around the world. Recent official figures show that PTWs are estimated to represent only 2% of the total traffic but represent 30% of total deaths on French roads. However, as these estimated figures are obtained by simply counting the number plates registered, they do not give a true picture of the PTWs on the road at any given moment. This dissertation comes under the project METRAMOTO and is a technical applied research work and deals with two problems: detection of PTWsand the use of a laser scanner to count PTWs in the traffic. Traffic generally contains random vehicles of unknown nature and behaviour such as speed,vehicle interaction with other users on the road etc. Even though there are several technologies that can measure traffic, for example radars, cameras, magnetometers etc, as the PTWs are small-sized vehicles, they often move in between lanes and at quite a high speed compared to the vehicles moving in the adjacent lanes. This makes them difficult to detect. the proposed solution in this research work is composed of the following parts: a configuration to install the laser scanner on the road is chosen and a data coherence method is introduced so that the system is able to detect the road verges and its own height above the road surface. This is validated by simulator. Then the rawd ata obtained is pre-processed and is transform into the spatial temporal domain. Following this, an extraction algorithm called the Last Line Check (LLC) method is proposed. Once extracted, the objectis classified using one of the two classifiers either the Support Vector Machine (SVM) or the k-Nearest Neighbour (KNN). At the end, the results given by each of the two classifiers are compared and presented in this research work. The proposed solution in this research work is a propototype that is intended to be integrated in a real time system that can be installed on a highway to detect, extract, classify and counts PTWs in real time under all traffic conditions (traffic at normal speeds, dense traffic and even traffic jams).
109

Switching hybrid recommender system to aid the knowledge seekers

Backlund, Alexander January 2020 (has links)
In our daily life, time is of the essence. People do not have time to browse through hundreds of thousands of digital items every day to find the right item for them. This is where a recommendation system shines. Tigerhall is a company that distributes podcasts, ebooks and events to subscribers. They are expanding their digital content warehouse which leads to more data for the users to filter. To make it easier for users to find the right podcast or the most exciting e-book or event, a recommendation system has been implemented. A recommender system can be implemented in many different ways. There are content-based filtering methods that can be used that focus on information about the items and try to find relevant items based on that. Another alternative is to use collaboration filtering methods that use information about what the consumer has previously consumed in correlation with what other users have consumed to find relevant items. In this project, a hybrid recommender system that uses a k-nearest neighbors algorithm alongside a matrix factorization algorithm has been implemented. The k-nearest neighbors algorithm performed well despite the sparse data while the matrix factorization algorithm performs worse. The matrix factorization algorithm performed well when the user has consumed plenty of items.
110

Comparison of Automatic Classifiers’ Performances using Word-based Feature Extraction Techniques in an E-government setting

Marin Rodenas, Alfonso January 2011 (has links)
Nowadays email is commonly used by citizens to establish communication with their government. On the received emails, governments deal with some common queries and subjects which some handling officers have to manually answer. Automatic email classification of the incoming emails allows to increase the communication efficiency by decreasing the delay between the query and its response. This thesis takes part within the IMAIL project, which aims to provide an automatic answering solution to the Swedish Social Insurance Agency (SSIA) (“Försäkringskassan” in Swedish). The goal of this thesis is to analyze and compare the classification performance of different sets of features extracted from SSIA emails on different automatic classifiers. The features extracted from the emails will depend on the previous preprocessing that is carried out as well. Compound splitting, lemmatization, stop words removal, Part-of-Speech tagging and Ngrams are the processes used in the data set. Moreover, classifications will be performed using Support Vector Machines, k- Nearest Neighbors and Naive Bayes. For the analysis and comparison of different results, precision, recall and F-measure are used. From the results obtained in this thesis, SVM provides the best classification with a F-measure value of 0.787. However, Naive Bayes provides a better classification for most of the email categories than SVM. Thus, it can not be concluded whether SVM classify better than Naive Bayes or not. Furthermore, a comparison to Dalianis et al. (2011) is made. The results obtained in this approach outperformed the results obtained before. SVM provided a F-measure value of 0.858 when using PoS-tagging on original emails. This result improves by almost 3% the 0.83 obtained in Dalianis et al. (2011). In this case, SVM was clearly better than Naive Bayes.

Page generated in 0.0573 seconds