201 |
AI Pinpoints Sustainability Priorities where Surveys Can’t : Towards Sustainable Public Procurement with Unsupervised Text Classification / AI hittar hållbarhetsprioriteringar där enkäter går bet : Mot hållbara offentliga upphandlingar med oövervakad textklassificeringNordstrand, Mattias January 2024 (has links)
There are many sustainability issues related to products, services, and business processes. For example, the production, usage, and disposal of IT equipment all impact sustainability. Therefore, buying more sustainable IT equipment can make a difference. More sustainable IT equipment can be acquired by selecting IT equipment with sustainability certification, such as TCO Certified. TCO Certified makes sustainable purchasing easier, which is useful in public procurement. Public procurement is complex to guarantee objectivity and transparency. Transparency in public procurement also means many public procurement documents are available, which can be analyzed. We hypothesized that the sustainability focuses in these public procurement documents (what the text is about) reflect the sustainability priorities of professional buyers (in their minds, which is indirectly observable). With this link, we investigated differences in sustainability priorities by using a machine learning model for predicting sustainability focuses in public procurement documents. By using a large language model, we automatically extracted sustainability focuses in procurement documents from the e-procurement platform TED. Thereby, we measured the sustainability focus of countries all over the globe. Through interviews with experts, we saw several indications that the used method is a good way of pinpointing sustainability priorities. We provide maps of sustainability focuses around the world (in section 4.12). Moreover, we analyze the results in-depth. One interesting finding includes indications that countries generally do not prioritize an issue more if the issue is of a larger concern. Counterintuitively, countries prioritize an issue more if the issue is of lesser concern. One example of this is circularity focus, which we note is generally less in countries with worse waste management. To our knowledge, analyzing sustainability focuses in procurement documents has not been done on this scale before. We believe these novel results can lead the way for a better understanding of sustainability priorities around the world. / Det finns många hållbarhetsproblem relaterade till produkter, tjänster och affärsprocesser. Till exempel finns det en hållbarhetspåverkan i produktion, användning och avfallshantering av IT-utrustning. Inköp av hållbarare IT-utrustning kan därför göra skillnad. Mer hållbar IT-utrustning kan erhållas genom att välja hållbarhetscertifierad IT-utrustning som exempelvis TCO Certified. TCO Certified gör hållbara inköp enklare och är särskilt användbart inom offentlig upphandling. Offentlig upphandling är komplext i objektivitet- och transparenssyfte. Transparens inom offentlig upphandling innebär också att många offentliga upphandlingsdokument finns tillgängliga och kan analyseras. Vi antog att hållbarhetsfokusen inuti dessa offentliga upphandlings-dokument (vad texten handlar om) hänger ihop med de professionella inköparnas hållbarhetsprioriteringar (deras tankar/känslor/intressen). Med denna länk undersökte vi skillnader i hållbarhetsprioriteringar genom att använda en maskininlärningsmodell för att uppskatta hållbarhetsfokus i offentliga upphandlingsdokument. Med en storskalig språkmodell extraherade vi automatiskt hållbarhetsfokus i dokument från Tenders Electronics Daily. Därigenom mätte vi hållbarhetsfokus för länder över hela världen. Genom intervjuer med experter såg vi flera indikationer på att den använda metoden är ett bra sätt att uppskatta hållbarhetsprioriteringar på. Vi presenterar kartor över hållbarhetsfokus runt om i världen (i avsnitt 4.12). Dessutom analyserar vi resultaten på djupet. En intressant upptäckt är att länder generellt inte prioriterar ett problem mer om problemet är allvarligare i det landet utan snarare tvärtom. Länder prioriterar ett problem mer om problemet är av lägre allvarlighetsgrad i det landet. Till exempel ser vi att cirkularitetsfokuset i offentliga upphandlingar är mindre i länder med sämre avfallshantering. Att analysera hållbarhetsfokus i upphandlingsdokument har inte tidigare gjorts på denna skala såvitt vi känner till. Vi tror våra resultat kan bidra till en bättre förståelse av hållbarhetsprioriteringar runt om i världen.
|
202 |
Malicious Intent Detection Framework for Social NetworksFausak, Andrew Raymond 05 1900 (has links)
Many, if not all people have online social accounts (OSAs) on an online community (OC) such as Facebook (Meta), Twitter (X), Instagram (Meta), Mastodon, Nostr. OCs enable quick and easy interaction with friends, family, and even online communities to share information about. There is also a dark side to Ocs, where users with malicious intent join OC platforms with the purpose of criminal activities such as spreading fake news/information, cyberbullying, propaganda, phishing, stealing, and unjust enrichment. These criminal activities are especially concerning when harming minors. Detection and mitigation are needed to protect and help OCs and stop these criminals from harming others. Many solutions exist; however, they are typically focused on a single category of malicious intent detection rather than an all-encompassing solution. To answer this challenge, we propose the first steps of a framework for analyzing and identifying malicious intent in OCs that we refer to as malicious mntent detection framework (MIDF). MIDF is an extensible proof-of-concept that uses machine learning techniques to enable detection and mitigation. The framework will first be used to detect malicious users using solely relationships and then can be leveraged to create a suite of malicious intent vector detection models, including phishing, propaganda, scams, cyberbullying, racism, spam, and bots for open-source online social networks, such as Mastodon, and Nostr.
|
203 |
[en] TEXT CATEGORIZATION: CASE STUDY: PATENT S APPLICATION DOCUMENTS IN PORTUGUESE / [pt] CATEGORIZAÇÃO DE TEXTOS: ESTUDO DE CASO: DOCUMENTOS DE PEDIDOS DE PATENTE NO IDIOMA PORTUGUÊSNEIDE DE OLIVEIRA GOMES 08 January 2015 (has links)
[pt] Atualmente os categorizadores de textos construídos por técnicas de
aprendizagem de máquina têm alcançado bons resultados, tornando viável a
categorização automática de textos. A proposição desse estudo foi a definição de
vários modelos direcionados à categorização de pedidos de patente, no idioma
português. Para esse ambiente foi proposto um comitê composto de 6 (seis)
modelos, onde foram usadas várias técnicas. A base de dados foi constituída de
1157 (hum mil cento e cinquenta e sete) resumos de pedidos de patente,
depositados no INPI, por depositantes nacionais, distribuídos em várias
categorias. Dentre os vários modelos propostos para a etapa de processamento da
categorização de textos, destacamos o desenvolvido para o Método 01, ou seja, o
k-Nearest-Neighbor (k-NN), modelo também usado no ambiente de patentes, para
o idioma inglês. Para os outros modelos, foram selecionados métodos que não os
tradicionais para ambiente de patentes. Para quatro modelos, optou-se por
algoritmos, onde as categorias são representadas por vetores centróides. Para um
dos modelos, foi explorada a técnica do High Order Bit junto com o algoritmo k-
NN, sendo o k todos os documentos de treinamento. Para a etapa de préprocessamento
foram implementadas duas técnicas: os algoritmos de stemização
de Porter; e o StemmerPortuguese; ambos com modificações do original. Foram
também utilizados na etapa do pré-processamento: a retirada de stopwords; e o
tratamento dos termos compostos. Para a etapa de indexação foi utilizada
principalmente a técnica de pesagem dos termos intitulada: frequência de termos
modificada versus frequência de documentos inversa TF -IDF . Para as medidas
de similaridade ou medidas de distância destacamos: cosseno; Jaccard; DICE;
Medida de Similaridade; HOB. Para a obtenção dos resultados foram usadas as
técnicas de predição da relevância e do rank. Dos métodos implementados nesse
trabalho, destacamos o k-NN tradicional, o qual apresentou bons resultados
embora demande muito tempo computacional. / [en] Nowadays, the text s categorizers constructed based on learning techniques,
had obtained good results and the automatic text categorization became viable.
The purpose of this study was the definition of various models directed to text
categorization of patent s application in Portuguese language. For this
environment was proposed a committee composed of 6 (six) models, where were
used various techniques. The text base was constituted of 1157 (one thousand one
hundred fifty seven) abstracts of patent s applications, deposited in INPI, by
national applicants, distributed in various categories. Among the various models
proposed for the step of text categorization s processing, we emphasized the one
devellopped for the 01 Method, the k-Nearest-Neighbor (k-NN), model also used
in the English language patent s categorization environment. For the others
models were selected methods, that are not traditional in the English language
patent s environment. For four models, there were chosen for the algorithms,
centroid vectors representing the categories. For one of the models, was explored
the High Order Bit technique together with the k-NN algorithm, being the k all the
training documents. For the pre-processing step, there were implemented two
techniques: the Porter s stemization algorithm; and the StemmerPortuguese
algorithm; both with modifications of the original. There were also used in the
pre-processing step: the removal of the stopwards; and the treatment of the
compound terms. For the indexing step there was used specially the modified
documents term frequency versus documents term inverse frequency TF-IDF .
For the similarity or distance measures there were used: cosine; Jaccard; DICE;
Similarity Measure; HOB. For the results, there were used the relevance and the
rank technique. Among the methods implemented in this work it was emphasized
the traditional k-NN, which had obtained good results, although demands much
computational time.
|
204 |
Tuning of machine learning algorithms for automatic bug assignmentArtchounin, Daniel January 2017 (has links)
In software development projects, bug triage consists mainly of assigning bug reports to software developers or teams (depending on the project). The partial or total automation of this task would have a positive economic impact on many software projects. This thesis introduces a systematic four-step method to find some of the best configurations of several machine learning algorithms intending to solve the automatic bug assignment problem. These four steps are respectively used to select a combination of pre-processing techniques, a bug report representation, a potential feature selection technique and to tune several classifiers. The aforementioned method has been applied on three software projects: 66 066 bug reports of a proprietary project, 24 450 bug reports of Eclipse JDT and 30 358 bug reports of Mozilla Firefox. 619 configurations have been applied and compared on each of these three projects. In production, using the approach introduced in this work on the bug reports of the proprietary project would have increased the accuracy by up to 16.64 percentage points.
|
205 |
The research on chinese text multi-label classification / Avancée en classification multi-labels de textes en langue chinoise / 中文文本多标签分类研究Wei, Zhihua 07 May 2010 (has links)
Text Classification (TC) which is an important field in information technology has many valuable applications. When facing the sea of information resources, the objects of TC are more complicated and diversity. The researches in pursuit of effective and practical TC technology are fairly challenging. More and more researchers regard that multi-label TC is more suited for many applications. This thesis analyses the difficulties and problems in multi-label TC and Chinese text representation based on a mass of algorithms for single-label TC and multi-label TC. Aiming at high dimensionality in feature space, sparse distribution in text representation and poor performance of multi-label classifier, this thesis will bring forward corresponding algorithms from different angles.Focusing on the problem of dimensionality “disaster” when Chinese texts are represented by using n-grams, two-step feature selection algorithm is constructed. The method combines filtering rare features within class and selecting discriminative features across classes. Moreover, the proper value of “n”, the strategy of feature weight and the correlation among features are discussed based on variety of experiments. Some useful conclusions are contributed to the research of n-gram representation in Chinese texts.In a view of the disadvantage in Latent Dirichlet Allocation (LDA) model, that is, arbitrarily revising the variable in smooth process, a new strategy for smoothing based on Tolerance Rough Set (TRS) is put forward. It constructs tolerant class in global vocabulary database firstly and then assigns value for out-of-vocabulary (oov) word in each class according to tolerant class.In order to improve performance of multi-label classifier and degrade computing complexity, a new TC method based on LDA model is applied for Chinese text representation. It extracts topics statistically from texts and then texts are represented by using the topic vector. It shows competitive performance both in English and in Chinese corpus.To enhance the performance of classifiers in multi-label TC, a compound classification framework is raised. It partitions the text space by computing the upper approximation and lower approximation. This algorithm decomposes a multi-label TC problem into several single-label TCs and several multi-label TCs which have less labels than original problem. That is, an unknown text should be classified by single-label classifier when it is partitioned into lower approximation space of some class. Otherwise, it should be classified by corresponding multi-label classifier.An application system TJ-MLWC (Tongji Multi-label Web Classifier) was designed. It could call the result from Search Engines directly and classify these results real-time using improved Naïve Bayes classifier. This makes the browse process more conveniently for users. Users could locate the texts interested immediately according to the class information given by TJ-MLWC. / La thèse est centrée sur la Classification de texte, domaine en pleine expansion, avec de nombreuses applications actuelles et potentielles. Les apports principaux de la thèse portent sur deux points : Les spécificités du codage et du traitement automatique de la langue chinoise : mots pouvant être composés de un, deux ou trois caractères ; absence de séparation typographique entre les mots ; grand nombre d’ordres possibles entre les mots d’une phrase ; tout ceci aboutissant à des problèmes difficiles d’ambiguïté. La solution du codage en «n-grams »(suite de n=1, ou 2 ou 3 caractères) est particulièrement adaptée à la langue chinoise, car elle est rapide et ne nécessite pas les étapes préalables de reconnaissance des mots à l’aide d’un dictionnaire, ni leur séparation. La classification multi-labels, c'est-à-dire quand chaque individus peut être affecté à une ou plusieurs classes. Dans le cas des textes, on cherche des classes qui correspondent à des thèmes (topics) ; un même texte pouvant être rattaché à un ou plusieurs thème. Cette approche multilabel est plus générale : un même patient peut être atteint de plusieurs pathologies ; une même entreprise peut être active dans plusieurs secteurs industriels ou de services. La thèse analyse ces problèmes et tente de leur apporter des solutions, d’abord pour les classifieurs unilabels, puis multi-labels. Parmi les difficultés, la définition des variables caractérisant les textes, leur grand nombre, le traitement des tableaux creux (beaucoup de zéros dans la matrice croisant les textes et les descripteurs), et les performances relativement mauvaises des classifieurs multi-classes habituels. / 文本分类是信息科学中一个重要而且富有实际应用价值的研究领域。随着文本分类处理内容日趋复杂化和多元化,分类目标也逐渐多样化,研究有效的、切合实际应用需求的文本分类技术成为一个很有挑战性的任务,对多标签分类的研究应运而生。本文在对大量的单标签和多标签文本分类算法进行分析和研究的基础上,针对文本表示中特征高维问题、数据稀疏问题和多标签分类中分类复杂度高而精度低的问题,从不同的角度尝试运用粗糙集理论加以解决,提出了相应的算法,主要包括:针对n-gram作为中文文本特征时带来的维数灾难问题,提出了两步特征选择的方法,即去除类内稀有特征和类间特征选择相结合的方法,并就n-gram作为特征时的n值选取、特征权重的选择和特征相关性等问题在大规模中文语料库上进行了大量的实验,得出一些有用的结论。针对文本分类中运用高维特征表示文本带来的分类效率低,开销大等问题,提出了基于LDA模型的多标签文本分类算法,利用LDA模型提取的主题作为文本特征,构建高效的分类器。在PT3多标签分类转换方法下,该分类算法在中英文数据集上都表现出很好的效果,与目前公认最好的多标签分类方法效果相当。针对LDA模型现有平滑策略的随意性和武断性的缺点,提出了基于容差粗糙集的LDA语言模型平滑策略。该平滑策略首先在全局词表上构造词的容差类,再根据容差类中词的频率为每类文档的未登录词赋予平滑值。在中英文、平衡和不平衡语料库上的大量实验都表明该平滑方法显著提高了LDA模型的分类性能,在不平衡语料库上的提高尤其明显。针对多标签分类中分类复杂度高而精度低的问题,提出了一种基于可变精度粗糙集的复合多标签文本分类框架,该框架通过可变精度粗糙集方法划分文本特征空间,进而将多标签分类问题分解为若干个两类单标签分类问题和若干个标签数减少了的多标签分类问题。即,当一篇未知文本被划分到某一类文本的下近似区域时,可以直接用简单的单标签文本分类器判断其类别;当未知文本被划分在边界域时,则采用相应区域的多标签分类器进行分类。实验表明,这种分类框架下,分类的精确度和算法效率都有较大的提高。本文还设计和实现了一个基于多标签分类的网页搜索结果可视化系统(MLWC),该系统能够直接调用搜索引擎返回的搜索结果,并采用改进的Naïve Bayes多标签分类算法实现实时的搜索结果分类,使用户可以快速地定位搜索结果中感兴趣的文本。
|
Page generated in 0.0355 seconds