Spelling suggestions: "subject:"cample selection bias"" "subject:"5ample selection bias""
1 |
Developing and Evaluating Methods for Mitigating Sample Selection Bias in Machine LearningPelayo Ramirez, Lourdes Unknown Date
No description available.
|
2 |
Learning under differing training and test distributionsBickel, Steffen January 2008 (has links)
One of the main problems in machine learning is to train a predictive model from training data and to make predictions on test data. Most predictive models are constructed under the assumption that the training data is governed by the exact same distribution which the model will later be exposed to. In practice, control over the data collection process is often imperfect. A typical scenario is when labels are collected by questionnaires and one does not have access to the test population. For example, parts of the test population are underrepresented in the survey, out of reach, or do not return the questionnaire. In many applications training data from the test distribution are scarce because they are difficult to obtain or very expensive. Data from auxiliary sources drawn from similar distributions are often cheaply available.
This thesis centers around learning under differing training and test distributions and covers several problem settings with different assumptions on the relationship between training and test distributions-including multi-task learning and learning under covariate shift and sample selection bias. Several new models are derived that directly characterize the divergence between training and test distributions, without the intermediate step of estimating training and test distributions separately. The integral part of these models are rescaling weights that match the rescaled or resampled training distribution to the test distribution. Integrated models are studied where only one optimization problem needs to be solved for learning under differing distributions. With a two-step approximation to the integrated models almost any supervised learning algorithm can be adopted to biased training data.
In case studies on spam filtering, HIV therapy screening, targeted advertising, and other applications the performance of the new models is compared to state-of-the-art reference methods. / Eines der wichtigsten Probleme im Maschinellen Lernen ist das Trainieren von Vorhersagemodellen aus Trainingsdaten und das Ableiten von Vorhersagen für Testdaten. Vorhersagemodelle basieren üblicherweise auf der Annahme, dass Trainingsdaten aus der gleichen Verteilung gezogen werden wie Testdaten. In der Praxis ist diese Annahme oft nicht erfüllt, zum Beispiel, wenn Trainingsdaten durch Fragebögen gesammelt werden. Hier steht meist nur eine verzerrte Zielpopulation zur Verfügung, denn Teile der Population können unterrepräsentiert sein, nicht erreichbar sein, oder ignorieren die Aufforderung zum Ausfüllen des Fragebogens. In vielen Anwendungen stehen nur sehr wenige Trainingsdaten aus der Testverteilung zur Verfügung, weil solche Daten teuer oder aufwändig zu sammeln sind. Daten aus alternativen Quellen, die aus ähnlichen Verteilungen gezogen werden, sind oft viel einfacher und günstiger zu beschaffen.
Die vorliegende Arbeit beschäftigt sich mit dem Lernen von Vorhersagemodellen aus Trainingsdaten, deren Verteilung sich von der Testverteilung unterscheidet. Es werden verschiedene Problemstellungen behandelt, die von unterschiedlichen Annahmen über die Beziehung zwischen Trainings- und Testverteilung ausgehen. Darunter fallen auch Multi-Task-Lernen und Lernen unter Covariate Shift und Sample Selection Bias. Es werden mehrere neue Modelle hergeleitet, die direkt den Unterschied zwischen Trainings- und Testverteilung charakterisieren, ohne dass eine einzelne Schätzung der Verteilungen nötig ist. Zentrale Bestandteile der Modelle sind Gewichtungsfaktoren, mit denen die Trainingsverteilung durch Umgewichtung auf die Testverteilung abgebildet wird. Es werden kombinierte Modelle zum Lernen mit verschiedenen Trainings- und Testverteilungen untersucht, für deren Schätzung nur ein einziges Optimierungsproblem gelöst werden muss. Die kombinierten Modelle können mit zwei Optimierungsschritten approximiert werden und dadurch kann fast jedes gängige Vorhersagemodell so erweitert werden, dass verzerrte Trainingsverteilungen korrigiert werden.
In Fallstudien zu Email-Spam-Filterung, HIV-Therapieempfehlung, Zielgruppenmarketing und anderen Anwendungen werden die neuen Modelle mit Referenzmethoden verglichen.
|
3 |
Estimation of the mincerian wage model addressing its specification and different econometric issuesBhatti, Sajjad Haider 03 December 2012 (has links) (PDF)
In the present doctoral thesis, we estimated Mincer's (1974) semi logarithmic wage function for the French and Pakistani labour force data. This model is considered as a standard tool in order to estimate the relationship between earnings/wages and different contributory factors. Despite of its vide and extensive use, simple estimation of the Mincerian model is biased because of different econometric problems. The main sources of bias noted in the literature are endogeneity of schooling, measurement error, and sample selectivity. We have tackled the endogeneity and measurement error biases via instrumental variables two stage least squares approach for which we have proposed two new instrumental variables. The first instrumental variable is defined as "the average years of schooling in the family of the concerned individual" and the second instrumental variable is defined as "the average years of schooling in the country, of particular age group, of particular gender, at the particular time when an individual had joined the labour force". Schooling is found to be endogenous for the both countries. Comparing two said instruments we have selected second instrument to be more appropriate. We have applied the Heckman (1979) two-step procedure to eliminate possible sample selection bias which found to be significantly positive for the both countries which means that in the both countries, people who decided not to participate in labour force as wage worker would have earned less than participants if they had decided to work as wage earner. We have estimated a specification that tackled endogeneity and sample selectivity problems together as we found in respect to present literature relative scarcity of such studies all over the globe in general and absence of such studies for France and Pakistan, in particular. Differences in coefficients proved worth of such specification. We have also estimated model semi-parametrically, but contrary to general norm in the context of the Mincerian model, our semi-parametric estimation contained non-parametric component from first-stage schooling equation instead of non-parametric component from selection equation. For both countries, we have found parametric model to be more appropriate. We found errors to be heteroscedastic for the data from both countries and then applied adaptive estimation to control adverse effects of heteroscedasticity. Comparing simple and adaptive estimations, we prefer adaptive specification of parametric model for both countries. Finally, we have applied quantile regression on the selected model from mean regression. Quantile regression exposed that different explanatory factors influence differently in different parts of the wage distribution of the two countries. For both Pakistan and France, it would be the first study that corrected both sample selectivity and endogeneity in single specification in quantile regression framework
|
4 |
Three Essays on Evaluating the Impact of Natural Resource Management ProgramsDe los Santos Montero, Luis Alberto 17 November 2017 (has links)
No description available.
|
5 |
Estimation of the mincerian wage model addressing its specification and different econometric issues / Estimation de la relation de salaires de Mincer : choix de specification et enjeux économétriquesBhatti, Sajjad Haider 03 December 2012 (has links)
Dans cette thèse, notre cadre d’analyse repose sur l’estimation de la fonction de gain proposée par Mincer (1974). Le but est de reprendre la spécification de ce modèle en s'intéressant aux problèmes d’estimation liés. Le but est aussi une comparaison pour les marchés du travail français et pakistanais en utilisant une spécification plus robuste.[...] Toutefois, suivant une nombreuse littérature, la simple estimation du modèle de Mincer est biaisée, ceci en raison de différents problèmes. [...] Dans la présente thèse deux nouvelles variables instrumentales sont proposées dans une application de type IV2SLS. [...] D'après l'analyse menée dans cette thèse, la seconde variable instrumentale apparaît être la plus appropriée, cela puisqu’elle possède un faible effet direct sur la variable de réponse par rapport à la première variable instrumentale proposée. Par ailleurs, la définition de cette variable instrumentale est plus robuste que la première variable instrumentale. [...] Pour éliminer une autre source potentielle de biais, dans l'estimation du modèle de Mincer, i.e. le biais de sélection, la classique méthode à deux étapes de correction proposée par Heckman (1979) a été appliquée. Par cette méthode le biais de sélection a été trouvé positif et statistiquement significatif pour les deux pays. [...] Dans la littérature relative à l'estimation du modèle de Mincer, nous avons noté qu’il y a très peu d'études qui corrigent les deux sources de biais simultanément et aucune étude de cette nature n’existe pas pour la France ou le Pakistan.[...] Donc, en réponse, nous estimons ici une seule spécification corrigeant de manière simultanée le biais de sélection de l'échantillon et le biais d'endogénéité de l'éducation. Nous avons également noté, toujours d'après la littérature, que la robustesse des hypothèses du modèle linéaire utilisé pour estimer le modèle de Mincer a rarement été discutée et testée.[...] Nous avons donc testé formellement la validité de l'hypothèse d'homoscédasticité, cela en appliquant le test de White (1980).[...] Donc, afin d'éviter les effets de l'hétéroscédasticité des erreurs sur le processus d'estimation, nous avons réalisé une estimation adaptative du modèle de Mincer.[...]Basées sur la performance globale des modèles paramétrique et semi-paramétrique, nous avons constaté que, pour la France, les deux formes d'estimation apparaissent bien spécifiées. Toujours dans l'idée de maintenir la facilité d’estimation, le modèle paramétrique a été sélectionné afin d'être le plus approprié pour les données françaises. Pour l'analyse du Pakistan, nous avons conclu que le modèle semi-paramétrique produit des résultats en désaccord avec l’agrément général au Pakistan, mais aussi en rapport à la littérature internationale pour certaines des variables.[...] Donc, comme pour les données françaises, pour les données pakistanaises, nous avons aussi choisi le modèle paramétrique comme le plus robuste qu’afin d'estimer les impacts exercés par les différents facteurs explicatifs sur le processus de la détermination des salaires. Pour les deux pays, après avoir comparé les versions simples et adaptatives du modèle paramétrique et du modèle semi-paramétrique, nous avons trouvé que le modèle paramétrique dans la spécification adaptative est plus performant dans l’objectif d'estimer les impacts des différents facteurs contributifs au processus de détermination des salaires.Enfin, nous avons estimé le modèle de Mincer dans une forme paramétrique choisie de ces estimations, comme le plus approprié en rapport à la forme semi-paramétrique, et à partir de l'analyse de régression en moyenne, comme pour le modèle de régression par quantile.[...]La méthode de régression par quantile a révélé que la plupart des variables explicatives influencent les gains salariaux, ceci différemment suivant les différentes parties de la distribution des salaires, pour les deux marchés du travail considérés. / In the present doctoral thesis, we estimated Mincer’s (1974) semi logarithmic wage function for the French and Pakistani labour force data. This model is considered as a standard tool in order to estimate the relationship between earnings/wages and different contributory factors. Despite of its vide and extensive use, simple estimation of the Mincerian model is biased because of different econometric problems. The main sources of bias noted in the literature are endogeneity of schooling, measurement error, and sample selectivity. We have tackled the endogeneity and measurement error biases via instrumental variables two stage least squares approach for which we have proposed two new instrumental variables. The first instrumental variable is defined as "the average years of schooling in the family of the concerned individual" and the second instrumental variable is defined as "the average years of schooling in the country, of particular age group, of particular gender, at the particular time when an individual had joined the labour force". Schooling is found to be endogenous for the both countries. Comparing two said instruments we have selected second instrument to be more appropriate. We have applied the Heckman (1979) two-step procedure to eliminate possible sample selection bias which found to be significantly positive for the both countries which means that in the both countries, people who decided not to participate in labour force as wage worker would have earned less than participants if they had decided to work as wage earner. We have estimated a specification that tackled endogeneity and sample selectivity problems together as we found in respect to present literature relative scarcity of such studies all over the globe in general and absence of such studies for France and Pakistan, in particular. Differences in coefficients proved worth of such specification. We have also estimated model semi-parametrically, but contrary to general norm in the context of the Mincerian model, our semi-parametric estimation contained non-parametric component from first-stage schooling equation instead of non-parametric component from selection equation. For both countries, we have found parametric model to be more appropriate. We found errors to be heteroscedastic for the data from both countries and then applied adaptive estimation to control adverse effects of heteroscedasticity. Comparing simple and adaptive estimations, we prefer adaptive specification of parametric model for both countries. Finally, we have applied quantile regression on the selected model from mean regression. Quantile regression exposed that different explanatory factors influence differently in different parts of the wage distribution of the two countries. For both Pakistan and France, it would be the first study that corrected both sample selectivity and endogeneity in single specification in quantile regression framework
|
6 |
考慮樣本選擇之兩性薪資低付與差異分析: 隨機邊界關聯結構模型之應用 / An Analysis of Gender Wage Underpayment and Differential with Censoring: A Combination of the Stochastic Frontier Approach with Copula Methods劉洪禎 Unknown Date (has links)
本文採民國94、96、98、100、102年的台灣 「人力運用調查」 資料庫, 以關聯結構法找出組合誤差間的關聯結構密度函數與聯合機率密度函數, 建構隨機邊界關聯結構模型, 解決勞動市場上樣本選擇性問題。 之後分別針對男性及女性估計薪資方程式, 探討每位勞工的薪資效率程度。 本文男、 女性的勞工各按年齡、 工作經驗、 職業、 行業、 教育程度、 公司規模、 工作身分、 婚姻狀態、 工作地分為9大類, 在各類中分群比較薪資效率。
實證結果顯示, 公司規模、 工作身分、 工作地等 3 類, 不論有無考慮樣本選擇, 同一性別中的薪資效率變動趨勢大致一致, 但其餘 6 類, 有考慮樣本選擇的薪資效率變動明顯不同於未考慮樣本選擇。 在考慮了樣本選擇之後的實證結果大多打破以往文獻的預期, 可能是因為過往文獻探討薪資效率時, 大多未考慮樣本選擇, 即將無工作者樣本完全排除, 導致迴歸分析結果僅適用於有工作者。
除了探討薪資效率外, 本文也嘗試在隨機邊界法的架構下, 提出一個衡量性別歧視的新觀點, 將兩性薪資無效率的差異視為一種性別歧視。 若以新觀點衡量台灣的勞動市場, 會發現這5個年度中, 薪資差異幾乎可以完全歸諸於性別歧視。 這顯示即使兩性的薪資差異雖然逐年縮小, 但性別的刻板印象仍存於當今的台灣勞動市場, 造成明顯的性別歧視。 / This paper adopts the "Manpower Utilization Survey" data, a database conducted by Directorate General of Budget, Accounting & Statistics, Executive Yuan, ROC (DGBAS), to study the issues of gender wage differentials and underpayment. The econometric model considers sample selection under the framework of the stochastic frontier model with copula methods. It requires to correct for sample selection and derive the copula density function and joint probability density function by copula method. We separately estimate the male and female wage equations, respectively, to evaluate the wage efficiency and decompose the average wage differential between male and female into several components.
The paper distinguishes workers into 9 categories, including age, experience, occupation, industry, education, firm size, working identity, marital status and working area, and compares the wage efficiency between those categories. The empirical results shows that, the trend of wage efficiency in the categories of firm size, working identity, and working area are almost the same in each gender whether correcting for the sample selection bias or not. However, in the remaining 6 categories, the wage efficiency changes substantially after correcting the sample selection bias. With the correction of the sample election bias, most of the findings differ from those from the past literatures. This may be attributed to the fact that the past works mainly focus on employed workers and lead to possible sample selection bias.
The paper also tries to offer a new method to measure the gender discrimination, which considers the difference in wage inefficiency between the male-female inefficiency as an element of discrimination. The paper finds that the wage differential between male and female can explain almost the entire discrimination. This findings confirms that the gender discrimination still exists in the Taiwan’s labor market eventhough the wage differential between male and female decreases over time.
|
7 |
美國不動產投資信託資產稅賦遞延交換對股票報酬和股利之影響 / The Effect of Tax Deferred Exchange on Stock Return and Dividend in U.S. REITs Property Transaction劉依涵, Yi-Han,Liu Unknown Date (has links)
本文以2003到2006年美國上市之不動產投資信託(REITs)的資產稅賦遞延交換做研究,並用資產出售交易作為比較,觀察稅賦遞延交換對股票報酬和股利的影響,研究結果發現稅賦遞延交換對於股票報酬有負的宣告效果,然而出售資產的交易有正的且顯著的宣告效果,由於美國REITs基於稅法規定,作為免稅體,每年要以股利的形式分配百分之九十的盈餘給股東,稅賦遞延交換並不能像資產出售交易一樣帶來現金流入,因此對於未來股東的股利所得有所影響,股東對於股票報酬沒有正向的反應,但是股東會考慮稅賦遞延交換會帶來資產重配置的效率,再加上REITs通常會支付比規定還要多的股利,因此稅賦遞延交換的對於股票報酬的負影響會因此而減弱,進一步針對交易方式還有REITs股利分配進行研究,研究的結果支持稅賦遞延交換後的股利比起直接出售交易後所發放的股利還要少。本文除了研究股東對於交易宣告的反應之外,也綜觀不同資產交易方式的現金流量和REITs股利的關連性,藉此瞭解影響REITs選擇交易方式的內涵因素,以及對股票報酬和股利的影響。 / This research examines the tax deferred exchanges made by public U.S. Real Estate Investment Trusts (REITs) over 2003-2006 as well as the transactions of sell-off. The purpose of this study attempts to explore the effects of tax deferred exchange on stock returns and dividend distribution. Result of this study shows that announcement effect of tax deferred exchange is negative in stock value. On the contrary, the relationship between sell-offs and stock value is significantly positive. The reason to explain the difference on announcement effect between two types of property transaction is the specific taxable earning distribution restriction on REITs. U.S REITs have to pay out 90 % of taxable earnings in the form of dividends to their shareholders to exempt from tax. As a result, tax deferred exchange doesn’t bring cash inflow contributing to dividend increase and then shareholders react a lower stock return on tax deferred exchange than on sell-offs. However, the negative effect is weakened by the efficiency of asset reallocation and the regular dividend distribution over tax law restriction. In the analysis of dividend payment, the result of dividend examination supports the hypothesis that tax deferred exchange without cash inflow make dividend fewer than sell-offs. This study may be of importance in explaining the reaction of shareholders on tax deferred exchange of REITs’ property, as well as in providing shareholders with a better understanding of the relationship between cash flow and dividend distribution in order to clarify the cause that affect REITs to utilize different types of transaction and the factors that affect stock return and dividend.
|
Page generated in 0.1119 seconds