Página 1 dos resultados de 106 itens digitais encontrados em 0.007 segundos

Compositional analysis for an unbiased measure of soil aggregation

Parent, Leon E.; de Almeida, Cinara X.; Hernandes, Amanda; Egozcue, Juan J.; Gulser, Coskun; Bolinder, Martin A.; Katterer, Thomas; Andren, Olof; Parent, Serge E.; Anctil, Francois; Centurion, Jose F.; Natale, William
Fonte: Elsevier B.V. Publicador: Elsevier B.V.
Tipo: Outros Formato: 123-131
Português
Relevância na Pesquisa
26.29%
Soil aggregation is an index of soil structure measured by mean weight diameter (MWD) or scaling factors often interpreted as fragmentation fractal dimensions (D-f). However, the MWD provides a biased estimate of soil aggregation due to spurious correlations among aggregate-size fractions and scale-dependency. The scale-invariant D-f is based on weak assumptions to allow particle counts and sensitive to the selection of the fractal domain, and may frequently exceed a value of 3, implying that D-f is a biased estimate of aggregation. Aggregation indices based on mass may be computed without bias using compositional analysis techniques. Our objective was to elaborate compositional indices of soil aggregation and to compare them to MWD and D-f using a published dataset describing the effect of 7 cropping systems on aggregation. Six aggregate-size fractions were arranged into a sequence of D-1 balances of building blocks that portray the process of soil aggregation. Isometric log-ratios (ilrs) are scale-invariant and orthogonal log contrasts or balances that possess the Euclidean geometry necessary to compute a distance between any two aggregation states, known as the Aitchison distance (A(x,y)). Close correlations (r>0.98) were observed between MWD...

A rank aggregation framework for video multimodal geocoding

Li, Lin Tzy; Pedronette, Daniel Carlos Guimarães; Almeida, Jurandy; Penatti, Otávio A.B.; Calumby, Rodrigo Tripodi; Torres, Ricardo da Silva
Fonte: Universidade Estadual Paulista Publicador: Universidade Estadual Paulista
Tipo: Artigo de Revista Científica Formato: 1-37
Português
Relevância na Pesquisa
36.12%
This paper proposes a rank aggregation framework for video multimodal geocoding. Textual and visual descriptions associated with videos are used to define ranked lists. These ranked lists are later combined, and the resulting ranked list is used to define appropriate locations for videos. An architecture that implements the proposed framework is designed. In this architecture, there are specific modules for each modality (e.g, textual and visual) that can be developed and evolved independently. Another component is a data fusion module responsible for combining seamlessly the ranked lists defined for each modality. We have validated the proposed framework in the context of the MediaEval 2012 Placing Task, whose objective is to automatically assign geographical coordinates to videos. Obtained results show how our multimodal approach improves the geocoding results when compared to methods that rely on a single modality (either textual or visual descriptors). We also show that the proposed multimodal approach yields comparable results to the best submissions to the Placing Task in 2012 using no extra information besides the available development/training data. Another contribution of this work is related to the proposal of a new effectiveness evaluation measure. The proposed measure is based on distance scores that summarize how effective a designed/tested approach is...

On biclusters aggregation and its benefits for enumerative solutions = : Agregação de biclusters e seus benefícios para soluções enumerativas; Agregação de biclusters e seus benefícios para soluções enumerativas

Saullo Haniell Galvão de Oliveira
Fonte: Biblioteca Digital da Unicamp Publicador: Biblioteca Digital da Unicamp
Tipo: Dissertação de Mestrado Formato: application/pdf
Publicado em 27/02/2015 Português
Relevância na Pesquisa
26.12%
Biclusterização envolve a clusterização simultânea de objetos e seus atributos, definindo mo- delos locais de relacionamento entre os objetos e seus atributos. Assim como a clusterização, a biclusterização tem uma vasta gama de aplicações, desde suporte a sistemas de recomendação, até análise de dados de expressão gênica. Inicialmente, diversas heurísticas foram propostas para encontrar biclusters numa base de dados numérica. No entanto, tais heurísticas apresen- tam alguns inconvenientes, como não encontrar biclusters relevantes na base de dados e não maximizar o volume dos biclusters encontrados. Algoritmos enumerativos são uma proposta recente, especialmente no caso de bases numéricas, cuja solução é um conjunto de biclusters maximais e não redundantes. Contudo, a habilidade de enumerar biclusters trouxe mais um cenário desafiador: em bases de dados ruidosas, cada bicluster original se fragmenta em vá- rios outros biclusters com alto nível de sobreposição, o que impede uma análise direta dos resultados obtidos. Essa fragmentação irá ocorrer independente da definição escolhida de co- erência interna no bicluster, sendo mais relacionada com o próprio nível de ruído. Buscando reverter essa fragmentação...

Potential aggregation prone regions in biotherapeutics: A survey of commercial monoclonal antibodies

Wang, Xiaoling; Das, Tapan K; Singh, Satish K; Kumar, Sandeep
Fonte: Landes Bioscience Publicador: Landes Bioscience
Tipo: Artigo de Revista Científica
Publicado em //2009 Português
Relevância na Pesquisa
26.3%
Aggregation of a biotherapeutic is of significant concern and judicious process and formulation development is required to minimize aggregate levels in the final product. Aggregation of a protein in solution is driven by intrinsic and extrinsic factors. In this work we have focused on aggregation as an intrinsic property of the molecule. We have studied the sequences and Fab structures of commercial and non-commercial antibody sequences for their vulnerability towards aggregation by using sequence based computational tools to identify potential aggregation-prone motifs or regions. The mAbs in our dataset contain 2 to 8 aggregation-prone motifs per heavy and light chain pair. Some of these motifs are located in variable domains, primarily in CDRs. Most aggregation-prone motifs are rich in β branched aliphatic and aromatic residues. Hydroxyl-containing Ser/Thr residues are also found in several aggregation-prone motifs while charged residues are rare. The motifs found in light chain CDR3 are glutamine (Q)/asparagine (N) rich. These motifs are similar to the reported aggregation promoting regions found in prion and amyloidogenic proteins that are also rich in Q/N, aliphatic and aromatic residues. The implication is that one possible mechanism for aggregation of mAbs may be through formation of cross-β structures and fibrils. Mapping on the available Fab—receptor/antigen complex structures reveals that these motifs in CDRs might also contribute significantly towards receptor/antigen binding. Our analysis identifies the opportunity and tools for simultaneous optimization of the therapeutic protein sequence for potency and specificity while reducing vulnerability towards aggregation.

Discrimination of soluble and aggregation-prone proteins based on sequence information

Fang, Yaping; Fang, Jianwen
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
26.03%
Understanding the factors governing protein solubility is a key to grasp the mechanisms of protein solubility and may provide insight into protein aggregation and misfolding related diseases such as Alzheimer’s disease. In this work, we attempt to identify factors important to protein solubility using feature selection. Firstly, we calculate 1438 features including physicochemical properties and statistics for each protein. Random Forest algorithm is used to select the most informative and the minimal subset of features based on their predictive performance. A predictive model is built based on 17 selected features. Compared with previous models, our model achieves better performance with a sensitivity of 0.82, specificity 0.85, ACC 0.84, AUC 0.91 and MCC 0.67. Furthermore, a model using redundancy-reduced dataset (sequence identity <= 30%) achieves the same performance as the model without redundancy reduction. Our results provide not only a reliable model for predicting protein solubility but also a list of features important to protein solubility. The predictive model is implemented as a freely available web application at http://shark.abl.ku.edu/ProS/.

Ordinal response prediction using bootstrap aggregation, with application to a high-throughput methylation dataset

Archer, K. J.; Mas, V. R.
Fonte: PubMed Publicador: PubMed
Tipo: Artigo de Revista Científica
Publicado em 20/12/2009 Português
Relevância na Pesquisa
26.29%
Many investigators conducting translational research are performing high-throughput genomic experiments and then developing multigenic classifiers using the resulting high-dimensional dataset. In a large number of applications, the class to be predicted may be inherently ordinal. Examples of ordinal outcomes include Tumor-Node-Metastasis (TNM) stage (I, II, III, IV); drug toxicity evaluated as none, mild, moderate, or severe; and response to treatment classified as complete response, partial response, stable disease, or progressive disease. While one can apply nominal response classification methods to ordinal response data, in so doing some information is lost that may improve the predictive performance of the classifier. This study examined the effectiveness of alternative ordinal splitting functions combined with bootstrap aggregation for classifying an ordinal response. We demonstrate that the ordinal impurity and ordered twoing methods have desirable properties for classifying ordinal response data and both perform well in comparison to other previously described methods. Developing a multigenic classifier is a common goal for microarray studies, and therefore application of the ordinal ensemble methods is demonstrated on a high-throughput methylation dataset.

PASTA 2.0: an improved server for protein aggregation prediction

Walsh, Ian; Seno, Flavio; Tosatto, Silvio C.E.; Trovato, Antonio
Fonte: Oxford University Press Publicador: Oxford University Press
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
26.12%
The formation of amyloid aggregates upon protein misfolding is related to several devastating degenerative diseases. The propensities of different protein sequences to aggregate into amyloids, how they are enhanced by pathogenic mutations, the presence of aggregation hot spots stabilizing pathological interactions, the establishing of cross-amyloid interactions between co-aggregating proteins, all rely at the molecular level on the stability of the amyloid cross-beta structure. Our redesigned server, PASTA 2.0, provides a versatile platform where all of these different features can be easily predicted on a genomic scale given input sequences. The server provides other pieces of information, such as intrinsic disorder and secondary structure predictions, that complement the aggregation data. The PASTA 2.0 energy function evaluates the stability of putative cross-beta pairings between different sequence stretches. It was re-derived on a larger dataset of globular protein domains. The resulting algorithm was benchmarked on comprehensive peptide and protein test sets, leading to improved, state-of-the-art results with more amyloid forming regions correctly detected at high specificity. The PASTA 2.0 server can be accessed at http://protein.bio.unipd.it/pasta2/.

Field normalization at different aggregation levels

Crespo, Juan A.; Herranz, Neus; Li, Yunrong; Ruiz-Castillo, Javier
Fonte: Universidade Carlos III de Madrid Publicador: Universidade Carlos III de Madrid
Tipo: info:eu-repo/semantics/draft; info:eu-repo/semantics/workingPaper Formato: application/pdf; text/plain
Publicado em /12/2012 Português
Relevância na Pesquisa
26.03%
This paper studies the impact of differences in citation practices using the model introduced in Crespo et al. (2012) according to which the number of citations received by an article depends on its underlying scientific influence and the field to which it belongs. Using a dataset of about 4.4 million articles published in 1998- 2003 with a five-year citation window, the main results are the following four. Firstly, we estimate a set of exchange rates (ERs) to express the citation counts of articles in a wide quantile interval into the equivalent counts in the all-sciences case. For example, in the fractional case we find that in 187 out of 219 sub-fields the ERs are reliable in the sense that the coefficient of variation is smaller than or equal to 0.10. ERs are estimated over the [660, 978] interval that, on average, covers about 62% of all citations. Secondly, in the fractional case the normalization of the raw data using the ERs (or the sub-field mean citations) as normalization factors reduces the importance of the differences in citation practices from 18% to 3.8% (3.4%) of overall citation inequality. Thirdly, the results in the fractional case are essentially replicated when we adopt the multiplicative approach. Fourthly...

An alternative to field-normalization in the aggregation of heterogeneous scientific fields

Perianes-Rodríguez, Antonio; Ruiz-Castillo, Javier
Fonte: Universidade Carlos III de Madrid Publicador: Universidade Carlos III de Madrid
Tipo: info:eu-repo/semantics/draft; info:eu-repo/semantics/workingPaper
Publicado em /04/2015 Português
Relevância na Pesquisa
26.03%
A possible solution to the problem of aggregating heterogeneous fields in the all-sciences case relies on the normalization of the raw citations received by all publications. In this paper, we study an alternative solution that does not require any citation normalization. Provided one uses sizeand scale-independent indicators, the citation impact of any research unit can be calculated as the average (weighted by the publication output) of the citation impact that the unit achieves in all fields. The two alternatives are confronted when the research output of the 500 universities in the 2013 edition of the CWTS Leiden Ranking is evaluated using two citation impact indicators with very different properties. We use a large Web of Science dataset consisting of 3.6 million articles published in the 2005-2008 period, and a classification system distinguishing between 5,119 clusters. The main two findings are as follows. Firstly, differences in production and citation practices between the 3,332 clusters with more than 250 publications account for 22.5% of the overall citation inequality. After the standard field-normalization procedure where cluster mean citations are used as normalization factors, this figure is reduced to 4.3%. Secondly...

Epidaurus: aggregation and integration analysis of prostate cancer epigenome

Wang, Liguo; Huang, Haojie; Dougherty, Gregory; Zhao, Yu; Hossain, Asif; Kocher, Jean-Pierre A.
Fonte: Oxford University Press Publicador: Oxford University Press
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
26.2%
Integrative analyses of epigenetic data promise a deeper understanding of the epigenome. Epidaurus is a bioinformatics tool used to effectively reveal inter-dataset relevance and differences through data aggregation, integration and visualization. In this study, we demonstrated the utility of Epidaurus in validating hypotheses and generating novel biological insights. In particular, we described the use of Epidaurus to (i) integrate epigenetic data from prostate cancer cell lines to validate the activation function of EZH2 in castration-resistant prostate cancer and to (ii) study the mechanism of androgen receptor (AR) binding deregulation induced by the knockdown of FOXA1. We found that EZH2's noncanonical activation function was reaffirmed by its association with active histone markers and the lack of association with repressive markers. More importantly, we revealed that the binding of AR was selectively reprogramed to promoter regions, leading to the up-regulation of hundreds of cancer-associated genes including EGFR. The prebuilt epigenetic dataset from commonly used cell lines (LNCaP, VCaP, LNCaP-Abl, MCF7, GM12878, K562, HeLa-S3, A549, HePG2) makes Epidaurus a useful online resource for epigenetic research. As standalone software...

Large-scale analysis of macromolecular crowding effects on protein aggregation using a reconstituted cell-free translation system

Niwa, Tatsuya; Sugimoto, Ryota; Watanabe, Lisa; Nakamura, Shugo; Ueda, Takuya; Taguchi, Hideki
Fonte: Frontiers Media S.A. Publicador: Frontiers Media S.A.
Tipo: Artigo de Revista Científica
Publicado em 08/10/2015 Português
Relevância na Pesquisa
26.12%
Proteins must fold into their native structures in the crowded cellular environment, to perform their functions. Although such macromolecular crowding has been considered to affect the folding properties of proteins, large-scale experimental data have so far been lacking. Here, we individually translated 142 Escherichia coli cytoplasmic proteins using a reconstituted cell-free translation system in the presence of macromolecular crowding reagents (MCRs), Ficoll 70 or dextran 70, and evaluated the aggregation propensities of 142 proteins. The results showed that the MCR effects varied depending on the proteins, although the degree of these effects was modest. Statistical analyses suggested that structural parameters were involved in the effects of the MCRs. Our dataset provides a valuable resource to understand protein folding and aggregation inside cells.

SIFlore, a dataset of geographical distribution of vascular plants covering five centuries of knowledge in France: Results of a collaborative project coordinated by the Federation of the National Botanical Conservatories

Just, Anaïs; Gourvil, Johan; Millet, Jérôme; Boullet, Vincent; Milon, Thomas; Mandon, Isabelle; Dutrève, Bruno
Fonte: Pensoft Publishers Publicador: Pensoft Publishers
Tipo: Artigo de Revista Científica
Publicado em 29/09/2015 Português
Relevância na Pesquisa
26.13%
More than 20 years ago, the French Muséum National d’Histoire Naturelle1 (MNHN, Secretariat of the Fauna and Flora) published the first part of an atlas of the flora of France at a 20km spatial resolution, accounting for 645 taxa (Dupont 1990). Since then, at the national level, there has not been any work on this scale relating to flora distribution, despite the obvious need for a better understanding. In 2011, in response to this need, the Federation des Conservatoires Botaniques Nationaux2 (FCBN, http://www.fcbn.fr) launched an ambitious collaborative project involving eleven national botanical conservatories of France. The project aims to establish a formal procedure and standardized system for data hosting, aggregation and publication for four areas: flora, fungi, vegetation and habitats. In 2014, the first phase of the project led to the development of the national flora dataset: SIFlore. As it includes about 21 million records of flora occurrences, this is currently the most comprehensive dataset on the distribution of vascular plants (Tracheophyta) in the French territory. SIFlore contains information for about 15'454 plant taxa occurrences (indigenous and alien taxa) in metropolitan France and Reunion Island, from 1545 until 2014. The data records were originally collated from inventories...

Extending Ripley’s K-Function to Quantify Aggregation in 2-D Grayscale Images

Amgad, Mohamed; Itoh, Anri; Tsui, Marco Man Kin
Fonte: Public Library of Science Publicador: Public Library of Science
Tipo: Artigo de Revista Científica
Publicado em 04/12/2015 Português
Relevância na Pesquisa
26.03%
In this work, we describe the extension of Ripley’s K-function to allow for overlapping events at very high event densities. We show that problematic edge effects introduce significant bias to the function at very high densities and small radii, and propose a simple correction method that successfully restores the function’s centralization. Using simulations of homogeneous Poisson distributions of events, as well as simulations of event clustering under different conditions, we investigate various aspects of the function, including its shape-dependence and correspondence between true cluster radius and radius at which the K-function is maximized. Furthermore, we validate the utility of the function in quantifying clustering in 2-D grayscale images using three modalities: (i) Simulations of particle clustering; (ii) Experimental co-expression of soluble and diffuse protein at varying ratios; (iii) Quantifying chromatin clustering in the nuclei of wt and crwn1 crwn2 mutant Arabidopsis plant cells, using a previously-published image dataset. Overall, our work shows that Ripley’s K-function is a valid abstract statistical measure whose utility extends beyond the quantification of clustering of non-overlapping events. Potential benefits of this work include the quantification of protein and chromatin aggregation in fluorescent microscopic images. Furthermore...

PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation

Qin, Chengjie; Rusu, Florin
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
26.17%
Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets. In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive $8\text{TB}$ TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.; Comment: 36 pages

An Aggregation Method for Sparse Logistic Regression

Liu, Zhe
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Português
Relevância na Pesquisa
26.17%
$L_1$ regularized logistic regression has now become a workhorse of data mining and bioinformatics: it is widely used for many classification problems, particularly ones with many features. However, $L_1$ regularization typically selects too many features and that so-called false positives are unavoidable. In this paper, we demonstrate and analyze an aggregation method for sparse logistic regression in high dimensions. This approach linearly combines the estimators from a suitable set of logistic models with different underlying sparsity patterns and can balance the predictive ability and model interpretability. Numerical performance of our proposed aggregation method is then investigated using simulation studies. We also analyze a published genome-wide case-control dataset to further evaluate the usefulness of the aggregation method in multilocus association mapping.

Empirical Studies on Symbolic Aggregation Approximation Under Statistical Perspectives for Knowledge Discovery in Time Series

Song, Wei; Wang, Zhiguang; Ye, Yangdong; Fan, Ming
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 08/06/2015 Português
Relevância na Pesquisa
26.03%
Symbolic Aggregation approXimation (SAX) has been the de facto standard representation methods for knowledge discovery in time series on a number of tasks and applications. So far, very little work has been done in empirically investigating the intrinsic properties and statistical mechanics in SAX words. In this paper, we applied several statistical measurements and proposed a new statistical measurement, i.e. information embedding cost (IEC) to analyze the statistical behaviors of the symbolic dynamics. Our experiments on the benchmark datasets and the clinical signals demonstrate that SAX can always reduce the complexity while preserving the core information embedded in the original time series with significant embedding efficiency. Our proposed IEC score provide a priori to determine if SAX is adequate for specific dataset, which can be generalized to evaluate other symbolic representations. Our work provides an analytical framework with several statistical tools to analyze, evaluate and further improve the symbolic dynamics for knowledge discovery in time series.; Comment: 7 pages, 6 figures. Accepted by FSKD 2015

Learning to Rank Academic Experts in the DBLP Dataset

Moreira, Catarina; Martins, Bruno; Calado, Pável
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 21/01/2015 Português
Relevância na Pesquisa
26.2%
Expert finding is an information retrieval task that is concerned with the search for the most knowledgeable people with respect to a specific topic, and the search is based on documents that describe people's activities. The task involves taking a user query as input and returning a list of people who are sorted by their level of expertise with respect to the user query. Despite recent interest in the area, the current state-of-the-art techniques lack in principled approaches for optimally combining different sources of evidence. This article proposes two frameworks for combining multiple estimators of expertise. These estimators are derived from textual contents, from graph-structure of the citation patterns for the community of experts, and from profile information about the experts. More specifically, this article explores the use of supervised learning to rank methods, as well as rank aggregation approaches, for combing all of the estimators of expertise. Several supervised learning algorithms, which are representative of the pointwise, pairwise and listwise approaches, were tested, and various state-of-the-art data fusion techniques were also explored for the rank aggregation framework. Experiments that were performed on a dataset of academic publications from the Computer Science domain attest the adequacy of the proposed approaches.; Comment: Expert Systems...

Using Rank Aggregation for Expert Search in Academic Digital Libraries

Moreira, Catarina; Martins, Bruno; Calado, Pável
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 21/01/2015 Português
Relevância na Pesquisa
26.12%
The task of expert finding has been getting increasing attention in information retrieval literature. However, the current state-of-the-art is still lacking in principled approaches for combining different sources of evidence. This paper explores the usage of unsupervised rank aggregation methods as a principled approach for combining multiple estimators of expertise, derived from the textual contents, from the graph-structure of the citation patterns for the community of experts, and from profile information about the experts. We specifically experimented two unsupervised rank aggregation approaches well known in the information retrieval literature, namely CombSUM and CombMNZ. Experiments made over a dataset of academic publications for the area of Computer Science attest for the adequacy of these methods.; Comment: In Simp\'{o}sio de Inform\'{a}tica, INForum, Portugal, 2011

A Graph Traversal Based Approach to Answer Non-Aggregation Questions Over DBpedia

Zhu, Chenhao; Ren, Kan; Liu, Xuan; Wang, Haofen; Tian, Yiding; Yu, Yong
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 16/10/2015 Português
Relevância na Pesquisa
26.29%
We present a question answering system over DBpedia, filling the gap between user information needs expressed in natural language and a structured query interface expressed in SPARQL over the underlying knowledge base (KB). Given the KB, our goal is to comprehend a natural language query and provide corresponding accurate answers. Focusing on solving the non-aggregation questions, in this paper, we construct a subgraph of the knowledge base from the detected entities and propose a graph traversal method to solve both the semantic item mapping problem and the disambiguation problem in a joint way. Compared with existing work, we simplify the process of query intention understanding and pay more attention to the answer path ranking. We evaluate our method on a non-aggregation question dataset and further on a complete dataset. Experimental results show that our method achieves best performance compared with several state-of-the-art systems.; Comment: In the proceedings of the 5th Joint International Semantic Technology (JIST2015)

Pantheon: A Dataset for the Study of Global Cultural Production

Yu, Amy Zhao; Ronen, Shahar; Hu, Kevin; Lu, Tiffany; Hidalgo, César A.
Fonte: Universidade Cornell Publicador: Universidade Cornell
Tipo: Artigo de Revista Científica
Publicado em 25/02/2015 Português
Relevância na Pesquisa
26.19%
We present the Pantheon 1.0 dataset: a manually curated dataset of individuals that have transcended linguistic, temporal, and geographic boundaries. The Pantheon 1.0 dataset includes the 11,341 biographies present in more than 25 languages in Wikipedia and is enriched with: (i) manually curated demographic information (place of birth, date of birth, and gender), (ii) a cultural domain classification categorizing each biography at three levels of aggregation (i.e. Arts/Fine Arts/Painting), and (iii) measures of global visibility (fame) including the number of languages in which a biography is present in Wikipedia, the monthly page-views received by a biography (2008-2013), and a global visibility metric we name the Historical Popularity Index (HPI). We validate our measures of global visibility (HPI and Wikipedia language editions) using external measures of accomplishment in several cultural domains: Tennis, Swimming, Car Racing, and Chess. In all of these cases we find that measures of accomplishments and fame (HPI) correlate with an $R^2 \geq 50%$, suggesting that measures of global fame are appropriate proxies for measures of accomplishment.