Universidade Federal de Viçosa
Viçosa, MG. Brasil
A Software in the Area of Genetics and Experimental Statistics
Departamento de Biologia Geral
Viçosa, MG. 36570-00
To breed genetically superior plants, the selected individuals must simultaneously unite a series of properties to produce a comparatively higher yield and to meet consumer demands. A way to increase the chances of success of a breeding program is to perform reliable experiments, generating a great volume of experimental data. Based on an adequate processing of these data, genetic parameters can be estimated and biological phenomena interpreted. In this phase of result analysis and interpretation, appropriate software systems and computer resources are of utmost importance.
The development of software in the field of plant Genetics and Breeding is crucial due to the scarcity of such resources available to the scientific community. The availability of such tools would supply the increasing demand of users in numerous research institutions who deal with an enormous volume of data, requiring adequate ways of processing to accurately estimate statistical and biological parameters.
Particularly in the case of plant genetics, it is noted that the intensive breeding of many species and the complexity of the most important traits require the use of increasingly accurate selection criteria. In all breeding stages, breeders must use information that is expressed in parameters of the biometric models, which are usually available in the output of most scientifically oriented software systems.
The software GENES is compatible with IBM PCs and requires the Windows operating system. Some configuration settings are indispensable, such as a screen resolution of 1024 x 768 (large fonts) and the use of a decimal symbol expressed by points. The package comprises 257 executable projects, 131 text documents in rtf format, occupying about 285Mbytes, available in English, Spanish and Portuguese.
Application of the program
An application of the program Genes usually includes the following steps:
a. Examples of data files: Examples of data files to be processed by Genes are available, which are particularly useful in initial studies, for providing a double learning effect about the operation of the application itself and of the statistical and biometrical techniques used. Each procedure is represented by an icon that accesses the file containing an illustrative example of a particular procedure, with the advantage of the complete description of all its parameters for immediate data analysis.
b. Supplying data for processing: The procedures generally have a common sequence of data analysis. Basically, the user provides the name of the file containing the data to be processed, information about the parameters (number of variables, treatments, blocks etc.), the names of the variables (optional), and then prints or saves the results. It is recommended that these data files should be in .txt format, but they are importable from excel spreadsheets.
The data are supplied in a file containing a data spreadsheet, in which each column represents a certain characteristic to be analyzed and each row the experimental observation. Sometimes, the first columns are reserved to describe classificatory variables or descriptor effects, e.g., treatments, blocks, years, locations etc.
c. Parameter Description: For each procedure, the user must provide specific information on the data file that will be used in processing. For each procedure, specific information is requested. Thus, for example, to perform variance analysis, the user should provide the number of variables to be analyzed, the number of treatments and the number of blocks or replications. In other procedures, other information will be solicited, but in the different procedures, the control buttons on the right side of the screen are common. These buttons represent:
Return: ends operations on the screen of parameter identification.
Read Data: reads the file data, considering all rows and columns. This option is useful to identify gaps in the structure of the data file. Through statistics indicating average, maximum and minimum, possible typos can be detected. An error in data reading would definitely lead to errors in data processing, so that the user must apply the necessary corrections, according to the specification of each procedure, to ensure correct data reading and analyses.
d. Definition of names of variables: After providing the information in the procedure of parameter identification, the user can name the variables analyzed. If the variables are not named, the program will apply the description: X1, X2, … , Xn.
e. Result output: Results are provided by a proper editor of the program Genes. However, the output file can be exported to Word, allowing the use of all features of this powerful editor. In this case, we present the results in a file with font Courier New 8, with customized heading and page numbering. Results can also be exported to Excel or Wordpad and diagrams and figures to Excel or Mspaint.
The Genes software system contains analysis modules that involve several procedures of biometric analysis, as described below.
Biometrics is the application of statistics to the biological field, being essential for planning, assessment and interpretation of all data obtained in research in the biological area. A growing user demand is noted in various research institutions in the biological area, especially with a view to genetic studies, which deal with large data volumes. This requires an adequate processing, to ensure an accurate estimation and interpretation of the statistical and biological parameters. But there is a market gap for software that would supply this demand. In this context, the program GENES was developed to cover mainly the area of biometrics, with numerous procedures for an adequate data processing. The following procedures are available in this module:
a. Genotype x Environment Interaction: stratification analysis, dissimilarity and correlations between environments.
b. Stability and Adaptability: analysis by methods based on ANOVA (traditional, Plaisted and Peterson, 1959, Wricke, 1965 and Annicchiarico, 1992), regression (Eberhart and Russell, 1966, Finlay and Wilkinson, 1963 and Tai, 1971), bi-segmented regression (Verma, Chahal and Murty, 1978, Silva and Barreto, 1985 and Cruz, Torres and Vencovsky, 1989) nonparametric analysis (Huehn, 1990, visual analysis and Lin and Binns, 1988), analysis of factors and main components or centroids.
c. Genetic gains from selection – Indices: calculation of gains by selection between families (univariate and indices), considering direct and indirect selection, the classic index of Smith, 1936 and Hazel, 1943, based on the sum of ranks of Mulamba and Mock, 1978, base index of Williams, 1962, multiplicative index of Subandi et al., 1973, weight-free index of Elston, 1963, based on the desired gains of Pesek and Baker, 1969 and on the genotype-ideotype distance index. Calculation of gains by selection between families by univariate methods or by the following restricted indices: classic index of Smith, 1936, and Hazel, 1943, of Kempthorne and Nordskog, 1959, of Tallis, 1962, of James, 1968, and of Cunningham et al., 1970. Calculation of gain by selection among families considering collinearity indices, indices of gains by selection among and within families, in balanced and unbalanced experiments, by massal and stratified selection among and within families. Visual selection analysis, multi-environment selection and prediction of gains by selection within, without information from plants within a plot.
d. Diallel Analysis: Analysis of balanced diallels (Methodologies of Griffing, 1956, Gardner and Eberhart, 1966, Hayman,1954, and Cocherhan and Weir,1977, tests among hybrids and reciprocals crosses, prediction of compounds and hybrids and of family indices) joint diallel analysis (of balanced diallels of Griffing, 1956, of Gardner and Eberhart, 1966, and of partial and circulating diallels), Partial diallels (by the methodologies of Geraldi and Miranda Filho, 1988, of Miranda Filho and Geraldi,1984, of Kempthorne, 1966, of Viana et al., 1999, and prediction of triple and double hybrids). Analysis of circulating, circulating partial and unbalanced diallels.
e. Analysis of Segregating and non-segregating generations: scale joint test (P1, P2, F1, F2 with optional inclusion of BC1 and BC2), analysis of experiments of segregating lines and parents in alternating rows and analysis of plants in generation Ft and the derived Ft+1 lines.
f. Repeatability: Analysis of original or classified data.
g. Combined selection: analysis of experiments of families with balanced and unbalanced data. Analysis of genetic design proposed by Comstock and Robinson (1948).
h. Genetic and Environmental Progress.
i. Nuclear Collection.
The designation multivariate analysis represents a large number of methods and techniques that simultaneously use all variables in the analysis, interpretation and processing of the data set from a biological phenomenon under study. The mathematical complexity, typical of multivariate methods, has inhibited the transfer of the underlying stochastic fundamentals and principles to the researchers. However, the key part, which is the statistical inference, has been stimulated through the use of well-constructed software with a user-friendly interface for researchers. In the program Genes, the scientist will find the following:
a) Analysis of structural simplification: Principal Components and Canonical Variable Analysis.
b) Association Analysis: Path analysis, Canonical Correlations and Factor analysis
c) Analysis of diversity: Discriminant Analysis (by the method proposed by Anderson or based on principal components). Measures of Dissimilarity: based on continuous, multicateegoric or binary phenotipic quantitative variables. Analysis of molecular data from dominant or codominant markers; cluster analysis: Tocher optimization method, hierarchical, graphic dispersion and 2D and 3D projection. Identification of more and less similar accessions. Importance of traits: by main components or the distance by Mahalanobis’ Generalized distance and canonical variable analysis.
One the major contributions of computing is that phenomena can be studied by simulating a complex situation in which parameters and constraints are established, so that the effect of certain controllable factors can be conveniently studied. Simulation is defined as a way of imitating the behavior of a real system by computational resources to study its functioning under alternative conditions, involving certain types of logic models to describe, as best as possible, the natural system .
Simulations are highly useful in genetic studies in various contexts, including studies of populations, the individual or of the proper genome. They require the development of appropriate biological models to represent the phenomena of interest as ideally as possible by researchers and suitable procedures of processing by programmers, according to the parameters and constraints, so that the influence of certain factors can be assessed.
Genes contained the procedures: Simulation of experiments, Simulation of Samples (p populations and v variables), Optimal Number of Families, Optimal Number of Plants (Random or predifined Sampling) and Optimal Number of Replications or Optimal Sample Size
Studies on diversity can be directed to plant breeding, evolutionary associations, conservation and management of plant material, among other purposes. In each case, an adequate methodology and appropriate information are required. The data of measured units, plants, accessions or taxa can be phenotypic or genotypic. Phenotypic information is derived from the evaluation of characteristics with continuous or discrete distributions, of which the latter can be multicategoric or binary. Genotypic data are obtained from molecular markers or DNA sequencing. In the case of markers, there are dominant or co-dominant and diallelic or multiallelic types. All these situations are addressed in the application Genes, by the approach:
a. Diversity between accessions: based on continuous, multi-categoric, binary phenotypic variables, and analysis of data of dominant and codominant (multi-allelic) markers.
b. Diversity between populations: Nei’s Genetic identity Calculation (1972) and the following distances: Euclidean, of Rogers, Angular, of Goldstein et. al (1985) and of Hedrick.
c. Diversity within populations: calculation of the coefficient of endogamy and heterozygosis, Shannon-Wiener index and heterozygosis from binary data.
d. Diversity among and within populations: descriptive analysis, Nei’s diversity index (1973), Wright’s fixation index (Two alleles or Multiple alleles), analysis of heterozygosity of Weir (1996). Analysis of Contingency Table, ANOVA of allelic frequency (F, f and q), AMOVA of Excoffier et al (1992) and analysis of binary data.
e. Discriminant Analysis: discriminant analysis of Anderson, analysis based on main components or in k-nearest neighbors. Discriminant analyses from the dissimilarity matrices.
f. Grouping analysis: using the following methods: Tocher optimization and hierarchical methods, by graphic dispersion, 2D and 3D projection and analysis of more and less similar accessions. Matrices of Dissimilarities: calculation of the correlation and sum between elements of matrices of dissimilarity. Importance of traits: considering phenotypic quantitative characters or molecular information, by means of MANOVA
g. Optimization: Analysis of the optimal number of binary or multi-allelic markers for the study of genetic variance. Simulation: simulation of populations, crossings and population samples, under the effect of divergent selection or genetic drift.
h. Relationship coefficient and Hardy-Weinberg Equilibrium: Population analysis based on the information of codominant diallelic or multi-allelic markers. Analysis of Gametic Disequilibrium.
This module contains procedures based on statistical models with wide application in various areas of research and undergraduate and graduate teaching. The importance of statistical analysis is the probabilistic proof of the truth of a particular hypothesis formulated based on extensive studies and on analyses of the research results. In statistics, parameters estimates related to the data are presented and interpreted per se, or hypothesis are tested and results are associated with probability values by means of statistical tests. Usually, the use of a particular inferential statistics is directed by the study question. The software Genes offers the following procedures for statistical analyses:
a. Descriptive Statistics, Normality Test and Stand Correction Methods
b. Variance Analysis: analysis of completely randomized designs and schemes, of experiments with regular and non-regular treatments, in randomized blocks, factorial and subdivided plots. Analysis of origin/progeny/plant, simple and triple lattices and hierarchical models.
c. Regressions: simple linear, non-linear, multiple and polynomial, response surface and 3D graphical analysis.
d. Correlations: calculation of genetic correlations, partial and canonical Pearson and Spearman correlations. Path analysis (involving 1 or 2 chains) and path analysis under collinearity.
e. Comparison of Means: Tests of Tukey, Duncan, Scheffé and Scott and Knott, Tukey test with variable number of replications, Dunnett, t test, Tocher, and chi-square test to evaluate hypotheses, heterogeneity and factorial linkage.
The study of matrices is considered fundamental because it is an important tool in this area of mathematics related to calculations and parameter estimation. It is widely applied in estimation methods and model adjustments, such as least squares and maximum likelihood and different matrix analyses. The following procedures are available in Genes:
a. Diagnosis of multicollinearity
b. Algebra of matrices
c. Solution of the system
d. Solution of the system
Integration with other software
Currently, the software GENES has 205 executable projects involving the modules of experimental statistics, biometrics, multivariate analysis, genetic diversity, and simulation matrices. Thus, each procedure has a particular data set for which an appropriate biometric template is prepared that will allow the user to process data and generate and properly interpret results of the studied phenomenon. However, additional analyses may be required or even some differentiated form of carrying out the same kind of study may be evaluated. In this case, the user would surely be willing to try a new analysis option provided no effort is required to understand the particular access to an alternative program or supplement. The user of software GENES has direct access to other applications such as:
Microsoft Word: designed to receive output results and emit reports
Microsoft Excel: designed to receive outputs or results of complementation analysis, in particular graphical analysis.
Microsoft Paint: designed to receive figures, images, and diagrams resulting from the analysis to which, as the researcher sees fit, graphical resources can be applied to improve the aesthetics of the result.
Free software environment R: For each procedure available within software GENES, the user finds a set of instructions for the appropriate settings so that the data can be accessed and processed by program R, according to the researcher’s demand. The program R has been increasingly accepted by universities and companies around the world. Nowadays, the acquisition costs of statistical software packages that are similar or even poorer in terms of analysis capacity, are very high, especially for the predominantly small and medium businesses in our country. Thus, the inclusion of this facility in Genes is yet a another contribution to the use of R, intended to break barriers and facilitate the construction of diagrams and data analyses of quality data, at no cost and with the same reliability as of other software.
Matlab®: Software GENES generates established scripts by a set of sentences or commands to perform or solve problems of a particular type of study based on a set of data or information, within the Matlab program. Matlab is an interactive system whose basic data element is an array that does not require dimensioning. This system allows the resolution of many numerical problems in a fraction of the time one would spend writing a similar program in Fortran, Basic or C. Moreover, solutions to problems are expressed almost exactly as they are written mathematically. Each script consists of set of methods organized and documented by one or more parts of a process allowing, if necessary, the identification and correction of errors by means of debugging the script to obtain a solution with no errors.
The author gratefully acknowledges FAPEMIG, CAPES and CNPq for their financial support.
References concerning the software system
CRUZ, C. D. . Programa Genes - Análise multivariada e simulação. 1. ed. Viçosa, MG: Editora UFV, 2006. v. 1. 175 p.
CRUZ, C. D. . Programa Genes - Biometria. 1. ed. Viçosa,MG: Editora UFV, 2006. v. 1. 382 p.
CRUZ, C. D. . Programa Genes - Diversidade Genética. 1. ed. Viçosa, MG: Editora UFV, 2008. v. 1. 278 p.
CRUZ, C. D. . Programa Genes - Estatística Experimental e Matrizes. 1. ed. Viçosa: Editora UFV, 2006. v. 1. 285 p.