Statistical Analysis of Experimental Data The statistical analysis of experimental data is an important requirement of an engineer. In the context of this project we were required to develop a Java applet program, converted now to "Java Web Start" (JNLP) application, which computes statistical information based on an input data file consisting of (x, y) data observation pairs. The following notes present a briefly introduction (background) on the statistical topics that the applet computes. Java app If you would like to process your data samples, it is required to store data as (x,y) pairs in a text (.txt) file and process it by opening file menu and chossing open. Particularly, the file must consist of rows of data. Each row must have two columns separated by space(s). Each row represents a (x,y) sample, where the first and the second column contain the "x" and "y" values of your data respectively. Click on this icon to download "statistics.jnlp" (JNLP) file. Then just accept and double click on this "JNLP" file. This should start "Java Web Start" and you should be able to launch the "Java" application shortly. "Java Web Start" (JNLP) file is for user to download and launch a "Java" program, when user clicks on the downloaded "JNLP" file. With browsers being more restrictive when it comes to running "Java applets", the initially developed "Java applet" converted to "Java Web Start" (JNLP) application so that instead of being launched from "html", it is now being launched with "JNLP", independently of a browser, as it does not rely on a browser (Java) plugin. Prerequisites : a) install "Java JDK/JRE". b) starting with Java 7 Update 51, an application that does not conform with the latest security practices, such as an application not signed with a certificate from "trusted certificate authority" or an application that is digitally signed by an unknown publisher - as in our case - or a "JAR" file not having the "Permission" manifest attribute or an application signed with an expired certificate, can still be authorized to run by including the site (URL) that hosts it to the "Exception Site List" of the "Java Control Panel / Security tab". "Java Control Panel" can be launched e.g. through the "Windows Start menu / Java program listing / Configure Java" or e.g. by searching for the "javacpl.exe" executable. Have a look at https://java.com/en/download/faq/exception_sitelist.xml for more information. So in our case, the "Exception Site List" of the "Java Control Panel" should be properly updated with the following "URL" : "http://aristotelis-metsinis.github.io/". Introduction The progress of the science and specifically of the engineering science has been closely related to successful experimentation. An experiment can be classified as of deterministic or random-non deterministic. The former class is composed by experiments where the observable consistently gives the same value each time the experiment takes place under the same conditions. An experiment that consists of calculation of the capacitance, inductance and resistance elements of an analog filter, when we apply a specific voltage waveform at input and we observe a specific voltage waveform at output, can be assumed as deterministic. The later class consists of experiments where each time the experiment is performed, a different outcome may be observed although the conditions are the same. The experiment that consists of sending a particular string of e.g. 1000 bits into a link and observing the bits received at the far end in error can be assumed as non-deterministic experiment. The interpretation of such experiments can be developed by means of statistical methods. Suppose now, that an engineer wishes to discover the lifetime of specific type devices working under real time conditions. Ideally, the experimenter would like to take every such a device that exists and record its lifetime. This collection of results would then represent the population of lifetimes. However, experimental limitations usually prevent values for the entire population being known. The best that can be hoped for is that a collection of observations (sampling) taken from the population will enable good estimates (e.g. mean, variance, etc.) of the unknown population to be determined. When the observations are collected in such a way that each population value has an equal chance of being included, the sample is said to be random and the estimates deduced from the sample are called sample estimates (e.g. sample mean, sample variance and sample standard deviation, etc.). Specifically, numbers computed from a data set to help us to estimate its relative frequency histrogram are called numerical descriptive measures. Those measures fall into three categories:
Numerical descriptive measures computed from a sample data are called statistics whereas those computed from the population are called parameters. However, we might made a somewhat arbitrary distinction between data analysis procedures that are model independent (descriptive statistics, e.g. mean, variance, correlation, etc.) and those that are model dependent (e.g. Least squares fits, etc.). Sample mean Perhaps, the most common measure of central tendency is the sample (arithmetic) mean, defined as follows: The sample mean of a set of "n" observations is the average of the observations: (1).
Sample standard deviation The most commonly used measures of data variation are the sample variance and the sample standard deviation, defined as follows: The sample variance of a set of "n" observations is defined to be (2).
The sample standard deviation of "n" observations is equal to the square root of the sample variance: (3). If data set has an approximately mound-shaped (e.g. Gaussian) relative frequency distribution, then the following rules of thumb may be used to describe the data set:
There is a long discussion about why the denominator in (2) is "(n-1)" instead of "n". However, it is beyond the scope of this tutorial to present such a discussion. On the other hand it might be worth mentioning that it can be shown that gives a better estimate of the population variance being an unbiased estimator. As the mean depends on the first moment of data, so the variance and the standard deviation depend on the second moment. It is not uncommon to be dealing with a distribution whose second moment does not exist (i.e. infinite). In that case the variance as well as the standard deviation are useless as measures of the data’s width around its central value. The values obtained by (2) & (3) will not converge with increased number of points nor show any consistency from data set to data set drawn from the same distribution. Higher moments involving higher power of the input data are almost always less robust that lower moments. Skewness & Kurtosis The third moment of data is referred as skewness and the fourth moment of data is referred as kurtosis. Particularly, are defined as follows: The skewness of a set of "n" observations is defined to be (4). The kurtosis of a set of "n" observations is defined to be (5).
Least squares method One of the most important applications of statistics involves estimating the mean value of a dependent-response variable "y" or predicting some future value of "y" based on the knowledge of a set of related independent (in algebraic rather than probabilistic terms) variables . The object is to develop a prediction equation (or model) that expresses "y" as a function of the independent variables , enabling the prediction of "y" for specific values of the independent variables. The models used to relate a dependent variable "y" to the independent variables are called regression models expressing the mean value of "y" for given values of as a linear function of a set of known parameters. In the context of this project, we introduce the simple linear regression model that relates "y" to a single independent variable "x" and particularly we fit this model to a set of data using the method of least squares. A simple linear regression model makes the assumption that the mean value of "y" for a given value of "x" graphs as a straight line and that points deviate about this line of means by random amount : (6) where : the point where the line intercepts y-axis and : slope of the line. In order to fit a simple linear regression model to a set of data, we must find estimators for the unknown parameters , of the line of means, making the following assumptions about :
In actual practice, the assumptions need not hold exactly in order for least squares estimators and test statistics (e.g. chi squared) need to possess the measure of reliability that we would expect from a regression analysis. In order to choose the "best fitting" line for a set of data, we shall estimate and by using the method of least squares. That particularly, method has the significant property of being the only line having the sum of squares of the deviations minimum although many lines exist for which the sum of deviations (errors) is equal to zero. To find the least squares line for a set of data, lets assume that we have a sample of "n" data points . The straight-line model for the response "y" in terms of "x" is given by (6). The line of means is given by (7) and the fitted line is represented as: (8) where is an estimator of "E(y)" and a predictor of some future value of "y" which would be obtained by substituting into the (8), whereas are estimators of and respectively. The prediction equation (8) is called the least squares line if the quantities make the sum of squares of the deviations of "y" about the predicted values for all of the "n" data values minimum. It is easily proved (by setting the two partial derivatives of the sum of squares of deviations with respect to equal to zero respectively and solving the resulting linear system) that:
The method of least squares line fits a straight line through any set of points even when the relationship between the variables is not linear. Concluding this section it might be worth mentioning a number of properties of the least squares method:
Chi-Squared statistics As mentioned, the assumptions made in the previous section for the least squares estimators need not hold in order and test statistics like chi-squared statistics may be used to estimate the goodness of fit of the data to the e.g. least squares method model. We may assume that the chi-squared statistics between the least squares line and the data points is computed by the following relation: (20). In order to estimate the significance of the chi-squared statistics, we shall apply the incomplete gamma function , where "a" the degrees of freedom which is equal to [n-number of constraints] (the usual case is number of constants=1) and "x2" the chi-squared statistics. The incomplete gamma function is defined by: , where , having the following properties:
Except for the special case where "a" is an integer, we can not obtain a close form for the integral of the gamma density function. Consequently, the cumulative distribution function must be obtained using approximation procedures. |