Data Science with Java – Part 4 : Testing Hypothesis with the inference package

To test if a certain hypothesis is likely to be true we can take advantage of the Apache commons math inference package.

Considering the tests included in the package is a good opportunity to learn more about statistics and probability theory.

Let´s consider the following binomial test about flipping a coin:

		BinomialTest binomialTest = new BinomialTest();

		double nullHypothesis = 0.5; //fair coin
		int numberOfSuccesses = 9; //number of heads (biased coin)
		
		//Two sided = Represents a right-sided test. H0: p ≤ p0, H1: p > p0.
		AlternativeHypothesis alternativeHypothesis = AlternativeHypothesis.TWO_SIDED;
		int numberOfTrials = 10;

		// Returns the observed significance level, or p-value, associated with
		// a Binomial test.
		double significanceLevel = binomialTest.binomialTest(numberOfTrials, numberOfSuccesses, nullHypothesis,
				alternativeHypothesis);

		double alpha = 0.03; //significance level of the test
		
		// Returns whether the null hypothesis can be rejected with the given
		// confidence level.
		//true if signficanceLevel < alpha
		boolean rejected = binomialTest.binomialTest(numberOfTrials, numberOfSuccesses, nullHypothesis,
				alternativeHypothesis, alpha);

		System.out.println("The significance level is " + significanceLevel);
		System.out.println("Can we reject the null hypothesis?" + rejected);

 

The result that we get is:

The significance level is 0.021484375000000003
Can we reject the null hypothesis?true

The significance level is lower that the expected value alpha; it means that we can discard the test.

In the next posts I will write about the ChiSquare and KolmogorovSmirnov tests too. Stay tuned! 🙂

Data science with Java – Part 3: Statistics with Apache Commons Math library

Although some statistical analysis can be performed with simple Java 8 code (thanks to lambda functions and Stream API),  a lot more with less lines of code can be achieved with libraries Google Guava or Apache Commons Mathematics Library.

I am a big fan of the Apache Foundation, so I will discard Google guava for now.

The commons library offers a couple options for each statistical function.

You can use the class DescriptiveStatistics passing the array of doubles as parameter:

DescriptiveStatistics descriptiveStatistics = new DescriptiveStatistics(testData);
out.println("\nThe mean is " + descriptiveStatistics.getMean());
out.println("The standard deviation is " + descriptiveStatistics.getStandardDeviation());
out.println("The median is " + descriptiveStatistics.getPercentile(50));

or use the classes Mean, Median, etc.

public static double getMean(double[] testData) {
    Mean mean = new Mean();
    return mean.evaluate(testData);
}

The StandardDeviation can be constructued using the sample formual (Bessel´s bias correction) setting the parameter to “true”:

private static double getUnbiasedStandardDeviation(double[] testData) {
// unbiased estimation
    StandardDeviation sdSubset = new StandardDeviation(false);
    return sdSubset.evaluate(testData);
}

private static double getBiasCorrectedStandardDeviation(double[] testData) {
// bias corrected estimation ( n − 1 instead of n in the formula)
    StandardDeviation sdPopulation = new StandardDeviation(true);
    return sdPopulation.evaluate(testData);
}

 

Data Science with Java – Part 2: CSV data into charts

A nice java library called opencsv allows you to import the csv file content and make charts out of it.

Let´s consider for example unemployment in Germany since the reunification. We will use a csv file containing year, amount of people in germany, west and east (four columns)

1991,2602203,1596457,1005745
1992,2978570,1699273,1279297
1993,3419141,2149465,1269676
1994,3698057,2426276,1271781
1995,3611921,2427083,1184838
1996,3965064,2646442,1318622
1997,4384456,2870021,1514435
1998,4280630,2751535,1529095
1999,4100499,2604720,1495779
2000,3889695,2380987,1508707
2001,3852564,2320500,1532064
2002,4061345,2498392,1562953
2003,4376795,2753181,1623614
2004,4381281,2782759,1598522
2005,4860909,3246755,1614154
2006,4487305,3007158,1480146
2007,3760586,2475528,1285058
2008,3258954,2138778,1120175
2009,3414992,2314215,1100777
2010,3238965,2227473,1011492
2011,2976488,2026545,949943
2012,2897126,1999918,897209
2013,2950338,2080342,869995
2014,2898388,2074553,823835
2015,2794664,2020503,774162
2016,2690975,1978672,712303

We can represent it with an index chart by using just JavaFX and the opencsv library:

package de.datascience.charts;

import java.io.FileReader;

import com.opencsv.CSVReader;

import javafx.application.Application;
import javafx.scene.Scene;
import javafx.scene.chart.CategoryAxis;
import javafx.scene.chart.LineChart;
import javafx.scene.chart.NumberAxis;
import javafx.scene.chart.ScatterChart;
import javafx.scene.chart.XYChart;
import javafx.stage.Stage;

public class UnemploymentGermany extends Application {

	@Override
	public void start(Stage stage) throws Exception {
		stage.setTitle("Index Chart Sample");
		final NumberAxis yAxis = new NumberAxis(0, 5000000, 1);
		final CategoryAxis xAxis = new CategoryAxis();

		final LineChart<String, Number> lineChart = new LineChart<>(xAxis, yAxis);
		yAxis.setLabel("People without job");
		xAxis.setLabel("year");
		lineChart.setTitle("Unemployment in Germnay");

		XYChart.Series series = new XYChart.Series();
		XYChart.Series seriesWest = new XYChart.Series();
		XYChart.Series seriesEast = new XYChart.Series();
		
		series.setName("Germany");
		seriesWest.setName("West Germany");
		seriesEast.setName("East Germany");
		
		try (CSVReader dataReader = new CSVReader(new FileReader("docs/unemployment_germany.csv"))) {
			String[] nextLine;
			while ((nextLine = dataReader.readNext()) != null) {
				String year = String.valueOf(nextLine[0]);
				int population = Integer.parseInt(nextLine[1]);
				series.getData().add(new XYChart.Data(year, population));
				int populationWest = Integer.parseInt(nextLine[2]);
				;
				seriesWest.getData().add(new XYChart.Data(year, populationWest));
				int populationEast = Integer.parseInt(nextLine[3]);
				seriesEast.getData().add(new XYChart.Data(year, populationEast));
			}
		}

		lineChart.getData().addAll(series, seriesWest, seriesEast);
		Scene scene = new Scene(lineChart, 500, 400);
		stage.setScene(scene);
		stage.show();
	}

	public static void main(String[] args) {
		launch(args);
	}
}

The output will be the following:

Data Science with Java – Part 1: bar charts with FX

This year some books about using Java for Data science have been released and I am very happy about it!!! It doesn´t have to be Python at any cost.

Let´s dive into this new Java adventure. 🙂

Some basic visualization can be achieved with some FX classes, that can be found in the “javafx.scene.chart” package.

The following code will create a bar chart about the the Shares of Expenditures in 4 countries by category:

package de.datascience.charts;

import javafx.application.Application;
import javafx.scene.Scene;
import javafx.scene.chart.BarChart;
import javafx.scene.chart.CategoryAxis;
import javafx.scene.chart.NumberAxis;
import javafx.scene.chart.XYChart;
import javafx.stage.Stage;

public class ExpendituresShares extends Application {

    final static String FOOD = "Food";
    final static String HOUSING = "Housing";
    final static String TRANSPORTATION = "Transportation";
    final static String HEALTHCARE = "Health care";
    final static String CLOTHING = "Clothing";
    
    final static String USA="U.S.A.";
    final static String UK="United Kingdom";
    final static String CANADA="Canada";
    final static String JAPAN="Japan";

    final CategoryAxis xAxis = new CategoryAxis();
    final NumberAxis yAxis = new NumberAxis();

    final XYChart.Series<String, Number> usaSeries = new XYChart.Series<>();
    final XYChart.Series<String, Number> canadaSeries2 = new XYChart.Series<>();
    final XYChart.Series<String, Number> ukSeries = new XYChart.Series<>();
    final XYChart.Series<String, Number> japanSeries = new XYChart.Series<>();

    public void simpleBarChartByCountry(Stage stage) {
        stage.setTitle("Bar Chart");
        final BarChart<String, Number> barChart
                = new BarChart<>(xAxis, yAxis);
        barChart.setTitle("Shares of expenditures by Country");
        xAxis.setLabel("Category");
        yAxis.setLabel("Percentage");

        usaSeries.setName(USA);
        addDataItem(usaSeries, FOOD, 14);
        addDataItem(usaSeries, HOUSING, 26);
        addDataItem(usaSeries, TRANSPORTATION, 17);
        addDataItem(usaSeries, HEALTHCARE, 8);
        addDataItem(usaSeries, CLOTHING, 4);

        canadaSeries2.setName(CANADA);
        addDataItem(canadaSeries2, FOOD, 15);
        addDataItem(canadaSeries2, HOUSING, 21);
        addDataItem(canadaSeries2, TRANSPORTATION, 20);
        addDataItem(canadaSeries2, HEALTHCARE, 7);
        addDataItem(canadaSeries2, CLOTHING, 6);

        ukSeries.setName(UK);
        addDataItem(ukSeries, FOOD, 20);
        addDataItem(ukSeries, HOUSING, 24);
        addDataItem(ukSeries, TRANSPORTATION, 15);
        addDataItem(ukSeries, HEALTHCARE, 2);
        addDataItem(ukSeries, CLOTHING, 6);
        
        japanSeries.setName(JAPAN);
        addDataItem(japanSeries, FOOD, 23);
        addDataItem(japanSeries, HOUSING, 22);
        addDataItem(japanSeries, TRANSPORTATION, 10);
        addDataItem(japanSeries, HEALTHCARE, 4);
        addDataItem(japanSeries, CLOTHING, 4);

        Scene scene = new Scene(barChart, 800, 600);
        barChart.getData().addAll(usaSeries, canadaSeries2, ukSeries, japanSeries);
        stage.setScene(scene);
        stage.show();
    }

    public void addDataItem(XYChart.Series<String, Number> series,
            String x, Number y) {
        series.getData().add(new XYChart.Data<>(x, y));
    }

    @Override
    public void start(Stage stage) {
        simpleBarChartByCountry(stage);
    }

    public static void main(String[] args) {
        launch(args);
    }

}

If you run the main you should see the following window:

Python : Basic statistics with the numpy module

The numpy module features some useful functions for statistics, like “mean()” and “median()”:

https://docs.scipy.org/doc/numpy/reference/routines.statistics.html

For example let´s consider a 2D array with age and height of some people and print out some statistics:

#! /usr/bin/env python
import numpy as np

#age, height in meters
person = [[11,1.56],[4, 0.80], [44, 1.88], [23, 1.68], [55, 1.74]]

np_person = np.array(person)

print(np_person)

age = np_person[:,0]

height = np_person[:,1]
 
 #average
print("average age: " + str(np.mean(age)))
print("average height: " + str(np.mean(height)))

#the standard deviation is also rounded to two decimals only.
std_height= round(np.std(height),2)

print("standard deviation of the height: "+ str(std_height))

#correlation
corr = np.corrcoef(np_person[:,0], np_person[:,1])
print("Correlation: " + str(corr))

The code can be also found on github:
https://github.com/lauraliparulo/python-scripts/blob/master/statistics/person_stats.py

Python scripting with Linux: which shebang?

If you want to execute python scripts with Linux you need to add the shebang line: “#! /usr/bin/env python”

It must be added on top of the file.

The shebang will allow you to run the script as any other script. Among the many options to run it, assuming the script name is “script”, one is:

> ./script.py

The file must be made executable:

>  sudo chmod +x script.py

Assuming we want to print an homogenous array created with the numpy module, a script might include the module import too:

#! /usr/bin/env python

import numpy as np

array1 = np.array([1,2,3,4])

print(array1)

print(type(array1))

It will print the following lines:

[1 2 3 4]
<type ‘numpy.ndarray’>

 

 

Files and directories in Java: the NIO.2 Approach

In the previous post I told you about the old fashioned way to handle files in java.
In this article we will focus on the new API.
The File class is no longer used. We will use the “Path” class instead.
Files is a utility class to create either files or directories, that takes a path in the constructor:

	Path filePath=Paths.get("/home/laura/demo.txt");
	Files.createFile(filePath);
	
	Path dirPath=Paths.get("/home/laura/newdir");
	Files.createDirectory(dirPath);
	
	Path nestedDirPath=Paths.get("/home/laura/newdir3/newdir4");
	Files.createDirectories(nestedDirPath);

The “createDirectories” method created the whole file system structure (the nested directories).

With NIO.2 copying or deleting a file or directory have become a matter of one line of code, thanks to the very straightforward methods “copy”, “delete”, etc.

Many things can be achieved with the utility classes Files and Paths.

Java IO: Handling characters files efficiently

Let´s take a look a the classic IO file handling in java.
First let´s start with the “File” class.
The File class is not used to read or write data. It´s used to create or delete files and directories, for searching and working with paths.
So it´s not used for the files or directories content.

The File class constructor does not create anything on the hard drive. As you can see in the following snippet, you need to use the createNewFile method for it:

		File file = new File("/home/laura/demo.txt");
		System.out.println("Does the file exists?"+file.exists());
		boolean created = file.createNewFile();
		System.out.println("File created?"+created);
		System.out.println("Does the file exists?"+file.exists());

The createNewFile method returns “false” if the file already exists.

If the goal is handling text files, you don´t need to use any Stream class. All you need is a writer one. The basic option is to use the naked “FileReader” and “FileWriter” classes:

		FileWriter fileWriter= new FileWriter(file);
		fileWriter.write("hello world. this is a text file.");
		fileWriter.flush();
		fileWriter.close();
		
		FileReader fileReader= new FileReader(file);
		char[] in = new char[100];		
		int size = fileReader.read(in);
	
		System.out.println("File size:"+size);
		
		for(char c: in){
			System.out.print(c);
		}
	
		fileReader.close();

The FileReader method “read” delivers the amount of character read.

Notice that after writing the text into the file, you need to call the method “flush”. It´s necessary to call it to make sure that the whole data flows into the file before closing the writer.

So far so good. But this simple way shown above is not the most elegant solution out there, because we have been using an array (that has a fixed size)!

A much better approach is wrapping it up using the classes BufferedReader and BufferedWriter:

		FileWriter fileWriter= new FileWriter(file);
		BufferedWriter bufferedWriter = new BufferedWriter(fileWriter);
		
		bufferedWriter.write("hello world. this is a text file.");
		bufferedWriter.newLine();
		bufferedWriter.write("You are using the buffered reader now!");			
		bufferedWriter.close();
		
		FileReader fileReader= new FileReader(file);
		BufferedReader bufferedReader = new BufferedReader(fileReader);
		
		String data;			
		while((data=bufferedReader.readLine())!=null){
			System.out.println(data);		
		}
		
		bufferedReader.close();

Notice that buffered reader also has a method called “readLine”, that you wouldn´t get with a simple File reader.
Once you close a writer it cannot be reopened again. You get an exception if you use it after closing it!

Debugging and Testing in Java : enabling assertions

For testing and debugging purposes you can enable the assertions evaluation with the VM parameter “-ea” ( or “-enableassertions” if you prefer the whole thing).

The assertions remain in the code and are just ignored at runtime if you don´t enable them. You can think of them as an aid in case of need.

The following method throws an exception if the parameter given is not a String containing “Laura”:

         private void testFirstName(String firstName){
                 assert(firstName.equals("Laura"));

               System.out.println("The given first name is Laura");
         }

A string message can be displayed in the stack trace, by adding a second expression to the assertion, like :

	private void testFirstName(String firstName){
		assert(firstName.equals("Laura")) : "the name "+ firstName+ " is not Laura!";
		
		System.out.println("The given first name is Laura");
	}

The output in this second case would be something like:

Exception in thread "main" java.lang.AssertionError: the name Anna is not Laura!
at com.demo.AssertionsDemo.testFirstName(AssertionsDemo.java:15)
at com.demo.AssertionsDemo.main(AssertionsDemo.java:8)

Since Java 1.4 “assert” has become a keyword.

Assertions can be disabled with the VM option “-da” or “-disableassertions”. Although the assertions are disabled by default, the manual disabling might make sense if you don´t want to enable the assertions for all the classes.

For example if you want to run a java jar file (for example “demo.jar”), but disabling the assertions for one of its classes (ex. “com.demo.Demo.java”) you need to use the following parameters:

> java -jar -ea -da:com.demo.Demo

To disable them for the whole “com.demo” package (including subpackages):

> java -jar -ea -da:com.demo...

So entering VM parameters like “-ea:” or “-ea:” allow you to select which assertions to evaluate.

Ubuntu 14.04: PPA key not found on keyserver

Running “apt-get update” ubuntu complains about missings ppa keys.

In my case I was trying to install Libre Office and the missing key was 83FBA1751378B444.

I couldn´t find any key running:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 83FBA1751378B444

The message that I gog in the console was:
gpg: requesting key 83FBA1751378B444 from hkp server keyserver.ubuntu.com
gpgkeys: key 83FBA1751378B444 not found on keyserver
gpg: no valid OpenPGP data found.
gpg: Total number processed: 0

The solution to me was running:

 sudo launchpad-getkeys 

and the I could finally run “apt-get update”.