Sonntag, 26. Juli 2015

Java: How to read files quickly

Recently I had a project in which a graph with nodes and edges had too be build. The biggest file was even to big for Eclipse to open. It contained about 2.970.000 lines. The lines began with 'N' or 'E' and had additional fields like identifier etc. The first approach was really slow and took about 17 seconds to read in the big file. So I was forced to research for faster methods to read in files.
Here we go:

Surprisingly the MappedByteReader was slower than the BufferedReader. In average it took 1.873 seconds to read the large file.

MappedByteReader and StringTokenizer:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.CharBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.nio.charset.Charset;
import java.util.StringTokenizer;

...

FileInputStream file = null;
InputStreamReader ir = null;
BufferedReader br = null;
  
final String charsetName = "UTF-8";
  
try {
   
   file = new FileInputStream(Path);
   ir = new InputStreamReader(file);
   br = new BufferedReader(ir);
   FileChannel ch = file.getChannel();
   MappedByteBuffer mbb = ch.map(MapMode.READ_ONLY, 0L, ch.size());
   
   String text = null;
   
   while (mbb.hasRemaining())  {
    
      CharBuffer cb =  Charset.forName(charsetName).decode(mbb);
      text = cb.toString();
      strTokenizer = new StringTokenizer(text);
    
      while (strTokenizer.hasMoreTokens()) {
    
         String nextToken = strTokenizer.nextToken();
         // put in here your logic
      }
    
   }
   
   file.close();
   ir.close();
   br.close();
   
   } catch (FileNotFoundException e) {
      e.printStackTrace();
   } catch (IOException e) {
      e.printStackTrace();
   }
}


The BufferedReader was the fastest solution. It took about 1.451 seconds.

BufferedReader and StringTokenizer:
FileInputStream file = null;
InputStreamReader ir = null;
BufferedReader br = null;
  
try {
   
   file = new FileInputStream(Path);
   ir = new InputStreamReader(file);
   br = new BufferedReader(ir);
   
   String line = null;
   
   while (((line = br.readLine()) != null))  {
    
      strTokenizer = new StringTokenizer(line);
    
      while (strTokenizer.hasMoreTokens()) {
    
         String nextToken = strTokenizer.nextToken();
         // put in here your logic
      }
    
   }
   
   file.close();
   ir.close();
   br.close();
   
   } catch (FileNotFoundException e) {
      e.printStackTrace();
   } catch (IOException e) {
      e.printStackTrace();
   }
}
The java.util.Scanner solution was the slowest one. It took about 17.127 seconds.

Scanner:
FileInputStream file = null;
  
try {
   
    file = new FileInputStream(Path);
    
    scannerFile = new Scanner(file);
    scannerFile.useLocale(Locale.US);
    
    while (scannerFile.hasNext()) {
       // put in here your logic
    }
    
    file.close();
   
    } catch (FileNotFoundException e) {
       e.printStackTrace();
    } catch (IOException e) {
       e.printStackTrace();
    } finally {
       if (scannerFile != null) {
        scannerFile.close();
       }
    }
}

private static Node getNodeFromLine() {
  
   final long item1 = scannerFile.nextLong();
   final double item2 = scannerFile.nextDouble();
   ...
}

private static Edge getEdgeFromLine() {

   final long item1 = scannerFile.nextLong();
   final boolean item2 = getBooleanFromInt(scannerFile.nextInt());
   ...
}



The StringTokenizer solutions shared the same Node and Edge methods:
private static Node getNodeFromLine() {
  
   final long item1 = Long.parseLong(strTokenizer.nextToken());
   final double item2 = Double.parseDouble(strTokenizer.nextToken());
   ...
}

private static Edge getEdgeFromLine() {

   final long item1 = Long.parseLong(strTokenizer.nextToken());
   final boolean item2 = getBooleanFromString(strTokenizer.nextToken());
   ...
}

I removed the whole logic just to focus on the differences between the implementations.

You can easily see that in this case the Scanner wasn't fast enough and the MappedByteReader and the BufferedReader solutions were about 17 times faster. I chose the StringTokenizer because it is supposed to be faster than String.split(), but I didn't test it (have a look here http://stackoverflow.com/questions/691184/scanner-vs-stringtokenizer-vs-string-split).

I hope that give your implementation a performance boost!