About
What is a Lexer ? known also as Tokenizer or Scanner - Lexical Analysis in Java
Articles Related
API
Scanner
Simple scanner that can parse primitive types and strings using regular expressions.
Scanner:
- parsing primitive data type
- very flexible
- but don't return an array of strings.
String input = "1 fish 2 fish red fish blue fish";
Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
System.out.println(s.nextInt());
System.out.println(s.nextInt());
System.out.println(s.next());
System.out.println(s.next());
s.close();
String and Pattern
String.split() and Pattern.split().
You can't change the delimiter halfway through depending on a particular token.
String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
System.out.println(result[x]);
StringTokenizer
StringTokenizer is essentially designed for pulling out tokens delimited by fixed substrings.
StringTokenizer st = new StringTokenizer("this is a test");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
Utility
- OpenNLP - Sentence Detector nlp to perform sentence splitting
- Stanford Parser to get the dependency relations of words