How do I tokenize a text with a StreamTokenizer?
Instantiate a StreamTokenizer, pass it a Reader instance and loop through the available tokens with nextToken. This method returns an integer that refers to the type of token that was read. These are the possibilities:
This simple example shows you how to read in a text file and print out its tokens.
If we run it on the following text file:
it produces the following result.
Notice that /* , / , // and whitespace seem to be left out! In addition, anything that comes after a / is left out too! The reason for this is that StreamTokenizer has a initial setup:
You can customize the StreamTokenizer in a number of ways:
1. wordChars(int lo, int hi)
The lo and hi parameters specify the unicode range of characters that you would like to see treated as part of a word. You can call this method several times to include several ranges. Try this after you have instantiated the StreamTokenizer:
2. whitespaceChars(int lo, int hi)
The lo and hi parameters specify the unicode range of characters that you would like to see treated as whitespace. You can call this method several times to include several ranges. Try this:
3. ordinaryChars(int lo, int hi)
The lo and hi parameters specify the unicode range of characters that you would like to see treated as being an ordinary character, meaning it's not part of a word, number, whitespace, etc. It will be returned by nextToken as a single character. There's a variation on this method that takes only one parameter. Try this:
4. commentChar(int ch)
Specifies that the value ch should be treated as a comment character, meaning the character plus the rest of the line is ignored. Try this:
5. quoteChar(int ch)
Tells the tokenizer that all characters between this delimiter ch are treated as a string constant. Try this:
This tells the tokenizer that characters from 0 to 9, the period and the minus sign should be recognized as being part of a TT_NUMBER token, if it can be constructed. By default, parseNumbers is set. You can have . and - treated otherwise but then you would have to use the methods ordinaryChar or wordChars.
7. eolIsSignificant(boolean b)
If b is set, TT_EOL will be returned whenever an end-of-line is encountered. Otherwise, they are ignored. Try this:
8. slashStarComments(boolean b)
If b is set, all characters between /* and */ are ignored (C style comments)
9. slashSlashComments(boolean b)
If b is set, // is recognized as being comments (the rest of the line is ignored). (C++ style comments)
10. lowerCaseMode(boolean lc)
if lc is set, all word tokens are lowercased when returned.
"Pushes" the last token that was returned back on the stream. Next time nextToken is invoked, the same token will be returned as the last one.
Then there's another member variable lineno that you may invoke at any time to get the current linenumber.
Author of answer: Joris Van den Bogaert
Comments to this answer are only viewable by members. Login or become a member!