esuslogo
 [To advertise Java(tm) Events here, contact us!]
banner

Java™
by example!






New @ Esus.com


  gb  In-house search engine for better results!

  gb  Get updates with the esus.com
newsletter!









  Home 
 Browse Categories 
 Ask a Java Question 
 Help 
  For Java Tips & Tricks, subscribe to the esus.com newsletter!
Search Java Q&A, Links, API's:   adv 

How do I tokenize a text with a StreamTokenizer?

Instantiate a StreamTokenizer, pass it a Reader instance and loop through the available tokens with nextToken. This method returns an integer that refers to the type of token that was read. These are the possibilities:

 
This code sample is only viewable to esus.com members
Login or become a member!



This simple example shows you how to read in a text file and print out its tokens.

Main.java:

This code sample is only viewable to esus.com members
Login or become a member!


If we run it on the following text file:

 
This code sample is only viewable to esus.com members
Login or become a member!


it produces the following result.

 
This code sample is only viewable to esus.com members
Login or become a member!


Notice that /* , / , // and whitespace seem to be left out! In addition, anything that comes after a / is left out too! The reason for this is that StreamTokenizer has a initial setup:

 
This code sample is only viewable to esus.com members
Login or become a member!


You can customize the StreamTokenizer in a number of ways:

1. wordChars(int lo, int hi)

The lo and hi parameters specify the unicode range of characters that you would like to see treated as part of a word. You can call this method several times to include several ranges. Try this after you have instantiated the StreamTokenizer:

 
This code sample is only viewable to esus.com members
Login or become a member!


2. whitespaceChars(int lo, int hi)

The lo and hi parameters specify the unicode range of characters that you would like to see treated as whitespace. You can call this method several times to include several ranges. Try this:

 
This code sample is only viewable to esus.com members
Login or become a member!


3. ordinaryChars(int lo, int hi)

The lo and hi parameters specify the unicode range of characters that you would like to see treated as being an ordinary character, meaning it's not part of a word, number, whitespace, etc. It will be returned by nextToken as a single character. There's a variation on this method that takes only one parameter. Try this:

 
This code sample is only viewable to esus.com members
Login or become a member!


4. commentChar(int ch)

Specifies that the value ch should be treated as a comment character, meaning the character plus the rest of the line is ignored. Try this:

 
This code sample is only viewable to esus.com members
Login or become a member!


5. quoteChar(int ch)

Tells the tokenizer that all characters between this delimiter ch are treated as a string constant. Try this:

 
This code sample is only viewable to esus.com members
Login or become a member!


6. parseNumbers

This tells the tokenizer that characters from 0 to 9, the period and the minus sign should be recognized as being part of a TT_NUMBER token, if it can be constructed. By default, parseNumbers is set. You can have . and - treated otherwise but then you would have to use the methods ordinaryChar or wordChars.

7. eolIsSignificant(boolean b)

If b is set, TT_EOL will be returned whenever an end-of-line is encountered. Otherwise, they are ignored. Try this:

 
This code sample is only viewable to esus.com members
Login or become a member!


8. slashStarComments(boolean b)

If b is set, all characters between /* and */ are ignored (C style comments)

9. slashSlashComments(boolean b)

If b is set, // is recognized as being comments (the rest of the line is ignored). (C++ style comments)

10. lowerCaseMode(boolean lc)

if lc is set, all word tokens are lowercased when returned.

11. pushBack()

"Pushes" the last token that was returned back on the stream. Next time nextToken is invoked, the same token will be returned as the last one.

Then there's another member variable lineno that you may invoke at any time to get the current linenumber.


Further Information
Author of answer: Joris Van den Bogaert

Comments to this answer are only viewable by members. Login or become a member!





Terms of Service | Privacy Policy | Contact

Copyright © 2000-2003 Esus.com - All Rights Reserved 
Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. Esus.com is independent of Sun Microsystems, Inc. All other trademarks are the sole property of their respective owners.