[hfs-user] Char set for HFS volumes

Mark Day mday@apple.com
Fri, 1 Mar 2002 08:02:59 -0800

On Friday, March 1, 2002, at 12:48  AM, Biswaroop Banerjee wrote:

>   Can anybody tell me which char set is understood in
>   HFS volumes. For e.g. in DOS only A-Z, 0-9 and _ are
>   the valid characters.
>   So, what is for HFS.

Names on HFS are 31 bytes (27 bytes for volume names) and can consist of 
any byte value except ASCII colon (":").  Note: that means a zero byte 
*is* valid (which can make things difficult for implementations that use 
C-style strings which are zero-terminated.

Above I said bytes, not characters.  To support localizations to many 
languages, Mac OS supports a variety of character set encodings.  Some 
of those encodings use two bytes to represent a single character.  That 
means that file names might only contain 15 characters, which would 
occupy 30 bytes.

Off hand, I don't know if or where the various encodings are described.  
There may be documentation on Apple's developer web site.

Remember that HFS is case insensitive.  The definition of what 
characters are "upper case" or "lower case" is based on the MacRoman 
encoding.  MacRoman is similar to ISO Latin 1.  Take a look at the 
Darwin sources for code that does a case insensitive string compare 
using MacRoman (it will be called as part of the B-tree key comparison 
function for the catalog B-tree).

>   Again, for writing into a HFS volume for creating a CD image can we 
> go for UNICODE .

I would advise against that.  While you can store just about any byte 
sequence (as long as it doesn't contain an ASCII colon), storing Unicode 
(eg., UTF-8 or UTF-16) would make for garbage-looking filenames when 
viewed on a Macintosh.

>   The HFS volumes contain data in "Big Endian " format.
>   Can anybody tell me what are the fields which has to be
>   filled in Big Endian format.

Everything is big endian.  That even includes file names.  So, Macintosh 
encodings that use two bytes per character will store those two bytes in 
big endian form on HFS.  And the two bytes per UTF-16 code point are 
stored in big endian form on HFS Plus.