You are here: Home → Manuals on-line → PGI Compiler → pgC_lib → stdlibug → 2.13 Defining A Code Conversion Facet

Personal tools

Document Actions

2.13 Defining A Code Conversion Facet

Click on the banner to return to the user guide home page.

2.13 Defining A Code Conversion Facet

File stream buffers are responsible for the transport of characters to and from an external device. In many cases, the character encoding used internally inside your program and externally on the device will differ. Hence the file stream buffer will have to convert characters from one encoding to another each time it reads from or writes to the external device. (This User's Guide section on internationalization gives a detailed discussion of character encodings and explains a couple of typical code conversions. If you are not familiar with code conversions, we recommend you read about them before delving into the details of implementing one, which will be explained in this section.)

A code conversion is not performed by the file stream buffer itself. This task is encapsulated in a code conversion facet. Each time the file stream buffer has to convert characters, it consults its locale's code conversion facet for the actual conversion. For this reason, file stream buffers and code conversion facets have to work together closely, and the file stream buffer depends on its locale's code conversion facet.

This clear separation of responsibilities enables you to change a file stream's behavior substantially, without touching the file stream class itself. All you have to do is provide a special code conversion facet. In doing so, you turn an ordinary file stream into one that converts, say, EBCDIC files on a mainframe's file system into a stream of ASCII characters for internal processing.

However, the task of implementing a code conversion facet requires a thorough understanding of the way file stream buffers and code conversion facets interact. In this section, we will use two examples to explain the principles of this interaction.

Before we move on to the examples, let's go through an overview of the different kinds of code conversions. As we will see later on, different types of code conversions require different kinds of implementations.

2.13.1 Categories of Code Conversions

Code conversions fall into various categories depending on the properties of the character encodings involved. There are:

Constant-size conversions, and
Multibyte conversions, which again fall into the categories of:
- State-independent conversions, and
- State-dependent conversions.

Constant-size conversions are between character encodings where all characters are of equal size. All single- or wide-character encodings are examples of such character encodings. Each single character stands for itself and can be recognized and translated independently of its context. Conversions between ASCII and EBCDIC, or Unicode and ISO10646, are examples of constant-size conversions.

Multibyte conversions involve multibyte encodings. In multibyte encodings, characters have varying size. Some multibyte characters consist of two or more bytes, while others are represented by just one byte.

There is a substantial difference between code conversions involving state-dependent character encodings, and conversions between state-independent encodings. (Again, see this User's Guide section on internationalization for further details.)

State-dependent multibyte conversions involve one character encoding that is state-dependent. In state-dependent character encodings, character sequences can have different meanings depending on the current context. State-dependent encodings typically have modes and escape sequences that allow switching between modes. An example of a state-dependent character conversion is the conversion between the state-dependent JIS encoding for Japanese characters and the Unicode wide-character encoding.

State-independent multibyte conversions do not have modes. A sequence of characters can always be interpreted independently of its context. An example of a state-independent multibyte conversion is the conversion between EUC, which a state-independent multibyte encoding, and Unicode.

2.13.2 Example 1 -- Defining a Tiny Character Code Conversion (ASCII and EBCDIC)

As an example of how file stream buffers and code conversion facets collaborate, we would now like to implement a code conversion facet that can translate text files encoded in EBCDIC into character streams encoded in ASCII. The conversion between ASCII characters and EBCDIC characters is a constant-size code conversion where each character is represented by one byte. Hence the conversion can be done on a character-by-character basis.

To implement and use an ASCII-EBCDIC code conversion facet, we will:

Derive a new facet type from the standard code conversion facet type codecvt.
Specialize the new facet type for the character type char.
Implement the member functions that are used by the file buffer.
Imbue a file stream's buffer with a locale that carries an ASCII-EBCDIC code conversion facet.

The following sections will explain these steps in detail.

2.13.2.1 Derive a New Facet Type

Here is the new code conversion facet type AsciiEbcdicConversion:

template <class internT, class externT, class stateT>
class AsciiEbcdicConversion
: public codecvt<internT, externT, stateT>
{
};

It is empty because we will specialize the class template for the character type char.

2.13.2.2 Specialize the New Facet Type and Implement the Member Functions

Each code conversion facet has two main member functions, in() and out():

Function in()is responsible for the conversion done on reading from the external device; and
Function out()is responsible for the conversion necessary for writing to the external device.

The other member functions of a code conversion facet used by a file stream buffer are:

The function always_noconv(), which returns true if no conversion is performed by the facet. This is because file stream buffers might want to bypass the code conversion facet when no conversion is necessary; e.g., when the external encoding is identical to the internal. Our facet obviously will perform a conversion and does not want to be bypassed, so always_noconv() will return false in our example.
The function encoding(), which provides information about the type of conversion; i.e., whether it is state-dependent or constant-size, etc. In our example, the conversion is constant-size. The function encoding() is supposed to return the size of the internal characters, which is 1 because the file buffer uses an ASCII encoding internally.

All public member functions of a facet call the respective, protected virtual member function, named do_...(). Here is the declaration of the specialized facet type:

class AsciiEbcdicConversion<char, char, mbstate_t>
: public codecvt<char, char, mbstate_t>
{
protected:
 
 result do_in(mbstate_t& state
              ,const char* from, const char* from_end, const char*& from_next
              ,char* to        , char* to_limit      , char*& to_next) const;
 
 result do_out(mbstate_t& state
              ,const char* from, const char* from_end, const char*& from_next
              ,char* to        , char* to_limit      , char*& to_next) const;

 bool do_always_noconv() const thow()
 { return false; };
 
 int do_encoding() const throw();
 { return  1; }
 
};

For the sake of brevity, we implement only those functions used by Rogue Wave's implementation of file stream buffers. If you want to provide a code conversion facet that is more widely usable, you would also have to implement the functions do_length() and do_max_length().

The implementation of the functions do_in() and do_out() is straightforward. Each of the functions translates a sequence of characters in the range [from,from_end) into the corresponding sequence [to,to_end). The pointers from_next and to_next point one beyond the last character successfully converted. In principle, you can do whatever you want, or whatever it takes, in these functions. However, for effective communication with the file stream buffer, it is important to indicate success or failure properly.

2.13.2.3 Use the New Code Conversion Facet

Here is an example of how the new code conversion facet can be used:

fstream inout("/tmp/fil");                                    \\1
AsciiEbcdicConversion<char,char,mbstate_t> cvtfac;
locale cvtloc(locale(),&cvtfac);
inout.rdbuf()->pubimbue(cvtloc)                               \\2
cout << inout.rdbuf();                                        \\3

//1	When a file is created, a snapshot of the current global locale is attached as the default locale. Remember that a stream has two locale objects: one used for formatting numeric items, and a second used by the stream's buffer for code conversions.
//2	Here the stream buffer's locale is replaced by a copy of the global locale that has an ASCII-EBCDIC code conversion facet.
//3	The content of the EBCDIC file "/tmp/fil" is read, automatically converted to ASCII, and written to cout.

2.13.3 Error Indication in Code Conversion Facets

Since file stream buffers depend on their locale's code conversion facet, it is important to understand how they communicate. On writing to the external device, the file stream buffer hands over the content of its internal character buffer, partially or entirely, to the code conversion facet; i.e., to its out() function. It expects to receive a converted character sequence that it can write to the external device. The reverse takes place, using the in() function, on reading from the external file.

In order to make the file stream buffer and the code conversion facet work together effectively, it is necessary that the two main functions in() and out() indicate error situations the way the file stream buffer expects them to do it.

There are four possible return codes for the functions in() and out():

ok, which should obviously be returned when the conversion went fine.
partial, which should be returned when the code conversion reaches the end of the input sequence [from,from_end) before a new character can be created. The file stream buffer's reaction to partial is to provide more characters and call the code conversion facet again, in order to successfully complete the conversion.[44]
error, which indicates a violation of the conversion rules; i.e., the character sequence to be converted does not obey the expected rules and thus cannot be recognized and converted. In this situation, the file stream buffer stops doing anything, and the file stream eventually sets its state to badbit and throws an exception if appropriate.
noconv, which is returned if no conversion was needed.

2.13.4 Example 2 -- Defining a Multibyte Character Code Conversion (JIS and Unicode)

Let us consider the example of a state-dependent code conversion. As mentioned previously, this type of conversion would occur between JIS, which is a state-dependent multibyte encoding for Japanese characters, and Unicode, which is a wide-character encoding. As usual, we assume that the external device uses multibyte encoding, and the internal processing uses wide-character encoding.

Here is what you have to do to implement and use a state-dependent code conversion facet:

Define a new conversion state type if necessary.
Define a new character traits type if necessary, or instantiate the character traits template with the new state type.
Define the code conversion facet.
Instantiate new stream types using the new character traits type.
Imbue a file stream's buffer with a locale that carries the new code conversion facet.

These steps are explained in detail in the following sections.

2.13.4.1 Define a New Conversion State Type

While parsing or creating a sequence of multibytes in a state-dependent multibyte encoding, the code conversion facet has to maintain a conversion state. This state is by default of type mbstate_t, which is the implementation-dependent state type defined by the C library. If this type does not suffice to keep track of the conversion state, you have to provide your own conversion state type. We will see how this is done in the code below, but please note first that the new state type must have the following member functions:

A constructor. The argument 0 has the special meaning of creating a conversion state object that represents the initial conversion state;
Copy constructor and assignment;
Comparison for equality and inequality.

Now here is the sketch of a new conversion state type:

class JISstate_t {
public: 
                   JISstate_t( int state=0 )
                   : state_(state) { ; }
 
                   JISstate_t(const JISstate_t& state)
                   : state_(state.state_) { ; }
 
                   JISstate_t& operator=(const JISstate_t& state)
                    {
                       if ( &state != this )
                         state_= state.state_;
                       return *this;
                    }
 
                   JISstate_t& operator=(const int state)
                    {
                       state_= state;
                       return *this;
                    }
 
                   bool operator==(const JISstate_t& state) const
                    {
                       return ( state_ == state.state_ );
                    }
 
                   bool operator!=(const JISstate_t& state) const
                    {
                       return ( !(state_ == state.state_) );
                    }
 
private: 
                   int state_;
 
                 };

2.13.4.2 Define a New Character Traits Type

The conversion state type is part of the character traits. Hence, with a new conversion state type, you need a new character traits type.

Rogue Wave's implementation of the Standard C++ Library has a non-standard extension to the standard character traits class template char_traits. The extension is an additional template parameter for the conversion state type. For this reason, you can create a new character traits type by instantiating the character traits with your new conversion state type:

char_traits<wchar_t, JISstate_t>

However, if you do not want to rely on a non-standard and thus non-portable feature of the library, you have to define a new character traits type and redefine the necessary types:

struct JIS_char_traits: public char_traits<wchar_t> 
{
        typedef JISstate_t                state_type;
        typedef fpos<state_type>          pos_type;
        typedef wstreamoff                off_type;
};

2.13.4.3 Define the Code Conversion Facet

Just as in the first example, you have to define the actual code conversion facet. The steps are basically the same as before, too: define a new class template for the new code conversion type and specialize it. The code would look like this:

template <class internT, class externT, class stateT>
class UnicodeJISConversion
: public codecvt<internT, externT, stateT>
{
};

class UnicodeJISConversion<wchar_t, char, JISstate_t>
: public codecvt<wchar_t, char, JISstate_t>
{
protected:
 
 result do_in(JISstate_t& state,
              const char*  from,
              const char*  from_end,
              const char*& from_next,
              wchar_t*     to, 
              wchar_t*     to_limit,
              wchar_t*&    to_next) const;

 result do_out(JISstate_t& state,
               const wchar_t*  from,
               const wchar_t*  from_end,
               const wchar_t*& from_next,
               char*           to,
               char*           to_limit, 
               char*&          to_next) const;

 bool do_always_noconv() const throw()
 { return false; };
 
 int do_encoding() const throw();
 { return -1; }
 
};

In this case, the function do_encoding()has to return -1, which identifies the code conversion as state-dependent. Again, the functions in() and out() have to conform to the error indication policy explained under class codecvt in the Class Reference.

The distinguishing characteristic of a state-independent conversion is that the conversion state argument to in() and out() is used for communication between the file stream buffer and the code conversion facet. The file stream buffer is responsible for creating, maintaining, and deleting the conversion state. At the beginning, the file stream buffer creates a conversion state object that represents the initial conversion state and hands it over to the code conversion facet. The facet modifies it according to the conversion it performs. The file stream buffer receives it and stores it between two subsequent code conversions.

2.13.4.4 Use the New Code Conversion Facet

Here is an example of how the new code conversion facet can be used:

typedef basic_fstream<wchar_t,JIS_char_traits> JIS_fstream;   \\1
JIS_fstream inout("/tmp/fil");
UnicodeJISConversion<wchar_t,char,JISstate_t> cvtfac;
locale cvtloc(locale(),&cvtfac);
inout.rdbuf()->pubimbue(cvtloc)                               \\2
wcout << inout.rdbuf();                                       \\3

//1	Our Unicode-JIS code conversion needs a conversion state type different from the default type mbstate_t. Since the conversion state type is contained in the character traits, we have to create a new file type. Instead of JIS_char_traits, we could have taken advantage of the non-standard extension to the character traits template and have used char_traits<wchar_t,JISstate_t>.
//2	Here the stream buffer's locale is replaced by a copy of the global locale that has a Unicode-JIS code conversion facet.
//3	The content of the JIS encoded file "/tmp/fil" is read, automatically converted to Unicode, and written to wcout.

ICTP Portal

Sections

Personal tools

Document Actions

2.13 Defining A Code Conversion Facet

2.13 Defining A Code Conversion Facet

2.13.1 Categories of Code Conversions

2.13.2 Example 1 -- Defining a Tiny Character Code Conversion (ASCII and EBCDIC)

2.13.2.1 Derive a New Facet Type

2.13.2.2 Specialize the New Facet Type and Implement the Member Functions

2.13.2.3 Use the New Code Conversion Facet

2.13.3 Error Indication in Code Conversion Facets

2.13.4 Example 2 -- Defining a Multibyte Character Code Conversion (JIS and Unicode)

2.13.4.1 Define a New Conversion State Type

2.13.4.2 Define a New Character Traits Type

2.13.4.3 Define the Code Conversion Facet

2.13.4.4 Use the New Code Conversion Facet