Globalization Overview
Informatica can process data in different languages. Some languages require single-byte data, while other languages require multibyte data. To process data correctly in Informatica, you must set up the following items:
- •Locale. Informatica requires that the locale settings on machines that access Informatica applications are compatible with code pages in the domain. You may need to change the locale settings. The locale specifies the language, territory, encoding of character set, and collation order.
- •Data movement mode. The PowerCenter Integration Service can process single-byte or multibyte data and write it to targets. Use the ASCII data movement mode to process single-byte data. Use the Unicode data movement mode for multibyte data.
- •Code pages. Code pages contain the encoding to specify characters in a set of one or more languages. You select a code page based on the type of character data you want to process. To ensure accurate data movement, you must ensure compatibility among code pages for Informatica and environment components. You use code pages to distinguish between US-ASCII (7-bit ASCII), ISO 8859-1 (8-bit ASCII), and multibyte characters.
To ensure data passes accurately through your environment, the following components must work together:
- •Domain configuration database code page
- •Administrator tool locale settings and code page
- •PowerCenter Integration Service data movement mode
- •Code page for each PowerCenter Integration Service process
- •PowerCenter Client code page
- •PowerCenter repository code page
- •Source and target database code pages
- •Metadata Manager repository code page
You can configure the PowerCenter Integration Service for relaxed code page validation. Relaxed validation removes restrictions on source and target code pages.
Unicode
The Unicode Standard is the work of the Unicode Consortium, an international body that promotes the interchange of data in all languages. The Unicode Standard is designed to support any language, no matter how many bytes each character in that language may require. Currently, it supports all common languages and provides limited support for other less common languages. The Unicode Consortium is continually enhancing the Unicode Standard with new character encodings. For more information about the Unicode Standard, see
http://www.unicode.org.
The Unicode Standard includes multiple character sets. Informatica uses the following Unicode standards:
- •UCS-2 (Universal Character Set, double-byte). A character set in which each character uses two bytes.
- •UTF-8 (Unicode Transformation Format). An encoding format in which each character can use between one to four bytes.
- •UTF-16 (Unicode Transformation Format). An encoding format in which each character uses two or four bytes.
- •UTF-32 (Unicode Transformation Format). An encoding format in which each character uses four bytes.
- •GB18030. A Unicode encoding format defined by the Chinese government in which each character can use between one to four bytes.
Informatica is a Unicode application. The PowerCenter Client, PowerCenter Integration Service, and Data Integration Service use UCS-2 internally. The PowerCenter Client converts user input from any language to UCS-2 and converts it from UCS-2 before writing to the PowerCenter repository. The PowerCenter Integration Service and Data Integration Service converts source data to UCS-2 before processing and converts it from UCS-2 after processing. The PowerCenter repository, Model repository, PowerCenter Integration Service, and Data Integration Service support UTF-8. You can use Informatica to process data in any language.
Working with a Unicode PowerCenter Repository
The PowerCenter repository code page is the code page of the data in the PowerCenter repository. You choose the PowerCenter repository code page when you create or upgrade a PowerCenter repository. When the PowerCenter repository database code page is UTF-8, you can create a PowerCenter repository using the UTF-8 code page.
The domain configuration database uses the UTF-8 code page. If you need to store metadata in multiple languages, such as Chinese, Japanese, and Arabic, you must use the UTF-8 code page for all services in that domain.
The Service Manager synchronizes the list of users in the domain with the list of users and groups in each application service. If a user in the domain has characters that the code page of the application services does not recognize, characters do not convert correctly and inconsistencies occur.
Use the following guidelines when you use UTF-8 as the PowerCenter repository code page:
- •The PowerCenter repository database code page must be UTF-8.
- •The PowerCenter repository code page must be a superset of the PowerCenter Client and PowerCenter Integration Service process code pages.
- •You can input any character in the UCS-2 character set. For example, you can store German, Chinese, and English metadata in a UTF-8 enabled PowerCenter repository.
- •Install languages and fonts on the PowerCenter Client machine. If you are using a UTF-8 PowerCenter repository, you may want to enable the PowerCenter Client machines to display multiple languages. By default, the PowerCenter Clients display text in the language set in the system locale. Use the Regional Options tool in the Control Panel to add language groups to the PowerCenter Client machines.
- •You can use the Windows Input Method Editor (IME) to enter multibyte characters from any language without having to run the version of Windows specific for that language.
- •Choose a code page for a PowerCenter Integration Service process that can process all PowerCenter repository metadata correctly. The code page of the PowerCenter Integration Service process must be a subset of the PowerCenter repository code page. If the PowerCenter Integration Service has multiple service processes, ensure that the code pages for all PowerCenter Integration Service processes are subsets of the PowerCenter repository code page. If you are running the PowerCenter Integration Service process on Windows, the code page for the PowerCenter Integration Service process must be the same as the code page for the system or user locale. If you are running the PowerCenter Integration Service process on UNIX, use the UTF-8 code page for the PowerCenter Integration Service process.