Network Working Group M. Crispin
Internet-Draft University of Washington
Document: internet-drafts/draft-crispin-collation-unicasemap-00.txt
December 2006
i;unicode-casemap - Simple Unicode Collation Algorithm
Status of this Memo
By submitting this Internet-Draft, each author represents that
any applicable patent or other IPR claims of which he or she is
aware have been or will be disclosed, and any of which he or she
becomes aware will be disclosed, in accordance with Section 6 of
BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
A revised version of this document will be submitted to the RFC
editor as an Informational Document for the Internet Community.
A revised version of this draft document will be submitted to the RFC
editor as a Proposed Standard for the Internet Community. Discussion
and suggestions for improvement are requested, and should be sent to
ietf-imapext@IMC.ORG. This document will expire before 5 June 2007.
Distribution of this memo is unlimited.
Abstract
This document describes "i;unicode-casemap", a simple
case-insensitive collation for Unicode strings. It provides
equality, substring and ordering operations.
Introduction
The "i;ascii-casemap" collation described in [COMPARATOR] is quite
simple to implement and provides case-independent comparisons for the
26 Latin alphabetics. It is specified as the default and/or baseline
comparator in some application protocols, e.g., [IMAP-SORT].
It is possible, with a modest extension, to provide a more
sophisticated collation with greater multilingual applicability than
"i;ascii-casemap".
This collation, "i;unicode-casemap", is intended to be an alternative
to, and preferred over, "i;ascii-casemap". It does not replace the
"i;basic" collation described in [BASIC].
1. Unicode Casemap Collation Description
The "i;unicode-casemap" collation is a simple collation which
operates on Unicode strings and treats characters case-insensitively.
It provides equality, substring and ordering operations. All input
is valid.
For the equality and ordering operations, each input string is
prepared by converting it to "titlecased canonicalized UTF-8" as
follows on a per-character basis:
(1) If the string is in a non-Unicode character set, the codepoint
is converted from that character set to the associated
codepoint in Unicode.
(2) If the codepoint has a titlecase property in UnicodeData.txt
(this is normally the same as the uppercase property) the
codepoint is converted to the titlecased codepoint.
(3) If the codepoint has a decomposition property in
UnicodeData.txt the codepoint is converted to the decomposed
codepoints.
(4) The resulting codepoint(s) is/are appended to the titlecased
canonicalized UTF-8 string.
The resulting two titlecased canonicalized UTF-8 strings are then
treated as in i;octet for equality and ordering.
Care should be taken when using OS-supplied functions to implement
this collation as it is not locale sensitive. Functions such as
strcasecmp and toupper are sometimes locale sensitive and may
inconsistently casemap letters.
The i;unicode-casemap collation is well suited to to use with many
Internet protocols and computer languages. Use with natural language
is often inappropriate; even though the collation apparently supports
languages such as Swahili and English, in real-world use it tends to
mis-sort a number of types of string:
o people and place names containing scripts that are not collated
according to "alphabetical order".
o words with characters that have diacriticals. However,
i;unicode-casemap generally does a better job than i;ascii-casemap
for most (but not all) languages. For example, German umlaut
letters will sort correctly, but some Scandinavian letters will
not.
o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
in English),
o strings containing other non-letter symbols; e.g., euro and pound
sterling symbols, quotation marks other than '"', dashes/hyphens,
etc.
2. Unicode Casemap Collation Registration
i;unicode-casemap
Unicode Casemap
equality order substring
RFC XXXX
IETF
mrc@cac.washington.edu
3. Security Considerations
Collations will normally be used with UTF-8 strings. Thus the
security considerations for [UTF-8], [STRINGPREP] and
[UNICODE-SECURITY] also apply and are normative to this
specification.
4. IANA Considerations
The i;unicode-casemap collation should be added to the registry of
collations defined in [COMPARATOR]
5. Normative References
The following documents are normative to this document:
[COMPARATOR] Newman, C., "Internet Appplication Protocol
Collation Registry", Work in Progress.
[STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
Internationalized Strings ("stringprep")",
RFC 3454, December 2002.
[UTF-8] Yergeau, F., "UTF-8, a transformation format
of ISO 10646", STD 63, RFC 3629, November 2003.
[UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
Considerations", February 2006,
.
6. Informative References:
[BASIC] Newman, C., Duerst, M., and Gulbrandsen, A.,
"i;basic - the Unicode Collation Algorithm",
Work in Progress.
[IMAP-SORT] Crispin, M. "Internet Message Access Protocol -
SORT and THREAD Extensions", Work in Progress.
Appendices
Author's Address
Mark R. Crispin
Networks and Distributed Computing
University of Washington
4545 15th Avenue NE
Seattle, WA 98105-4527
Phone: +1 (206) 543-5762
EMail: MRC@CAC.Washington.EDU
Full Copyright Statement
Copyright (C) The Internet Society (2006).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf-
ipr@ietf.org.
Acknowledgement
Funding for the RFC Editor function is currently provided by the
Internet Society.