MySQL collation 问题

在从 5.0.x 向 5.1.x 升级的过程中，使用 mysqldump 备份某张表之后，向 5.1.x 中导入时却遇到重复数据的错误：

ERROR 1062 (23000) at line 65: Duplicate entry '1003-' for key 'uk_cat_name'

这张表的schema是这样的：

CREATE TABLE `tag` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET ucs2 NOT NULL DEFAULT '',
  `count` int(11) NOT NULL DEFAULT '0',
  `cat_id` mediumint(8) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `uk_cat_name` (`cat_id`,`name`),
  KEY `idx_cat_count` (`cat_id`,`count`)
) ENGINE=InnoDB AUTO_INCREMENT=2545965 DEFAULT CHARSET=latin1

分析 mysqldump 生成的 SQL 之后，发现数据本身并没有问题，并且向原有的 5.0.x 中导入也没有问题，那么问题应该是出在向 5.1.x 导入的过程中，但仍不明白其原因，于是去掉了 UNIQUE KEY constraint，导入数据后，发现的确有重复的数据：

mysql> select name, cat_id, count(1) as cnt from tag group by cat_id, name having cnt>1;
+--------------+--------+-----+
| name         | cat_id | cnt |
+--------------+--------+-----+
| Günter_Gras  |   1001 |   2 |
| Strauß       |   1001 |   2 |
| Suskind      |   1001 |   2 |
| sas          |   1002 |   2 |
| Weise        |   1002 |   2 |
| R.Strauß     |   1003 |   2 |
| Strauß       |   1003 |   2 |
+--------------+--------+-----+
7 rows in set (2.30 sec)

那就很奇怪了，使用 mysqldump 导出之前，数据是满足 UNIQUE KEY 约束的，并没有重复数据，导出的 SQL 也是正确的，为什么再次导入时反倒出现了重复数据呢？翻了几篇文档之后有了一些眉目，问题应该是出在 5.1.x 的 collation 上，有两个 “bug” tickets 描述了我遇到的情况：

那么 collation 是什么呢？collation 是在特定字符集内用于比较（排序）字符的规则，比如A和B谁在前谁在后，要不要区分大小写等等，不同的字符集有多种不同的比较（排序）规则，比如我们常见的 utf8_general_ci 就是一种大小写不敏感的规则， Wikipedia 和 MySQL 文档有详细的描述。

在我遇到问题的这个场景下，相关字段的定义是 `name` varchar(255) CHARACTER SET ucs2 NOT NULL DEFAULT ‘’，而 ucs2 这个字符集的默认 collation 是 ucs2_general_ci：

mysql> show character set like 'ucs2';
+---------+---------------+-------------------+--------+
| Charset | Description   | Default collation | Maxlen |
+---------+---------------+-------------------+--------+
| ucs2    | UCS-2 Unicode | ucs2_general_ci   |      2 |
+---------+---------------+-------------------+--------+
1 row in set (0.00 sec)

5.0.x 时没有问题，而 5.1.x 下有上面的 “bug” 中提到的问题，亦即导致我遇到的 duplicate key 问题的原因。

那么 ucs2 这个字符集的collation有哪些呢？

mysql> show collation like 'ucs2%';
+--------------------+---------+-----+---------+----------+---------+
| Collation          | Charset | Id  | Default | Compiled | Sortlen |
+--------------------+---------+-----+---------+----------+---------+
| ucs2_general_ci    | ucs2    |  35 | Yes     | Yes      |       1 |
| ucs2_bin           | ucs2    |  90 |         | Yes      |       1 |
| ucs2_unicode_ci    | ucs2    | 128 |         | Yes      |       8 |
| ucs2_icelandic_ci  | ucs2    | 129 |         | Yes      |       8 |
| ucs2_latvian_ci    | ucs2    | 130 |         | Yes      |       8 |
| ucs2_romanian_ci   | ucs2    | 131 |         | Yes      |       8 |
| ucs2_slovenian_ci  | ucs2    | 132 |         | Yes      |       8 |
| ucs2_polish_ci     | ucs2    | 133 |         | Yes      |       8 |
| ucs2_estonian_ci   | ucs2    | 134 |         | Yes      |       8 |
| ucs2_spanish_ci    | ucs2    | 135 |         | Yes      |       8 |
| ucs2_swedish_ci    | ucs2    | 136 |         | Yes      |       8 |
| ucs2_turkish_ci    | ucs2    | 137 |         | Yes      |       8 |
| ucs2_czech_ci      | ucs2    | 138 |         | Yes      |       8 |
| ucs2_danish_ci     | ucs2    | 139 |         | Yes      |       8 |
| ucs2_lithuanian_ci | ucs2    | 140 |         | Yes      |       8 |
| ucs2_slovak_ci     | ucs2    | 141 |         | Yes      |       8 |
| ucs2_spanish2_ci   | ucs2    | 142 |         | Yes      |       8 |
| ucs2_roman_ci      | ucs2    | 143 |         | Yes      |       8 |
| ucs2_persian_ci    | ucs2    | 144 |         | Yes      |       8 |
| ucs2_esperanto_ci  | ucs2    | 145 |         | Yes      |       8 |
| ucs2_hungarian_ci  | ucs2    | 146 |         | Yes      |       8 |
+--------------------+---------+-----+---------+----------+---------+
21 rows in set (0.00 sec)

上面可以看到，有针对不同语言的排序规则，在我的场景下，并没有此需求，那么我尝试一下 ucs2_bin，按照二进制排序：

mysql> alter table tag modify `name` varchar(255) CHARACTER SET ucs2 collate ucs2_bin not null default '';
Query OK, 2436674 rows affected (56.40 sec)
Records: 2436674  Duplicates: 0  Warnings: 0

mysql> show create table tag\G
*************************** 1. row ***************************
       Table: tag
Create Table: CREATE TABLE `tag` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET ucs2 COLLATE ucs2_bin NOT NULL DEFAULT '',
  `count` int(11) NOT NULL DEFAULT '0',
  `cat_id` mediumint(8) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `uk_cat_name` (`cat_id`,`name`),
  KEY `idx_cat_count` (`cat_id`,`count`)
) ENGINE=InnoDB AUTO_INCREMENT=2545965 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> select name, cat_id, count(1) as cnt from tag group by cat_id, name having cnt>1;
Empty set (2.22 sec)

可以看到，使用 ucs2_bin collation 之后，如预期的结果一样，原本不同的字符串被正确地区分开了。

对于如何选择合适的 collation，官方文档有描述，也有一些讨论，比如 What is the best collation to use for MySQL (with PHP)。