Recently I was asked a question: Why am I getting utf8mb4_0900_ai_ci as the default collation in MySQL 8, despite setting the server to use utf8mb4_general_ci?
With the upgrade to MySQL 8, many users, like my friend, face confusion around character sets and collations, especially when the MySQL 5.7 default collation (e.g., utf8mb4_general_ci) is replaced by utf8mb4_0900_ai_ci. This post will explain why this happens and guide you through resolving such issues when upgrading to MySQL 8, focusing on utf8mb4 and its default collation behavior.
First, let’s check the current character set and collation settings in the MySQL server:
mysql [localhost:8032] {msandbox} (test) > show global variables like '%character%';
+--------------------------+--------------------------------------------+
| Variable_name | Value |
+--------------------------+--------------------------------------------+
| character_set_client | utf8mb4 |
| character_set_connection | utf8mb4 |
| character_set_database | utf8mb4 |
| character_set_filesystem | binary |
| character_set_results | utf8mb4 |
| character_set_server | utf8mb4 |
| character_set_system | utf8mb3 |
| character_sets_dir | /opt/percona_server/8.0.32/share/charsets/ |
+--------------------------+--------------------------------------------+
8 rows in set (0.01 sec)
mysql [localhost:8032] {msandbox} (test) > show global variables like '%collat%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_general_ci |
| collation_server | utf8mb4_general_ci |
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
+-------------------------------+--------------------+
4 rows in set (0.01 sec)
Let’s create a table without specifying a collation explicitly:
mysql [localhost:8032] {msandbox} (test) > create table cs (id int not null auto_increment primary key, val varchar(40)) ;
Query OK, 0 rows affected (0.05 sec)
mysql [localhost:8032] {msandbox} (test) > show create table cs\G
* 1. row *
Table: cs
Create Table: CREATE TABLE cs (
id int NOT NULL AUTO_INCREMENT,val varchar(40) DEFAULT NULL,PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
We have utf8mb4_900_ai_ci and not utf8mb4_general_ci as a coalition. Why?
Is it because default_collation_for_utf8mb4 is still configured as utf8mb4_0900_ai_ci?
Let’s change default_collation_for_utf8mb4 to utf8mb4_general_ci. Ensured that this also changed at the current session.
mysql [localhost:8032] {msandbox} (test) > SET PERSIST default_collation_for_utf8mb4='utf8mb4_general_ci';
Query OK, 0 rows affected, 1 warning (0.00 sec)
mysql [localhost:8032] {msandbox} (test) > show warnings;
+---------+------+--------------------------------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------------------------------------------------------------------------+
| Warning | 1681 | Updating 'default_collation_for_utf8mb4' is deprecated. It will be made read-only in a future release. |
+---------+------+--------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql [localhost:8032] {msandbox} (test) > show global variables like '%collat%';
+-------------------------------+--------------------+
| Variable_name | Value |
+-------------------------------+--------------------+
| collation_connection | utf8mb4_general_ci |
| collation_database | utf8mb4_general_ci |
| collation_server | utf8mb4_general_ci |
| default_collation_for_utf8mb4 | utf8mb4_general_ci |
+-------------------------------+--------------------+
4 rows in set (0.00 sec)
mysql [localhost:8032] {msandbox} (test) > create table csagain (id int not null auto_increment primary key, val varchar(40)) ;
Query OK, 0 rows affected (0.03 sec)
mysql [localhost:8032] {msandbox} (test) > show create table csagain\G
* 1. row *
Table: csagain
Create Table: CREATE TABLE csagain (
id int NOT NULL AUTO_INCREMENT,val varchar(40) DEFAULT NULL,PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
Well the table collation is still shown as utf8mb4_0900_ai_ci!
Now, I hope you noted the warning above – what we did is deprecated! There’s more to it. The default_collation_for_utf8mb4 system variable is for internal use by MySQL Replication only. What we did is not correct. It is there to support the replication from older versions of MySQL (5.7) to MySQL 8 and assist in avoiding replication errors.
Refer: https://dev.mysql.com/doc/refman/8.4/en/server-system-variables.html#sysvar_default_collation_for_utf8mb4
Move on, do you think this could be fixed with SET NAMES? Let’s try
mysql [localhost:8032] {msandbox} (test) > SET NAMES 'utf8mb4' COLLATE 'utf8mb4_general_ci';
Query OK, 0 rows affected (0.00 sec)
mysql [localhost:8032] {msandbox} (test) > drop table csagain;
Query OK, 0 rows affected (0.02 sec)
mysql [localhost:8032] {msandbox} (test) > create table csagain (id int not null auto_increment primary key, val varchar(40)) ;
Query OK, 0 rows affected (0.03 sec)
mysql [localhost:8032] {msandbox} (test) > show create table csagain\G
* 1. row *
Table: csagain
Create Table: CREATE TABLE csagain (
id int NOT NULL AUTO_INCREMENT,val varchar(40) DEFAULT NULL,PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.01 sec)
No luck! So why are we having collation as utf8mb4_0900_ai_ci and not utf8mb4_general_ci even after adjusting and ensuring the configuration remains utf8mb4_general_ci.
I’d like to express my gratitude to two individuals I deeply admire and learn from – Alok (The PoP) and Guptaji. Their relentless dedication to expanding their technical expertise, while also helping fellow consultants grow along the way, has been truly inspiring. Thank you both for your continuous efforts and for always being a source of knowledge and support.
promised-to-be-paid promotion
The Collation Realization
Every database has a database character set and a database collation. When a database is created without explicitly mentioning the character set and collation, those values get inherited by the server character set and server collation default values.
Thus when we do examine the character set for current database we realize the reasoning:
mysql [localhost:8032] {msandbox} (test) > show create database test;
+----------+--------------------------------------------------------------------------------------------------------------------------------+
| Database | Create Database |
+----------+--------------------------------------------------------------------------------------------------------------------------------+
| test | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci */ /*!80016 DEFAULT ENCRYPTION='N' */ |
+----------+--------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
If we create a new database, it will inherit the current settings, which is utf8mb4_general_ci.
mysql [localhost:8032] {msandbox} (test) > show create database guptaji;
+----------+-----------------------------------------------------------------------------------------------------------------------------------+
| Database | Create Database |
+----------+-----------------------------------------------------------------------------------------------------------------------------------+
| guptaji | CREATE DATABASE `guptaji` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci */ /*!80016 DEFAULT ENCRYPTION='N' */ |
+----------+-----------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
The answer to our riddle was to ALTER the old database to have an updated character set
mysql [localhost:8032] {msandbox} (guptaji) > alter database test COLLATE utf8mb4_general_ci;
Query OK, 1 row affected (0.02 sec)
mysql [localhost:8032] {msandbox} (guptaji) > show create database test;
+----------+--------------------------------------------------------------------------------------------------------------------------------+
| Database | Create Database |
+----------+--------------------------------------------------------------------------------------------------------------------------------+
| test | CREATE DATABASE `test` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci */ /*!80016 DEFAULT ENCRYPTION='N' */ |
+----------+--------------------------------------------------------------------------------------------------------------------------------+
Now if you create a table without explicitly mentioning the CHARSET or COLLATION, it will be inherited from here. That said, do not forget that you can always explicitly mention collate utf8mb4_general_ci.
mysql [localhost:8032] {msandbox} (test) > drop table csagain;
Query OK, 0 rows affected (0.04 sec)
create table csagain (id int not null auto_increment primary key, val varchar(40)) DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
Query OK, 0 rows affected (0.05 sec)
mysql [localhost:8032] {msandbox} (test) > show create table csagain\G
* 1. row *
Table: csagain
Create Table: CREATE TABLE csagain (
id int NOT NULL AUTO_INCREMENT,val varchar(40) COLLATE utf8mb4_general_ci DEFAULT NULL,PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci
1 row in set (0.01 sec)
utf8mb4_0900_ai_ci vs utf8mb4_general_ci
Here’s a quick comparison between utf8mb4_general_ci and utf8mb4_0900_ai_ci collations in MySQL 8, and when you might use each one
Feature | utf8mb4_general_ci | utf8mb4_0900_ai_ci |
Compatibility | Compatible with earlier MySQL versions (5.7 and earlier) | Designed for MySQL 8.0 and later |
Character Comparison | Basic comparison, ignoring accents | More advanced comparison, considers accents and some linguistic rules |
Performance | Faster for basic comparisons | Slightly slower due to more complex rules |
Language Support | Good for general use, but limited for specific languages | Improved support for multiple languages and scripts |
Unicode Support | Supports most Unicode characters | Enhanced Unicode support, including newer characters |
Handling of Special Characters | Basic handling | More sophisticated handling of special characters and emojis |
Case Sensitivity | Case-insensitive | Case-insensitive |
Use Case | Suitable for general applications and legacy systems | Recommended for modern applications requiring precise sorting and comparison |
Implementation | Simple and straightforward | More complex, but adheres to modern Unicode standards |
utf8mb4_general_ci is a simpler, faster collation suitable for general use but may not handle certain linguistic nuances.
utf8mb4_0900_ai_ci offers better support for internationalization and modern Unicode standards, making it preferable for applications requiring precise sorting and character handling.
Choosing between the two depends on the specific needs of your application, particularly regarding language support and performance requirements. If asked my opinion in general, I’d go with default and latest.
Conclusion
This appeared to be a simple question from a friend but I think it had more into it than a simple answer. I hope my tests will help someone. I am going to conclude with quick actionable items for when you’re dealing with such confusion follow these steps to investigate.
Check your current collation settings using SHOW VARIABLES LIKE ‘%collation%’.
Alter the database collation with ALTER DATABASE <dbname> COLLATE <desired_collation>.
Set default collation for new databases by configuring default_collation_for_utf8mb4 in the MySQL server settings (although note that this is deprecated in future MySQL versions).
Explicitly specify collation when creating tables if needed.