Say goodbye to the torturous mysql sub-bank table! 100 million-level data has no inductance expansion, and the back-end needs to understand the NewSQL selection
During the recent interview exchange, the interviewer asked me TiDB landing practice grasp the depth.
In the 2014 PHP back-end research and development, I have deeply cultivated mysql high concurrency, big data table architecture evolution and performance optimization, from stand-alone tuning, master-slave copy, to sharding-jdbc/mycat The whole set of manual sub-bank sub-tables has been deeply implemented.
Although I have been paying attention to mainstream Newsql distributed solutions such as TiDB in my daily work, and have also contacted many times in the online architecture review and technical research, there has been no special purpose.Systematic arrangement into standard selection framework + landing boundary document. Taking this interview opportunity, I combined with many years of mysql large-scale actual combat experience and mature production cases in the industry, and then compared the official structure and best practice, I have completed the precipitation of this share:
Say goodbye to the full-dimensional selection reference of manual sub-bank sub-tables and 100 million-level data without inductive expansion, at the same time, clarify the classical questions of the long debate in the project:
After using TiDB on a large scale in the business, how Elasticsearch should choose and perform their own duties, which scenarios are reserved, and which scenarios can be simplified, and thoroughly explain the real architectural boundaries of RDBMS and the retrieval engine.
The whole content is based on the practical perspective of senior PHP+MySQL, taking into account the interview response standard + the real basis of enterprise production.
As an old programmer with many years of development experience, the most pit I have stepped on in these years is the expansion problem of the MySQL database. From the initial stand-alone Mysql business, to the query and write jitter that appeared after the single table data exceeded 10 million, to the last resort Hand sharding-jdbc and mycat do manual sub-databases, the whole process can be called ‘operation and maintenance nightmare’ – your own rules Segmentation keys, debugging routing rules, cross-sharded JOIN, and cross-slicing transactions are everywhere. The PHP business layer is also compatible with complex paging, sorting and aggregation logic, and the small and medium-sized teams cannot bear the weight of this architecture at all.
I believe that many backends like me have the same soul torture: is there a database that is compatible with the mysql protocol, ph P is seamless, and we can easily support the data volume of 100 million and 100 billion without us being concerned about the sub-database and sub-table, and can also ensure high availability and strong consistency? The answer is yes – the NewSQL distributed database is born to solve this pain point. In today’s article, I will combine my own development experience to take a detailed inventory of the mainstream free manual sharding MySQL distributed database (including open source + Public cloud), at the same time answer the question that everyone is most concerned about: After using TiDB, does Elasticsearch (ES) still need it? Provide the most practical reference for subsequent production environment database selection.
1. First review: those ‘fatal pain points’ of mysql manual sub-database sub-table
Before we talk about Newsql, let’s sort out the evolution of the traditional MySQL architecture, and let everyone understand why we urgently need a ‘slicing-free’ distributed database.
1. The natural limitations of stand-alone MySQL: once the amount of data in a single table exceeds 10 million, there will be obvious performance bottlenecks – the index is bloated and the query is slow, and after writing is concurrently increased In the event of jitter, the expansion can only rely on the vertical upgrade of the server (plus CPU, memory, and disk), but this method has a clear upper limit and cannot support the billion-level data and high concurrency scenarios.
- The sharding rules are all manual planning: which fields are selected as the sharding keys, how to split the data, and how to configure the routing rules, they all need to be debugged by themselves.
- Cross-slicing operations are extremely difficult to achieve: cross-slicing joins, cross-slice transactions are almost ‘metaphysics’, either give up strong consistency, or write complex business logic to bottom out, and it is easy to have the problem of data inconsistencies;
- Scale and migration is painful: when the data continues to grow and need to add new shards, it is necessary to manually migrate data, adjust routing configuration, and even modify business code, which is prone to service interruptions or data loss during this period;
- Double the cost of operation and maintenance: It is necessary to arrange personnel to maintain middleware and sharded clusters to monitor the load of each shard and to check the abnormality of sharding. Small and medium-sized teams do not have enough manpower support at all.
For our back-end, the most painful thing is: the original familiar MySQL syntax requires a lot of adaptation after sub-database and sub-table, and the ORM usage of Laravel, Yii2 and other frameworks should also be adjusted, and the business development efficiency has been greatly reduced. The core value of Newsql is to help us get rid of these troubles –Compatible with mysql protocol, automatic sharding, manual maintenance-free, the business layer is completely insensitive, and the billion-level data can easily hold it.
2. Mainstream free manual sharding Class MySQL distributed database full inventory (open source + public cloud, production available)
At present, the hand-free sharding and MySQL-compatible distributed database on the market are mainly divided into two categories: open source self-research version (suitable for privatized deployment, self-built computer room) and public cloud hosting version (suitable for enterprises that do not want to operate and maintain, out of the box). Let’s disassemble them one by one, focusing on analyzing the core highlights of each database, suitable for scenarios and shortcomings, so that everyone can choose according to their own business situation.
(1) Open source self-development version (privatized deployment/self-built computer room is the first choice)
This kind of database can be deployed on its own server, and it controls data and operation and maintenance independently. It is suitable for enterprises with high data security requirements and self-built computer rooms.
① TiDB (PingCap) [Key recommendation, most suitable for PHP ecosystem]
This is also the database I was asked during the interview, and it is also the most suitable for the NewSQL database of PHP backend and the easiest to land. As the open source distributed Newsql database self-developed by PingCap, its core positioning is to ‘replace mysql sub-table sub-table, compatible with the MySQL ecosystem, and support the high concurrency of 100 million-level data’.
Core highlights (focus on, fit PHP development scenarios):
- MySQL compatibility is full: 100% compatible with MySQL 5.7 protocol, syntax and connection driver, the PHP project (Laravel, Yii2, ThinkPHP) can be directly connected, no need to modify any SQL statement and business code, smooth migration without pressure;
- Automatic sharding, completely liberating the hands: the bottom layer will automatically shard the data according to ‘region’ (default 96MB), and the data is evenly scattered to multiple TikV sections No need to manually plan the sharding keys, no need to manually migrate data, the data has risen to 100 million, one billion, and only a new TikV node needs to be added, and the expansion is completely insensitive;
- High availability + strong transaction: 3 dungeon deployment based on RAFT protocol, automatic switching after a single node failure, zero data loss, no service interruption, support cross Disaster recovery of the computer room; at the same time, it supports distributed strong transaction ACID, which solves the pain points of cross-slice transactions after sub-database and sub-table, and is suitable for core businesses such as order and payment;
- HTAP is integrated, taking into account transaction and analysis: TiKV (line storage engine) is responsible for OLTP (online transactions, such as user ordering, payment), Tiflash (column storage engine, optional ) is responsible for OLAP (online analysis, such as background reports, data statistics), real-time data synchronization, no need to build a separate data warehouse for ETL, a set of database to complete transaction + analysis;
- Separation of storage and calculation, flexible expansion: TiDB Server (calculation layer) and TiKV (storage layer) are separated, if you want to improve the concurrency capability, you can add a new TiDB node; if you want to store more data, add a new TikV node, the expansion is flexible, and the cost is controllable.
Suitable for scenarios:
- Internet scenarios: large tables such as orders, users, bills, running water, etc.
- PHP old project migration: I don’t want to modify the business code, I want to get rid of the trouble of sub-database and sub-table, and smoothly replace the traditional mysql;
- Small and medium-sized teams: The learning cost is low, and a local command can start the test cluster, and the difficulty of operation and maintenance is much lower than the plan of sub-database sub-table + middleware.
Short board:
- Small-scale data deployment is slightly heavy: compared to single-machine MySQL, TiDB needs to deploy three core components of TiDB, TiKV, and PD. The server resource occupancy is slightly higher, which is suitable for scenarios with a large amount of data, and the small data volume scenarios (less than one million levels in a single table) are more efficient to use stand-alone MySQL;
- Very low latency queries are not as good as native MySQL: for simple, microsecond-level latency requests (such as simple primary key queries), TiDB The latency is slightly higher than that of stand-alone MySQL, but for most PHP services (such as web interface, background management), it fully meets the needs.
② OceanBase open source version (self-developed by Ant Group)
OceanBase is a distributed database developed by Ant Group. The core advantage is ‘financial stability’. Alipay’s core accounting system and Double Eleven peak are supported by it. The open source version can be used for free, which is suitable for scenarios with high stability requirements.
Core highlights:
- Financial level high availability: After years of actual combat verification by Alipay, it supports multi-activity in different places, automatic failure switching, zero loss of data, and meets the compliance requirements of the financial industry;
- Compatibility: Compatible with the MySQL protocol, and support some Oracle syntax, suitable for enterprises that need to migrate from Oracle to the MySQL ecosystem;
- High resource utilization: support automatic partitioning, baseline merge, and high data compression. Under the same data volume, the storage cost is lower than that of TiDB, and the utilization rate of server resources is higher;
- Strong consistency: Distributed transactions support ACID, which is suitable for core financial scenarios such as accounting and payment.
Suitable for scenarios: finance, payment, core accounting systems, enterprises with extremely high requirements for data consistency and stability.
Short board:
- The learning curve is steep: the architecture is more complex than TiDB, and the operation and maintenance are difficult, which requires special operation and maintenance personnel to maintain, and the landing cost of small and medium-sized teams is high;
- The community ecology is not as good as TiDB: the domestic community activity is lower than that of TiDB. When encountering problems, the documents and solutions are not as rich as TiDB;
- PHP ecological adaptation is not as good as TiDB: Although it is compatible with the MySQL protocol, it is less than TiDB in the adaptation of the PHP framework and community cases, and additional debugging may be required when landing.
③ PolarDB-X open source version (self-developed by Ali)
PolarDB-X is Ali’s native distributed MySQL database, which is similarly positioned to TiDB. The core advantage is ‘Ali ecological adaptation’.
Core highlights:
- MySQL is highly compatible: seamlessly compatible with MySQL syntax and protocols, PHP projects can be directly connected, and the migration cost is low;
- Automatic sharding + global secondary index: the bottom-level automatic sharding, supports global secondary index, and solves the pain points of index query after sharding;
- Alibaba Ecological Linkage: It is highly linked with other products of Alibaba Cloud (such as RDS, OSS, and DataWorks), which is suitable for enterprises that have already used Alibaba Eco.
Suitable for scenarios: Ali ecological heavy users need to be privately deployed, and a distributed database that is compatible with MySQL.
Shortcomings: The activity of the community is lower than that of TiDB, the function update speed of the open source version is not as good as that of TiDB, and the advantages of non-Aliba ecology enterprises are not obvious.
(2) Public cloud hosting version (no operation and maintenance, out of the box, the first choice for enterprise production)
For most enterprises (especially small and medium-sized teams), the operation and maintenance cost of self-built distributed database clusters is too high. Efficient and more worry-free choice – manufacturers are responsible for the underlying deployment, operation and maintenance, expansion, and troubleshooting. We only need to focus on business development and use it out of the box.
1. Alibaba Cloud PolarDB
Alibaba Cloud’s distributed MySQL hosted version, based on the PolarDB-X open source version, is optimized.
Core advantages: cloud hosting, the bottom sharding details are completely shielded, and the business layer is completely insensitive; support elastic expansion, automatic backup, fault self-healing; and Alibaba Cloud RDS , ECS, SLS and other products are seamlessly linked, which is suitable for the implementation of PHP projects on Alibaba Cloud; there are mature solutions in scenarios such as spikes, orders, and massive flowing water.
2. Tencent Cloud TDSQL MySQL Edition
The core advantages of Tencent Cloud’s self-developed MySQL are ‘financial-grade stability’ and ‘traditional project adaptation’, which are more implemented in the fields of finance, government and enterprise.
Core advantages: automatic sharding, distributed strong transactions, multi-aware remote activities; compatible with the MySQL protocol, low cost of PHP project migration; Perfect monitoring, backup, and operation and maintenance tools are suitable for enterprises who do not want to invest too much operation and maintenance manpower; support localization and adaptation, suitable for government and enterprise projects.
3. HUAWEI CLOUD GAUSSDB(for MySQL)
The core advantages of Huawei Cloud’s distributed MySQL are ‘separation of storage and calculation’ and ‘support of large-scale data’, which are suitable for scenarios with a large amount of data and a very high concurrency.
Core advantages: support PB-level data storage, automatic sharding, flexible expansion; compatible with MySQL protocol, PHP seamless connection; provide financial-grade high availability, suitable for core business implementation; Huawei Cloud has strong ecological linkage, suitable for enterprises using Huawei Cloud.
4. AWS Aurora MySQL Distributed Extension
The core advantages of Amazon Cloud’s distributed MySQL are ‘Overseas Deployment’ and ‘Native MySQL Compatibility’, which is suitable for overseas business and need to connect with the AWS ecosystem.
Core advantages: fully compatible with native MySQL, PHP projects can be directly connected; support elastic expansion, automatic bottom-level sharding; overseas nodes are widely covered, suitable for overseas deployment services; provide complete monitoring and operation and maintenance tools.
(3) Production selection comparison table (direct comparison, rapid decision-making)
In order to facilitate everyone’s quick selection, I have compiled a comparison table, covering the core dimension, and the production environment can directly refer to:
| Database | attribution | mysql compatibility | automatic sharding | distributed transaction | Operation and maintenance difficulty | Typical Best Scene |
|---|---|---|---|---|---|---|
| TiDB (open source) | Open source built | Extremely high (100% compatible 5.7) | Fully automatic | Strong ACID | Medium (suitable for small and medium-sized teams) | Internet/PHP project migration, alternative database sub-tables, 100 million-level structured data |
| OceanBase (open source) | Open source built | High (compatible with MySQL + part of Oracle) | Fully automatic | Financial level strong | High (suitable for professional operation and maintenance) | Accounting/Payment/Banking Core, Financial Grade High Availability Scenario |
| PolarDB-X (open source) | Open source built | extremely high | Fully automatic | Strength | Medium | Ali Ecological Self-built, compatible with MySQL distributed scenarios |
| Alibaba Cloud PolarDB | Public Cloud Hosting | extremely high | Manufacturer’s Shield Details | Strength | Very low (manufacturer operation and maintenance) | Alibaba Cloud system, do not want operation and maintenance, 100 million data scenarios |
| Tencent Cloud TDSql MySQL version | Public Cloud Hosting | extremely high | Manufacturer’s Shield Details | Strength | Extremely low | Tencent Cloud System, Financial/Government and Enterprise Projects, Traditional Project Migration |
| mycat+mysql (traditional scheme) | middleware combination | Depend on sharding rules | manual planning | weak, too many pits | extremely high | The old historical project has no choice but to continue its life, and the new project is not recommended |
In one sentence, the selection conclusion (the key points, interview + production can be used):
- PHP traditional web/Internet business, want to low-cost migration, self-study and landing, interview bonus points → TiDB (open source version) is the best;
- Core scenarios such as financial accounting and payment, pursuing extreme stability → OceanBase (open source version) or Tencent Cloud TDSql;
- The whole Alibaba cloud system, do not want to invest in operation and maintenance manpower, and land directly on the cloud → Alibaba Cloud PolarDB;
- Overseas business, need to connect with AWS ecology → AWS Aurora MySQL;
- Old historical projects cannot directly migrate NewSQL → Temporarily use mycat+mysql to continue life, new projects are preferred for newsql.
3. Core Q&A: After using TiDB, can Elasticsearch (es) be used?
This is one of the most tangled issues I am considering TiDB – in many PHP projects, we will use MySQL to store business data, use E s do full-text search (such as article search, product search), if tidb is used, can ES be killed and the operation and maintenance cost of a set of clusters is reduced? After in-depth understanding and practical verification, the answer is very clear:TiDB cannot completely replace ES, the two are positioned differently, whether ES needs to be reserved, it depends on whether your business needs full-text retrieval.
(1) First make it clear: the core positioning of TiDB and ES is completely different
Let’s first clarify the core strengths of the two, and then we will know why we can’t replace each other.
1. The core strengths of Elasticsearch (ES): full-text search + unstructured data processing
The essence of ES is ‘retrieval engine’, not a database, its core value is ‘quick search’, especially full-text word segmentation, which is irreplaceable by TiDB:
- Full-text word segmentation search: Support word segmentation in Chinese, English and other languages, such as ‘PHP distributed database’, which can be divided into ‘PHP’, ‘Distributed’ and ‘Database’. Three keywords, the user searches any keyword, which can quickly match the relevant content, and can also achieve highlighting and weight sorting (for example, the content with high matching degree is ranked in the front);
- Log retrieval: unstructured/semi-structured data such as system log, operation log, error log, etc., ES can quickly index and retrieve, which is convenient for us to troubleshoot problems (such as quickly locate an error log according to keywords);
- Multi-dimensional free screening: support free screening of multiple fields and multi-conditions, such as product search, you can filter according to multiple dimensions such as price, classification, sales, and evaluation, and the search speed is extremely fast;
- Unstructured data processing: Support the retrieval of unstructured data such as text and pictures (requires plug-ins), which is a completely incapable of TIDB.
2. The core strengths of TiDB: structured transaction data + standard SQL operation
The essence of TiDB is ‘distributed relational database’, and the core value is ‘store and manage structured transaction data’, instead of MySQL, not ES:
- Structured data storage: suitable for storing structured data such as orders, users, balances, and running water to ensure strong consistency and transactionality of data;
- Standard SQL operation: support JOIN, transaction, DDL, paging, sorting and other SQL operations supported by MySQL, and the PHP business layer can directly operate in a familiar way;
- 100 million-level data support: automatic fragmentation, easy support for the storage and query of 10000000-level and one billion-level structured data;
- Basic analysis ability: Through the TIFLASH column storage engine, you can do simple big data statistical analysis (such as monthly order volume, user activity statistics), but cannot achieve full-text search like ES.
(2) Production and landing conclusion: Judgment by scene, do not need to blindly reserve or delete ES
Combined with the common scenarios of PHP projects, we can judge in two situations, avoid wasting resources, and avoid stepping on pits:
✅ Scenarios that can be offline/deployed without ES (TiDB is enough)
If your business only has ‘structured data query’ and does not have full-text retrieval requirements, then ES can do without deployment, and TiDB+TIFlash can handle all requirements:
- User basic information query: query according to the user ID, mobile phone number, user name and other precise conditions, the index query speed of TiDB is fully satisfied;
- Order and bill query: query according to the order number, user ID, time range and other conditions, support pagination, sorting, and simple aggregation (such as counting the total number of orders for a user);
- In the background of ordinary reports: such as sales reports, user reports, through the TIFLASH column storage engine, it can quickly complete statistical analysis, and there is no need to synchronize data to ES or data warehouse;
- Precise condition query: all scenarios that do not require Chinese word segmentation and do not need fuzzy retrieval are competent, and the performance is better than ES
❌ Never replace the ES scene (must keep)
As long as your business has the following requirements, ES cannot be saved, and TiDB cannot be replaced:
- Full-text word segmentation search: such as blog articles, product titles/details, fuzzy search of news content, keyword search, need for word segmentation, highlighting, weight sorting;
- Log retrieval: centralized retrieval and troubleshooting of system logs, operation logs, and error logs, especially massive log scenarios (such as tens of millions of logs per day);
- Multi-dimensional free screening: such as e-commerce product search, users can freely screen according to multiple dimensions such as price, classification, sales volume, evaluation, and brand, and require fast search speed;
- Unstructured text retrieval: such as the search and analysis of unstructured texts such as user reviews and article content.
(3) compromise plan: small and medium-sized teams to save money skills (milk search does not use ES)
If your business has only a ‘light search’ requirement (such as simple keyword fuzzy queries, without complex word segmentation and weight sorting), and do not want to maintain a bulky ES cluster, you can consider the following compromises, which are lighter and easier to maintain than ES:
- TiDB+ MySQL native full-text index: TiDB is compatible with MySQL’s full-text indexing function, for simple Chinese word segmentation, modulus Paste query (such as username, title fuzzy search), can be used, but the performance and functions are far inferior to ES, suitable for mild needs;
- Access light retrieval engine: such as MeiliSearch and TypeSense, these search engines are much lighter than ES, simple deployment, low operation and maintenance cost, support basic word segmentation, highlighting, fuzzy search, and suitable for the light retrieval needs of small and medium-sized teams;
- Note: If it is a heavy search demand (such as e-commerce product search, massive article search), or keep ES honestly to avoid affecting the user experience.
4. Combined with 14 years of PHP experience: the best route of architecture evolution (production and implementation)
Combined with my PHP development experience over the years, I recommend a set of ‘minimal, efficient and easy to implement’ architectural evolution route, which is suitable for most PHP projects, especially those migrated from the traditional MySQL sub-table sub-table:
(1) Old architecture (painful version, it is not recommended to continue to use)
PHP + stand-alone MySQL → Data explosion → Master-slave copying bottleneck → Mycat/Sharding-JDBC manual sub-table sub-table → Change the code, change the SQL, countless cross-block pits, and explode the operation and maintenance cost
The problem with this architecture is: complex operation and maintenance, low business development efficiency, and difficulty in expanding capacity.
(2) New minimalist architecture (recommended, the first choice for production and landing)
According to whether there are retrieval needs for the business, there are two solutions, and both can completely get rid of the trouble of manual sub-database sub-tables:
Option 1: No retrieval requirements (such as background management system, payment system)
PHP + TiDB → Get all the needs
- You don’t need to care about the sub-database and sub-table, the data can run smoothly when it reaches 100 million;
- PHP code and SQL statements do not need to be changed at all, smooth migration;
- TIFLASH is responsible for background report statistics, and does not need to build additional data warehouses;
- The operation and maintenance is simple, only need to maintain the TiDB cluster, and the score library is divided into table + middleware and 10 times worry-free.
Option 2: There are retrieval needs (such as e-commerce, blog, content platform)
PHP + TiDB (core business data) + lightweight search engine/ES (retrieval requirements)
- TiDB: store core structured data such as orders, users, and balances to ensure transaction and consistency;
- Lightweight retrieval engine (MeiliSearch): to deal with mild search requirements (such as simple keyword search), low operation and maintenance cost;
- ES (optional): to deal with heavy search requirements (such as full-text search, log retrieval of commodities), only deploy necessary nodes, and reduce operation and maintenance costs;
- Data synchronization: Through the binlog of TiDB, the data that needs to be retrieved is synchronized to the retrieval engine to realize real-time data update.
The advantages of this architecture: the core business is stable, the retrieval needs are met, the operation and maintenance costs are controllable, and it is perfectly adapted to the PHP ecosystem, and the small and medium-sized teams can also easily land.
5. Summary: The ultimate idea of PHP back-end database selection
As a 14-year-old PHP programmer, after experiencing the torture of the MySQL sub-table and dividing the table, I have a clearer understanding of the selection of the database, and I will share it with you here, and also the end of this blog:
1. For PHP projects,TiDB is currently the most suitable open source NewSQL database— Compatible with MySQL, automatic sharding, manual maintenance, and PHP zero transformation, can completely solve the pain points of sub-repository and sub-table, and support the high concurrency of 100 million-level data, whether it is self-study, interview or production and landing, it is the best solution;
2. The public cloud hosting version (such as Alibaba Cloud PolarDB, Tencent Cloud TDSql), suitable for enterprises who do not want to put into operation and maintenance manpower, out of the box, with guaranteed stability, is the first choice for enterprise production;
3. TiDB and ES are not the relationship of ‘choose one’, but ‘complementary’ relationship – TiDB solves the storage, transaction and basic analysis of structured data, ES solves full-text retrieval and unstructured data processing, and whether ES needs to be reserved depends on the retrieval needs of the business;
4. New projects try to avoid manual sub-database sub-table, preferentially choose the Newsql database, reduce the architectural complexity and operation and maintenance cost;
5. The core principles of database selection:Meet business needs, reduce development and operation and maintenance costs, ensure stability and scalability, you don’t have to blindly pursue ‘tall’, which is the best for your own business.
Finally, I hope this article can help the back-end developers who have been tortured by mysql sub-databases like me, and I hope everyone can be quick Quickly grasp the core knowledge points of NewSQL, avoid detours in interviews and production selection, and completely say goodbye to the pain of manual sub-repository sub-tables!