mysql InnoDB crash and recover

[摘要] 因为机器突然断电，mysql innodb crash 了。数据文件损坏导致mysql进程起不来，几经折腾，终于把服务起来了，但是解决过程并不完美，下面还是把处理经过记录一下。

Mysql版本：5.6.36，
操作系统：CentOS release 6.8

先贴出来mysql.log 日志的报错情况：

information that should help you find out what is causing the crash.
2018-01-05 08:19:44 13811 [Note] Plugin 'FEDERATED' is disabled.
2018-01-05 08:19:44 13811 [Note] InnoDB: Using atomics to ref count buffer pool pages
2018-01-05 08:19:44 13811 [Note] InnoDB: The InnoDB memory heap is disabled
2018-01-05 08:19:44 13811 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2018-01-05 08:19:44 13811 [Note] InnoDB: Memory barrier is not used
2018-01-05 08:19:44 13811 [Note] InnoDB: Compressed tables use zlib 1.2.3
2018-01-05 08:19:44 13811 [Note] InnoDB: Using Linux native AIO
2018-01-05 08:19:44 13811 [Note] InnoDB: Using CPU crc32 instructions
2018-01-05 08:19:44 13811 [Note] InnoDB: Initializing buffer pool, size = 8.0G
2018-01-05 08:19:44 13811 [Note] InnoDB: Completed initialization of buffer pool
2018-01-05 08:19:44 13811 [Note] InnoDB: Highest supported file format is Barracuda.
2018-01-05 08:19:44 13811 [Note] InnoDB: Log scan progressed past the checkpoint lsn 229983134766
2018-01-05 08:19:44 13811 [Note] InnoDB: Database was not shutdown normally!
2018-01-05 08:19:44 13811 [Note] InnoDB: Starting crash recovery.
2018-01-05 08:19:44 13811 [Note] InnoDB: Reading tablespace information from the .ibd files...
2018-01-05 08:19:44 13811 [Note] InnoDB: Restoring possible half-written data pages
2018-01-05 08:19:44 13811 [Note] InnoDB: from the doublewrite buffer...
InnoDB: Doing recovery: scanned up to log sequence number 229983251745
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 691.
InnoDB: You may have to recover from a backup.
2018-01-05 08:19:45 7f3dd3ed1720 InnoDB: Page dump in ascii and hex (16384 bytes):

InnoDB: End of page dump
2018-01-05 08:19:45 7f3dd3ed1720 InnoDB: uncompressed page, 
stored checksum in field1 1575536212, calculated checksums for field1: crc32 1100691290, 
innodb 1575536212, none 3735928559, stored checksum in field2 2238042318, calculated checksums for field2: 
crc32 1100691290, innodb 3402961953, none 3735928559, page LSN 53 2349971314, low 4 bytes of LSN at page end 2349884494, 
page number (if stored to page already) 691, space id (if created with >= MySQL-4.1.1 and stored already) 0
InnoDB: Page may be an update undo log page
InnoDB: Database page corruption on disk or a failed
InnoDB: file read of page 691.
InnoDB: You may have to recover from a backup.
InnoDB: It is also possible that your operating
InnoDB: system has corrupted its own file cache
InnoDB: and rebooting your computer removes the
InnoDB: error.
InnoDB: If the corrupt page is an index page
InnoDB: you can also try to fix the corruption
InnoDB: by dumping, dropping, and reimporting
InnoDB: the corrupt table. You can use CHECK
InnoDB: TABLE to scan your table for corruption.
InnoDB: See also http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
InnoDB: Ending processing because of a corrupt database page.
2018-01-05 08:19:45 7f3dd3ed1720  
InnoDB: Assertion failure in thread 139903525263136 in file buf0buf.cc line 4295
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
00:19:45 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=0
max_threads=700
thread_count=0
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 285949 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
8388608 + 131072 * 700
8388608 + 91750400 = 97792


Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8e5d85]
/usr/sbin/mysqld(handle_fatal_signal+0x494)[0x6692b4]
/lib64/libpthread.so.0[0x35ea00f790]
/lib64/libc.so.6(gsignal+0x35)[0x35e9c32625]
/lib64/libc.so.6(abort+0x175)[0x35e9c33e05]
/usr/sbin/mysqld[0xa4fc37]
/usr/sbin/mysqld[0xa65454]
/usr/sbin/mysqld[0xa6589b]
/usr/sbin/mysqld[0xa53851]
/usr/sbin/mysqld[0xa1dadb]
/usr/sbin/mysqld[0xa12a45]
/usr/sbin/mysqld[0xa12dc6]
/usr/sbin/mysqld[0xa1527e]
/usr/sbin/mysqld[0x9ff480]
/usr/sbin/mysqld[0x942b5d]
/usr/sbin/mysqld(_Z24ha_initialize_handlertonP13st_plugin_int+0x48)[0x5ac828]
/usr/sbin/mysqld[0x6f3171]
/usr/sbin/mysqld(_Z11plugin_initPiPPci+0xbb6)[0x6f6fd6]
/usr/sbin/mysqld[0x59ea28]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x42d)[0x5a3cdd]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x35e9c1ed5d]
/usr/sbin/mysqld[0x595349]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

根据http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html的提示，需要修复数据库和表。
由于mysql现在仍无法启动，所以需要增加如下：
[mysqld]
innodb_force_recovery = 4

关于innodb_force_recovery的几个值介绍：

As a safety measure, InnoDB prevents INSERT, UPDATE, or DELETE operations when innodb_force_recovery is greater than 0.

1 (SRV_FORCE_IGNORE_CORRUPT)

Lets the server run even if it detects a corrupt page. Tries to make SELECT * FROM tbl_name jump over corrupt index records and pages, which helps in dumping tables.

2 (SRV_FORCE_NO_BACKGROUND)

Prevents the master thread and any purge threads from running. If a crash would occur during the purge operation, this recovery value prevents it.

3 (SRV_FORCE_NO_TRX_UNDO)

Does not run transaction rollbacks after crash recovery.

4 (SRV_FORCE_NO_IBUF_MERGE)

Prevents insert buffer merge operations. If they would cause a crash, does not do them. 
Does not calculate table statistics. This value can permanently corrupt data files. After using this value, be prepared to drop and recreate all secondary indexes.

5 (SRV_FORCE_NO_UNDO_LOG_SCAN)

Does not look at undo logs when starting the database: InnoDB treats even incomplete transactions as committed. This value can permanently corrupt data files.

6 (SRV_FORCE_NO_LOG_REDO)

Does not do the redo log roll-forward in connection with recovery. This value can permanently corrupt data files. Leaves database pages in an obsolete state, 
which in turn may introduce more corruption into B-trees and other database structures.

在my.cnf中添加完 innodb_force_recovery 配置后，mysql可以启动了，然后尝试修复所有的数据库：

mysqlcheck -u root -p --auto-repair --check --all-databases
common_schema.metadata                             OK
common_schema.numbers                              OK
mysql.columns_priv                                 OK
mysql.db                                           OK
mysql.event                                        OK
mysql.func                                         OK
mysql.general_log                                  OK
mysql.help_category                                OK
mysql.help_keyword                                 OK

全部修复完成，去掉innodb_force_recovery，再重启mysql，仍然报错。

使用xtrabackup的操作中发现了这样一个提示：
xtrabackup: Error: failed to read page after 10 retries. File ./ibdata1 seems to be corrupted.

于是尝试在 innodb_force_recovery = 4 的情况下，使用xtrabackup备份一下数据库，然后重建mysql。
这个过程中 innobackupex 备份是成功的，重建mysql也是成功的，重新apply-log和copy-back也是成功，但是恢复之后的数据库仍然无法启动，跟之前的情况类似，已经很无语了。

无奈之下选择使用了innobackupex最近的一次备份进行恢复，完成后重启mysql服务成功！
但是问题是最近的备份距现在已经过去了20个小时，innodb crash 这段时间的更新会丢失，由于在innodb_force_recovery = 4的状态下，mysql是可以启动的，所以在原来的数据基础上启动了新的端口：3307，然后手动比较3307和3306两套数据的变化！
唉，太挫了~