Got an error reading communication packets db.Conn(context.Background()) wsarecv: An existing connection was forcibly closed by the remote host

结论：

客户端没有正常关闭数据库连接。

ip1、ip2分别为内网、公网，数据库结构、数据一样，show variables like "%time%"; 结果一样。

ip1有正常返回，ip2返回错误。

[mysql] 2021/11/17 10:52:12 packets.go:37: read tcp 10.128.1.192:57218->ip2:3306: wsarecv: An existing connection was forcibly closed by the remote host.
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xc0000005 code=0x0 addr=0x0 pc=0x2bd207]

package main

import (
	"context"
	"database/sql"
	"fmt"
	_ "github.com/go-sql-driver/mysql"
	"time"
)

func main() {
	//d, s := "mysql", "root:qazwsx@tcp(ip1:3306)/test"
	d, s := "mysql", "root:qazwsx@tcp(ip2:3306)/test"
	db, e := sql.Open(d, s)
	if e != nil {
		fmt.Println(e)
	}
	e = db.Ping()
	if e != nil {
		fmt.Println(e)
	}
	//db.Conn(context.Background())

	// close database after all work is done
	defer db.Close()

	if err := db.Ping(); nil != err {
		fmt.Println(err)
	}

	q := "SELECT ip,ONLINE FROM test WHERE ONLINE IN (1,0)"

	rows, _ := db.Query(q)
	m := map[string]bool{}
	for rows.Next() {
		var IPV4 string
		var b bool
		_ = rows.Scan(&IPV4, &b)
		m[IPV4] = b
	}

	for k, v := range m {
		fmt.Println(k, v)
	}
	for {

	}
}

　ip2中mysql日志：

root@test:/var/log/mysql# tail -f error.log

2021-11-17T02:12:14.708584Z 12987912 [Note] Aborted connection 12987912 to db: 'test' user: 'root' host: '1.234.2.18' (Got an error reading communication packets)
2021-11-17T02:12:51.898377Z 12987913 [Note] Aborted connection 12987913 to db: 'test' user: 'root' host: '1.234.2.18' (Got an error reading communication packets)

MySQL "Got an error reading communication packet" - Percona Database Performance Blog https://www.percona.com/blog/2016/05/16/mysql-got-an-error-reading-communication-packet-errors/

github.com/go-sql-driver/mysql@v1.6.0/packets.go:28


package mysql

import (
   "bytes"
   "crypto/tls"
   "database/sql/driver"
   "encoding/binary"
   "encoding/json"
   "errors"
   "fmt"
   "io"
   "math"
   "time"
)

// Packets documentation:
// http://dev.mysql.com/doc/internals/en/client-server-protocol.html

// Read packet to buffer 'data'
func (mc *mysqlConn) readPacket() ([]byte, error) {
   var prevData []byte
   for {
      // read packet header
      data, err := mc.buf.readNext(4)
      if err != nil {
         if cerr := mc.canceled.Value(); cerr != nil {
            return nil, cerr
         }
         errLog.Print(err)
         mc.Close()
         return nil, ErrInvalidConn
      }

      // packet length [24 bit]
      pktLen := int(uint32(data[0]) | uint32(data[1])<<8 | uint32(data[2])<<16)

      // check packet sync [8 bit]
      if data[3] != mc.sequence {
         if data[3] > mc.sequence {
            return nil, ErrPktSyncMul
         }
         return nil, ErrPktSync
      }
      mc.sequence++

      // packets with length 0 terminate a previous packet which is a
      // multiple of (2^24)-1 bytes long
      if pktLen == 0 {
         // there was no previous packet
         if prevData == nil {
            errLog.Print(ErrMalformPkt)
            mc.Close()
            return nil, ErrInvalidConn
         }

         return prevData, nil
      }

      // read packet body [pktLen bytes]
      data, err = mc.buf.readNext(pktLen)
      if err != nil {
         if cerr := mc.canceled.Value(); cerr != nil {
            return nil, cerr
         }
         errLog.Print(err)
         mc.Close()
         return nil, ErrInvalidConn
      }

      // return data if this was the last packet
      if pktLen < maxPacketSize {
         // zero allocations for non-split packets
         if prevData == nil {
            return data, nil
         }

         return append(prevData, data...), nil
      }

      prevData = append(prevData, data...)
   }
}

发现在ip的情况下，添加代码

db.Conn(context.Background())

注意：

数据库连接的关闭

defer db.Close()

2021/11/18 17:17:50 http: panic serving 10.128.1.55:50472: Error 1040: Too many connections
goroutine 767 [running]:
net/http.(*conn).serve.func1()
/home/u/go/src/net/http/server.go:1801 +0xb9
panic({0xa2c4a0, 0xc000534378})
/home/u/go/src/runtime/panic.go:1047 +0x266
WebSvc/internal/data.MysqlConn()
/home/u/code/WebSvc/internal/data/gflight.go:99 +0xff

则返回正常数据。

Go/src/database/sql/sql.go:1884

// Conn returns a single connection by either opening a new connection
// or returning an existing connection from the connection pool. Conn will
// block until either a connection is returned or ctx is canceled.
// Queries run on the same Conn will be run in the same database session.
//
// Every Conn must be returned to the database pool after use by
// calling Conn.Close.
func (db *DB) Conn(ctx context.Context) (*Conn, error) {
   var dc *driverConn
   var err error
   for i := 0; i < maxBadConnRetries; i++ {
      dc, err = db.conn(ctx, cachedOrNewConn)
      if err != driver.ErrBadConn {
         break
      }
   }
   if err == driver.ErrBadConn {
      dc, err = db.conn(ctx, alwaysNewConn)
   }
   if err != nil {
      return nil, err
   }

   conn := &Conn{
      db: db,
      dc: dc,
   }
   return conn, nil
}

Go/src/database/sql/sql.go:1259

// conn returns a newly-opened or cached *driverConn.
func (db *DB) conn(ctx context.Context, strategy connReuseStrategy) (*driverConn, error) {
   db.mu.Lock()
   if db.closed {
      db.mu.Unlock()
      return nil, errDBClosed
   }
   // Check if the context is expired.
   select {
   default:
   case <-ctx.Done():
      db.mu.Unlock()
      return nil, ctx.Err()
   }
   lifetime := db.maxLifetime

   // Prefer a free connection, if possible.
   numFree := len(db.freeConn)
   if strategy == cachedOrNewConn && numFree > 0 {
      conn := db.freeConn[0]
      copy(db.freeConn, db.freeConn[1:])
      db.freeConn = db.freeConn[:numFree-1]
      conn.inUse = true
      if conn.expired(lifetime) {
         db.maxLifetimeClosed++
         db.mu.Unlock()
         conn.Close()
         return nil, driver.ErrBadConn
      }
      db.mu.Unlock()

      // Reset the session if required.
      if err := conn.resetSession(ctx); err == driver.ErrBadConn {
         conn.Close()
         return nil, driver.ErrBadConn
      }

      return conn, nil
   }

   // Out of free connections or we were asked not to use one. If we're not
   // allowed to open any more connections, make a request and wait.
   if db.maxOpen > 0 && db.numOpen >= db.maxOpen {
      // Make the connRequest channel. It's buffered so that the
      // connectionOpener doesn't block while waiting for the req to be read.
      req := make(chan connRequest, 1)
      reqKey := db.nextRequestKeyLocked()
      db.connRequests[reqKey] = req
      db.waitCount++
      db.mu.Unlock()

      waitStart := nowFunc()

      // Timeout the connection request with the context.
      select {
      case <-ctx.Done():
         // Remove the connection request and ensure no value has been sent
         // on it after removing.
         db.mu.Lock()
         delete(db.connRequests, reqKey)
         db.mu.Unlock()

         atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))

         select {
         default:
         case ret, ok := <-req:
            if ok && ret.conn != nil {
               db.putConn(ret.conn, ret.err, false)
            }
         }
         return nil, ctx.Err()
      case ret, ok := <-req:
         atomic.AddInt64(&db.waitDuration, int64(time.Since(waitStart)))

         if !ok {
            return nil, errDBClosed
         }
         // Only check if the connection is expired if the strategy is cachedOrNewConns.
         // If we require a new connection, just re-use the connection without looking
         // at the expiry time. If it is expired, it will be checked when it is placed
         // back into the connection pool.
         // This prioritizes giving a valid connection to a client over the exact connection
         // lifetime, which could expire exactly after this point anyway.
         if strategy == cachedOrNewConn && ret.err == nil && ret.conn.expired(lifetime) {
            db.mu.Lock()
            db.maxLifetimeClosed++
            db.mu.Unlock()
            ret.conn.Close()
            return nil, driver.ErrBadConn
         }
         if ret.conn == nil {
            return nil, ret.err
         }

         // Reset the session if required.
         if err := ret.conn.resetSession(ctx); err == driver.ErrBadConn {
            ret.conn.Close()
            return nil, driver.ErrBadConn
         }
         return ret.conn, ret.err
      }
   }

   db.numOpen++ // optimistically
   db.mu.Unlock()
   ci, err := db.connector.Connect(ctx)
   if err != nil {
      db.mu.Lock()
      db.numOpen-- // correct for earlier optimism
      db.maybeOpenNewConnections()
      db.mu.Unlock()
      return nil, err
   }
   db.mu.Lock()
   dc := &driverConn{
      db:         db,
      createdAt:  nowFunc(),
      returnedAt: nowFunc(),
      ci:         ci,
      inUse:      true,
   }
   db.addDepLocked(dc, dc)
   db.mu.Unlock()
   return dc, nil
}

In this blog post, we’ll discuss the possible reasons for MySQL “Got an error reading communication packet” errors and how to address them.

In Percona’s managed services, we often receive customer questions on communication failure errors. So let’s discuss possible reasons for this error and how to remedy it.

MySQL Communication Errors

First of all, whenever a communication error occurs, it increments the status counter for either Aborted_clients or Aborted_connects, which describe the number of connections that were aborted because the client died without closing the connection properly and the number of failed attempts to connect to MySQL server (respectively). The possible reasons for both errors are numerous (see the Aborted_clients increments or Aborted_connects increments sections in the MySQL manual).

In the case of log_warnings, MySQL also writes this information to the error log (shown below):

1 2	[Warning] Aborted connection 305628 to db: 'db' user: 'dbuser' host: 'hostname' (Got an error reading communication packets) [Warning] Aborted connection 305627 to db: 'db' user: 'dbuser' host: 'hostname' (Got an error reading communication packets)

In this case, MySQL increments the status counter for Aborted_clients, which could mean:

The client connected successfully but terminated improperly (and may relate to not closing the connection properly)
The client slept for longer than the defined wait_timeout or interactive_timeout seconds (which ends up causing the connection to sleep for wait_timeout seconds and then the connection gets forcibly closed by the MySQL server)
The client terminated abnormally or exceeded the max_allowed_packet for queries

The above is not an all-inclusive list. Now, let’s identify what is causing this problem and how to remedy it.

Fixing MySQL Communication Errors

To be honest, aborted connection errors are not easy to diagnose. But in my experience, it’s related to network/firewall issues most of the time. We usually investigate those issues with the help of Percona toolkit scripts, i.e. pt-summary / pt-mysql-summary / pt-stalk. The outputs from those scripts can be very helpful.

Some of the reasons for aborted connection errors can be:

A high rate of connections sleeping inside MySQL for hundred of seconds is one of the symptoms that applications aren’t closing connections after doing work, and instead relying on the wait_timeout to close them. I strongly recommend changing the application logic to properly close connections at the end of an operation.
Check to make sure the value of max_allowed_packet is high enough, and that your clients are not receiving a “packet too large” message. This situation aborts the connection without properly closing it.
Another possibility is TIME_WAIT. I’ve noticed many TIME_WAIT notifications from the netstat, so I would recommend confirming the connections are well managed to close on the application side.
Make sure the transactions are committed (begin and commit) properly so that once the application is “done” with the connection it is left in a clean state.
You should ensure that client applications do not abort connections. For example, if PHP has option max_execution_time set to 5 seconds, increasing connect_timeout would not help because PHP will kill the script. Other programming languages and environments can have similar safety options.
Another cause for delay in connections is DNS problems. Check if you have skip-name-resolve enabled and if hosts are authenticated against their IP address instead of their hostname.
One way to find out where your application is misbehaving is to add some logging to your code that will save the application actions along with the MySQL connection ID. With that, you can correlate it to the connection number from the error lines. Enable the Audit log plugin, which logs connections and query activity, and check the Percona Audit Log Plugin as soon as you hit a connection abort error. You can check for the audit log to identify which query is the culprit. If you can’t use the Audit plugin for some reason, you can consider using the MySQL general log – however, this can be risky on a loaded server. You should enable the general log for at least a few minutes. While it puts a heavy burden on the server, errors happen fairly often so you should be able to collect the data before the log grows too large. I recommend enabling the general log with an -f tail, then disable the general log when you see the next warning in the log. Once you find the query from the aborted connection, identify which piece of your application issues that query and co-relate the queries with portions of your application.
Try increasing the net_read_timeout and net_write_timeout values for MySQL and see if that reduces the number of errors. net_read_timeout is rarely the problem unless you have an extremely poor network. Try tweaking those values, however, because in most cases a query is generated and sent as a single packet to the server, and applications can’t switch to doing something else while leaving the server with a partially received query. There is a very detailed blog post on this topic from our CEO, Peter Zaitsev.

Aborted connections happen because a connection was not closed properly. The server can’t cause aborted connections unless there is a networking problem between the server and the client (like the server is half duplex, and the client is full duplex) – but that is the network causing the problem, not the server. In any case, such problems should show up as errors on the networking interface. To be extra sure, check the ifconfig -a output on the MySQL server to check if there are errors.

Another way to troubleshoot this problem is via tcpdump. You can refer to this blog post on how to track down the source of aborted connections. Look for potential network issues, timeouts and resource issues with MySQL.

I found this blog post useful in explaining how to use tcpdump on busy hosts. It provides help for tracking down the TCP exchange sequence that led to the aborted connection, which can help you figure out why the connection broke.

For network issues, use a ping to calculate the round trip time (RTT) between a machine where mysqld is located and the machine from where the application makes requests. Send a large file (1GB or more) to and from client and server machines, watch the process using tcpdump, then check if an error occurred during transfer. Repeat this test a few times. I also found this from my colleague Marco Tusa useful: Effective way to check network connection.

One other idea I can think of is to capture the netstat -s output along with a timestamp after every N seconds (e.g., 10 seconds so you can relate netstat -s output of BEFORE and AFTER an aborted connection error from the MySQL error log). With the aborted connection error timestamp, you can co-relate it with the netstat sample captured as per a timestamp of netstat, and watch which error counters increased under the TcpExt section of netstat -s.

Along with that, you should also check the network infrastructure sitting between the client and the server for proxies, load balancers, and firewalls that could be causing a problem.

Conclusion:
In addition to diagnosing communication failure errors, you also need to take into account faulty ethernets, hubs, switches, cables, and so forth which can cause this issue as well. You must replace the hardware itself to properly diagnose these issues.