Pythonn内存管理以及垃圾回收机制

更多详细关于垃圾回收：https://pythonav.com/wiki/detail/6/88/

内存管理

Python解释器由c语言开发完成，py中所有的操作最终都由底层的c语言来实现并完成，所以想要了解底层内存管理需要结合python源码来进行解释。

1. 两个重要的结构体

include/object.h

#define _PyObject_HEAD_EXTRA            
    struct _object *_ob_next;           
    struct _object *_ob_prev;
     
#define PyObject_HEAD       PyObject ob_base;
 
#define PyObject_VAR_HEAD      PyVarObject ob_base;
 
 
typedef struct _object {
    _PyObject_HEAD_EXTRA// 用于构造双向链表
    Py_ssize_t ob_refcnt; // 引用计数器
    struct _typeobject *ob_type;   // 数据类型
} PyObject;
 
 
typedef struct {
    PyObject ob_base;  // PyObject对象
    Py_ssize_t ob_size;/* Number of items in variable part，即：元素个数 */
} PyVarObject;

以上源码是Python内存管理中的基石，其中包含了：

2个结构体
- PyObject，此结构体中包含3个元素。
  - _PyObject_HEAD_EXTRA，用于构造双向链表。
  - ob_refcnt，引用计数器。
  - *ob_type，数据类型。
- PyVarObject，次结构体中包含4个元素（ob_base中包含3个元素）
  - ob_base，PyObject结构体对象，即：包含PyObject结构体中的三个元素。
  - ob_size，内部元素个数。
3个宏定义
- PyObject_HEAD，代指PyObject结构体。
- PyVarObject_HEAD，代指PyVarObject对象。
- _PyObject_HEAD_EXTRA，代指前后指针，用于构造双向队列。

Python中所有类型创建对象时，底层都是与PyObject和PyVarObject结构体实现，一般情况下由单个元素组成对象内部会使用PyObject结构体（float）、由多个元素组成的对象内部会使用PyVarObject结构体（str/int/list/dict/tuple/set/自定义类），因为由多个元素组成的话是需要为其维护一个 ob_size（内部元素个数）。

include/floatobject.h

include/longobject.h

include/bytesobject.h

include/listobject.h

include/tupleobject.h

include/dictobject.h

include/setobject.h

自定义类 include/object.h

注意：Python3只保留int类型，但此时的int就是Python2中的long类型，请看如下官方提示： PEP 0237: Essentially, long renamed to int. That is, there is only one built-in integral type, named int; but it behaves mostly like the old long type.点击查看原文。

2. 内存管理

以float和list类型为例，分析python源码执行流程，了解内存管理机制。

2.1 float类型

情景一：创建float对象时

val= 3.14

当按照上述方式创建一个Float类型对象时，源码内部会先后执行如下代码。

第一步：根据float类型所需的内存大小，为其开辟内存。

Objects/obmalloc.c

PyMemAllocatorEx的方法说明

第二步：对新开辟的内存中进行类型和引用的初始化

include/objimpl.h

Objects/object.c

所以，float类型每次创建对象时都会把对象放到 refchain 的双向链表中。

情景二：float对象引用时

val= 7.8
data= val

这个过程比较简单，在给对象创建新引用时，会对其引用计数器+1的动作。

include/object.h

情景三：销毁float对象时

val= 3.14
# 主动删除对象
del val
 
"""
主动del删除对象时，会执行对象销毁的动作。
一个函数执行完毕之后，其内部局部变量也会有销毁动作，如：
def func():
    val = 2.22
 
func()
"""

当进行销毁对象动作时，先后会执行如下代码：

include/object.h

Objects/object.c

第一步，调用float类型的tp_dealloc进行内存的销毁。

按理此过程说应该直接将对象内存销毁，但float内部有缓存机制，所以他的执行流程是这样的：

float内部缓存的内存个数已经大于等于100，那么在执行`del val`的语句时，内存中就会直接删除此对象。
未达到100时，那么执行 `del val`语句，不会真的在内存中销毁对象，而是将对象放到一个free_list的单链表中，以便以后的对象使用。

Objects/floatobject.c

扩展：读源码了解现象本质

Objects/obmalloc.c

第二步，在refchain双向链表中移除

Objects/object.c

综上所述，float对象在创建对象时会把为其开辟内存并初始化引用计数器为1，然后将其加入到名为 refchain 的双向链表中；float对象在增加引用时，会执行 Py_INCREF在内部会让引用计数器+1；最后执行销毁float对象时，会先判断float内部free_list中缓存的个数，如果已达到300个，则直接在内存中销毁，否则不会真正销毁而是加入free_list单链表中，以后后续对象使用，销毁动作的最后再在refchain中移除即可。

垃圾回收机制

Python的垃圾回收机制是以：引用计数器为主，标记清除和分代回收为辅。

1. 引用计数器

每个对象内部都维护了一个值，该值记录这此对象被引用的次数，如果次数为0，则Python垃圾回收机制会自动清除此对象。下图是Python源码中引用计数器存储的代码。

引用计数器的获取及代码示例：

import sys
 
# 在内存中创建一个字符串对象"武沛齐"，对象引用计数器的值为：1
nick_name= '武沛齐'
 
# 应该输入2，实际输出2，因为getrefcount方法时把 nick_name 当做参数传递了，引发引用计数器+1，所以打印时值为：2
# 注意：getrefcount 函数执行完毕后，会自动-1，所以本质上引用计数器还是1.
print(sys.getrefcount(nick_name))
 
# 变量 real_name 也指向的字符串对象"武沛齐"，即：引用计数器再 +1，所以值为：2
real_name= nick_name
 
# 应该输出2，实际输出3. 因为getrefcount方法时把 real_name 当做参数传递了，引发引用计数器+1，所以打印时值为：3
# 注意：getrefcount 函数执行完毕后，会自动-1，所以本质上引用计数器还是2.
print(sys.getrefcount(nick_name))
 
# 删除reald_name变量，并让其指向对象中的引用计数器-1
del real_name
 
# 应该输出1，实际输出2，因为getrefcount方法时把 real_name 当做参数传递了，引发引用计数器+1，所以打印时值为：2.
print(sys.getrefcount(nick_name))
 
 
 
 
 
# ############ getrefcount 注释信息 ############
'''
def getrefcount(p_object): # real signature unknown; restored from __doc__
    """
    getrefcount(object) -> integer
     
    Return the reference count of object.  The count returned is generally
    one higher than you might expect, because it includes the (temporary)
    reference as an argument to getrefcount().
    """
    return 0
'''

2. 循环引用

通过引用计数器的方式基本上可以完成Python的垃圾回收，但它还是具有明显的缺陷，即：“循环引用” 。

#!/usr/bin/env python
# -*- coding:utf-8 -*-
import gc
import objgraph
 
 
class Foo(object):
    def __init__(self):
        self.data= None
 
 
# 在内存创建两个对象，即：引用计数器值都是1
obj1= Foo()
obj2= Foo()
 
# 两个对象循环引用，导致内存中对象的应用+1，即：引用计数器值都是2
obj1.data= obj2
obj2.data= obj1
 
# 删除变量，并将引用计数器-1。
del obj1
del obj2
 
# 关闭垃圾回收机制，因为python的垃圾回收机制是：引用计数器、标记清除、分代回收 配合已解决循环引用的问题，关闭他便于之后查询内存中未被释放对象。
gc.disable()
 
# 至此，由于循环引用导致内存中创建的obj1和obj2两个对象引用计数器不为0，无法被垃圾回收机制回收。
# 所以，内存中Foo类的对象就还显示有2个。
print(objgraph.count('Foo'))

注意：gc.collect() 可以主动触发垃圾回收；

循环引用的问题会引发内存中的对象一直无法释放，从而内存逐渐增大，最终导致内存泄露。

为了解决循环引用的问题，Python又在引用计数器的基础上引入了标记清除和分代回收的机制。

so，不必再担心循环引用的问题了。

Reference cycles involving lists, tuples, instances, classes, dictionaries, and functions are found.

Python GC 源码文档：http://www.arctrix.com/nas/python/gc/

3. 标记清除&分代回收

Python为了解决循环引用，针对 lists, tuples, instances, classes, dictionaries, and functions 类型，每创建一个对象都会将对象放到一个双向链表中，每个对象中都有 _ob_next 和 _ob_prev 指针，用于挂靠到链表中。

/* Nothing is actually declared to be a PyObject, but every pointer to
 * a Python object can be cast to a PyObject*.  This is inheritance built
 * by hand.  Similarly every pointer to a variable-size Python object can,
 * in addition, be cast to PyVarObject*.
 */
typedef struct _object {
    _PyObject_HEAD_EXTRA # 双向链表
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;
 
typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size;/* Number of items in variable part */
} PyVarObject;
 
 
/* Define pointers to support a doubly-linked list of all live heap objects. */
#define _PyObject_HEAD_EXTRA            
    struct _object *_ob_next;           
    struct _object *_ob_prev;

随着对象的创建，该双向链表上的对象会越来越多。

当对象个数超过 700个时，Python解释器就会进行垃圾回收。
当代码中主动执行 gc.collect() 命令时，Python解释器就会进行垃圾回收。

1

2

3

import gc

gc.collect()

Python解释器在垃圾回收时，会遍历链表中的每个对象，如果存在循环引用，就将存在循环引用的对象的引用计数器 -1，同时Python解释器也会将计数器等于0（可回收）和不等于0（不可回收）的一分为二，把计数器等于0的所有对象进行回收，把计数器不为0的对象放到另外一个双向链表表（即：分代回收的下一代）。

关于分代回收（generations）：

The GC classifies objects into three generations depending on how many collection sweeps they have survived. New objects are placed in the youngest generation (generation 0). If an object survives a collection it is moved into the next older generation. Since generation 2 is the oldest generation, objects in that generation remain there after a collection. In order to decide when to run, the collector keeps track of the number object allocations and deallocations since the last collection. When the number of allocations minus the number of deallocations exceeds threshold0, collection starts. Initially only generation 0 is examined. If generation 0 has been examined more than threshold1 times since generation 1 has been examined, then generation 1 is examined as well. Similarly, threshold2 controls the number of collections of generation 1 before collecting generation 2.

# 默认情况下三个阈值为 (700,10,10) ，也可以主动去修改默认阈值。
import gc
 
gc.set_threshold(threshold0[, threshold1[, threshold2]])

官方文档： https://docs.python.org/3/library/gc.html

参考文档：

　　http://www.wklken.me/posts/2015/09/29/python-source-gc.html

　　https://yq.aliyun.com/users/yqzdoezsuvujg/album?spm=a2c4e.11155435.0.0.d07467451AwRxO