Jewel版cephfs当rm -rf文件时报错"No space left on device"

一、概述
- 临时的解决方法：
- 永久解决方法
二、详情
三、参考网址

一、概述

原因：cephfs的stray满了

不严谨地说, stray就相当于回收站，stray文件节点有文件数限制，默认是10万，定时删除，但删除的很慢。导致stray一直处于满的状态。严谨的说法在https://docs.ceph.com/en/latest/cephfs/dirfrags/

临时的解决方法：

ceph tell mds.node01 injectargs --mds_bal_fragment_size_max 1000000

然后把mds_bal_fragment_size_max写到配置文件里。
当然这样存在风险，隐隐有种饮鸩止渴的感觉。就像文档中说的一样，如果mds_bal_fragment_size_max太大，可能造成OSD故障。但是在不重启mds的情况下也只能这样了。
大功告成了。

永久解决方法

https://cloud.tencent.com/developer/article/1177839
(没试过)

二、详情

因为线上环境直接重启肯定不好。所以还是一边监测一边调整参数。
注意：这只能作为实验的参考，请谨慎修改

查看当前mds的状态

$ ceph daemonperf mds.node3
-----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log---- 
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0   26k 129 |  0    0    0 |  0    0    0 |  0    0  6.2k   0 | 31  136    0 
  0   26k 129 |  0    0    0 |  0    0    0 |  0    0  6.2k   0 | 31  136    0

其中node3是自己active的mds的名称。

修改参数

$ ceph daemon mds.node3 config show | grep mds_bal_fragment_size_max
    "mds_bal_fragment_size_max": "100000",

#修改mds_bal_fragment_size_max
$ ceph tell mds.node01 injectargs --mds_bal_fragment_size_max 1000000

查询命令还可以使用。"ceph --show-config"
同理修改其他配置：(有一点效果，但不推荐)

ceph tell mds.node03 injectargs --filer_max_purge_ops 10000
ceph tell mds.node03 injectargs --mds_max_purge_files 256

修改完后再次查看mds状态，会发现purge的数据频率会略微提升。

依据

源码文件StrayManager.cc

/*
 * Note that there are compromises to how throttling
 * is implemented here, in the interests of simplicity:
 *  * If insufficient ops are available to execute
 *    the next item on the queue, we block even if
 *    there are items further down the queue requiring
 *    fewer ops which might be executable
 *  * The ops considered "in use" by a purge will be
 *    an overestimate over the period of execution, as
 *    we count filer_max_purge_ops and ops for old backtraces
 *    as in use throughout, even though towards the end
 *    of the purge the actual ops in flight will be
 *    lower.
 *  * The ops limit may be exceeded if the number of ops
 *    required by a single inode is greater than the
 *    limit, for example directories with very many
 *    fragments.
 */
 /*......*/
  // Work out a limit based on n_pgs / n_mdss, multiplied by the user's
  // preference for how many ops per PG
  max_purge_ops = uint64_t(((double)pg_count / (double)mds_count) *
			   g_conf->mds_max_purge_ops_per_pg);

  // User may also specify a hard limit, apply this if so.
  if (g_conf->mds_max_purge_ops) {
    max_purge_ops = MIN(max_purge_ops, g_conf->mds_max_purge_ops);
  }

xfs单目录能支持多少文件

另外引申出一个问题，linux系统单个文件夹最多能建多少个文件夹。(这个问题与上述错误无关)
ext3据说只能建3万多。
ext4和xfs据说能建理论上的无限值。参考：https://www.cnblogs.com/bluestorm/p/11587064.html
ceph安装osd的文件系统是xfs，因此用xfs做测试。
因为此脚本执行起来极为耗时，所以暂且弄个1百万试试。

#!/bin/bash
mkdir tmp
cd tmp
i=1
while [ $i -lt 1000006 ]
do
    mkdir $i
    if [ $? -ne 0 ]; then
       echo "cannot make dir $i"
       exit
    fi
    ((i++))
done

结果：

[root@node1 tmp]# ls | wc -l
1000005

也就说明xfs系统单个文件夹可以支持至少百万个文件。

三、参考网址

https://www.cnblogs.com/hlc-123/p/10700725.html
https://www.cnblogs.com/zphj1987/p/13575384.html
http://blog.wjin.org/posts/cephfs-remove-file-error.html
https://cloud.tencent.com/developer/article/1177839
https://docs.ceph.com/en/latest/cephfs/mds-config-ref/#
https://cloud.tencent.com/developer/article/1181883
https://tracker.ceph.com/issues/15920?journal_id=71023

转载请注明来源：https://www.cnblogs.com/bugutian/