记用tensorflow-ranking时的bugs

tensorflow-ranking bugs

1 在metric函数中给全局变量赋值

报错：

TypeError: An op outside of the function building code is being passed
a "Graph" tensor. It is possible to have Graph tensors
leak out of the function building context by including a
tf.init_scope in your function building code.
For example, the following function will fail:
  @tf.function
  def has_init_scope():
    my_constant = tf.constant(1.)
    with tf.init_scope():
      added = my_constant * 2
The graph tensor has name: add:0

报错代码：

top_one_time = 0

def top_one_accuracy(y_true, y_pred):
    max_idx_gt = tf.argsort(y_true)[:, -1]
    max_idx_pred = tf.argsort(y_pred)[:, -1]

    judge = tf.equal(max_idx_gt, max_idx_pred)
    num_true = tf.reduce_sum(tf.cast(judge, tf.int32))

    global top_one_time
    top_one_time += num_true

    return top_one_time

场景：

在metric函数中给全局变量赋值

排查步骤：

通过控制变量法定位到此条语句
初步判定为tensorflow框架错误，Google，原因可能是在 init_scope 外进行了某变量的初始化，又在 init_scope 内使用了。
有解决方案为加下列语句禁用 tf 的 eager模式
```
tf.compat.v1.disable_eager_execution()
```

尝试后出现新报错

报错：

tensorflow.python.framework.errors_impl.FailedPreconditionError: 3 root error(s) found.
  (0) Failed precondition: Error while reading resource variable metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total/N10tensorflow3VarE does not exist.
	 [[{{node metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/value/ReadVariableOp}}]]
	 [[gt/Squeeze/_283]]
  (1) Failed precondition: Error while reading resource variable metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total/N10tensorflow3VarE does not exist.
	 [[{{node metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/value/ReadVariableOp}}]]
	 [[loss/gt_loss/pairwise_logistic_loss/weighted_loss/num_present/broadcast_weights/assert_broadcastable/is_valid_shape/else/_291/has_valid_nonscalar_shape/then/_1005/has_invalid_dims/ExpandDims_1/_371]]
  (2) Failed precondition: Error while reading resource variable metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/total/N10tensorflow3VarE does not exist.
	 [[{{node metrics/gt_mean_reciprocal_rank/mean_reciprocal_rank/mean/value/ReadVariableOp}}]]
0 successful operations.
0 derived errors ignored.

搜索后解决方案为：

from tensorflow.python.keras.backend import set_session
from tensorflow.python.keras.models import load_model

tf_config = some_custom_config
sess = tf.Session(config=tf_config)
graph = tf.get_default_graph()

# IMPORTANT: models have to be loaded AFTER SETTING THE SESSION for keras! 
# Otherwise, their weights will be unavailable in the threads after the session there has been set
set_session(sess)
model = load_model(...)

# and then in each request (i.e. in each thread):
global sess
global graph
with graph.as_default():
    set_session(sess)
    model.predict(...)

尝试后发现无效

于是回到最初版本寻找问题切入点，联想原因进行尝试，将代码改为

def top_one_accuracy(y_true, y_pred):
    max_idx_gt = tf.argsort(y_true)[:, -1]
    max_idx_pred = tf.argsort(y_pred)[:, -1]

    judge = tf.equal(max_idx_gt, max_idx_pred)
    num_true = tf.reduce_sum(tf.cast(judge, tf.int32))

    return num_true

错误解决

额外探索，将代码改为

top_one_time = 0

def top_one_accuracy(y_true, y_pred):
    max_idx_gt = tf.argsort(y_true)[:, -1]
    max_idx_pred = tf.argsort(y_pred)[:, -1]

    judge = tf.equal(max_idx_gt, max_idx_pred)
    num_true = tf.reduce_sum(tf.cast(judge, tf.int32))

    global top_one_time
    top_one_time += num_true

    return num_true

依然报错

2 直接使用tensorflow-ranking.metrics中的函数当作metric函数

报错：

ValueError: tf.function-decorated function tried to create variables on non-first call.

报错代码：

model.compile(metrics=[tfr.metrics.normalized_discounted_cumulative_gain, tfr.metrics.mean_reciprocal_rank])

场景：

直接使用 tensorflow-ranking.metrics 的函数作 metric

排查步骤：

通过控制变量法定位到此条语句
初步判定为tensorflow框架错误，Google，原因可能是未正确使用 @tf.function 修饰器，但我并未使用它。
于是开始阅读 tf-ranking源码

源码：

def normalized_discounted_cumulative_gain(
    labels,
    predictions,
    weights=None,
    topn=None,
    name=None,
    gain_fn=_DEFAULT_GAIN_FN,
    rank_discount_fn=_DEFAULT_RANK_DISCOUNT_FN):
  """Computes normalized discounted cumulative gain (NDCG).

  Args:
    labels: A `Tensor` of the same shape as `predictions`.
    predictions: A `Tensor` with shape [batch_size, list_size]. Each value is
      the ranking score of the corresponding example.
    weights: A `Tensor` of the same shape of predictions or [batch_size, 1]. The
      former case is per-example and the latter case is per-list.
    topn: A cutoff for how many examples to consider for this metric.
    name: A string used as the name for this metric.
    gain_fn: (function) Transforms labels. Note that this implementation of
      NDCG assumes that this function is *increasing* as a function of its
      imput.
    rank_discount_fn: (function) The rank discount function. Note that this
      implementation of NDCG assumes that this function is *decreasing* as a
      function of its imput.

  Returns:
    A metric for the weighted normalized discounted cumulative gain of the
    batch.
  """
  metric = metrics_impl.NDCGMetric(name, topn, gain_fn, rank_discount_fn)
  with tf.compat.v1.name_scope(metric.name,
                               'normalized_discounted_cumulative_gain',
                               (labels, predictions, weights)):
    per_list_ndcg, per_list_weights = metric.compute(labels, predictions,
                                                     weights)
  return tf.compat.v1.metrics.mean(per_list_ndcg, per_list_weights)

发现每次调用此函数都会生成一个 metrics_impl.NDCGMetric 对象，可能因此导致某些函数在非初始化时被运行，从而错误（原因）

于是自己写了一个函数代替。先初始化这个 metrics_impl.NDCGMetric 对象，然后每次调用函数时调用它的compute

ndcg_topn = tfr.metrics.metrics_impl.NDCGMetric('ndcg_topn', app.transform_param_config.n)

def metric_ndcg_topn(y_true, y_pred):
    return ndcg_topn.compute(y_true, y_pred, None)

调用代码：

model.compile(metrics=metric_ndcg_topn)

错误解决

额外探索。下列代码依然报错，判断是 tf.compat.v1.metrics.mean 有问题

ndcg_topn = tfr.metrics.metrics_impl.NDCGMetric('ndcg_topn', app.transform_param_config.n)

def metric_ndcg_topn(y_true, y_pred):
    per_list_ndcg, per_list_weights = ndcg_topn.compute(y_true, y_pred, None)
    return tf.compat.v1.metrics.mean(per_list_ndcg, per_list_weights)

额外探索。下列代码不报错，但是输出不对

ndcg_topn = tfr.metrics.metrics_impl.NDCGMetric('ndcg_topn', app.transform_param_config.n)
mean = tf.keras.metrics.Mean()

def metric_ndcg_topn(y_true, y_pred):
    per_list_ndcg, per_list_weights = ndcg_topn.compute(y_true, y_pred, None)
    return mean(per_list_ndcg, per_list_weights)