[nginx] async_mode_nginx CPU 100% deadlock问题分析

很遗憾只定位到了一个比较小的问题范围,理清了root cause, 但是没有找到复现的边界条件以及solution.

Hi all, I have the quite same problem with the latest software version:
async_nginx: 0.4.5
openssl: 1.1.1k
qatengine: 0.6.4
qatdriver: 1.7.l.4.13.0.9

the reproduce situation: config values in nginx.conf :
default_algorithms CIPHERS
qat_poll_mode heuristic

I have debuged async_ningx and found there is a infinite loop. I think this is the reason here.

1 in function ngx_http_do_read_client_request_body(), nginx goin the for(;;)[line:288] loop and never break.
as recv()[line:343] always return NGX_AGAIN, and c->read->ready always == 1
go deep in recv(), the NGX_AGAIN is return by func ngx_ssl_handle_recv()::line:2546 because of async job is paused.
2. when async context swapd, an other infinite loop was happend. in function qat_chained_ciphers_do_cipher() line:1554
as the read()[qat_pause_job():line279] always return EAGAIN.
3. As I know qat_crypto_callbackFn() is called by func qat_engine_poll(). I think, this because of the callback function qat_crypto_callbackFn() never have any CPU chance/CPU TIME to be called, then the paused async job never be waked up.
then I check the POLL logic in async_nginx. I found point 4 descripte below.
4. In function ngx_ssl_engine_qat_heuristic_poll(), all the values of the six variables(num_*) never grow up, so function qat_engine_poll() have no any chance to execute.

when I change my engine config in nginx.conf, this issue is disappear, and i can work around. the config like below:
qat_heuristic_poll_asym_threshold = 0
qat_heuristic_poll_sym_threshold = 0

It seems a logic deadlock here ? nginx want qat to update counters but counters updated need nginx release some CPU time.
or, maybe the following code do not consider the long time idle SSL connections ?
if (*num_asym_requests_in_flight + *num_kdf_requests_in_flight
+ *num_cipher_requests_in_flight + *num_asym_mb_items_in_queue
+ *num_kdf_mb_items_in_queue + *num_sym_mb_items_in_queue
>= (int) *ngx_ssl_active) {

Anyone have any idea about this ?

详见:https://github.com/intel/QAT_Engine/issues/181

原文地址:https://www.cnblogs.com/hugetong/p/14922073.html