主动的PostgreSQL数据库(S)性能扫描仪
#devops #postgres #database #monitoring

深夜。电话响。系统中的关键情况。 CPU达到百分之一百。迫切需要找到解决方案。

我认为这种情况对许多工程师都是熟悉的。

在关键任务系统上,重要的是要积极地防止系统的关键参数达到直接影响性能,稳定性和可靠性的最大值。

喜欢在医学上,我们想在出现第一症状后立即找到治愈。

数据库是现代系统的重要关键要素。

此博客是关于监视PostgreSQL数据库的。

它描述了一种方法,该方法可以轻松识别使用系统效率低下的查询,帮助找到绩效问题的根本原因,并帮助了解典型的工作负载模式和性能瓶颈。发现的模式和查询可以改进,系统将有效,有弹性地工作。

主动的PostgreSQL数据库(S)性能扫描仪是一个连接到数据库并运行一组探针的脚本,可以根据需要进行扩展。所有探针都是对数据库的查询,这些查询是由结构统一的。

它包括:

  • 阈值值,
  • 检查的描述,
  • 与哪个问题关联的问题,
  • 有关如何解决问题的建议
  • SQL查询要执行检查
  • 如果需要更多证据
  • ,则附加可选的SQL查询

如果某些检查超过阈值值,则将以以下标准形式生成相应的报告:

  • 检查的描述
  • datetime
  • 环境详细信息
  • 问题
  • 有关此问题的详细信息
  • 其他证据
  • 建议

脚本具有以下结构:

  • 执行探针(MainProcessor)的功能
  • 检查输入参数(帮助函数)的功能
  • 设置查询将被剪切的字符数。如果查询太长,则可以使输出可读是有用的。
  • 填充环境详细信息。
  • 检查PostgreSQL版本。如果应根据数据库引擎的版本执行不同的查询/检查,这很有用。
  • 检查PG_STAT_STATETENTS扩展已启用。脚本正在使用它。
  • 可扩展的探针集。

监视脚本可以根据选中的PostgreSQL数据库的版本运行不同类型的查询。当数据库元数据结构取决于版本时,这很有用。

如果需要监视几个PostgreSQL数据库,则可以在循环中运行主动的PostgreSQL数据库。

该脚本的基本版本包含可用于监视的示例检查。它包括与连接利用相关的探针,长期非最佳查询,查询的高CPU利用率等。可以将其扩展到任何其他指标,并指示监视和检查很重要。

如何运行主动的PostgreSQL DB性能扫描仪:

proactive_pg_db_performance_scanner.sh -h db_host -p 5432 -U postgres -d postgres

输出的示例:

Check in the pg_stat_statements DB queries that take more than 5000 ms
DateTime: 20230105_112233
Environment: Host:db_host; Port:5432; DB_Username:postgres; DB_Name: postgres
Issue: Long-running queries
Details:
 userid | dbid  |       db_name        | total_time | calls | mean  |             query                                            | chk 
--------+-------+----------------------+------------+-------+-------+----------------------------------------------------------------------
  11111 | 11112 |    my_database_1     |  55555.00  |     1 | 55555 | select * from my_table where a='12345'                       | vwv 
  11111 | 11112 |    my_database_1     |  33333.00  |     1 | 33333 | update my_table set a='12345'                                | vwv 
  11111 | 11112 |    my_database_1     |  11111.00  |     1 | 11111 | delete from my_table where a='12345'                         | vwv 
(3 rows)
Recommendation: Check why the query/queries take so much time. It may be a heavy non-optimized query. Maybe it's an unusual application pattern.
Check the queries that occupy more than 15 % of a CPU
DateTime: 20230106_115523
Environment: Host:db_host; Port:5432; DB_Username:postgres; DB_Name: postgres
Issue: Query/queries that utilize significant portion of CPU
Details:
 userid | dbid  |       db_name       |  total_time  |  calls  |    mean  | cpu_portion_pctg |                      query             | chk 
--------+-------+---------------------+--------------+---------+----------+------------------+----------------------------------------+-----
  11111 | 11112 |    my_database_1    | 888799911.12 | 9999999 |    88.88 |            80.00 | select * from my_table where a='12345' | wvw
  11111 | 11112 |    my_database_1    |     99999.99 |       1 | 99999.99 |            20.00 | update my_table set a='12345'          | wvw
(2 rows)
Recommendation: Check why the query/queries take a significant portion of the CPU. Maybe it takes significant time. Maybe it's running too frequently. Try to analyze why this DB query takes a significant part of the CPU.
The query/queries that allocates/allocate a significant number of connection slots (Threshold=300)
DateTime: 20230106_120551
Environment: Host:db_host; Port:5432; DB_Username:postgres; DB_Name: postgres
Issue: The most of connection slots are occupied by single query
Details:
 pctg  |        query                                          | num_of_allocated_connection_slots_by_the_query | tot_allocated_slots | chk 
-------+-------------------------------------------------------+------------------------------------------------+---------------------+-----
 55.50 | select * from my_table where a='12345'                |                                            555 |                1000 | wvw
 33.30 | update my_table set a='12345'                         |                                            333 |                1000 | wvw
(2 rows)
Recommendation: Check why a single pattern of queries allocates so many connection slots. It may be application logic, or an unusual application pattern issue. 

以下是主动的Postgresql DB性能扫描仪的源代码。

#!/bin/bash

##########################################################
#
# Proactive PostgreSQL DB Performance Scanner
#
# Purpose: Connect to the PostgreSQL DB instance
#          and run a set of queries
#          to find problematic performance patterns
# 
# Provide the output in the format:
#
# DateTime:
# Environment:
# Issue:
# Evidence:
# Recommendation:
#
# Date: 04-Jan-2023
#
# Author: Dmitry
#
###########################################################

mainProcessor()
{

inpHost="${1}"
inpPort="${2}"
inpDBusername="${3}"
inpDBname="${4}"
sql_query="${5}"
sql_query_extra="${6}"
probe="${7}"
the_environment="${8}"
issue="${9}"
recommendation="${10}"
the_line="${11}"

if [ -z "$inpHost" ] || [ -z "$inpPort" ] || [ -z "$inpDBusername" ] || [ -z "$inpDBname" ]
then
  echo "Error: not populated parameters!"
  exit 3
fi

answer=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -c "$sql_query")

nRowsReturned=`echo $answer | grep wvw | wc -l`

if [ -z "$sql_query_extra" ]
then
      evidence=""
else
      evidence=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -c "$sql_query_extra")
fi 

if [ "$nRowsReturned" -gt "0" ]; then

current_datetime=`date +"%Y%m%d_%H%M%S"`
echo " "
echo "$probe"
echo "DateTime: $current_datetime"
echo "Environment: $the_environment"
echo "Issue: $issue" 
echo "Details:"
echo "$answer"

if [ ! -z "$evidence" ]
then
  echo "Evidence:"
  echo "$evidence"
fi

echo "Recommendation: $recommendation"
echo " "
echo "$the_line"

fi

}

helpFunction()
{
   echo ""
   echo "Usage: $0 -h hostname -p port -U db_username -d db_name"
   echo -e "\t-h Postgres hostname"
   echo -e "\t-p Postgers port"
   echo -e "\t-U Postgres DB username"
   echo -e "\t-d Postgres DB name"
   echo -e " "
   exit 1 # Exit script after printing help
}

while getopts "h:p:U:d:" opt
do
   case "$opt" in
      h ) inpHost="$OPTARG" ;;
      p ) inpPort="$OPTARG" ;;
      U ) inpDBusername="$OPTARG" ;;
      d ) inpDBname="$OPTARG" ;;
      ? ) helpFunction ;; # Print helpFunction in case parameter is non-existent
   esac
done

# Print helpFunction in case parameters are empty
if [ -z "$inpHost" ] || [ -z "$inpPort" ] || [ -z "$inpDBusername" ] || [ -z "$inpDBname" ] 
then
   echo "Some or all of the parameters are empty";
   helpFunction
fi

echo " "
echo "Proactive PG DB Performance Scanner"
echo " "

the_line=" === === === === === === === === === === === === === === === "

echo "$the_line"

query_lenght_to_print=2048

the_environment="Host:$inpHost; Port:$inpPort; DB_Username:$inpDBusername; DB_Name: $inpDBname"

DBVersion=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -t -c "select version(); ")

DBVersion_Num=`echo $DBVersion | awk ' { print $2 } '`

# Check that pg_stat_statements is enabled and populated

n_check=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -qtX << EOF
select count(1) from information_schema.tables where table_name = 'pg_stat_statements';
EOF)

n_check=`echo $n_check | xargs`

if [ "$n_check" -eq "0" ]; then
    echo "The pg_stat_statements table does not exist. Please enable pg_stat_statements to be populated with recs."
    echo " "
    exit 1
fi

n_rows_pg_stat_statements=$(psql -h $inpHost -p $inpPort -U $inpDBusername -d $inpDBname -qtX << EOF
select count(1) n_rows_pg_stat_statements
  from pg_stat_statements
EOF)

n_rows_pg_stat_statements=`echo $n_rows_pg_stat_statements | xargs`

if [ "$n_rows_pg_stat_statements" -eq "0" ]; then
    echo "The pg_stat_statements table is empty. Please enable pg_stat_statements. It should be populated with recs."
    echo " "
    exit 1
fi

#################################################################################################

############### probe 1

n_threshold=10

probe="Check the databases having more than $n_threshold active connections"
issue="It were found databases with the high number of active connections"
recommendation="Check why Customers open so many active connections. It may be wrong configuration or unusual application pattern."

sql_query="
select datname, count(1) num_of_active_connections, 'wvw' chk
from pg_stat_activity
where datname!='' and state!='idle'
group by datname
having count(1)>$n_threshold
order by 2 desc
"

sql_query_extra="
select datname, state, client_addr, client_hostname, substr(query, 1, $query_lenght_to_print) query
from pg_stat_activity
where state!='idle' and datname in (
select datname
from
 (
  select datname, count(1) num_of_active_sessions
    from pg_stat_activity
   where state!='idle' and datname!=''
   group by 1
  having count(1)>0
 ) M
)
order by 1, 5
"

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"

############### probe 2

n_threshold=30

probe="Check DB queries that take more than $n_threshold seconds"
issue="Long-running queries"
recommendation="Check why the query/queries take so much time. It maybe it's heavy non-optimized query. Maybe it's unusual application pattern. "

sql_query="
select
    now()-query_start as runtime,
    pid as process_id,
    datname as db_name,
    client_addr,
    client_hostname,
    substr(query, 1, $query_lenght_to_print) query,
    'wvw' chk
from pg_stat_activity
where state!='idle' and datname!=''
and now() - query_start > '$n_threshold seconds'::interval
order by 1 desc;
"

sql_query_extra=""

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"


############### probe 3

n_threshold=2000

probe="Check in the pg_stat_statements DB queries that take more than $n_threshold ms"
issue="Long-running queries"
recommendation="Check why the query/queries take so much time. It may be it is a heavy non-optimized query. Maybe it's an unusual application pattern."

if [[ $DBVersion_Num == "11"*  || $DBVersion_Num == "12"* ]] ; then

sql_query="
SELECT
        pss.userid,
        pss.dbid,
        pd.datname as db_name,
        round(pss.total_time::numeric, 2) as total_time,
        pss.calls,
        round(pss.mean_time::numeric, 0) as mean,
        substr(pss.query, 1, $query_lenght_to_print) query,
        'wvw' chk
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
and round(pss.mean_time::numeric, 0) > $n_threshold
ORDER BY round(pss.mean_time::numeric, 0) desc
LIMIT 30;
"

else

sql_query="
SELECT
        pss.userid,
        pss.dbid,
        pd.datname as db_name,
        round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) as total_time,
        pss.calls,
        round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 0) as mean,
        substr(pss.query, 1, $query_lenght_to_print) query,
        'wvw' chk
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
and round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 0) > $n_threshold
ORDER BY round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 0) desc
LIMIT 30;
"

fi

sql_query_extra=""

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"

############### probe 4

n_threshold=10

probe="Check the queries that occupy more than $n_threshold % of a CPU"
issue="Query/queries that utilize significant portion of CPU"
recommendation="Check why the query/queries take a significant portion of the CPU. Maybe it takes significant time. Maybe it's running too frequently. Try to analyze why this DB query takes a significant part of the CPU."

if [[ $DBVersion_Num == "11"*  || $DBVersion_Num == "12"* ]] ; then

sql_query="
select M.*, 'wvw' chk
from
(SELECT
        pss.userid,
        pss.dbid,
        pd.datname as db_name,
        round(pss.total_time::numeric, 2) as total_time,
        pss.calls,
        round(pss.mean_time::numeric, 2) as mean,
        round((100 * pss.total_time / sum(pss.total_time::numeric) OVER ())::numeric, 2) as cpu_portion_pctg,
        substr(pss.query, 1, $query_lenght_to_print) query
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
ORDER BY pss.total_time
DESC LIMIT 30) M
where cpu_portion_pctg > $n_threshold;
"

else

sql_query="
select M.*, 'wvw' chk
from
(SELECT
        pss.userid,
        pss.dbid,
        pd.datname as db_name,
        round((pss.total_exec_time + pss.total_plan_time)::numeric, 2) as total_time,
        pss.calls,
        round((pss.mean_exec_time+pss.mean_plan_time)::numeric, 2) as mean,
        round((100 * (pss.total_exec_time + pss.total_plan_time) / sum((pss.total_exec_time + pss.total_plan_time)::numeric) OVER ())::numeric, 2) as cpu_portion_pctg,
        substr(pss.query, 1, $query_lenght_to_print) query
FROM pg_stat_statements pss, pg_database pd
WHERE pd.oid=pss.dbid
ORDER BY (pss.total_exec_time + pss.total_plan_time)
DESC LIMIT 30) M
where cpu_portion_pctg > $n_threshold;
"

fi

sql_query_extra=""

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"

############### probe 5

n_threshold=1000

probe="Check DB queries that run more than $n_threshold times per second"
issue="Too frequent DB queries"
recommendation="Check why the query/queries run so frequent. Maybe it's pointing to some abnormal pattern. "

sql_query="
select M.*, 'wvw' chk
from
(with
a as (select dbid, queryid, query, calls s from pg_stat_statements),
b as (select dbid, queryid, query, calls s from pg_stat_statements, pg_sleep(1))
select
        pd.datname as db_name,
        substr(a.query, 1, $query_lenght_to_print) as the_query,
        sum(b.s-a.s) as runs_per_second
from a, b, pg_database pd
where
  a.dbid= b.dbid
and
  a.queryid = b.queryid
and
  pd.oid=a.dbid
and
  pd.datname not in ('postgres')
group by 1, 2
having sum(b.s-a.s) > $n_threshold
order by 3 desc) M;
"

sql_query_extra=""

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"

############### probe 6

n_threshold=5

probe="Actual connections to Max connections ratio (Threshold=$n_threshold)"
issue="Too high ratio of actual connections to max connections"
recommendation="Check that there is enough connection slots."

sql_query="
select  a connection_slots_occupied,
        b max_connections,
        round((a.actual_connections::float/nullif(b.max_connections::float,0))::numeric*100, 2) the_ratio,
        'wvw' chk
        from
        (select count(1) as actual_connections from pg_stat_activity) a,
        (select setting as max_connections from pg_settings where name='max_connections') b
 where round((a.actual_connections::float/nullif(b.max_connections::float,0))::numeric*100, 2) > $n_threshold;
"

sql_query_extra="
select datname, substr(query, 1, $query_lenght_to_print) query, count(1) num_of_allocated_connection_slots
  from pg_stat_activity
  group by 1, 2
  having count(1) > 5
  order by 3 desc;
"

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"

############### probe 7

n_threshold=5

probe="The query/queries that allocates/allocate the most connection slots (Threshold=$n_threshold)"
issue="The most of connection slots are occupied by single query"
recommendation="It maybe configuration issue. It looks suspicious. because single query occupies the most connection slots of the DB instance"

sql_query="
select
  round((M.num_of_allocated_connection_slots_by_the_query::float/nullif(M.tot_allocated_slots::float,0))::numeric*100, 2) pctg,
  M.*
from
(select
      substr(query, 1, $query_lenght_to_print) query,
      count(1) num_of_allocated_connection_slots_by_the_query,
      (select count(1) as n from pg_stat_activity) tot_allocated_slots,
      'wvw' chk
  from
      pg_stat_activity
group by 1, 3
having count(1) > $n_threshold
order by 2 desc) M;
"

sql_query_extra=""

mainProcessor "$inpHost" "$inpPort" "$inpDBusername" "$inpDBname" "$sql_query" "$sql_query_extra" "$probe" \
              "$the_environment" "$issue" "$recommendation" "$the_line"

echo " "

######################################
#
# End
#
######################################