Yugabytedb中的可扩展界数不同-DEV365 开发者社区

对于Oracle数据库，以下问题是隐式的：
发票具有状态（付费/未付款）和客户端（客户端_id），我们想迅速计数，有多少客户有未付费发票，在有更多信息时停止计数：

安迪·赛耶

@andrewsayer_

长周末的Oracle Performance测验。您将如何使这个快速：用未付发票计算独特的客户，停止计数为99，期望1m无薪发票。基本sql：从（从status status ='novaid'的发票中选择“ client_id”的选择count（*）

17:05 pm-2023年4月8日

要高效，我们不想为所有付费的发票索引，而只想索引未付费的发票。而且我们不想扫描所有未付发票来计算不同的客户。

oracle 数据库缺乏一些容易有效地执行此操作的功能：没有部分索引，没有索引跳过扫描（有跳过扫描，但仅用于其他情况）。解决方案可以使用程序方法进行跳过扫描，无论是从PL/SQL代码或带有子句的SQL递归。并使用虚拟列为未索引子集的虚拟列返回null。

postgresql 为部分索引提供了简单的声明：

create index invoice_unpaid on invoice ( client_id asc )
 where status='unpaid';

但是，PostgreSQL社区建议使用递归进行有效的索引扫描：https://wiki.postgresql.org/wiki/Loose_indexscan

使用 yugabytedb 您可以将问题写成一个简单的SQL，描述了业务规则：

select count(*) from (
  select distinct client_id 
  from invoice 
  where status='unpaid'
  limit 99
) as clients_unpaid

上面的索引，这足以获得对我们的查询的高性能和可扩展的响应。

对1亿发票进行测试

这是我如何创建表和索引来测试它的方式：

create extension if not exists pgcrypto;
create table invoice (
 primary key (invoice_uid)
 , invoice_uid uuid default gen_random_uuid()
 , client_id bigint check ( client_id>0 )
 , status text check ( status in ('paid','unpaid')  ) default 'unpaid'
);

insert into invoice (client_id, status)
select client_id , case when n>10 then 'paid' else 'unpaid' end as status
 from generate_series(1,1000) as client_id
    , generate_series(1,10000) as n
;

create index invoice_unpaid on invoice ( client_id asc ) 
 where status='unpaid';

请注意，在现实生活中，我可能已经将主键定义为（client_id，Invoice_UID），因为它可以为每个客户端提供快速访问模式，而无需辅助索引。在这里，我想展示它如何与辅助索引一起使用。无论如何，插入行时没有开销。在一个多动物区域中，维护索引最多需要两毫秒来为其添加索引条目。与NOSQL数据库相反，SQL的美丽：您可以添加新的访问模式，而无需更改现有的模式和代码。

我还通过运行9次插入来添加了更多发票。我在一个小的yugabytedb上这样做了3个节点群：

43写OPS/S是从SQL处理层（YSQL）发送到分布式交易存储（DOCDB）的行。在此3x4VCPU群集中，这是大约27000个插入物，复制因子rf = 3（即使一个可用性区域下降，也可以保证应用程序连续性）：

我有来自1000个不同客户的1亿张发票，每个客户有100个无薪发票：

yugabyte=> \timing on
Timing is on.

yugabyte=> select count(*) from invoice;

   count
-----------
 100000000
(1 row)

Time: 37560.658 ms (00:37.561)

yugabyte=> select count(*) from invoice where status='unpaid';

 count
--------
 100000
(1 row)

Time: 469.696 ms

yugabyte=> select count(distinct client_id)
           from invoice where status='unpaid' ;

 count
-------
  1000
(1 row)

Time: 470.734 ms

计算所有行不到一分钟。多亏了部分索引，计算未付费的索引的时间不到500毫秒，从中获得不同的client_id。

当限制99个不同的值时，这已经很好，甚至更好（使用PostgreSql limit或fetch first 99 rows only）：

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice 
            where status='unpaid' 
            limit 99
           ) as clients_unpaid;

 count
-------
    99
(1 row)

Time: 81.335 ms

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid' 
            fetch first 99 rows only
           ) as clients_unpaid;

 count
-------
    99
(1 row)

Time: 74.347 ms

鉴于我正在远程连接（瑞士到爱尔兰的连接约30毫秒），这是大约50毫秒的执行时间。我可以通过解释分析进行检查。

我从执行计划中看到的那样，与混合动力扫描的不同下降尚未发生：

yugabyte=> explain (costs off, analyze on, dist on)
            select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid' 
            fetch first 99 rows only
           ) as clients_unpaid;

                                                   QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=47.162..47.162 rows=1 loops=1)
   ->  Limit (actual time=1.541..47.147 rows=99 loops=1)
         ->  Unique (actual time=1.540..47.137 rows=99 loops=1)
               ->  Index Only Scan using invoice_unpaid on invoice (actual time=1.539..46.530 rows=9801 loops=1)
                     Heap Fetches: 0
                     Storage Index Read Requests: 11
                     Storage Index Execution Time: 43.999 ms
 Planning Time: 0.086 ms
 Execution Time: 47.201 ms
 Storage Read Requests: 11
 Storage Write Requests: 0
 Storage Execution Time: 43.999 ms
 Peak Memory Usage: 24 kB
(13 rows)

Time: 77.360 ms

我们应该添加有关下降的注释，但是数字说明：rows=9801仅从索引返回。如果在扫描过程中完成了独特的情况，我应该从这里看到99行，因为限制也被推下来。除此之外，只有一个Read Requests将返回99行

当前（我正在yugabytedb 2.17.2中进行测试，并为此打开了#16771），为了进行优化，我需要为该范围添加一个明确的启动，这很容易，因为我知道较低的范围client_id的界限：

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice 
            where status='unpaid' 
            and client_id>0 -- guaranteed by check constraint
            limit 99
           ) as clients_unpaid;
 count
-------
    99
(1 row)

Time: 31.505 ms

yugabyte=> select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid' 
            and client_id>0 -- guaranteed by check constraint
            fetch first 99 rows only
           ) as clients_unpaid;

 count
-------
    99
(1 row)

Time: 30.983 ms

我可以从执行计划中检查：

yugabyte=> explain (costs off, analyze on, dist on)
            select count(*) from (
            select distinct client_id 
            from invoice
            where status='unpaid'
            and client_id>0  -- guaranteed by check constraint
            fetch first 99 rows only
           ) as clients_unpaid;

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate (actual time=2.025..2.025 rows=1 loops=1)
   ->  Limit (actual time=1.971..2.014 rows=99 loops=1)
         ->  Unique (actual time=1.969..2.005 rows=99 loops=1)
               ->  Index Only Scan using invoice_unpaid on invoice (actual time=1.967..1.987 rows=99 loops=1)
                     Index Cond: (client_id > 0)
                     Heap Fetches: 0
                     Storage Index Read Requests: 1
                     Storage Index Execution Time: 2.000 ms
 Planning Time: 0.096 ms
 Execution Time: 2.068 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 2.000 ms
 Peak Memory Usage: 24 kB
(14 rows)

Time: 31.029 ms

现在，只有索引扫描仅返回rows=99，证明了独特和极限都被推入一个索引扫描（Storage Index Read Requests: 1）。我得到的响应时间为31毫秒，因为我远程连接。来自瑞士和我的数据库位于爱尔兰（AWS EU-West-1），但Execution Time是2毫秒，这是该申请在同一地区运行时的响应时间。

单数响应时间很棒，但更重要的是它的可扩展性。在这里，执行时间不取决于表中的行总数（1亿），也不取决于未付费发票的总数（1万），也不取决于客户端数量（1000），而仅取决于查询：带有未付发票的99个客户。 具有与结果成正比的响应时间是为用户提供可预测性能的最佳方法。您不需要NOSQL数据库。您甚至不必更改数据模型。当然，只需在使用正确的SQL数据库中为您的用例添加正确的索引即可。 Yugabytedb分布于SQL，PostgreSQL兼容和开源。