引入nova placement之后对调度的影响(by quqi99)

作者:张华 发表于:2020-09-17
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

nova cell v2

nova cell v2将nova db分成了3个(nova, nova_api, nova_cell0,虚机信息只存储在所在的cell中,公共数据存储在nova_api库中), nova_api中的3个表(nova_api.host_mappings, nova_api.instance_mappings, nova_api.cell_mappings可以直接从instance找到cell_id进而找到DB与MQ的信息,这样nova-api直接就可以操作该cell之类的DB与MQ从而可以让nova-compute可以水平扩展到更多的物理节点,另一方面,nova-api节点也不再需要nova-cell服务,要有nova-api与nova-scheduler两个服务即可.

mysql> pager less -S
PAGER set to 'less -S'
mysql> show tables;
mysql> select instance_uuid,cell_id from instance_mappings;
+--------------------------------------+---------+
| instance_uuid                        | cell_id |
+--------------------------------------+---------+
| 4039ed4e-d0a1-46ba-99a5-68bc84421b42 |       2 |

mysql> select * from host_mappings;
+---------------------+------------+----+---------+-------------------------------------+
| created_at          | updated_at | id | cell_id | host                                |
+---------------------+------------+----+---------+-------------------------------------+
| 2020-09-16 06:18:26 | NULL       |  1 |       2 | juju-3ba760-ceilometer-15.cloud.sts |

mysql> select transport_url,name,database_connection from cell_mappings;
+----------------------------------------------------------------------------------------------------------+-------+-----------------------------------------------------------------------------+
| transport_url                                                                                            | name  | database_connection                                                         |
+----------------------------------------------------------------------------------------------------------+-------+-----------------------------------------------------------------------------+
| none:///                                                                                                 | cell0 | mysql+pymysql://nova:4Hjrdj5yMTkG6V9nxNpqrfVdhtJ5Tnww@10.5.0.103/nova_cell0 |
| rabbit://nova:wSz5LjscfBqKnhVWKBZnrXdwS5Kz6TByz9jKfm2xKHbCRYPPSbcnqFwPTnCp8VpP@10.5.0.199:5672/openstack | cell1 | mysql+pymysql://nova:4Hjrdj5yMTkG6V9nxNpqrfVdhtJ5Tnww@10.5.0.103/nova       |

所以当遇到这种错误时:

openstack server delete 2ebf1b2d-f679-4265-9c4b-71420dace71a
No server with a name or ID of 2ebf1b2d-f679-4265-9c4b-71420dace71a
sudo nova-manage cell_v2 list_cells
sudo nova-manage cell_v2 map_instances --cell_uuid <cell-id-from-above>
openstack server delete 2ebf1b2d-f679-4265-9c4b-71420dace71a

另外,

sudo nova-manage cell_v2 discover_hosts --verbose

nova placement API

nova placement API在Newton被引入, nova-scheduler调用placement-api用于调度. 主要用于跟踪记录Resource Provider(compute-node, external storage-pool, external ip-allocation-pool etc)的Inventory和Usage.自Pike版本, 必须启用Placement API来辅助nova-scheduler service进行compute node调度,并以此替代之前的RAMFilter、CoreFilter和DiskFilter。概念对象如下:

  • Resource Class, 资源种类, placement api默认实现了DISK_GB, MEMORY_MB,VCPU三种标准resource classes, 也提供了custom resource classes的接口.
  • Resource Providers:资源提供者,实际提供资源的对象,例如:compute node、storage pool
  • Inventory:资源清单,资源提供者所拥有的资源清单,例如:compute node 拥有的vCPU、Disk、RAM 等 inventories
  • Resource Allocations:资源分配状况,包含了Resource Class、Resource Provider以及Consumer 的映射关系。记录消费者使用了多少该类型的资源数量
  • Provider Aggregate:资源聚合,类似 HostAggregate 的概念
  • Traits:资源特征,不同资源提供者可能会具有不同的资源特征。Traits 作为资源提供者特征的描述,它不能够被消费,但在某些Workflow 或者会需要这些信息。例如:标识可用的Disk是一个SSD,可以帮助Scheduler更好的匹配 instance boot请求。
#注意:当删除compute_node表中的记录时,也要同时更新resource_providers表中的uuid字段
# 当然,compute_node表中的记录不需要删除,resource-update线程应该自动更新里面的drity usage (eg: pinned_vcps)
mysql> select * from placement.resource_providers;
+---------------------+---------------------+----+--------------------------------------+-------------------------------------+------------+----------+------------------+--------------------+
| created_at          | updated_at          | id | uuid                                 | name                                | generation | can_host | root_provider_id | parent_provider_id |
+---------------------+---------------------+----+--------------------------------------+-------------------------------------+------------+----------+------------------+--------------------+
| 2020-09-16 06:18:15 | 2020-09-16 10:19:33 |  1 | a7081054-ee03-44b8-ae21-f20e0535cfc1 | juju-3ba760-ceilometer-15.cloud.sts |         19 |     NULL |                1 |               NULL |

# for the field resource_class_id, 0 means VCPU, 1 means MEMORY_MB, 2 means DISK_GB
mysql> select * from placement.inventories;
+---------------------+------------+----+----------------------+-------------------+-------+----------+----------+----------+-----------+------------------+
| created_at          | updated_at | id | resource_provider_id | resource_class_id | total | reserved | min_unit | max_unit | step_size | allocation_ratio |
+---------------------+------------+----+----------------------+-------------------+-------+----------+----------+----------+-----------+------------------+
| 2020-09-16 06:18:15 | NULL       |  1 |                    1 |                 0 |     2 |        0 |        1 |        2 |         1 |               16 |
| 2020-09-16 06:18:15 | NULL       |  2 |                    1 |                 1 |  3944 |      512 |        1 |     3944 |         1 |              1.5 |
| 2020-09-16 06:18:15 | NULL       |  3 |                    1 |                 2 |    38 |        0 |        1 |       38 |         1 |                1 |
mysql> select * from placement.allocations;
+---------------------+------------+----+----------------------+--------------------------------------+-------------------+------+
| created_at          | updated_at | id | resource_provider_id | consumer_id                          | resource_class_id | used |
+---------------------+------------+----+----------------------+--------------------------------------+-------------------+------+
| 2020-09-16 08:45:44 | NULL       | 16 |                    1 | 64cb10fd-246f-4864-b06d-687d59c47c2c |                 2 |    1 |
| 2020-09-16 08:45:44 | NULL       | 17 |                    1 | 64cb10fd-246f-4864-b06d-687d59c47c2c |                 1 |   64 |
| 2020-09-16 08:45:44 | NULL       | 18 |                    1 | 64cb10fd-246f-4864-b06d-687d59c47c2c |                 0 |    1 |

Placement CLI

sudo apt install python3-osc-placement -y

$ openstack resource provider list
+--------------------------------------+-------------------------------------+------------+
| uuid                                 | name                                | generation |
+--------------------------------------+-------------------------------------+------------+
| a7081054-ee03-44b8-ae21-f20e0535cfc1 | juju-3ba760-ceilometer-15.cloud.sts |         19 |
+--------------------------------------+-------------------------------------+------------+

$  openstack resource provider inventory list a7081054-ee03-44b8-ae21-f20e0535cfc1
+----------------+------------------+----------+----------+----------+-----------+-------+
| resource_class | allocation_ratio | min_unit | max_unit | reserved | step_size | total |
+----------------+------------------+----------+----------+----------+-----------+-------+
| VCPU           |             16.0 |        1 |        2 |        0 |         1 |     2 |
| MEMORY_MB      |              1.5 |        1 |     3944 |      512 |         1 |  3944 |
| DISK_GB        |              1.0 |        1 |       38 |        0 |         1 |    38 |
+----------------+------------------+----------+----------+----------+-----------+-------+

$ openstack resource provider usage show a7081054-ee03-44b8-ae21-f20e0535cfc1
+----------------+-------+
| resource_class | usage |
+----------------+-------+
| VCPU           |     3 |
| MEMORY_MB      |   192 |
| DISK_GB        |     3 |
+----------------+-------+

关于Traits的进一步说明

如果说Inventory and Allocation是来辅助ResourceProvider来管理数量问题的话,那么traits用来辅助特征信息的管理。例如:用户需要为instance关联80G的disk(数量),但是也要求是SSD(特征),那么就需要标记StorageResourceProvider是不是SSD. 它类似于tag (https://github.com/openstack/os-traits)
所以resource_provider_traits用来关联resource_provider表与trait
mysql> select * from resource_provider_traits where resource_provider_id = 2;
+---------------------+------------+----------+----------------------+
| created_at          | updated_at | trait_id | resource_provider_id |
+---------------------+------------+----------+----------------------+
| 2020-10-26 12:30:09 | NULL       |       59 |                    2 |

mysql> select * from traits;
+---------------------+------------+-----+---------------------------------------+
| created_at          | updated_at | id  | name                                  |
+---------------------+------------+-----+---------------------------------------+
| 2020-10-26 12:27:34 | NULL       |  59 | COMPUTE_DEVICE_TAGGING                |

如何使用呢?
1, The cloud deployer creates an aggregate representing all the compute nodes in row 1, racks 6 through 10:
$AGG_UUID=`openstack aggregate create r1rck0610`
# for all compute nodes in the system that are in racks 6-10 in row 1...
openstack aggregate add host $AGG_UUID $HOSTNAME

2, The cloud deployer creates a ResourceProvider representing the NFS share:
$RP_UUID=`openstack resource-provider create "/mnt/nfs/row1racks0610/" \
    --aggregate-uuid=$AGG_UUID`

3, The cloud deployer updates the resource provider’s capacity of shared disk:
openstack resource-provider set inventory $RP_UUID \
    --resource-class=DISK_GB \
    --total=100000 --reserved=1000 \
    --min-unit=50 --max-unit=10000 --step-size=10 \
    --allocation-ratio=1.0

4, The cloud deployer adds the STORAGE_SSD trait
openstack resource-provider trait add $RP_UUID STORAGE_SSD

5, Scheduling based on traits - https://docs.openstack.org/ironic/queens/install/configure-nova-flavors.html
openstack --os-baremetal-api-version 1.37 baremetal node add trait \
  $NODE_UUID CUSTOM_TRAIT1 HW_CPU_X86_VMX
nova flavor-key my-baremetal-flavor set trait:CUSTOM_TRAIT1=required
nova flavor-key my-baremetal-flavor set trait:HW_CPU_X86_VMX=required

one bug

例如, 如https://bugs.launchpad.net/nova/+bug/1679750描述的场景, 在hostA上创建一虚机,hostA死掉了, 这时删除虚机时无法删除实例(因为nova-compute这时死掉了啊), 这样会导致allocations表中的记录没有被删除. 如果hostA又起来了, nova-compute的init_host->_complete_partial_deletion

pre_start_hook -> update_available_resource -> nova/compute/manager.py#update_available_resource_for_node -> update_available_resource -> _update_available_resource -> _remove_deleted_instances_allocations

当把一个nova-compute删除时,删了service和compute_node表记录后,却没有删除placement resource provider和host mapping records.
nova-compute自己都死了它是没法自己删自己的,所以改由nova-api在启动时在删除instances时也删除allocations表中的记录 -
https://review.opendev.org/#/c/580498/

debug

NOTE: the table resource_providers, inventories allocations are in the db placement rather than nova_api
select * nova_api.from host_mappings;
select * from nova_api.cell_mappings;
select * from placement.resource_providers where name like '%xxx%';   xx.bos01.xxx (9)
select * from nova.compute_nodes where host like '%bagon%' or hypervisor_hostname like '%xxx%';
select * from placement.inventories where resource_provider_id in (select id from nova_api.resource_providers where name like '%xxx%');
select * from placement.allocations where resource_provider_id in (select id from nova_api.resource_providers where name like '%xxx%') order by consumer_id,resource_provider_id,resource_class_id;
select uuid, host, node, vcpus, memory_mb, vm_state, power_state, task_state, root_gb, ephemeral_gb, cell_name,deleted from nova.instances where uuid in (select consumer_id from nova_api.allocations where resource_provider_id in (select id from nova_api.resource_providers where name like '%xxx%')) order by uuid;

20201230更新 - another bug

"select numa_topology from nova.compute_nodes where hypervisor_hostname=‘cloud3.xxx.com’\G"显示cell0上的pinned_cpus将所有CPU全用完了导致nova-schedule无法继续调度报“Filter NUMATopologyFilter returned 0 hosts"这种错。
下面代码分析显示周期性的update_available_resource本来是可以自动修改数据库记录的。

pre_start_hook -> update_available_resource -> _update_available_resource -> _update_usage_from_instances -> _update_usage_from_instance -> _update_usage -> numa_usage_from_instance_numa

从数据库拿出host_cell.pinned_cpus作为pinned_cpus的初始值,特别要注意:host_cell.pinned_cpus并不是直接从数据库取的,它通过运行下列的self._copy_resources(cn, resources)方法实际上让host_cell.pinned_cpus永远为empty

693 def _init_compute_node(self, context, resources):
...
713 if nodename in self.compute_nodes:
714 cn = self.compute_nodes[nodename]
715 self._copy_resources(cn, resources)
716 self._setup_pci_tracker(context, cn, resources)
717 return False

它根据free变量来决定是往pinned_cpus上加还是减。

./nova/virt/hardware.py#numa_usage_from_instance_numa
def numa_usage_from_instance_numa(host_topology, instance_topology,free=False):
...
for host_cell in host_topology.cells:
new_cell = objects.NUMACell(
id=host_cell.id,
cpuset=shared_cpus,
pcpuset=dedicated_cpus,
memory=host_cell.memory,
cpu_usage=0,
memory_usage=0,
mempages=host_cell.mempages,
pinned_cpus=host_cell.pinned_cpus,
siblings=host_cell.siblings)
...
if free:
if (instance_cell.cpu_thread_policy ==
fields.CPUThreadAllocationPolicy.ISOLATE):
new_cell.unpin_cpus_with_siblings(pinned_cpus)
else:
new_cell.unpin_cpus(pinned_cpus)

free是由”free = sign == -1“决定的(看仔细,右边的是两个等号,左边的是一个等号)

def _update_usage(self, usage, nodename, sign=1):
...
free = sign == -1
cn.numa_topology = hardware.numa_usage_from_instance_numa(
host_numa_topology, instance_numa_topology, free)._to_json()

def _update_usage_from_instance():
is_new_instance = uuid not in self.tracked_instances
is_removed_instance = not is_new_instance and (is_removed or
instance['vm_state'] in vm_states.ALLOW_RESOURCE_REMOVAL)
if is_new_instance:
self.tracked_instances.add(uuid)
sign = 1
if is_removed_instance:
self.tracked_instances.remove(uuid)
sign = -1
...
self._update_usage(self._get_usage_dict(instance, instance),nodename, sign=sign)

所以只要update_available_resource运行那脏记录必须得到修改,那现在没修改说明update_available_resource一直没运行,日志里发现这种错误placement正在使用http而非https打头的endpoint从而导致placement api不可用,这样导致update_available_resource在调用update_placement时出错,从而导致update_available_resource自2020-10-26后再未运行。详见-https://bugs.launchpad.net/charm-nova-compute/+bug/1826382

2020-10-26 15:43:34.459 1393 WARNING keystoneauth.discover [req-5dcdc394-2784-40d2-984c-54fe261f36f0 - - - - -] Failed to contact the endpoint at http://placement-int.xxx.com:8778 for discovery. Fallback to using that endpoint as the base url.
2020-10-26 15:43:34.463 1393 ERROR nova.compute.manager [req-5dcdc394-2784-40d2-984c-54fe261f36f0 - - - - -] Could not retrieve compute node resource provider 8bd4062b-84c7-4aab-ade7-31dc01695878 and therefore unable to error out any instances stuck in BUILDING state. Error: Failed to retrieve allocations for resource provider 8bd4062b-84c7-4aab-ade7-31dc01695878:

关于numa测试环境的搭建可以见-https://blog.csdn.net/quqi99/article/details/51993512, 注意一点,grub里定义isolcpus并不会让nova不使用这些cpu, nova里专门有vcpu_pin_set来做这件事。

Reference

[1] https://blog.csdn.net/jmilk/article/details/81264240

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 书香水墨 设计师:CSDN官方博客 返回首页