In this post I will be exploring the current state of quality of service (QoS) in OpenStack. I will be looking at both what is possible now and what is on the horizon and targeted for the Havana release. Note that I am truly only intimately familiar with Glance and thus part of the intention of this post is to gather information from the community. Please let me know what I have missed, what I have gotten incorrect, and what else might be out there.
The term quality of service traditionally refers to the users reservation, or guarantee of a certain amount of network bandwidth. Instead of letting current network traffic and TCP flow control and back off algorithms dictate the rate of a users transfer across a network, the user would request N bits/second over a period of time. If the request is granted the user could expect to have that amount of bandwidth at their disposal. It is quite similar to resource reservation.
When considering quality of service in OpenStack we really should look beyond networks and at all of the resources on which there is contention, the most important of which are:
Let us take a look at QoS in some of the prominent OpenStack components.
While quotas are quite different from QoS they do have some overlapping concepts and thus will be discussed here briefly. A quota is a set maximum amount of a resource that a user is allowed to use. This does not necessarily mean that the user is guaranteed that much of the given resource, it just means that is the most they can have. That said quotas can sometimes be manipulated to provide a type of QoS (ex: set a bandwidth quota to 50% of your network resources per user and then only allow two users at a time).
Currently there is an effort in the keystone community to add centralized quota management for all OpenStack components to keystone. Keystone will provide management interfaces to the quota information. When a user attempts to use a resource OpenStack components will query Keystone for the particular resource’s quota. Enforcement of the quota will be done by that OpenStack service, not by Keystone.
The design for quota management in keystone seems fairly complete and is described here. The implementation does not appear to be targeted for the Havana release but hopefully we will see it some time in the I cycle. Note that once this is in Keystone the other OpenStack components must be modified to use it so it will likely be some time before this is available across OpenStack.
Glance is the image registry and delivery component of OpenStack. The main resources that it uses is network bandwidth when uploading/downloading images and the storage capacity of backend storage systems (like swift and GlusterFS). A user of Glance may wish to get a guarantee from the server that when it starts uploading or downloading an image that server will deliver N bits/second. In order to achieve this Glance does not only have to reserve bandwidth on the workers NIC and the local network, but it also has to get a similar QoS guarantee from the storage system which houses its data (swift, GlusterFS, etc).
Glance provides no first class QoS features. There is no way at all for a client to negotiate or discover the amount of bandwidth which can be dedicated to them. Even using outside OS level services to work around this issue is unlikely. The main problem is reserving the end to end path (from the network all the way through to the storage system).
In my opinion the solution to adding QoS to Glance is to get Glance out of the Image delivery business. Efforts are well underway (and should be available in the Havana release) to expose the underlying physical locations of a given image (things like http:// and swift://). In this way the user can negotiate directly with the storage system for some level of QoS, or it can Staccato to handle the transfer for it.
QoS for Cinder appears to be underway for the Havana release. Users of Cinder can ask for a specific volume type. Part of that volume type is a string that defines the QoS of the volume IO (fast, normal, or slow). Backends that can handle all of the demands of the volume type become candidates for scheduling.
More information about QoS in cinder can be found in the following links:
Neutron (formerly known as Quantum) provides network connectivity as a service. A blueprint for QoS in Neutron can be found here and additional information can be found here.
This effort is targeted for the Havana release. In the presence of Neutron plugins that support QoS (Cisco, Nicira, ?) this will allow users reservation of network bandwidth.
In nova all of the resources in the above list are used. User VMs necessarily use some amount of CPU, memory, IO, and network resources. Users truly interested in a guaranteed level of quality of service need a way to pin all of those resources. An effort for this in Nova is documented here with this blueprint.
While this effort appear to be what is needed in Nova it is unfortunately quite old and currently marked as obsolete. However the effort seems to have new life recently as shown by this email exchange. A definition of work can be found here with the blueprint here.
This effort will operate similarly to how Cinder is proposing QoS. A set of string will be defined: High (1 vCPU per CPU), Normal (2 vCPUs per CPU), low (4 vCPUs per CPU). This type string would then be added as part of the instance type when requesting a new VM instance. Memory commitment is not addressed in this effort, nor is network and disk IO (however those are best handled by Neutron and
Cinder respectively).
Unfortunately nothing seems to be scheduled for Havana.
Currently in nova there is the following configuration option:
# cpu_allocation_ratio=16.0
This sets the ratio of virtual CPUs to physical CPUs. If this value is set to 1.0 then the user will know that the number of CPUs in its requested instance type maps to full system CPUs. Similarly there is:
# ram_allocation_ratio=1.5
which does the same thing for RAM. While these do give a notion of QoS to the user they are too coarsely grained and can be inefficient when considering users that do not need/want such QoS.
Swift does not have any explicit QoS options. However it does have a rate limiting middleware which provides a sort of quota on bandwidth for users. How to set these values can be found here.
2020 has not been a year we would have been able to predict. With a worldwide pandemic and lives thrown out of gear, as we head into 2021, we are thankful that our community and project continued to receive new developers, users and make small gains. For that and a...
It has been a while since we provided an update to the Gluster community. Across the world various nations, states and localities have put together sets of guidelines around shelter-in-place and quarantine. We request our community members to stay safe, to care for their loved ones, to continue to be...
The initial rounds of conversation around the planning of content for release 8 has helped the project identify one key thing – the need to stagger out features and enhancements over multiple releases. Thus, while release 8 is unlikely to be feature heavy as previous releases, it will be the...