Methods and apparatus for reconfiguring hosts in provider network environments in which hosts are evaluated to determine if steps of a full rebuild can be skipped. The hosts may implement slots of different types for virtual machines (VMs). Upon detecting that slots of a particular type are needed, a host that implements slots of another type may be selected for reconfiguration. The host may be evaluated to determine if one or more steps of a full rebuild can be skipped. The host may then be reconfigured to implement slots of the target type according to results of the evaluation. In at least some reconfigurations, at least one step of a full rebuild procedure is not performed for the respective host. Results of previous reconfigurations may be fed back into the evaluation process and used as one of the criteria for determining if steps can be skipped.
|
7. A method, comprising:
performing, by one or more devices on a provider network:
monitoring demand for two or more types of slots maintained in two or more logical pools of host devices with available slots on the provider network and a number of available slots on the host devices in the respective pools, each pool including host devices that implement slots of a respective one of the two or more slot types;
determining that additional slots of a particular one of the two or more slot types are needed in a respective one of the two or more pools;
selecting one or more host devices from another pool that includes host devices that implement slots of another slot type to be moved to the pool that includes host devices that implement slots of the particular slot type; and
reconfiguring the selected host devices according to respective rebuild strategies to implement slots of the particular type, wherein, in said reconfiguring at least one of the selected host devices, at least one of a plurality of steps in a full rebuild procedure for host devices is not performed for the selected host device.
18. One or more non-transitory computer-readable storage media storing program instructions that when executed on or across one or more processors cause the one or more processors to:
monitor demand for two or more types of slots maintained in two or more logical pools of host devices with available slots on the provider network and a number of available slots on the host devices in the respective pools, each pool including host devices that implement slots of a respective one of the two or more slot types;
determine that additional slots of a particular one of the two or more slot types are needed in a respective one of the two or more pools;
select one or more host devices from another pool that includes host devices that implement slots of another slot type to be moved to the pool that includes host devices that implement slots of the particular slot type; and
reconfigure the selected host devices according to respective rebuild strategies to implement slots of the particular type, wherein, in said reconfiguring at least one of the selected host devices, at least one of a plurality of steps in a full rebuild procedure for host devices is not performed for the selected host device.
1. A system, comprising:
a processor coupled to a memory, the memory including instructions that upon execution cause the system to:
maintain two or more logical pools of host devices with available slots on a provider network, each pool including host devices that implement slots of a respective one of two or more slot types;
monitor demand for the two or more types of slots maintained in the respective pools and a number of available slots on the host devices in the respective pools;
upon determining that demand is high for slots of a particular slot type in a respective pool or that the number of available slots of the particular slot type in the respective pool is below a threshold:
select one or more host devices from another pool that includes host devices that implement slots of another slot type to be moved to the pool that includes host devices that implement slots of the particular slot type; and
cause respective rebuild strategies to be executed for the selected host devices, wherein the rebuild strategy for a selected host device when executed reconfigures the selected host device to implement slots of the particular slot type; and
wherein the rebuild strategy for at least one of the selected host devices when executed does not perform at least one of a plurality of steps in a full rebuild procedure for the selected host device.
2. The system as recited in
select a host device from the pool of host devices that implements slots of the other type as a candidate host device;
evaluate a respective rebuild strategy for the candidate host device to determine if the candidate host device can be rebuilt within a time constraint for providing the additional slots of the particular slot type and with an acceptable level of risk according to one or more risk constraints;
if the candidate host device can be rebuilt within the time constraint and with the acceptable level of risk, select the candidate host device to be reconfigured to implement slots of the particular type; and
if the candidate host device cannot be rebuilt within the time constraint and with the acceptable level of risk, select and evaluate another host device from the pool of host devices that implement slots of the other type as a candidate host device.
3. The system as recited in
determine a rebuild strategy for a host device that satisfies a time constraint and a risk constraint for reconfiguring the host device to implement slots of the particular type;
select a host device from the pool of host devices that implements slots of the other type as a candidate host device;
evaluate host status information for the candidate host device to determine if the candidate host device can be rebuilt according to the determined rebuild strategy;
if the candidate host device can be rebuilt according to the determined rebuild strategy, select the candidate host device to be reconfigured to implement slots of the particular type according to the determined rebuild strategy; and
if the candidate host device cannot be rebuilt according to the determined rebuild strategy, select and evaluate another host device from the pool as a candidate host device.
4. The system as recited in
evaluate information about hardware components of the host device to determine if a hardware vetting step of the full rebuild procedure can be skipped for the host device;
evaluate information about one or more disks of the host device to determine if a disk cleaning step of the full rebuild procedure can be skipped for the host device;
evaluate information about software on the host device to determine if a software install step of the full rebuild procedure can be skipped for the host device;
evaluate the information about the software on the host device to determine if a software update step of the full rebuild procedure to update installed software for the host device can be skipped for the host device; and
determine if a reboot step of the full rebuild procedure can be skipped, wherein the reboot step of the full rebuild procedure can be skipped if other steps that are to be performed when reconfiguring the host device do not require that the host device be rebooted.
5. The system as recited in
6. The system as recited in
8. The method as recited in
selecting a host device from the pool of host devices that implements slots of the other type as a candidate host device;
evaluating a respective rebuild strategy for the candidate host device to determine if the candidate host device can be rebuilt within a time constraint for providing the additional slots of the particular slot type and with an acceptable level of risk according to one or more risk constraints;
if the candidate host device can be rebuilt within the time constraint and with the acceptable level of risk, selecting the candidate host device to be reconfigured to implement slots of the particular type; and
if the candidate host device cannot be rebuilt within the time constraint and with the acceptable level of risk, selecting and evaluating another host device from the pool of host devices that implement slots of the other type as a candidate host device.
9. The method as recited in
determining a rebuild strategy for a host device that satisfies a time constraint and a risk constraint for reconfiguring the host device to implement slots of the particular type;
selecting a host device from the pool of host devices that implements slots of the other type as a candidate host device;
evaluating the host status information for the candidate host device to determine if the candidate host device can be rebuilt according to the determined rebuild strategy;
if the candidate host device can be rebuilt according to the determined rebuild strategy, selecting the candidate host device to be reconfigured to implement slots of the particular type according to the determined rebuild strategy; and
if the candidate host device cannot be rebuilt according to the determined rebuild strategy, selecting and evaluating another host device from the pool of host devices that implement slots of the other type as a candidate host device.
10. The method as recited in
determining if a vetting step of the full rebuild procedure to test hardware components of the selected host device can be skipped;
determining if a disk cleaning step of the full rebuild procedure to wipe data from and repartition one or more disks of the selected host device can be skipped;
determining if a software install step of the full rebuild procedure to perform a clean install of software for the selected host device can be skipped;
determining if a software update step of the full rebuild procedure to update installed software for the selected host device can be skipped; or
determining if a reboot step of the full rebuild procedure can be skipped, wherein the reboot step of the full rebuild procedure can be skipped if the other steps that are to be performed when reconfiguring the host device do not require that the host device be rebooted.
11. The method as recited in
health information for the hardware components of the host device used in determining if the vetting step of the full rebuild procedure can be skipped;
health information for the one or more disks of the host device used in determining if the disk cleaning step of the full rebuild procedure can be skipped;
information about data currently stored on the one or more disks of the host device used in determining if the disk cleaning step of the full rebuild procedure can be skipped; or
information about software on the host device used in determining if the software install step of the full rebuild procedure can be skipped and in determining if the software update step of the full rebuild procedure can be skipped.
12. The method as recited in
13. The method as recited in
determining a time constraint for when the additional slots of the particular slot type are needed based at least in part on the number of available slots on the host devices in the respective pool;
determining an acceptable level of risk for not performing one or more steps of the full rebuild procedure when reconfiguring host devices to be moved to the pool that includes host devices that implement slots of the particular slot type based at least in part on the determined time constraint; and
wherein selecting one or more host devices from another pool that includes host devices that implement slots of another slot type to be moved to the pool that includes host devices that implement slots of the particular slot type is performed based at least in part on the determined time constraint and determined acceptable level of risk.
14. The method as recited in
increasing the time constraint and accepting a higher level of risk if the number of available slots on the host devices in the respective pool is below a threshold; and
relaxing the time constraint and accepting a lower level of risk if the number of available slots on the host devices in the respective pool is above the threshold.
15. The method as recited in
16. The method as recited in
17. The method as recited in
19. The one or more non-transitory computer-readable storage media as recited in
select a host device from the pool of host devices that implements slots of the other type as a candidate host device;
evaluate a respective rebuild strategy for the candidate host device to determine if the candidate host device can be rebuilt within a time constraint for providing the additional slots of the particular slot type and with an acceptable level of risk according to one or more risk constraints;
if the candidate host device can be rebuilt within the time constraint and with the acceptable level of risk, select the candidate host device to be reconfigured to implement slots of the particular type; and
if the candidate host device cannot be rebuilt within the time constraint and with the acceptable level of risk, select and evaluate another host device from the pool of host devices that implement slots of the other type as a candidate host device.
20. The one or more non-transitory computer-readable storage media as recited in
determine a rebuild strategy for a host device that satisfies a time constraint and a risk constraint for reconfiguring the host device to implement slots of the particular type;
select a host device from the pool of host devices that implements slots of the other type as a candidate host device;
evaluate the host status information for the candidate host device to determine if the candidate host device can be rebuilt according to the determined rebuild strategy;
if the candidate host device can be rebuilt according to the determined rebuild strategy, select the candidate host device to be reconfigured to implement slots of the particular type according to the determined rebuild strategy; and
if the candidate host device cannot be rebuilt according to the determined rebuild strategy, select and evaluate another host device from the pool of host devices that implement slots of the other type as a candidate host device.
|
This application is a continuation of U.S. patent application Ser. No. 15/472,097, filed Mar. 28, 2017, which is hereby incorporated by reference herein in its entirety.
Many companies and other organizations operate computer networks that interconnect numerous computer systems to support their operations, such as with the computer systems being co-located (e.g., as part of a local network) or instead located in multiple distinct geographical locations (e.g., connected via one or more private or public intermediate networks). For example, data centers housing significant numbers of interconnected computer systems have become commonplace, such as private data centers that are operated by and on behalf of a single organization, and public data centers that are operated by entities as businesses to provide computing resources to customers. Some public data center operators provide network access, power, and secure installation facilities for hardware owned by various customers, while other public data center operators provide “full service” facilities that also include hardware resources made available for use by their customers.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.
Various embodiments of methods and apparatus for reconfiguring host devices in provider network environments are described.
A hypervisor, or virtual machine monitor (VMM) 244, on the host 240 presents the VMs 248A-248n on the respective host 240 with a virtual platform and monitors the execution of the VMs 248A-248n on the host 240. Each VM 248 on a host 240 may be provisioned with a given amount of resources (memory space, storage (e.g., disk) space, computation (e.g., CPU) resources, etc.) as provided by the respective slot 246 on the respective host 240. Each VM 248 may be provided with one or more IP addresses; the VMM 240 on a respective host 240 may be aware of the IP addresses of the VMs 248A-248n on the host 240. The VMM 244 and VMs 248A-248n may be executed by components of the host 240, for example processor(s) and memory of the host 240, represented in
Referring again to
The provider network 100 may provide one or more services 104 implemented by computer systems comprising one or more computing devices on the provider network that provide APIs via which clients 190 may request slots 146 of the different types for their respective provider network implementations, for example for their private networks 110 on the provider network 100, via an intermediate network 170 such as the Internet. Once a client 190 acquires a slot 146, a VM 148 may be installed in the slot 146 as a resource instance 118, or simply instance, in the client's private network 110 according to the client's requirements. When a provider network service 104 receives a request for a slot 146 of a particular type from a client 190 (or from some other requestor, such as another provider network service), the provider network service 104 may send a request for an available slot 146 of that type to the pool management service 106. The pool management service 106 locates an available slot 146 on a host 140 in the pool of hosts that provide slot 146 of that type and notifies the requesting service 104 identifying the available slot 146. A VM 148 as specified by the client may then be installed in the slot 146 and configured as a resource instance 118 in the client's private network 110.
The pool management service 106 may monitor demand on the different types (e.g., different sizes) of slots 146 maintained in the respective pools, as well as the number of available slots 146 provided by hosts 140 in the pools, and may move hosts 140 between pools if needed. Conventionally, to move a host from one pool to another pool, a full rebuild procedure is run on the host 140 to reconfigure the host 140 with slots 146 of the target type.
The full rebuild procedure may take several hours, for example from 3 hours up to 10 or so hours in some cases. Thus, in some cases there may be a long delay between the time a client requests a slot and the time a VM is provisioned to the client's private network.
Embodiments of methods for reconfiguring hosts in a provider network are described in which hosts that are selected to be moved to another pool are evaluated to determine if one or more of the steps in the full rebuild procedure can be skipped when reconfiguring the hosts. If it is determined that one or more of the steps can be skipped for the host, then the host can be quickly reconfigured to implement slots of a different type by performing only the necessary step(s) and skipping (not performing) at least one step that is normally performed during a full rebuild procedure. The methods for reconfiguring hosts may thus, in at least some cases, reduce the delay between the time a client requests a slot and the time a VM is provisioned to the client's private network from several hours to a few minutes, for example ten minutes five minutes, or one minute depending on the number of steps that are not performed.
The provider network 300 may provide one or more services 304 implemented by computer systems comprising one or more computing devices on the provider network that provide APIs via which clients 390 may request slots of the different types for their respective private networks 310 on the provider network 300. Once a client 390 acquires a slot, a VM may be instantiated in the slot and configured as an instance 118 in the client's private network 310 according to the client's requirements. When a provider network service 304 receives a request for a slot of a particular type from a client 390, the provider network service 304 may send a request for an available slot of that type to the pool management service 306. The pool management service 306 locates an available slot on a host 340 in the pool 350 that provide slots of that type and notifies the requesting service 304 identifying the available slot. A VM as specified by the client may then be instantiated in the slot and provided as a resource instance 318 in the client's private network 310.
In
In
In some embodiments, the pool management service 404 may monitor results of execution of rebuild strategies on hosts 440; results information may, for example, be stored in pool management data 420. In some embodiments, the results of the previous executions of rebuild strategies on the hosts 440 may be provided as feedback to the rebuild strategy process 410 and used in determining the rebuild strategy for host 440A1.
The evaluation of host 440A1 by the rebuild strategy process 410 may determine that one or more of the steps in a full rebuild procedure that is typically used on the provider network 400 to reconfigure hosts (e.g., as illustrated in
At 1315, if the rebuild strategy indicates that hardware vetting is needed for the host, then hardware vetting may be performed as indicated at 1320. In hardware vetting, the hardware configuration of the host is identified and sent to a vetting process. The vetting process performs hardware vetting workflows (memory and disk checks, firmware version checks, etc.) to determine if the host hardware is sound. At 1315, if the rebuild strategy indicates that hardware vetting is not needed for the host, then element 1320 is skipped (not performed).
At 1325, if the rebuild strategy indicates that clean disks are needed on the host, then as indicated at 1330 the disk(s) on the host may be wiped to delete any data on the disk(s), and the disk(s) may be repartitioned. At 1325, if the rebuild strategy indicates that clean disks are not needed, then element 1330 is skipped.
At 1335, if the rebuild strategy indicates that a software install is needed on the host, then as indicated at 1340 a new image for the host execution environment/VMNI and base software may be installed on the host. At 1335, if the rebuild strategy indicates that a software install is not needed, then element 1340 is skipped.
At 1345, if the rebuild strategy indicates that software and/or firmware updates are required on the host, then as indicated at 1360 the updates are installed. At 1345, if the rebuild strategy indicates that software and firmware updates are not required, then element 1360 is skipped.
At 1365, if the rebuild strategy indicates that the host needs to be rebooted, then as indicated at 1370 the host is rebooted to finalize and verify the installation. At 1365, if the rebuild strategy indicates that the host does not need to be rebooted, then element 1370 is skipped.
As indicated at 1380, the rebuilt host is registered with the control plane of the provider network to notify the control plane that the host's slots are available. VMs may then be launched in the host's slots, for example from machine images maintained by the provider network.
The following describes example factors that may be considered and methods that may be performed at element 1510 of
In some embodiments, the time factors may affect the risk thresholds; for example, the rebuild strategy process may accept a higher level of risk in skipping one or more of the steps or processes of the steps if the slots are needed immediately or as soon as possible. In some embodiments, the pool health information may affect the time and/or risk factors. For example, if the pool of hosts to which the host is to be moved is critically low, then the additional slots provided by the host may be needed immediately or as soon as possible, and a higher level of risk may be acceptable to meet the time constraint. Conversely, if the pool of hosts to which the host is to be moved is in relatively good shape, then the time constraint may be relaxed, and a less risky but longer rebuild process may be selected.
In some embodiments, other factors or inputs may affect the time and/or risk factors, and may also affect whether certain steps can be skipped or should be performed. For example, in some embodiments, a user interface to a provider network service may be provided to provider network customers that allows the customers to specify the type and number of slots needed. The user interface may also allow the customers to specify a time constraint (e.g., immediately, as soon as possible, within an hour, or no constraint) and/or an acceptable level of risk that the customers are willing to take; customer inputs from the user interface may be provided to the rebuild strategy process as time and/or risk factors. Thus, a customer may indicate when requesting one or more slots of a particular type that the slots are needed immediately or as soon as possible, and/or may indicate a level of risk that is acceptable in order to get the slots as soon as possible. As another example, client information may include information about client(s) that currently or have had VMs executing on slots of the target host, and/or information about client(s) to which slots of the target type are to be provided after the rebuild. For example, information about a client may indicate that a client's VMs that are to execute in slots after the rebuild implement critical applications and are not tolerant to failures, and therefore the client may be risk-averse. As another example, information about a client that has had VMs executing in slots of the target host may indicate that the client's applications implemented by the VMs handle sensitive data, and therefore the disk cleaning step should be performed.
In some embodiments, feedback from previous rebuilds based on rebuild strategies may affect time and/or risk factors, and may also affect whether certain steps can be skipped or should be performed. For example, if skipping certain steps or certain processes in steps in rebuilds has resulted in a significant number of failures or other problems on hosts and/or complaints from customers with VMs executing on the hosts, then the risk level for skipping those steps or processes may be raised. Conversely, if skipping certain steps or certain processes in steps in rebuilds has not resulted in failures or other problems on hosts and/or complaints from customers with VMs executing on the hosts, then the risk level for skipping those steps or processes may be lowered.
At the hardware vetting step of the full rebuild process, one or more hardware vetting workflows (memory and disk checks and stress tests, firmware version checks, etc.) may be performed to determine if the host hardware is sound. In some embodiments, in evaluating whether the workflows of the hardware vetting step can be skipped, the rebuild strategy process may look at status/health information for the host to determine how recently the vetting workflows were performed for the hardware components. If the vetting workflows have been performed within an acceptable time period (e.g., within the last week, or within the last two weeks), then the rebuild strategy process may decide that one or more of the hardware vetting workflows can be skipped for the host. In some embodiments, the rebuild strategy process may also look at the status/health information for the host to determine if any of the hardware components have been experiencing problems that generate errors. If a hardware component has not been generating any errors, or if the number of errors are below an acceptable threshold, then the hardware vetting workflow for that component may be skipped. The rebuild strategy process may also check the firmware version of the to make sure that the firmware is up to date, or at least at an acceptable level with no pending critical firmware update, and may decide that a firmware update is thus not necessary at this time and can be skipped.
At the disk cleaning step of the full rebuild process, the on-host disk(s) are wiped and repartitioned. In some embodiments, in evaluating whether the steps of the disk cleaning step can be skipped for a host, the rebuild strategy process may look at status/health information for the host to determine health of the disk(s) and current partitioning to determine if the disk health and partitioning are acceptable; if they are, then the disk cleaning step may be skipped. In some embodiments, the rebuild strategy process may also look at information about client(s) that have had VMs executing in slots of the target host to determine whether the clients' applications implemented by the VMs handled sensitive data. If the disk(s) may include clients' sensitive data, the disk cleaning step should be performed; otherwise, the disk cleaning step may be skipped.
At the software installation step of the full rebuild process, a new image for the host execution environment/VMM is installed, and base software for the host is installed. In some embodiments, in evaluating whether the software installs of the software installation step can be skipped for a host, the rebuild strategy process may the rebuild strategy process may look at status/health information and the base rebuild requirements (e.g., the type of slots that are currently implemented on the host, and the type of slots that are to be implemented on the host during the rebuild process) to determine if the software installation step, or one or more installs of the step, may be skipped. For example, the rebuild strategy process may look at the versions of the currently installed software components to determine if the software components are sufficiently up-to-date and support the type of slots that are to be implemented on the host during the rebuild process. If so, at least part of the software installation step may be skipped. As another example, the rebuild strategy process may look at health information for the host to determine if the software has been executing for a period without generating errors; if the software has been generating errors or is otherwise suspect, the software installation step may need to be performed, and otherwise may be skipped.
At the update step of the full rebuild process, any pending updates for the host software and/or firmware are installed. In some embodiments, in evaluating whether the updates of the update step can be skipped for a host, the rebuild strategy process may look at current software and/or firmware versions on the host (or software and/or firmware versions of software that is to be installed on the host, if the software install step is to be performed) to determine if there are any pending critical updates or necessary updates support the type of slots that are to be implemented on the host during the rebuild process. The rebuild strategy process may decide to skip any updates that are not critical or necessary for the rebuild.
At the reboot step of the full rebuild process, the host is rebooted to finalize and verify the installation. In some embodiments, in evaluating whether the reboot step can be skipped for a host, the rebuild strategy process may examine what it has determined is to be performed or is to be skipped in the rebuild strategy for the host to determine if the reboot can be skipped. For example, some firmware updates, software installs, and software updates may require a reboot, while others may not.
At 1520, if the rebuild strategy process determines that one or more steps of the full rebuild procedure can be skipped, then as indicated at 1530, the rebuild strategy process may direct the rebuild agent on the host (or a host rebuild process executing on a device external to the host) to perform only the rebuild steps that were determined to be necessary, for example by providing a rebuild strategy that indicates the steps that are to be performed and/or the steps that can be skipped. At 1520, if the rebuild strategy process determines that the steps in the full rebuild procedure need to be performed and thus should not be skipped for this host, then at 1540 at least some of the rebuild criteria may be evaluated to determine if a different host should be selected for reconfiguration. For example, if a time factor indicates that slots of the target type are needed as soon as possible, but the evaluation of the host indicates that skipping one or more steps of the rebuild process for this host is above a risk threshold, then at 1540 the method may return to element 1500 to select and evaluate a different host. At 1540, if it is decided to not select another host but instead to proceed with a rebuild of the currently selected host, then as indicated at 1550 the pool management service may direct the agent on the host (or a host rebuild process executing on a device external to the host) to perform a full rebuild procedure, for example as indicated in
As indicated at 1724, the host status information for the candidate host may be evaluated to determine if a rebuild strategy can be executed for the host that meets time constraints and risk constraints. The host selection process may generate or obtain a rebuild strategy for the candidate host. For example, in some embodiments, the host status information for the candidate host may include health information for the host (e.g., how long has the host been running in the current configuration, etc.) and/or for components of the host (e.g., memory, storage, etc.); the health information for the host may be evaluated to determine which steps in a full rebuild procedure can be skipped for this host.
At 1724, after generating or obtaining a rebuild strategy for the candidate host, the host selection process may evaluate the rebuild strategy according to time constraints and risk constraints to determine if the candidate host is an acceptable candidate for rebuilding. The host selection process may be aware of how long a given rebuild strategy should take (e.g., seconds, minutes, hours). The time constraints may be obtained from the pool monitoring process, and may indicate how soon slots of the target type are needed. For example, a time constraint for a request for slots received from the pool monitoring process may indicate that the slots are needed immediately or as soon as possible, within 1 minute, within 5 minutes, within an hour, or that there is no time constraint (e.g., provide the slots whenever possible). The risk constraints may indicate levels or thresholds of risk that are acceptable in skipping one or more of the steps of the full rebuild procedure. A host may be determined as an acceptable candidate if the host can be rebuilt according to the strategy within the time constraints and with an acceptable level of risk. In some embodiments, if the current rebuild strategy does not allow the candidate host to be rebuilt within the given time constraints, then one or more steps may be eliminated from the rebuild strategy if the step(s) can be skipped with an acceptable level of risk. For example, there may be an acceptable risk threshold for skipping hardware vetting, an acceptable risk threshold for skipping disk cleaning, and so on. Thus, there may be a trade-off between risk and time; a higher level of risk may be acceptable if the slots are needed as soon as possible or immediately. Conversely, more time may be needed if a host cannot be rebuilt within the time constraints without assuming too much risk.
The risk constraints may be relaxed or increased according to the time constraints. For example, if the slots are needed immediately or as soon as possible, a higher level of risk may be acceptable in skipping one or more of the steps. As another example, if the slots are needed whenever possible with no time constraint, then a low level of risk, or no risk, may be acceptable in skipping one or more of the steps, and a full rebuild procedure may thus need to be performed for this host. Other factors may be considered when determining an acceptable level of risk. For example, security concerns for a client's data stored on the candidate host may indicate that the host's disks should be wiped and repartitioned. As another example, the health and history of particular host hardware components (e.g., memory, disks, processors, etc.) may indicate that skipping vetting for the component(s) carries a high level of risk, and thus the hardware should be vetted (e.g., if a hardware component has generated a significant number of errors over a time period) or if vetting can be safely skipped with very low risk (e.g., if the hardware components have been performing well for a time period without a significant number of errors). As another example, if the host status information indicates that software and/or firmware on the host device does not have any pending critical updates, then a software and/or firmware update step of the full rebuild procedure may be safely skipped for this host with low risk. However, if the host status information indicates that software and/or firmware on the host does have pending critical updates, then a software and/or firmware update step of the full rebuild procedure should be performed for this host.
At 1726, if the rebuild strategy for the host can be performed within the time constraints at an acceptable level of risk, then the method may proceed to element 1740 of
Note that the pool monitoring process may request that more than one host be added to the target pool, or may request some number of slots that would require more than one host to be added to the target pool. In this case, the method of
As indicated 1800, a rebuild agent on a host (or alternatively a host rebuild process executing on a device external to the host) receives and validates a rebuild strategy that was determined by a pool management service as described herein. At 1810, depending on the rebuild type, the agent executes a rebuild procedure for the host. If the rebuild type is resize, no reboot, then as indicated at 1820 the rebuild procedure resizes the host's slots, but does not reboot the host. If the rebuild type is resize, reboot, then as indicated at 1830 the rebuild procedure resizes the host's slots and reboots the host. If the rebuild type is resize, reboot with updates, then as indicated at 1840 the rebuild procedure resizes the host's slots, applies the necessary updates to software and/or firmware of the host, and reboots the host. If the rebuild type is full rebuild, then as indicated at 1850 a full rebuild procedure, for example as shown in
Example Provider Network Environment
This section describes example provider network environments in which embodiments of the methods and apparatus described in reference to
Conventionally, the provider network 4000, via the virtualization services 4010, may allow a client of the service provider (e.g., a client that operates client network 4050A) to dynamically associate at least some public IP addresses 4014 assigned or allocated to the client with particular resource instances 4012 assigned to the client. The provider network 4000 may also allow the client to remap a public IP address 4014, previously mapped to one virtualized computing resource instance 4012 allocated to the client, to another virtualized computing resource instance 4012 that is also allocated to the client. Using the virtualized computing resource instances 4012 and public IP addresses 4014 provided by the service provider, a client of the service provider such as the operator of client network 4050A may, for example, implement client-specific applications and present the client's applications on an intermediate network 4040, such as the Internet. Other network entities 4020 on the intermediate network 4040 may then generate traffic to a destination public IP address 4014 published by the client network 4050A; the traffic is routed to the service provider data center, and at the data center is routed, via a network substrate, to the private IP address 4016 of the virtualized computing resource instance 4012 currently mapped to the destination public IP address 4014. Similarly, response traffic from the virtualized computing resource instance 4012 may be routed via the network substrate back onto the intermediate network 4040 to the source entity 4020.
Private IP addresses, as used herein, refer to the internal network addresses of resource instances in a provider network. Private IP addresses are only routable within the provider network. Network traffic originating outside the provider network is not directly routed to private IP addresses; instead, the traffic uses public IP addresses that are mapped to the resource instances. The provider network may include networking devices or appliances that provide network address translation (NAT) or similar functionality to perform the mapping from public IP addresses to private IP addresses and vice versa.
Public IP addresses, as used herein, are Internet routable network addresses that are assigned to resource instances, either by the service provider or by the client. Traffic routed to a public IP address is translated, for example via 1:1 network address translation (NAT), and forwarded to the respective private IP address of a resource instance.
Some public IP addresses may be assigned by the provider network infrastructure to particular resource instances; these public IP addresses may be referred to as standard public IP addresses, or simply standard IP addresses. In some embodiments, the mapping of a standard IP address to a private IP address of a resource instance is the default launch configuration for all resource instance types.
At least some public IP addresses may be allocated to or obtained by clients of the provider network 4000; a client may then assign their allocated public IP addresses to particular resource instances allocated to the client. These public IP addresses may be referred to as client public IP addresses, or simply client IP addresses. Instead of being assigned by the provider network 4000 to resource instances as in the case of standard IP addresses, client IP addresses may be assigned to resource instances by the clients, for example via an API provided by the service provider. Unlike standard IP addresses, client IP Addresses are allocated to client accounts and can be remapped to other resource instances by the respective clients as necessary or desired. A client IP address is associated with a client's account, not a particular resource instance, and the client controls that IP address until the client chooses to release it. Unlike conventional static IP addresses, client IP addresses allow the client to mask resource instance or availability zone failures by remapping the client's public IP addresses to any resource instance associated with the client's account. The client IP addresses, for example, enable a client to engineer around problems with the client's resource instances or software by remapping client IP addresses to replacement resource instances.
In some embodiments, the IP tunneling technology may map IP overlay addresses (public IP addresses) to substrate IP addresses (private IP addresses), encapsulate the packets in a tunnel between the two namespaces, and deliver the packet to the correct endpoint via the tunnel, where the encapsulation is stripped from the packet. In
Referring to
In addition, a network such as the provider data center 4100 network (which is sometimes referred to as an autonomous system (AS)) may use the mapping service technology, IP tunneling technology, and routing service technology to route packets from the VMs 4124 to Internet destinations, and from Internet sources to the VMs 4124. Note that an external gateway protocol (EGP) or border gateway protocol (BGP) is typically used for Internet routing between sources and destinations on the Internet.
The data center 4100 network may implement IP tunneling technology, mapping service technology, and a routing service technology to route traffic to and from virtualized resources, for example to route packets from the VMs 4124 on hosts 4120 in data center 4100 to Internet destinations, and from Internet sources to the VMs 4124. Internet sources and destinations may, for example, include computing systems 4170 connected to the intermediate network 4140 and computing systems 4152 connected to local networks 4150 that connect to the intermediate network 4140 (e.g., via edge router(s) 4114 that connect the network 4150 to Internet transit providers). The provider data center 4100 network may also route packets between resources in data center 4100, for example from a VM 4124 on a host 4120 in data center 4100 to other VMs 4124 on the same host or on other hosts 4120 in data center 4100.
A service provider that provides data center 4100 may also provide additional data center(s) 4160 that include hardware virtualization technology similar to data center 4100 and that may also be connected to intermediate network 4140. Packets may be forwarded from data center 4100 to other data centers 4160, for example from a VM 4124 on a host 4120 in data center 4100 to another VM on another host in another, similar data center 4160, and vice versa.
While the above describes hardware virtualization technology that enables multiple operating systems to run concurrently on host computers as virtual machines (VMs) on the hosts, where the VMs may be instantiated on slots on hosts that are rented or leased to clients of the network provider, the hardware virtualization technology may also be used to provide other computing resources, for example storage resources 4118, as virtualized resources to clients of a network provider in a similar manner.
Provider network 4200 may provide a client network 4250, for example coupled to intermediate network 4240 via local network 4256, the ability to implement virtual computing systems 4292 via hardware virtualization service 4220 coupled to intermediate network 4240 and to provider network 4200. In some embodiments, hardware virtualization service 4220 may provide one or more APIs 4202, for example a web services interface, via which a client network 4250 may access functionality provided by the hardware virtualization service 4220, for example via a console 4294. In some embodiments, at the provider network 4200, each virtual computing system 4292 at client network 4250 may correspond to a computation resource 4224 that is leased, rented, or otherwise provided to client network 4250.
From an instance of a virtual computing system 4292 and/or another client device 4290 or console 4294, the client may access the functionality of storage virtualization service 4210, for example via one or more APIs 4202, to access data from and store data to a virtual data store 4216 provided by the provider network 4200. In some embodiments, a virtualized data store gateway (not shown) may be provided at the client network 4250 that may locally cache at least some data, for example frequently accessed or critical data, and that may communicate with virtualized data store service 4210 via one or more communications channels to upload new or modified data from a local cache so that the primary store of data (virtualized data store 4216) is maintained. In some embodiments, a user, via a virtual computing system 4292 and/or on another client device 4290, may mount and access virtual data store 4216 volumes, which appear to the user as local virtualized storage 4298.
While not shown in
A client's virtual network 4360 may be connected to a client network 4350 via a private communications channel 4342. A private communications channel 4342 may, for example, be a tunnel implemented according to a network tunneling technology or some other technology over an intermediate network 4340. The intermediate network may, for example, be a shared network or a public network such as the Internet. Alternatively, a private communications channel 4342 may be implemented over a direct, dedicated connection between virtual network 4360 and client network 4350.
A public network may be broadly defined as a network that provides open access to and interconnectivity among a plurality of entities. The Internet, or World Wide Web (WWW) is an example of a public network. A shared network may be broadly defined as a network to which access is limited to two or more entities, in contrast to a public network to which access is not generally limited. A shared network may, for example, include one or more local area networks (LANs) and/or data center networks, or two or more LANs or data center networks that are interconnected to form a wide area network (WAN). Examples of shared networks may include, but are not limited to, corporate networks and other enterprise networks. A shared network may be anywhere in scope from a network that covers a local area to a global network. Note that a shared network may share at least some network infrastructure with a public network, and that a shared network may be coupled to one or more other networks, which may include a public network, with controlled access between the other network(s) and the shared network. A shared network may also be viewed as a private network, in contrast to a public network such as the Internet. In some embodiments, either a shared network or a public network may serve as an intermediate network between a provider network and a client network.
To establish a virtual network 4360 for a client on provider network 4300, one or more resource instances (e.g., VMs 4324A and 4324B and storage 4318A and 4318B) may be allocated to the virtual network 4360. Note that other resource instances (e.g., storage 4318C and VMs 4324C) may remain available on the provider network 4300 for other client usage. A range of public IP addresses may also be allocated to the virtual network 4360. In addition, one or more networking devices (routers, switches, etc.) of the provider network 4300 may be allocated to the virtual network 4360. A private communications channel 4342 may be established between a private gateway 4362 at virtual network 4360 and a gateway 4356 at client network 4350.
In some embodiments, in addition to, or instead of, a private gateway 4362, virtual network 4360 may include a public gateway 4364 that enables resources within virtual network 4360 to communicate directly with entities (e.g., network entity 4344) via intermediate network 4340, and vice versa, instead of or in addition to via private communications channel 4342.
Virtual network 4360 may be, but is not necessarily, subdivided into two or more subnetworks, or subnets, 4370. For example, in implementations that include both a private gateway 4362 and a public gateway 4364, a virtual network 4360 may be subdivided into a subnet 4370A that includes resources (VMs 4324A and storage 4318A, in this example) reachable through private gateway 4362, and a subnet 4370B that includes resources (VMs 4324B and storage 4318B, in this example) reachable through public gateway 4364.
The client may assign particular client public IP addresses to particular resource instances in virtual network 4360. A network entity 4344 on intermediate network 4340 may then send traffic to a public IP address published by the client; the traffic is routed, by the provider network 4300, to the associated resource instance. Return traffic from the resource instance is routed, by the provider network 4300, back to the network entity 4344 over intermediate network 4340. Note that routing traffic between a resource instance and a network entity 4344 may require network address translation to translate between the public IP address and the private IP address of the resource instance.
Some embodiments may allow a client to remap public IP addresses in a client's virtual network 4360 as illustrated in
While
Illustrative System
In some embodiments, a system that implements a portion or all of the methods and apparatus for reconfiguring host devices in provider network environments as described herein may include a general-purpose computer system that includes or is configured to access one or more computer-accessible media, such as computer system 5000 illustrated in
In various embodiments, computer system 5000 may be a uniprocessor system including one processor 5010, or a multiprocessor system including several processors 5010 (e.g., two, four, eight, or another suitable number). Processors 5010 may be any suitable processors capable of executing instructions. For example, in various embodiments, processors 5010 may be general-purpose or embedded processors implementing any of a variety of instruction set architectures (ISAs), such as the x86, PowerPC, SPARC, or MIPS ISAs, or any other suitable ISA. In multiprocessor systems, each of processors 5010 may commonly, but not necessarily, implement the same ISA.
System memory 5020 may be configured to store instructions and data accessible by processor(s) 5010. In various embodiments, system memory 5020 may be implemented using any suitable memory technology, such as static random access memory (SRAM), synchronous dynamic RAM (SDRAM), nonvolatile/Flash-type memory, or any other type of memory. In the illustrated embodiment, program instructions and data implementing one or more desired functions, such as those methods, techniques, and data described above for providing client-defined rules for clients' resources in provider network environments, are shown stored within system memory 5020 as code 5025 and data 5026.
In one embodiment, I/O interface 5030 may be configured to coordinate I/O traffic between processor 5010, system memory 5020, and any peripheral devices in the device, including network interface 5040 or other peripheral interfaces. In some embodiments, I/O interface 5030 may perform any necessary protocol, timing or other data transformations to convert data signals from one component (e.g., system memory 5020) into a format suitable for use by another component (e.g., processor 5010). In some embodiments, I/O interface 5030 may include support for devices attached through various types of peripheral buses, such as a variant of the Peripheral Component Interconnect (PCI) bus standard or the Universal Serial Bus (USB) standard, for example. In some embodiments, the function of I/O interface 5030 may be split into two or more separate components, such as a north bridge and a south bridge, for example. Also, in some embodiments some or all of the functionality of I/O interface 5030, such as an interface to system memory 5020, may be incorporated directly into processor 5010.
Network interface 5040 may be configured to allow data to be exchanged between computer system 5000 and other devices 5060 attached to a network or networks 5050, such as other computer systems or devices as illustrated in
In some embodiments, system memory 5020 may be one embodiment of a computer-accessible medium configured to store program instructions and data as described above for
Various embodiments may further include receiving, sending or storing instructions and/or data implemented in accordance with the foregoing description upon a computer-accessible medium. Generally speaking, a computer-accessible medium may include storage media or memory media such as magnetic or optical media, e.g., disk or DVD/CD-ROM, volatile or non-volatile media such as RAM (e.g. SDRAM, DDR, RDRAM, SRAM, etc.), ROM, etc., as well as transmission media or signals such as electrical, electromagnetic, or digital signals, conveyed via a communication medium such as network and/or a wireless link.
The various methods as illustrated in the Figures and described herein represent exemplary embodiments of methods. The methods may be implemented in software, hardware, or a combination thereof. The order of method may be changed, and various elements may be added, reordered, combined, omitted, modified, etc.
Various modifications and changes may be made as would be obvious to a person skilled in the art having the benefit of this disclosure. It is intended to embrace all such modifications and changes and, accordingly, the above description to be regarded in an illustrative rather than a restrictive sense.
Gupta, Diwakar, Jagannathan, Srinivasan, Carson, Duane Todd, Mullen, Jonathan Welter
Patent | Priority | Assignee | Title |
Patent | Priority | Assignee | Title |
10374880, | Mar 28 2017 | Amazon Technologies, Inc.; Amazon Technologies, Inc | Methods and apparatus for reconfiguring hosts in provider network environments |
20080244579, | |||
20110173637, | |||
20140280961, |
Executed on | Assignor | Assignee | Conveyance | Frame | Reel | Doc |
Aug 02 2019 | Amazon Technologies, Inc. | (assignment on the face of the patent) | / |
Date | Maintenance Fee Events |
Aug 02 2019 | BIG: Entity status set to Undiscounted (note the period is included in the code). |
Jul 26 2024 | M1551: Payment of Maintenance Fee, 4th Year, Large Entity. |
Date | Maintenance Schedule |
Jan 26 2024 | 4 years fee payment window open |
Jul 26 2024 | 6 months grace period start (w surcharge) |
Jan 26 2025 | patent expiry (for year 4) |
Jan 26 2027 | 2 years to revive unintentionally abandoned end. (for year 4) |
Jan 26 2028 | 8 years fee payment window open |
Jul 26 2028 | 6 months grace period start (w surcharge) |
Jan 26 2029 | patent expiry (for year 8) |
Jan 26 2031 | 2 years to revive unintentionally abandoned end. (for year 8) |
Jan 26 2032 | 12 years fee payment window open |
Jul 26 2032 | 6 months grace period start (w surcharge) |
Jan 26 2033 | patent expiry (for year 12) |
Jan 26 2035 | 2 years to revive unintentionally abandoned end. (for year 12) |