Data sources and data requirements Evaluating the need for discovery tooling

Understanding initial assessment data requirements

Data collection can take a significant amount of time and easily become a blocker when there is no clarity about what data is needed and when it is needed. The key is to understand the balance between what is too little and what is too much data for the outcomes of this stage. To focus on the data and the fidelity level required for this early stage of portfolio assessment, adopt an iterative approach to data collection.

Data sources and data requirements

The first step is to identify your sources of data. Start by identifying the key stakeholders within your organization that can fulfill the data requirements. These are typically members of the service management, operations, capacity planning, monitoring, and support teams, and the application owners. Establish working sessions with members of these groups. Communicate data requirements and obtain a list of tools and existing documentation that can provide the data.

To guide these conversations, use the following set of questions:

How accurate and up to date is the current infrastructure and application inventory? For example, for the company configuration management database (CMDB), do we already know where the gaps are?
Do we have active tools and processes that keep the CMDB (or equivalent) updated? If so, how frequently it is updated? What is the latest refresh date?
Does the current inventory, such as the CMDB, contain application-to-infrastructure mapping? Is each infrastructure asset associated to an application? Is each application mapped to infrastructure?
Does the inventory contain a catalog of licenses and licensing agreements for each product?
Does the inventory contain dependency data? Note the existence of communication data such as server to server, application to application, application or server to database.
What other tools that can provide application and infrastructure information are available in the environment? Note the existence of performance, monitoring, and management tools that can be used as a source of data.
What are the different locations, such as data centers, hosting our applications and infrastructure?

After these questions have been answered, list your identified sources of data. Then assign a level of fidelity, or level of trust, to each of them. Data validated recently (within 30 days) from active programmatic sources, such as tools, have the highest level of fidelity. Static data is considered of lower fidelity and less trusted. Examples of static data are documents, workbooks, manually updated CMDBs, or any other non-programmatically maintained dataset, or whose last refresh date is older than 60 days.

The data fidelity levels in the following table are provided as examples. We recommend that you assess the requirements of your organization in terms of maximum tolerance to assumptions and associated risk to determine what is an appropriate level of fidelity. In the table, institutional knowledge refers to any information about applications and infrastructure that is not documented.

Data sources	Fidelity level	Portfolio coverage	Comments
Institutional knowledge	Low - Up to 25% of accurate data, 75% assumed values or data is older than 150 days.	Low	Scarce, focused on critical applications
Knowledge base	Medium-low - 35-40% of accurate data, 65-60% assumed values or data is 120-150 days old.	Medium	Manually maintained, inconsistent levels of detail
CMDB	Medium - ~50% of accurate data, ~50% assumed values or data is 90-120 days old.	Medium	Contains data from mixed sources, several data gaps
VMware vCenter exports	Medium-high - 75-80% of accurate data, 25-20% assumed values or data is 60-90 days old.	High	Covers 90% of the virtualized estate
Application performance monitoring	High - Mostly accurate data, ~5% assumed values or data is 0-60 days old.	Low	Limited to critical production systems (covers 15% of the application portfolio)

The following tables specify the required and optional data attributes for each asset class (applications, infrastructure, networks, and migration), the specific activity (inventory or business case), and the recommended data fidelity for this stage of assessment. The tables use the following abbreviations:

R, for required
(D), for directional business case, required for total cost of ownership (TCO) comparisons and directional business cases
(F), for full directional business case, required for TCO comparison and directional business cases that include migration and modernization costs
O, for optional
N/A, for not applicable

Applications

Attribute name	Description	Inventory and prioritization	Business case	Recommended fidelity level (minimum)
Unique identifier	For example, application ID. Typically available on existent CMDBs or other internal inventories and control systems. Consider creating unique IDs whenever these are not defined in your organization.	R	R (D)	High
Application name	Name by which this application is known to your organization. Include commercial off-the-shelf (COTS) vendor and product name when applicable.	R	R (D)	Medium-high
Is COTS?	Yes or No. Whether this is a commercial application or internal development	R	R (D)	Medium-high
COTS product and version	Commercial software product name and version	R	R (D)	Medium
Description	Primary application function and context	R	O	Medium
Criticality	For example, strategic or revenue-generating application, or supporting a critical function	R	O	Medium-high
Type	For example, database, customer relationship management (CRM), web application, multimedia, IT shared service	R	O	Medium
Environment	For example, production, pre-production, development, test, sandbox	R	R (D)	Medium-high
Compliance and regulatory	Frameworks applicable to the workload (e.g., HIPAA, SOX, PCI-DSS, ISO, SOC, FedRAMP) and regulatory requirements	R	R (D)	Medium-high
Dependencies	Upstream and downstream dependencies to internal and external applications or services. Non-technical dependencies such as operational elements (e.g., maintenance cycles)	O	O	Medium-low
Infrastructure mapping	Mapping to physical and/or virtual assets that make up the application	O	O	Medium
License	Commodity software license type (e.g., Microsoft SQL Server Enterprise)	O	R	Medium-high
Cost	Costs for software license, software operations, and maintenance	N/A	O	Medium

Infrastructure

Attribute Name	Description	Inventory and prioritization	Business case	Recommended fidelity level (minimum)
Unique identifier	For example, server ID. Typically available on existing CMDBs or other internal inventories and control systems. Consider creating unique IDs whenever these are not defined in your organization.	R	R	High
Network name	Asset name in the network (e.g., hostname)	R	O	Medium-high
DNS name (fully qualified domain name, or FQDN)	DNS name	O	O	Medium
IP address and netmask	Internal and/or public IP addresses	R	O	Medium-high
Asset type	Physical or virtual server, hypervisor, container, device, database instance, etc.	R	R	Medium-high
Product name	Commercial vendor and product name (for example, VMware ESXi, IBM Power Systems, Exadata)	R	R	Medium
Operating system	For example, REHL 8, Windows Server 2019, AIX 6.1	R	R	Medium-high
Configuration	Allocated CPU, number of cores, threads per core, total memory, storage, network cards	R	R	Medium-high
Utilization	CPU, memory, and storage peak and average. Database instance throughput.	R	O	Medium-high
License	Commodity license type (e.g., RHEL Standard)	R	R	Medium
Is shared infrastructure?	Yes or No to denote infrastructure services that provide shared services such as authentication provider, monitoring systems, backup services, and similar services	R	R (D)	Medium
Application mapping	Applications or application components that run in this infrastructure	O	O	Medium
Cost	Fully loaded costs for bare-metal servers, including hardware, maintenance, operations, storage (SAN, NAS, Object), operating system license, share of rackspace, and data center overheads	N/A	O	Medium-high

Networks

Attribute Name	Description	Inventory and prioritization	Business case	Recommended fidelity level (minimum)
Size of pipe (Mb/s), redundancy (Y/N)	Current WAN link specifications (e.g., 1000 Mb/s redundant)	O	R	Medium
Link utilization	Peak and average utilization, outbound data transfer (GB/month)	O	R	Medium
Latency (ms)	Current latency between connected locations.	O	O	Medium
Cost	Current cost per month	N/A	O	Medium

Migration

Attribute Name	Description	Inventory and prioritization	Business case	Recommended fidelity level (minimum)
Rehost	Customer and partner effort for each workload (person-days), customer and Partner cost rates per day, tool cost, number of workloads	N/A	R (F)	Medium-high
Replatform	Customer and partner effort for each workload (person-days), customer and partner cost rates per day, number of workloads	N/A	R (F)	Medium-high
Refactor	Customer and partner effort for each workload (person-days), customer and partner cost rates per day, number of workloads	N/A	O	Medium-high
Retire	Number of servers, average decommission cost	N/A	O	Medium-high
Landing zone	Re-use existing (Y/N), list of AWS Regions needed, cost	N/A	R (F)	Medium-high
People and change	Number of staff to train in cloud operations and development, cost of training per person, cost of training time per person	N/A	R (F)	Medium-high
Duration	Duration of in-scope workload migration (months)	O	R (F)	Medium-high
Parallel cost	Time frame and rate at which as-is costs can be removed during migration	N/A	O	Medium-high
Parallel cost	Time frame and rate at which AWS products and services, and other infrastructure costs, are introduced during migration	N/A	O	Medium-high

Evaluating the need for discovery tooling

Does your organization need discovery tooling? Portfolio assessment requires high-confidence, up-to-date data about applications and infrastructure. Initial stages of portfolio assessment can use assumptions to fill data gaps.

However, as progress is made, high-fidelity data enables the creation of successful migration plans and the correct estimation of target infrastructure to reduce cost and maximize benefits. It also reduces risk by enabling implementations that consider dependencies and avoids migration pitfalls. The primary use case for discovery tooling in cloud migration programs is to reduce risk and increase confidence levels in data through the following:

Automated or programmatic data collection, resulting in validated, highly trusted data
Acceleration of the rate at which data is obtained, improving project speed and reducing costs
Increased levels of data completeness, including communication data and dependencies not typically available in CMDBs
Obtaining insights such as automated application identification, TCO analysis, projected run rates, and optimization recommendations
High-confidence migration wave planning

When there is uncertainty about whether systems exist in a given location, most discovery tools can scan network subnets and discover those systems that respond to ping or Simple Network Management Protocol (SNMP) requests. Note that not all network or systems configurations will allow ping or SNMP traffic. Discuss these options with your network and technical teams.

Further stages of application portfolio assessment and migration heavily rely on accurate dependency-mapping information. Dependency mapping provides an understanding of the infrastructure and configuration that will be required in AWS (such as security groups, instance types, account placement, and network routing). It also helps with grouping applications that must move at the same time (such as applications that must communicate over low latency networks). In addition, dependency mapping provides information for evolving the business case.

When deciding on a discovery tool, it is important to consider all stages of the assessment process and to anticipate data requirements. Data gaps have the potential to become blockers, so it is key to anticipate those by analyzing future data requirements and data sources. Experience in the field dictates that most stalled migration projects have a limited dataset in which the applications in scope, associated infrastructure, and their dependencies are not clearly identified. This lack of identification can lead to incorrect metrics, decisions, and delays. Obtaining up-to-date data is the first step to successful migration projects.

How to select a discovery tool?

Several discovery tools in the market provide different features and capabilities. Consider your requirements. And decide on the most appropriate option for your organization. The most common factors when deciding on a discovery tool for migrations are the following:

Security

What is the authentication method to access the tool data repository or analytics engines?
Who can access the data, and what are the security controls to access the tool?
How does the tool collect data? Does it need dedicated credentials?
What credentials and access level does the tool need to access my systems and obtain data?
How is data transferred between the tool components?
Does the tool support data encryption at rest and in-transit?
Is data centralized in a single component inside or outside of my environment?
What are the network and firewall requirements?

Ensure that security teams are involved in early conversations about discovery tooling.

Data sovereignty

Where is the data stored and processed?
Does the tool use a software as a service (SaaS) model?
Does it have the possibility to retain all data within the boundaries of my environment?
Can data be screened before it leaves the boundaries of my organization?

Consider your organization needs in terms of data residency requirements.

Architecture

What infrastructure is required and what are the different components?
Is more than one architecture available?
Does the tool support installing components in air-locked security zones?

Performance

What is the impact of data collection on my systems?

Compatibility and scope

Does the tool support all or most of my products and versions? Review the tool documentation to verify supported platforms against the current information about your scope.
Are most of my operating systems supported for data collection? If you don't know your operating system versions, try to narrow the list of discovery tools to those with the wider range of supported systems.

Collection methods

Does the tool require to install an agent on each targeted system?
Does it support agent-less deployments?
Do agent and agent-less provide the same features?
What is the collection process?

Features

What are the features available?
Can it calculate total cost of ownership (TCO) and estimated AWS Cloud run rate?
Does it support migration planning?
Does it measure performance?
Can it recommend target AWS infrastructure?
Does it perform dependency mapping?
What level of dependency mapping does it provide?
Does it provide API access? (for example, can it be programmatically accessed to obtain data?)

Consider tools with strong application and infrastructure dependency-mapping functions and those that can infer applications from communication patterns.

Cost

What is the licensing model?
How much does the licensing cost?
Is the pricing for each server? Is it tiered pricing?
Are there any options with limited features that can be licensed on-demand?

Discovery tools are typically used throughout the entire lifecycle of migration projects. If your budget is limited, consider at least 6 months. However, absence of discovery tooling typically leads to higher manual effort and internal costs.

Support model

What levels of support are provided by default?
Is any support plan available?
What are the incident response times?

Professional services

Does the vendor offer professional services to analyze discovery outputs?
Can they cover the elements of this guide?
Are there any discounts or bundles for tooling + services?

Tip

To find and evaluate discovery tooling, use the Discovery, Planning, and Recommendation site.

Recommended features for the discovery tool

To avoid provisioning and combining data from multiple tools over time, a discovery tool should cover the following minimum features:

Software – The discovery tool should be able to identify running processes and installed software.
Dependency mapping – It should be able to collect network connection information and build inbound and outbound dependency maps of the servers and running applications. Also, the discovery tool should be able to infer applications from groups of infrastructure based on communication patterns.
Profile and configuration discovery – It should be able to report the infrastructure profile such as CPU family (for example, x86, PowerPC), the number of CPU cores, memory size, number of disks and size, and network interfaces.
Network storage discovery – It should be able to detect and profile network shares from network-attached storage (NAS).
Performance – It should be able to report peak and average utilization of CPU, memory, disk, and network.
Gap analysis – It should be able to provide insights on data quantity and fidelity.
Network scanning – It should be able to scan network subnets and discover unknown infrastructure assets.
Reporting – It should be able to provide collection and analysis status.
API access – It should be able to provide programmatic means to access collected data.

Additional features to consider

TCO analysis to provide a cost comparison between current on-premises cost and projected AWS cost.
Licensing analysis and optimization recommendations for Microsoft SQL Server and Oracle systems in rehost and replatform scenarios.
Migration strategy recommendation (Can the discovery tool make default migration R type recommendations based on current technology?)
Inventory export (to CSV or a similar format)
Right-sizing recommendation (for example, can it map a recommended target AWS infrastructure?)
Dependency visualization (for example, can dependency mapping be visualized in a graphical mode?)
Architectural view (for example, can architectural diagrams be automatically produced?)
Application prioritization (Can it assign weight or relevance to application and infrastructure attributes to create prioritization criteria for migration?)
Wave planning (for example, recommended groups of applications and the ability to create migration wave plans)
Migration cost estimation (estimation of effort to migrate)

Deployment considerations

After you have selected and procured a discovery tool, consider the following questions to drive conversations with the teams responsible for deploying the tool in your organization:

Are servers or applications operated by a third party? This could dictate the teams to involve and processes to follow.
What is the high-level process for gaining approval to deploy discovery tools?
What is the main authentication process to access systems such as servers, containers, storage, and databases? Are server credentials local or centralized? What is the process to obtain credentials? Credentials will be required to collect data from your systems (for example, containers, virtual or physical servers, hypervisors, and databases). Obtaining credentials for the discovery tool to connect to each asset can be challenging, especially when these assets are not centralized.
What is the network security zones outline? Are network diagrams available?
What is the process for requesting firewall rules in the data centers?
What are the current support service-level agreements (SLAs) in relation to data center operations (discovery tool installation, firewall requests)?

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Discovery acceleration and initial planning

Business drivers and technical guiding principles