Timing optimization techniques for RTL based designs on XC7Z007S-1CLG225C MiniZed board

svnagendra2003

4 Sep 2023

svnagendra2003 over 1 year ago
The timing optimization techniques are common for RTL based designs on different Xilinx FPGA platform including the XC7Z007S-1CLG225C FPGA device on MiniZed board.

Timing closure consists of the design meeting all timing requirements. It is easier to reach timing closure if you have the right HDL and constraints for synthesis. In addition, it is important to iterate through the synthesis stages with improved HDL, constraints, and synthesis options, as shown in the following figure.

To successfully close timing, follow these general guidelines:
- When initially not meeting timing, evaluate timing throughout the flow.
- Focus on worst negative slack (WNS) of each clock as the main way to improve total negative slack (TNS).
- Review large worst hold slack (WHS) violations (<-1 ns) to identify missing or inappropriate constraints.
- Revisit the trade-offs between design choices, constraints, and target architecture.
- Know how to use the tool options and Xilinx® design constraints (XDC).
- Be aware that the tools do not try to further improve timing (additional margin) after timing is met.
The following sections provide recommendations for reviewing the completeness and correctness of the timing constraints using methodology design rule checks (DRCs) and baselining, identifying the timing violation root causes, and addressing the violations using common techniques.

Checking for Valid Constraints:

Review the Check Timing section of the Timing Summary report to quickly assess the timing constraints coverage, including the following:
- All active clock pins are reached by a clock definition.
- All active path endpoints have requirement with respect to a defined clock (setup/hold/recovery/removal).
- All active input ports have an input delay constraint.
- All active output ports have an output delay constraint.
- Timing exceptions are correctly specified.
In addition to check_timing, the Methodology report (TIMING and XDC checks) flags timing constraints that can lead to inaccurate timing analysis and possible hardware malfunction. You must carefully review and address all reported issues.

Checking for Positive Timing Slacks:

The following timing metrics reflect the design timing score. Numbers must be positive to meet timing.
- Setup/Recovery (max delay analysis): WNS > 0 ns and TNS = 0 ns
- Hold/Removal (min delay analysis): WHS > 0 ns and THS = 0 ns
- Pulse Width: WPWS > 0 ns and TPWS = 0 ns
Understanding Timing Reports:

The Timing Summary report provides high-level information on the timing characteristics of the design compared to the constraints provided. Review the timing summary numbers during signoff:
Total Negative Slack (TNS)

The sum of the setup/recovery violations for each endpoint in the entire design or for a particular clock domain. The worst setup/recovery slack is the worst negative slack (WNS).

Total Hold Slack (THS)

The sum of the hold/removal violations for each endpoint in the entire design or for a particular clock domain. The worst hold/removal slack is the worst hold slack (WHS).

Total Pulse Width Slack (TPWS)

The sum of the violations for each clock pin in the entire design or a particular clock domain for the following checks:

Minimum low pulse width

Minimum high pulse width

Minimum period

Maximum period

Maximum skew (between two clock pins of a same leaf cell)

Worst Pulse Width Slack (WPWS)

The worst slack for all pulse width, period, or skew checks on any given clock pin.
The Total Slack (TNS, THS or TPWS) only reflects the violations in the design. When all timing checks are met, the Total Slack is null.

The timing path report provides detailed information on how the slack is computed on any logical path for any timing check. In a fully constrained design, each path has one or several requirements that must all be met in order for the associated logic to function reliably.

The main checks covered by WNS, TNS, WHS, and THS are derived from the sequential cell functional requirements:

Setup time

The time before which the new stable data must be available before the next active clock edge to be safely captured.

Hold requirement

The amount of time the data must remain stable after an active clock edge to avoid capturing an undesired value.

Recovery time

The minimum time required between the time the asynchronous reset signal has toggled to its inactive state and the next active clock edge.

Removal time

The minimum time after an active clock edge before the asynchronous reset signal can be safely toggled to its inactive state.

A simple example is a path between two flip-flops that are connected to the same clock net.

After a timing clock is defined on the clock net, the timing analysis performs both setup and hold checks at the data pin of the destination flip-flop under the most pessimistic, but reasonable, operating conditions. The data transfer from the source flip-flop to the destination flip-flop occurs safely when both setup and hold slacks are positive.

Checking That Your Design is Properly Constrained

Before looking at the timing results to see if there are any violations, be sure that every synchronous endpoint in your design is properly constrained.

Run check_timing to identify unconstrained paths. You can run this command as a standalone command, but it is also part of report_timing_summary. In addition, report_timing_summary includes an Unconstrained Paths section where N logical paths without timing requirements are listed by the already defined source or destination timing clock. N is controlled by the -max_path option.

After the design is fully constrained, run the report_methodology command and review the TIMING and XDC checks to identify non-optimal constraints, which will likely make timing analysis not fully accurate and lead to timing margin variations in hardware. To identify and correct unrealistic target clock frequencies or setup path requirement, use the report_qor_assessment command.

Fixing Issues Flagged by check_timing

The check_timing Tcl command reports that something is missing or wrong in the timing definition. When reviewing and fixing the issues flagged by check_timing, focus on the most important checks first. Following are the checks listed from most important to least important.

No Clock and Unconstrained Internal Endpoints

This allows you to determine whether the internal paths in the design are completely constrained. You must ensure that the unconstrained internal endpoints are at zero as part of the Static Timing Analysis signoff quality review.

Zero unconstrained internal endpoints indicate that all internal paths are constrained for timing analysis. However, the correct value of the constraints is not yet guaranteed.

Generated Clocks

Generated clocks are a normal part of a design. However, if a generated clock is derived from a master clock that is not part of the same clock tree, this can cause a serious problem. The timing engine cannot properly calculate the generated clock tree delay. This results in erroneous slack computation. In the worst case situation, the design meets timing according to the reports but does not work in hardware.

Loops and Latch Loops

A good design does not have any combinational loops, because timing loops are broken by the timing engine. The broken paths are not reported during timing analysis or evaluated during implementation. This can lead to incorrect behavior in hardware, even if the overall timing requirements are met.

No Input/Output Delays and Partial Input/Output Delays

All I/O ports must be properly constrained.

Multiple Clocks

Multiple clocks are usually acceptable. AMD recommends that you ensure that these clocks are expected to propagate on the same clock tree. You must also verify that the paths requirement between these clocks does not introduce tighter requirements than needed for the design to be functional in hardware.

If this is the case, you must use set_clock_groups or set_false_path between these clocks on these paths. Any time that you use timing exceptions, you must ensure that they affect only the intended paths.

Multiple clocks are usually acceptable. AMD recommends that you ensure that these clocks are expected to propagate on the same clock tree. You must also verify that the paths requirement between these clocks does not introduce tighter requirements than needed for the design to be functional in hardware.

If this is the case, you must use set_clock_groups or set_false_path between these clocks on these paths. Any time that you use timing exceptions, you must ensure that they affect only the intended paths.

Fixing Issues Flagged by report_methodology

The report_methodology command reports additional constraints and timing analysis issues, which you must carefully review before and after running the place and route tools. This section describes the main XDC and TIMING categories of checks, along with their relative impact on timing closure and hardware stability. You must focus on resolving the checks that impact timing closure first.

Methodology DRCs with Impact on Timing Closure

The DRCs shown in the following table flag design and timing constraint combinations that increase the stress on implementation tools, leading to impossible or inconsistent timing closure. These DRCs usually point to missing clock domain crossing (CDC) constraints, inappropriate clock trees, or inconsistent timing exception coverage due to logic replication. They must be addressed with highest priority.

Baselining the Design for timing closure:

Reviewing Clock Relationships:

You can view the relationship between clocks using the report_clock_interaction Tcl command. The report shows a matrix of source clocks and destination clocks. The color in each cell indicates the type of interaction between clocks, including any existing constraints between them. The following figure shows a sample clock interaction report.

Analyzing and Resolving Timing Violations

Reducing Net Delay Caused by Congestion

Device congestion can potentially lead to difficult timing closure if the critical paths are placed inside or next to a congested area or if the device utilization is high and the placed design is hardly routable. In many cases, congestion will significantly increase the router runtime. If a path shows routed delays that are longer than expected, analyze the congestion of the design and identify the best congestion alleviation technique

Congestion Area and Level Definition

AMD device routing architecture comprises interconnect resources of various lengths in each direction: North, South, East, and West. A congested area is reported as the smallest square that covers adjacent interconnect tiles (INT_XnYm) or CLB tiles (CLE_M_XnYm) where interconnect resource utilization in a specific direction is close to or over 100%. The congestion level is the positive integer which corresponds to the side length of the square. The following figure shows the relative size of congestion areas on an AMD device versus clock regions.

Interconnect Congestion Level in the Device Window

The Interconnect Congestion Level metric highlights the largest contiguous area in which routing resources are overused. By default, this metric is based on estimation, which is similar to the congestion level after initial routing. Actual routing can also be displayed if routing exists. After placement or after routing, you can display this congestion metric by right-clicking in the Device window and selecting Metric > Interconnect Congestion Level.

The Interconnect Congestion Level metric provides a quick visual overview of any congestion hotspots in the device. The following figure shows a placed design with several congested areas. This metric is based on the current interconnect demand and availability with a threshold of 0.9 (that is, 90% routing usage). The range is 0.1 to 0.9.

You can visualize congestion based on:
- Direction: North, South, East, West, Vertical, Horizontal
- Type: Short, Long, Global
- Style: Estimated, Routed, Mixed
Example of Congestion per CLB in the Device Window

Reducing Clock Skew:

To meet requirements such as high fanout clocks, short propagation delays, and low clock skew, AMD devices use dedicated routing resources to support the most common clocking schemes. Clock skew can severely reduce timing budget on high frequency clocks. Clock skew can also add excessive stress on implementation tools to meet both setup and hold when the device utilization is high.

The clock skew is typically less than 300 ps for intra-clock timing paths and less than 500 ps for timing paths between balanced synchronous clocks. When crossing resource columns, clock skew shows more variation, which is reflected in the timing slack and optimized by the implementation tools. For timing paths between unbalanced clock trees or with no common node, clock skew can be several nanoseconds, making timing closure almost impossible.

To reduce clock skew:
1. Review all clock relationships to ensure that only synchronous clock paths are timed and optimized.
2. Review the clock tree topologies and placement of timing paths impacted by higher clock skew than expected, as described in the following sections.
3. Identify the possible clock skew reduction techniques, as described in the following sections.
Using Intra-Clock Timing Paths:

Timing paths with the same source and destination clocks that are driven by the same clock buffer typically exhibit very low skew. This is because the common node is located on the dedicated clock network, close to the leaf clock pins, as shown in the following figure.

Limiting Synchronous Clock Domain Crossing Paths:

Timing paths between synchronous clocks driven by separate clock buffers exhibit higher skew, because the common node is located before the clock buffers. That is, the common node is farther from the leaf clock pins, resulting in higher pessimism in the timing analysis. The clock skew is even worse for timing paths between unbalanced clock trees due the delay difference between the source and destination clock paths. Although positive skew helps with meeting setup time, it hurts hold time closure, and vice versa.

In the following figure, three clocks have several intra and inter clock paths. The common node of the two clocks driven by the MMCM is located at the output of the MMCM (red markers). The common node of the paths between the MMCM input clock and MMCM output clocks is located on the net before the MMCM (blue marker). For the paths between the MMCM input clock and MMCM output clocks, the clock skew can be especially high depending on the clkin_buf BUFGCE location and the MMCM compensation mode.
- Cancel
- Vote Up 0 Vote Down
- Sign in to reply
- More
- Cancel