Program to count the cycles of the A8 cortex: v0.6


This post is a translation of “Programme pour compter les cycles du cortex A8: v0.6.

A more complex cycle counter

The operation of NEON led me to re-develop the engine of the cycle counter. The simulation is now closer to reality, even if there are some questions I have not found answers, on which I had to make some speculations.

The new engine is composed of two distinct steps:

  • The analysis of the code in charge of finding regular expressions that correspond to instructions
  • Calculating the time taken by the code

Parallel operation of units ARM and NEON posed some problems. In a loop when the time taken by instructions NEON is greater than the time taken by the ARM code, the first iteration is not representative of the real operation of the loop. The ARM will send to the NEON queue 2 instructions per cycle regardless of the time they will really take to run. Thus, the cycle counter is not able to calculated the correct time taken by the code NEON.

During the parsing stage, I took the opportunity to detect loops. I call “loop”, every pair “label – branch” where the label is declared before the branch instruction.
I then added to the cycle counter some “virtual” wait cycles (ie not actually exist in the cortex) before labels and before branch instruction whose purpose is to wait until the queue Pending NEON and VPf are empty.
These “virtual” wait cycles are notified by the message : “Wait for NEON & VPf Queue are empty.

Specifically, it gives a correct result in most cases.
Until the cycle counter is capable of performing loops (v0.8 maybe), I have not found better way to indicate the time taken by a NEON code.

The new engine handle the parallelism of NEON instructions. For now, I have not activated the parallelism of memory instructions as I did not do additional testing.

Finally, failing to have had time to carry out further tests on VPf, I considered that VPf works much like NEON (which is probably wrong).
The only differences I have applied are:

  • VPf is not pipelined and cannot execute 2 instructions in the same cycles.
  • VPf instruction queue can only contain a single statement.

Otherwise, there were some minor corrections and changes as listed below:

  • Adding a new display by column
  • Fixed a bug on TST, TEQ, CMP, CMN instructions which did not block the FLAG register
  • Fixed a bug on the LDR instruction (and STR), which does not test the availability of the address register
  • Adding ADR instruction
  • Adding missing variants of VCEQ, VCGE, VCGT, VCLE, VCLT
  • Adding VPADD, VPADDL, VPADAL instructions
  • Adding a large number of variants on instruction VLDx and VSTx
  • Managing different syntaxes for memory alignment instructions VLDx and VSTx ([r0:128] and [r0, :128])

Note: Although regular expressions have not changed, it is possible that some problems previously fixed reappear.
Feel free to report me all the problems encountered.

 | Tags:


Human control : 1 + 1 =