15

Program to count the cycles of the A8 cortex: v0.7

mai

This post is a translation of “Programme pour compter les cycles du cortex A8: v0.7.

The latest version of the cycle counter is online.

I now have the clear objective of integrating the cycle counter in an assembly editor (ebola), so I have greatly changed the output format.
The default rendering (and now the only) is the rendering type “source. ”
The cycle counter just adds information, without adding or removing any line code.

Output format

All information is contained in the 21 characters that precede the instruction.
This string contains:

  • core unit. (a ARM, n NEON and v VPf)
  • execution cycle (limit to 4 chars)
  • pipeline used
  • instruction cycle timing
  • bubbles not due to a register dependency (for example, wait for pipeline 0 to execute a multiplication)
  • The last register that cause a instruction stall

For example:

a.6-0    2c p0 r4:3  mla r4, r6, r7, r4

MLA instruction is an ARM instruction that will execute on cycle 6 pipeline 0.
The instruction run in 2 cycles.
It must wait for the pipeline 0.
The instruction have to wait for r4 because this register was not available before.
Enfin l’instruction n’a pas pu s’exécuter plus tôt car r4 n’était pas disponible. This registry has generated 3 bubbles in one or other of the pipelines.

Full output mode no longer exists, it is not possible to have so much clear vision as before. In a future version, I’ll display this missing information in a popup.

New features

This new version of the cycle counter can run loops.
That is, when the program encounters a branch instruction whose address precedes the branch instruction, it really executes the branch.

What for and how does it work?
When a program enters a loop, the pipeline state’s may, during the first iterations, be differents for each entry in the loop.
This state should converge to a fixed state (or sequence of states). So, The first iterations are therefore not representative of the real operation of the loop.
The new cycle counter executes the loop until you find a state converge (ie, repeating several times).

For an ARM loop, the convergence is quite fast. For a loop containing instructions NEON it may take longer.

In the end, the result is better since it ignores the first iterations of the loop. It is based on the most representative iteration of the loop.

More, each time the cycle counter detects a loop it resets the cycle number to 0. This provides easy access to the time taken by an iteration by reading the cycle which runs the branch. Obviously, the cycle counter is not capable of knowing how many times the loop should be really executed. So it runs the loop until reaching convergence. Concretely, this means that if you have a loop like

for(i=0 ; i<4 ;i++)
{
}

It is possible that the cycle counter executes the loop a number of times much higher than the expected 4 iterations.
Finally, the given execution cycles of NEON instructions may be surprised. Remember that due to the NEON instruction queue, it is likely that the instructions of a given iteration of the loop are really executed during the next iterations.

The evolutions

  • Some instructions were added.
  • NEON memory access instructions can now be parallelized with a calculation instruction.
  • Regular expressions are now more restrictive. We can no longer use the register r25 (nonexistent) for example.
  • Some checks on the validity of immediate values ​​have been established.
  • Writeback cycle is now support.
  • Some interactions between different core units was added (VMOV r0, d0 [1] for example).
  • I merged NEON and VPF queues. I don’t know if it was a good idea, but it has simplified the management of interactions without changing the result of counting cycles.
  • Finally, VPf instruction are not correctly handled in this version. This problem will be fixed in next release.

Feel free to send me a better translation of this post at pulsar[at]webshaker.net
You can also send me a translation to another language if you want.

 | Tags:

16 Responses to “Program to count the cycles of the A8 cortex: v0.7”

  1. fadden dit :

    Section 16.8 in the ARM Cortex-A8 Technical Reference Manual:

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/Babeghic.html

    There is a code example annotated with cycle and pipeline info. I pasted the ARM sample into the tool and it doesn’t quite agree.

  2. Etienne SOBOLE dit :

    Thank you fadden.
    I did not think to test these codes! I’will do !

    But, As I said in the post
    Finally, VPf instruction are not correctly handled in this version. This problem will be fixed in next release.

    I’m looking for Vpf sample code. If you have other code, send them to me. I need Vpf code to check the validity of the v0.8 !!!

  3. Igor dit :

    At first thank you for great tool! I have improved one of my asm function’s performance almost in 2x times by changing instructions order only!!!

    Little bug found, following two lines of code cause the crash of tool
    “vmov r9, s18
    vcvt.f32.s32 d9, d9″

    Crash message:
    Fatal error: Allowed memory size of 314572800 bytes exhausted (tried to allocate 718457 bytes) in /opt/web/clients/w/ws-php5/pulsar.webshaker.net/public_html/test/ccc8/pipeline-engine.php on line 2366

  4. Igor dit :

    p.s. Another one instructions pair which causing similar crash:
    vmov r9,s0
    vldr s0,[r9]

  5. Anton dit :

    does not recognize instruction
    vmov.i8 q10, #0×0f [Unrecognized instruction]

  6. Etienne SOBOLE dit :

    That’s right

    I’ve a couple of bugs to fix.
    I will do that soon.

  7. Martin dit :

    Hi Etienne! Thanks for developing Pulsar! I have one feature request: The vext command should dual issue with ALU instructions. Here is an example: http://pulsar.webshaker.net/ccc/sample-b0328271 Thanks!

  8. Etienne SOBOLE dit :

    Hi. And thank you !
    I’m working now on a simplier version of the cycle counter !
    The cortex A9 (and probably A15) can’t execute 2 NEON instructions on same cycle ! So the last version (not online for the moment) do not handle the dual issue capability of the Cortex A8 NEON version anymore !

    To be clear, I’ll not patch this bug !!! sorry !

  9. Martin dit :

    Hi, thanks for your answer. It’s a pity that A8 is not supported anymore. Did you consider having two versions of the cycle counter online? One for Cortex-A8 and one for Cortex-A9? Or is Cortex-A8 too difficult to support (I think you did a really great job, though)?

  10. AvLadder dit :

    Hi, i try to reduce cycles spend in some func with your tool.
    It was about 180 cycles before. And about 100 cycles after optimization with this tool.
    But on real process i get a 5% regress of my function speed. Why it can be?

  11. Etienne SOBOLE dit :

    That could comes from many things !

    1 – Cycles are counted for the Cortex A8. So, with Cortex A9 you can’t have some difference due to out of order mechanism’s
    2 – The cycle counter can’t manage with cache miss ! Most of time you’ll reach a limit due to the memory bandwidth

    If you made consecutive memory READ access, you can try to use PLD instruction.

    Rem : PLD instruction have to be used with an offset
    For example PLD [r0, #0x40]. Do not use a too small offset !

    Rem : do not use PLD instruction for writing buffer ! Use it only for reading !

  12. AvLadder dit :

    Thanks for your answer.
    1 – I have cortex A8 now for tests. but my app will be runs also on cortex A9 in future.
    2 – Maybe you know, how i can can count a cache misses? I read about System Control Coprocessor and System Performance Monitor which in cortex processors. But i don’t know how to get access to it on android device in user mode..

    Yes, i made consecutive memory READ access, my test was on SAD function (summ of absolute differences) of 16×16 matrix. But i do in with using of neon instructions, can PLD help me in this case?
    And there are a note on arm.com, that PLD is a hint instruction, and it’s implementation is optional. Do you know how often it isn’t implemented?

  13. Etienne SOBOLE dit :

    I do not really know how to count the cache miss and the time taken by these cache miss.

    Can you put your code in the cycler counter and give the permalink ! I could check if I found an optimisation for you !

  14. AvLadder dit :

    Yes, i think, i can do it, but in private message, can you give me your email or other contact?

  15. Ramanand Mandayam dit :

    Hi

    First of all, thanks for such a wonderful job.

    I had one question: When I try to count cycles for my sample program I see “Unrecognized instruction” for this kind of instruction:

    vminq.u8 q2, q1, q0

  16. Etienne SOBOLE dit :

    the cycle counter is not supported anymore.
    I’ll will give the code one day when I’ll have time.

Répondre

Human control : 9 + 4 =