Program to count the cycles of the A8 cortex: v0.8


Let’s go for of the cycle counter v0.8

This post is a translation of “Programme pour compter les cycles du cortex A8: v08.

Changes and evolutions in this new version are quite numerous.

  • Managing interactions between NEON and VPF
  • The cycle counter now check the validity of immediate values
  • You can now assign a inline variable to a register to allow the cycle counter to work with inline C code
  • Every analyzed code returns a permalink you can use (in a newsgroup for example) to show a code
  • Correction (or addition) of several hundreds of rules

Regarding this last point … I have added in version 0.7 on a log for record unrecognized instructions by the cycle counter. I should have done it much sooner I would have win a lot of time.
I realized it was missing a big number of instructions (mainly in VMOV, VCVT and memory access).

Otherwise, the cycle counter is used on average 45 times per day. Not so bad!
The cycle counter is still at the same place.

Version 0.81

June, 19 2011.

  • Patch instruction VLDR and VSTR
  • Update cycle table for STR instruction
  • Inprove immediate value parser
  • Add instructions SETEND, ISB, CLREX, SMI and SMC

Version 0.82

June, 21 2011.

  • Add instructions SVC, SWI, BKPT, CDP, CDP2, MCR, MCR2, MRC, MRC2, RFE, RFEDA, RFEDB, RFEIA and RFEIB. These instructions have been added for parsing purpose only. Their running cycle has been put to 1.

Version 0.83

28 Juin 2011.

  • Add instructions RRX, SEL, MLS, SMMUL, SMMLA, SMMLS.
  • Patch instructions RSC, MCR.
  • Patch instructions PUSH, POP (1 registre).
  • Patch instructions VLDM, VSTM.
  • Update cycle table for STM and LDM instructions.

Feel free to send me a better translation of this post at pulsar[at]webshaker.net
You can also send me a translation to another language if you want.

 | Tags:

20 Responses to “Program to count the cycles of the A8 cortex: v0.8”

  1. drk dit :

    VLDR/VSTR: they seem to crash it with an not implemented function error !
    STR with shift is actually dual cycle, not single ( a.44-1 1c STR R1, [R2,R3,LSR#8] )
    also, as most dissasemblers unite the MOVW/MOVT pair, it would be nice to either expand it to two opcodes, or generate an error if the MOV imm is >16 bits !

    Having some more info avaiable is nice (like, for each register the cycle/state its read/writen .. maby as a popup ?). Also, using registers in CAPS (like R2, R1) works fine, except they aren’t colored for stalls !

    Awesome work btw :)

  2. Etienne SOBOLE dit :

    Thank’s to make me work on sunday ;)

    You were correct on all points :
    - VLDR / VSTR is now working (missing a callback function)
    - STR rules had been updated.
    - immediate values are correctly recognized (I ‘ve forgot hexadecimal format)
    - MOV can’t be replace by MOVW / MOVT but now immediate value for MOV must work. I added :upper16: and :lower16: key word.

    Thank’s for the report.

  3. drk dit :

    Yay, now i can abuse your server even more ~ :p. Btw, is this script open source or something ? could be a nice base to build upon … … !

  4. drk dit :

    Oh also, while at it, i think SVC/SWI are still missing too :p

  5. drk dit :

    Je veux also dit, que c’est tre bon a trouve de *cough* useful tools *cough* en Francais. Come-ca je peux *practice* a lire ! (Yay for those delf exams all those years ago ..)

  6. Etienne SOBOLE dit :

    I see that your French is nearly as good as my English. ;)
    But I understood anyway !!!

    I’ll add SVC and SWI instructions.
    But for the moment the cycle counter is not un open source program. May be one day, I don’t know.

  7. RUBO dit :

    Can you please check latency between this two insns
    In mov, r0 is availble at stage E1.
    In add, r0 and r1 are both available in E1 stage. => latency = 1, in both cases.

  8. RUBO dit :

    I’m sorry, I made a mistake
    In mov, r1(not r0) is available at stage E1.

  9. Etienne SOBOLE dit :

    Well Rubo.
    You were right on the second example.

    The 2 instructions can’t execute in the same cycle.
    I have upgrade the cycle counter.

    Thank you.

  10. Dung dit :

    Please check below case:
    vld2.32 {d0-d1}, [r1]
    vld2.8 {d0[0],d1[0]}, [r1]
    From specs, vld2, which loads to one lane, requires destination registers at N1. So the second vld2 should start at the cycle 4.

  11. Dung dit :

    Please check below instructions in floating point group:

    In specs, they all require source registers at N2 stage.
    However, in your database (excel file), they require source registers at N1 stage. So the cycle count module doesn’t give the same result as expected.

  12. Dung dit :

    Please check 2 below instructions:
    vqdmlal.s16 q1, d0, d1
    vqdmlal.s16 q1, d0, d1[0]
    From cortex-a8 specs, the second vqdmlal with scalar required destination registers at N3. So the second vqdmlal should start at the cycle 5 instead of 2.

  13. Dung dit :

    I can’t find scheduling information of some instructions in the specs: “cortex_a8_r3p2_trm.pdf”.For example:

    Can you explain how you found scheduling information for these instructions. If you can’t find, how did you treat these instructions?

  14. Dung dit :

    vrsqrts.f32 d0, d1, d2
    vrsra.s16 q0, q1, #1
    The vrsqrts release d0 at N9. vrsra requires q0 at N3. So I think penalty cycle should happen here.

  15. Etienne SOBOLE dit :

    Hi Rubo, sorry for my later response, but I was on Holidays !!!

    vld2.32 {d0-d1}, [r1]
    vld2.8 {d0[0],d1[0]}, [r1]
    You’re right. there is a big probleme here. It will not be easy to make this case run correctly !

    You’re right again. I’ve updated the excel file.

    This is due to vqdmlal shortcut. Shortcut explanation of the documentation let suppose that there is extra latency cycle but in fact no. If you make the real test, you’ll see this is not the case.

    There is no cycle information into the Doc of the Cortex A8. You’re right.
    I do not remember where I found those information but they are into the Cortex A9 doc, and NEON seem’s to be the same on Cortex A8 and Cortex A9.

    Hum. It’s due to a mistake into the excel file. It’s updated now.

    Thank you for your help !

  16. Dung dit :

    vqdmlal.s16 q1, d0, d1
    vqdmlal.s16 q1, d0, d1[0]

    vqdmlal is just like vqdmla, it requires destination registers at N3. So 2 above instructions should have some penalty cycle. Please check.

  17. Dung dit :

    vrecpe.u32 d1, d1
    vrecpe.u32 q1, q0
    “vrecpe.u32 q1, q0″ requires d0 at N2, d1 at N3. So it should start at the cycle 4 instead of 5. I guess “vrecpe.u32 q1, q0″ requires d1 at N2 in the cycle count. Please check.

  18. Dung dit :

    “vrecpe.u32 q1, q0″ takes 2 cycles because of the quad word register but the cycle count says only 1 cycle. Please check.

  19. nguns dit :

    How to get the link for a sample that we tried on the calculator? like you guys are getting it as http://pulsar…/ccc/sample... how can I get such links for my calculations?

  20. nguns dit :

    Sorry for the trouble, got the answer. Thanks.


Human control : 3 + 8 =