element14 Community
element14 Community
    Register Log In
  • Site
  • Search
  • Log In Register
  • About Us
  • Community Hub
    Community Hub
    • What's New on element14
    • Feedback and Support
    • Benefits of Membership
    • Personal Blogs
    • Members Area
    • Achievement Levels
  • Learn
    Learn
    • Ask an Expert
    • eBooks
    • element14 presents
    • Learning Center
    • Tech Spotlight
    • STEM Academy
    • Webinars, Training and Events
    • Learning Groups
  • Technologies
    Technologies
    • 3D Printing
    • FPGA
    • Industrial Automation
    • Internet of Things
    • Power & Energy
    • Sensors
    • Technology Groups
  • Challenges & Projects
    Challenges & Projects
    • Design Challenges
    • element14 presents Projects
    • Project14
    • Arduino Projects
    • Raspberry Pi Projects
    • Project Groups
  • Products
    Products
    • Arduino
    • Avnet Boards Community
    • Dev Tools
    • Manufacturers
    • Multicomp Pro
    • Product Groups
    • Raspberry Pi
    • RoadTests & Reviews
  • Store
    Store
    • Visit Your Store
    • Choose another store...
      • Europe
      •  Austria (German)
      •  Belgium (Dutch, French)
      •  Bulgaria (Bulgarian)
      •  Czech Republic (Czech)
      •  Denmark (Danish)
      •  Estonia (Estonian)
      •  Finland (Finnish)
      •  France (French)
      •  Germany (German)
      •  Hungary (Hungarian)
      •  Ireland
      •  Israel
      •  Italy (Italian)
      •  Latvia (Latvian)
      •  
      •  Lithuania (Lithuanian)
      •  Netherlands (Dutch)
      •  Norway (Norwegian)
      •  Poland (Polish)
      •  Portugal (Portuguese)
      •  Romania (Romanian)
      •  Russia (Russian)
      •  Slovakia (Slovak)
      •  Slovenia (Slovenian)
      •  Spain (Spanish)
      •  Sweden (Swedish)
      •  Switzerland(German, French)
      •  Turkey (Turkish)
      •  United Kingdom
      • Asia Pacific
      •  Australia
      •  China
      •  Hong Kong
      •  India
      •  Korea (Korean)
      •  Malaysia
      •  New Zealand
      •  Philippines
      •  Singapore
      •  Taiwan
      •  Thailand (Thai)
      • Americas
      •  Brazil (Portuguese)
      •  Canada
      •  Mexico (Spanish)
      •  United States
      Can't find the country/region you're looking for? Visit our export site or find a local distributor.
  • Translate
  • Profile
  • Settings
Raspberry Pi
  • Products
  • More
Raspberry Pi
Raspberry Pi Forum My First Assembly Program on a PI Zero2W
  • Blog
  • Forum
  • Documents
  • Quiz
  • Events
  • Polls
  • Files
  • Members
  • Mentions
  • Sub-Groups
  • Tags
  • More
  • Cancel
  • New
Join Raspberry Pi to participate - click to join for free!
Featured Articles
Announcing Pi
Technical Specifications
Raspberry Pi FAQs
Win a Pi
Raspberry Pi Wishlist
Actions
  • Share
  • More
  • Cancel
Forum Thread Details
  • Replies 19 replies
  • Subscribers 662 subscribers
  • Views 4182 views
  • Users 0 members are here
Related

My First Assembly Program on a PI Zero2W

scottiebabe
scottiebabe over 3 years ago

I recently stumbled across this great ebook:

image

Available at: https://personal.utdallas.edu/~pervin/RPiA/RPiA.pdf

Following along the yellow brick road, I was able to get a basic program up and running. It takes two numbers as user input and calculates their sum. Thrilling! I know. Though kind of, as the addition is in double precision floating point and happening on the  VFP of the cortex-A53. I should note I am still using a 32-bit os on the pi, it will be interesting to see what changes when I move to the new 64-bit os.

.data
msg1: .asciz "Calculating A + B on the VFP\nInput A: "
msg2: .asciz "\nInput B: "
msg3: .asciz "\nResult =: %f\n"
scan_pattern : .asciz "%lf"

.balign 4
return: .word 0

.balign 8
d1: .double 0.0
d2: .double 0.0

.text

.global main /* entry point must be global */
.func main /* ’main’ is a function */
main: /* This is main */
   ldr r1, =return
   str lr, [r1]

   ldr r0, =msg1
   bl printf

   ldr r0, =scan_pattern @ r0 <- &scan_pattern
   ldr r1, =d1
   bl scanf

   ldr r0, =msg2
   bl printf

   ldr r0, =scan_pattern @ r0 <- &scan_pattern
   ldr r1, =d2
   bl scanf

   ldr r1, =d1
   vldmia.f64 r1, {d0-d1}


   vadd.f64 d0, d0, d1

   // printf only accepts doubles
   // double is passed on r2 & r3
   ldr r0, =msg3
   vmov r2,r3,d0
   bl printf
   
   ldr r1, =return
   ldr lr, [r1]
   bx lr

/* External */
.global printf
.global scanf

the makefile to assemble and link the program:

# Makefile
all: first fploop
fploop: fploop.o
	gcc -g -o $@ $+
fploop.o : fploop.s
	as -g -mfpu=vfpv2 -o $@ $<
first: first.o
	gcc -g -o $@ $+
first.o : first.s
	as -g -mfpu=vfpv2 -o $@ $<
clean:
	rm -vf first fploop *.o

Testing it out:

image

Amazing! lol

I also experimented with seeing how long some of the VFP instructions take. In this program I run a tight loop 1 billion times on a VFP instruction:

.data
msg1: .asciz "Looping: (%d) times...\n"
msg2: .asciz "\nResult =: %3.1e\n"
scan_pattern : .asciz "%lf"

.balign 4
return: .word 0
loopcount: .word 1000000000


.balign 8
d1: .double 1.0
d2: .double 1.0

.text

.global main /* entry point must be global */
.func main /* ’main’ is a function */
main: /* This is main */
   ldr r1, =return
   str lr, [r1]

   ldr r0, =msg1
   ldr r1, =loopcount
   ldr r1, [r1]
   bl printf

   ldr r1, =d1
   vldmia.f64 r1, {d0-d1}

   ldr r1, =loopcount
   ldr r1, [r1]
loop:
//   vadd.f64 d2, d0, d1
//   vadd.f64 d0, d0, d1
//   vmla.f64 d2, d0, d1
   subs r1, r1, #1
   bne loop

   ldr r0, =msg2
   vmov r2,r3,d0
   mov r1,#0
   bl printf
   
   ldr r1, =return
   ldr lr, [r1]
   bx lr

/* External */
.global printf
.global scanf

Just the loop itself, the sub and conditional branch run at the full system clock rate. Pretty snazzy!

image

(d1 never got modified so it remained at its initial value of 1.)

Adding one VFP instruction at a time and reassembling and running results in the following:

vadd.f64 d2, d0, d1 => 2s

vadd.f64 d0, d0, d1 => 4s

vmla.f64 d2, d1, d0 => 4s

That is a lot of processing power for a $15 development board.

  • Sign in to reply
  • Cancel
Parents
  • scottiebabe
    scottiebabe over 3 years ago

    The cortex-A53 on the zero2w is also a super-scaling bad-****. Here I add a few more integer instructions that end up running in parallel.

    image

    In 4s, the one logical core ends up complete 8 billion instructions, while only being clocked at 1 GHz. 

    image

    • Cancel
    • Vote Up +5 Vote Down
    • Sign in to reply
    • Cancel
Reply
  • scottiebabe
    scottiebabe over 3 years ago

    The cortex-A53 on the zero2w is also a super-scaling bad-****. Here I add a few more integer instructions that end up running in parallel.

    image

    In 4s, the one logical core ends up complete 8 billion instructions, while only being clocked at 1 GHz. 

    image

    • Cancel
    • Vote Up +5 Vote Down
    • Sign in to reply
    • Cancel
Children
No Data
element14 Community

element14 is the first online community specifically for engineers. Connect with your peers and get expert answers to your questions.

  • Members
  • Learn
  • Technologies
  • Challenges & Projects
  • Products
  • Store
  • About Us
  • Feedback & Support
  • FAQs
  • Terms of Use
  • Privacy Policy
  • Legal and Copyright Notices
  • Sitemap
  • Cookies

An Avnet Company © 2025 Premier Farnell Limited. All Rights Reserved.

Premier Farnell Ltd, registered in England and Wales (no 00876412), registered office: Farnell House, Forge Lane, Leeds LS12 2NE.

ICP 备案号 10220084.

Follow element14

  • X
  • Facebook
  • linkedin
  • YouTube