Friday, 14th May, 2010

x86_64 Assembler, Part 2

So, once again I set forth into the wilderness of asm programming, with very little to help me, bar a 64-bit Intel Core 2 Duo (I think… it was a uni box), a few friends busying themselves with work beside me, and the help of those on IRC. It was going to be a long night.

Having already overcome the main hurdles of working out how to print an answer to the screen, how to get it to assemble and link, and the basics of GCC, I just had to get two loops working nicely together. No biggy. Problem 6, possibly one of the easiest out of the first 10.

Original Problem 6 Here

The first part was simply about getting the sum 1^2 + 2^2 + … + 100^2.

sum_squares:
  ; Begin the calculation, rax = 1^2 + 2^2 + ... rbx^2
	
  ; First, save the registers to the stack.
  push rbx
  push rcx
  push rdx
  push rsi
  ; Next, set the initial values
  mov rcx, 0 
  mov rsi, 1
  mov rsi, rax 
	
loop_sum_squares:
  mul rax  ; rax = rax^2
  add rcx, rax  ; Add it to the accumulator
  inc rsi  ; Increment the loop counter
  mov rax, rsi ; Prepare for the next iteration
  cmp rsi, rbx  ; See if we have to do the next iteration
  je sum_sq_ret  ; We reached the limit - go to the exit sequence.
  jl loop_sum_squares ; Not reached it yet, keep looping.
	
sum_sq_ret:
  ; Exit sequence.
  mov rax,rcx  ; Put the result in rcx.
  ; Restore the registers to their original state before leaving.
  pop rsi
  pop rdx
  pop rcx
  pop rbx
  ret

The sum_squares is simply a driver function, ensuring that the accumulator is 0, saving the registers and so forth. loop_sum_squares is where it all happens, rsi is the loop counter, keeping track of where we are, and rax is where the squaring is done. Once it’s squared, it’s added to the accumulator in rcx. It then checks rsi and rbx to see if we’ve reached our limit. If we have, jump to the exit procedure. If not, continue looping.

The code for (1 + 2 + … + 100)^2 is vastly similar, relying on a counter and an accumulator.

	
sum_to:
  ; Save the registers
  push rcx
  push rdx
  ; Initial accumulator value
  mov rcx, 0
	
loop_sum_to:
  add rcx, rax  ; acc = acc + current
  inc rax   ; current++
  cmp rax, rbx
  jl loop_sum_to ; if rax < rbx, continue looping
  je sum_to_ret ; else go to the return sequence
	
sum_to_ret:
  mov rax, rcx ; put the result in the right place
  ; restore the registers
  pop rcx 
  pop rdx
  mul rax ; Square the sum.
  ret ; return

Now all that I needed was to call these and find the difference between them. Not hard at all.

section .text
    global _start
	
section .data
   msg db '%lld', 10, 0
	
extern printf
extern _exit
	
_start:
  mov rax, 1 ; 1^2 + 2^2 + ... +101^2
  mov rbx, 101 ; limit
  call sum_squares
  push rax  ; save the result for later
  mov rax, 1;
  call sum_to ; rax = (1 + 2 + 3 + ... + 100)^2
  pop rbx  ; rbx = 1^2 + 2^2 + ... + 100^2
  sub rax, rbx
  jmp write_result
	
write_result:
   ; write the result to the screen
   mov rsi, rax
   mov rdi, msg
   mov rax, 0
   call printf
   jmp exit
      
exit:
  ; Call the exit system call.
  push  42
  call _exit

Simple, really. Mathematically simpler than problem 5, actually. And a brief comparison between this implementation and the Python one:

dst502@milan:~/Code/asm$ time ./euler6
XXXXXXXX
	
real	0m0.002s
user	0m0.000s
sys	0m0.002s
dst502@milan:~/Code/asm$ time python ./euler6.py
XXXXXXXX
	
real	0m0.064s
user	0m0.047s
sys	0m0.004s

Clearly, the asm is faster. But bear in mind, it took me around 3 days to write the asm versions of problem 5 and problem 6, whereas the Python versions took barely 10 minutes between them.

Done! :-D