Some ARM Cortex-M have a DWT (Data Watchpoint and Trace) unit implemented, and it has a nice feature in that unit which counts the execution cycles. The DWT is usually implemented on most Cortex-M3, M4 and M7 devices, including e.g. the NXP Kinetis or LPC devices.

DWT Cycle Count Register (Source: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0489d/BABJFFGJ.html)