Brain-inspired computing based on emerging nonvolatile memory devices (II) - Xin Zhang, PhD
In traditional von Neumann computer architectures, the well-known memory wall problem of data moving between the microprocessor and off-chip memory has become the bottleneck of the entire system . The problem is exacerbated when training and testing large-scale neural networks require large amounts of data to compute. Because neuro-inspired learning algorithms are widely involved in large-scale matrix operations, computational paradigms that take advantage of a finer level of parallelism directly on the chip are attractive. One promising solution is neuroheuristic architecture, which makes use of distributed computing in neurons and local storage in synapses. This nerve-inspired architecture makes use of distributed computing in neurons and local weight storage in synapses . The figure below shows a revolutionary shift in the computing paradigm from a computation-centric (Feng architecture) architecture to a data-centric (neural heuristic) architecture.
A revolutionary shift from a computation-centric (Feng architecture) to a data-centric (neuro-heuristic) architecture (a) von Neumann architecture (b) neuro-heuristic architecture
Neurons are simple computing units (nonlinear activation or threshold function), and synapses are local storage connected by a large number of communication channels. The ultimate goal of neuro-inspired computing hardware implementation is to complement (rather than replace) today's mainstream von Neumann architectures for application-specific intelligent tasks such as image/speech recognition, autonomous driving, etc.
1. Introduction and classification of neuromorphic hardware design methods
So far, different partially parallel hardware platforms to implement neuro-inspired learning algorithms have been developed. In general, there are two approaches to neuromorphic hardware design that depend on how information is encoded. The first approach stays with digital (off-peak) implementations of machine/deep learning or Artificial Neural networks (ANNs), while taking inspiration from the nervous system to maximize parallel or distributed computing. In a digital implementation, neuron values are encoded by binary bits or pulse numbers or voltage levels. As an off-the-shelf technology, GPU(Graphics Processing Unit, abbreviated as GPU) or Field Programmable Gate Array (FPGA) has been widely used for hardware acceleration of machine/deep learning.
In order to further improve energy efficiency, Complementary Metal Oxide Semiconductor (CMOS) (Application Specific Integrated Circuit), Asic) accelerators have been prototyped . For example, Google uses their custom TPU(Tensor Processing Unit) platform to accelerate the complex intelligent computing tasks behind AlphaGo (Figure 3). The purpose of the digital (non-peak) method is to improve computational efficiency in terms of throughput and power [in units of TOPS/W (trillion (1012) operations per second].
The second method uses Spiking Neural Network (SNN) 's Spiking behavior to simulate biological real Neural Network more closely. In the spike method, the value of the neuron is encoded by spike time (for example, the interval between spikes) and even the actual waveform shape of the spike. Examples include custom-designed CMOS-BASED neuromorphic chips (e.g., BrainScaleS from Heidelbergs, TrueNorth from IBM, etc.). BrainScaleS platform is based on HICANN chip with 180nm node , which uses analog neurons similar to leakage integrated emission model and Static random-access Memory (SRAM) with 4bit 6-transistor. A static random access memory (Stram) unit and a 4-bit Digital to Analog Convertor (DAC) interface between Digital synapses and Analog neurons. A grain (DIE) consists of 512 neurons and 100,000 synapses, and a wafer (Wafer) consists of ~ 200,000 neurons and ~ 40 million synapses. BrainScale can run 10,000 times faster (~ kHz) than living things in real time, but at 500 watts/wafer. TrueNorth chips use digital neurons and digital synapses made from 1-bit transposable 8-transistor SRAM cells. In particular, a TrueNorth chip integrates 4,096 synaptic nuclei, 1 million digital neurons, and 256 million SRAM synapses, which are made in 28-nm nodes. The TrueNorth chip demonstrates a power consumption of 70mW, performing real-time (30fps) target recognition tasks at a very low clock frequency (~ kHz).
Table 1 summarizes the categories of different design approaches to hardware implementations for neuro-inspired computing. Categories are loosely classified here based on how the information is encoded and the technical choice of the hardware platform. Neurons can be encoded by using a level representation of binary bits, pulse numbers, voltage levels, or spikes, while synapses can be binary or multilevel (in an analog manner).
Categories of different design options for neuro-inspired computing hardware implementations (representative prototypes listed)
Off-the-shelf technology CMOS ASIC emerging synaptic devices
Grading represents GPUs
The CNN accelerator  simulates synapses:
UCSB's 12*12 cross array 
Umich's 32*32 cross array 
Tsinghua 128*81T1R RRAM(Resistive RAM, Resistive Random access Memory) array 
IBM's 500*661 1T1R PCM(Phase-change Memory) array 
UCSB's 785*128 Floating gate Transistor Array 
Binary synapse: ASU/ Tsinghua 16Mb1T1R RRAM macro 
Spikes represent SpiNNaker simulated neurons: BrainScaleSIBM's 256*256 1T1R PCM array STDP(Spike-Timing Dependent Plasticity) neuron circuit 
Digital neurons: TrueNorth
2. Offline and online training
There are two ways to train the neural network: ex-situ training and in-situ training. Off-line training refers to training through software, loading the trained weight onto the synaptic array of neuromorphic hardware through a one-time programming, and then doing only reasoning or classification on the hardware. For example, TrueNorth only supports offline training (weights need to be trained in advance and loaded into the SRAM synapse array). Therefore, this inference engine can only be used on devices at the edge of the cloud's predefined model, but it cannot adapt to changing input data or learn new features at runtime. Online training, on the other hand, means that the training is done on the neuromorphic hardware at run time (for example, weight training is done at run time). Accelerating training on neuromorphic hardware is a more challenging task.
The weight updating rules for machine/deep learning and spike neural networks are different. In machine/deep learning, a layer-by-layer back propagation (i.e., stochastic gradient descent) is usually used to optimize the target cost function by comparing the error between the prediction and the real label, and is therefore a supervised global training method. In contrast, in spike networks, local synaptic plasticity (that is, plasticity between adjacent neurons) is often used in an unsupervised manner. An important biologically plausible learning rule is impulse time-dependent plasticity (STDP). STDP learning rules state that if the postsynaptic neuron fires earlier than the presynaptic neuron, the conductance (weight) of the synapse increases, and vice versa. The closer the firing time interval between two neurons, the greater the change in weight. However, how to use this STDP learning rule (unsupervised and local to two adjacent neurons) to effectively update the whole neural network remains to be explored.
So far, in solving practical classification problems (such as image/speech recognition), the learning accuracy of machine/deep learning using back propagation is significantly better than that of peak neural network using STDP learning. As a result, we currently focus more on the machine/deep learning design perspective (rather than spike neural networks).
 A. Sally, “Reflections on the memory wall,” in Proc. Conf. Comput. Front., 2004, p. 162.
 C.-S. Poon and K. Zhou, “Neuromorphic silicon neurons and large-scale neural networks: Challenges and opportunities,” Front. Neurosci., vol. 5, no. 108, pp. 1–3, 2011.
 S. Chetlur et al., “cuDNN: Efficient primitives for deep learning,” Computer ence, Oct. 2014.
 G. Lacey, G. W. Taylor, and S. Areibi, “Deep learning on FPGAs: Past, present, and future,” 2016.
 G. Desoli, “A 2.9 TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2017, pp. 238–239.
 N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in Proc. ACM/IEEE Int. Symp. Comput. Architecture (ISCA), 2017.
 N. Jouppi, Google Supercharges Machine Learning Tasks With TPU Custom Chip, 2016. [Online]. Available: https://cloudplatform.googleblog.com/2016/05/Googlesupercharges- machine-learning-tasks-withcustom-chip.html
 J. Schemmel, D. Bruderle, A. Grubl, M. Hock, K. Meier, and S. Millner, “A waferscale neuromorphic hardware system for large-scale neural modeling,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2010, pp. 1947–1950.
 S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana, “The SpiNNaker project,” Proc. IEEE, vol. 102, no. 5, pp. 652–665, May 2014.
 S. Schmitt et al., “Neuromorphic hardware in the loop: Training a deep spiking network on the BrainScaleS wafer-scale,” in Proc. Int. Joint Conf. Neural Netw. (IJCNN), 2017.
 P. A. Merolla, “A million spiking-neuron integrated circuit with a scalable communication network and interface,” Science, vol. 345, no. 6197, pp. 668–673, Aug. 2014.
 M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, and D. B. Strukov, “Training and operation of an integrated neuromorphic network based on metal-oxide memristors,” Nature, vol. 521, pp. 61–64, May 2015.
 P. M. Sheridan, F. Cai, C. Du, W. Ma, Z. Zhang, and W. D. Lu, “Sparse coding with memristor networks,” Nature Nanotechnol., vol. 12, pp. 784–789, May 2017.
 P. Yao et al., “Face classification using electronic synapses,” Nature Commun., vol. 8, p. 15199, Feb. 2017.
 S. Kim et al., “NVM neuromorphic core with 64k-cell (256-by-256) phase change memory synaptic array with on-chip neuron circuits for continuous in-situ learning,” in IEDM Tech. Dig., 2015.
 X. Guo et al., “Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology,” in IEDM Tech. Dig., 2017.
 S. Yu, “Binary neural network with 16 Mb RRAM macro chip for classification and online training,” in IEDM Tech. Dig., Dec. 2016.
 C. Zamarreño-Ramos, L. A. Camuñas-Mesa, J. A. Pérez-Carrasco, T. Masquelier, T. Serrano-Gotarredona, and B. Linares-Barranco, “On spike-timing-dependentplasticity, memristive devices, and building a self-learning visual cortex,” Front. Neurosci., vol. 5, no. 26, pp. 1–22, 2011.